"Coherence theorems"
Had a conversation today about whether "coherence theorems" exist, what they are, to what extent you're shooting yourself in the foot if you're not an expected utility maximizer, to what extent agents will self-modify to be more coherent, etc. The last question is interesting and non-trivial. I think that in order to get self-modification, you have to have some preferences that are more basic than others. So the thing that happens is supposed to be something like:
- you realize that your current preferences will result in you having almost no influence in the future with high probability
- you realize that you don't want that
- so you change your preferences so that you stop shooting yourself in the foot etc.
The reason this comes up is that the person I was talking to made a distinction between "selection theorems" and "coherence theorems", where "if you do this you'll die out in the future" is just a selection theorem, and to be a "coherence theorem" you have to end up wanting to self-modify or something. But if you assume agents don't want to be selected against, then selection theorems become coherence theorems.
Anyway IDK this post is a bit rambly and I probably wouldn't make it if I weren't trying to add some momentum here. But I think that questions of how irrational agents end up self-modifying are kind of underrated / not thought about as much as they could be.
like this
Rick Korzekwa and Guive like this.