The pointers problem

Created time
Sep 22, 2022 12:26 PM
Main Box
Tags
EA
AI Alignment
Human values
Interpretability
We already established in
AI should optimize for things I truly value in the real-world, not only my estimate of them.
that we want the AI to optimize for things that are truly in the world, e.g. we don’t want the AI to optimize for my descriptions of things I want (since those usually are prone to failure), but to something I point to in the world.
What we value is largely a function of latent variables, so we need a way to point to something that isn’t directly observable for us. Now, the question is, how do we do that, without being too specific?
I think this problem could benefit from a simple graphic.
graph TD A[Estimates of things] --> B[Things in the real world] D[Humans can provide] --> A E[Constituted of latent variables] --> A F[We don't want AI to optimize for] --> A -- and instead for --> B