Is there a limit where tasks are so difficult, that it seems always easier to do reward hacking?

AI Alignment
Reference Box
Date created
Sep 25, 2022 12:23 PM
Related Main Box
I already established that reward hacking will always be an option in the given solution space to a certain problem and that it seems like an agent always converges to reward hacking, if it seems easier than a given goal.
Now, is there a limit where tasks are so difficult (e.g. solve the Riemann Hypothesis), that the agent will always converge to reward hacking behavior? I wonder whether alignment schemes are supposed to account for that, or else they will always lead to failure.
This seems to contradict this: