Reinforcement learning is unable to learn human preferences

AI Alignment
Reference Box
Date created
Sep 25, 2022 11:42 AM
Related Main Box
It seems progressively more unlikely to me that it is possible to specify a reinforcement learner that is actually able to effectively learn human preferences. If the source of reward comes from a human reward function, then it seems to me like the agent is always playing against us.
We are constantly creating games for the AI to play and game, but we aren’t playing with the AI, we are solely the creator of the game the AI wants to outplay.