Goal Misgeneralization

I co-led this research project at the AI Safety Camp, producing the first empirical demonstrations of goal misgeneralization in deep reinforcement learning.

Goal misgeneralization is a failure mode where agents retain their capabilities but pursue the wrong objectives. For example, an agent might continue to competently avoid obstacles while navigating to the wrong destination. This is more insidious than typical failures where agents simply break down under distribution shift.

The paper was published at ICML 2022.

Publications & Writing

Goal Misgeneralization in Deep Reinforcement Learning (arXiv)
Empirical Observations of Objective Robustness Failures (LessWrong)
Discussion: Objective Robustness and Inner Alignment (LessWrong)

Jack Koch

Explorer

Goal Misgeneralization

Publications & Writing

Graph View

Backlinks