論文 深掘り Hugging Face 発表: 2026-06-02 HF ↑27

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

著者: Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li ほか1名

要約

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hac…

#rl#llm#agent

同じカテゴリの記事