Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
要約
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given promp…