Self-Distilled Agentic Reinforcement Learning
Self-Distilled Agentic Reinforcement Learning
要約
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher…