論文深掘り Hugging Face 発表: 2026-05-11 HF ↑12

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

著者: Bo Yin, Qi Li, Xinchao Wang

要約

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely re…

#agent#alignment#llm

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

要約

同じカテゴリの記事

World-R1: テキストから動画生成における3D制約の強化学習による整合

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems