論文 深掘り Hugging Face 発表: 2026-05-11 HF ↑12

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

著者: Bo Yin, Qi Li, Xinchao Wang

要約

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely re…

#agent#alignment#llm

同じカテゴリの記事