On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
要約
Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely re…