論文深掘り arXiv 発表: 2026-05-04

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

著者: Rufeng Chen, Zhaofan Zhang, Zhejiang Yang, Hechang Chen, Sihong Xie

要約

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and co…

#diffusion#rl#alignment#benchmark

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

要約

同じカテゴリの記事

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

World-R1: テキストから動画生成における3D制約の強化学習による整合