論文 Hugging Face 発表: 2026-05-12 HF ↑3

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

著者: Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen ほか6名

要約

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions…

#llm#benchmark#agent#alignment

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

要約

同じカテゴリの記事

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

World-R1: テキストから動画生成における3D制約の強化学習による整合