論文 Hugging Face 発表: 2026-06-10 HF ↑3

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

著者: Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

要約

Search Agents — large language models augmented with search tools — have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models …

#agent#benchmark#llm

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

要約

同じカテゴリの記事

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

World-R1: テキストから動画生成における3D制約の強化学習による整合