EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
要約
Search Agents — large language models augmented with search tools — have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models …