論文 Hugging Face 発表: 2026-06-01 HF ↑4

Benchmarking Visual State Tracking in Multimodal Video Understanding

著者: Sihyun Yu, Nanye Ma, Pinzhi Huang, Hyunseok Lee, Shusheng Yang ほか6名

要約

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Mod…

#llm#benchmark#agent#multimodal#coding

Benchmarking Visual State Tracking in Multimodal Video Understanding

要約

同じカテゴリの記事

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

World-R1: テキストから動画生成における3D制約の強化学習による整合