← アーカイブ一覧
論文 深掘り Hugging Face 2026-05-31 HF ↑9
Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large coll...
#agent#benchmark#rl#multimodal
論文 深掘り Hugging Face 2026-05-31 HF ↑41
Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400...
#agent#benchmark#llm
論文 深掘り Hugging Face 2026-05-31 HF ↑12
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic...
#benchmark#agent#llm
論文 Hugging Face 2026-05-31 HF ↑20
While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradig...
#llm#benchmark#agent
論文 Hugging Face 2026-05-31 HF ↑11
In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solutio...
#agent#llm#benchmark
論文 Hugging Face 2026-05-31 HF ↑9
Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory:...
#benchmark#rag#diffusion
論文 Hugging Face 2026-05-31 HF ↑20
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to log...
#multimodal#benchmark
論文 Hugging Face 2026-05-31 HF ↑53
Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters car...
#fine-tuning#benchmark
論文 arXiv 2026-06-01
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frame...
#llm#benchmark#multimodal
論文 arXiv 2026-06-01
Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do n...
#benchmark#llm#alignment#multimodal