論文 深掘り Hugging Face 発表: 2026-05-20 HF ↑30

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

著者: Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu ほか16名

要約

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous…

#llm#multimodal#benchmark

同じカテゴリの記事