InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
要約
Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temp…