2026-05-05

16件

論文深掘り Hugging Face 2026-05-03 HF ↑64

MolmoAct2：現実世界展開のための行動推論モデル

ロボットAIの「オープンソース革命」が始まり、参入コストが数分の一になりそう

ロボット向けの汎用コントローラーを目指すVision-Language-Action（VLA）モデルは、実世界展開の観点でクローズドモデルや高価なハードウェア依存、高レイテンシといった課題を抱えている。本研究ではAllen AIが完全オープンな行動推論モデル「MolmoAct2」を発表。5つの軸で改善を加え、空間・身体的推論に特化したVLMバックボーン「MolmoER」（330万サンプルで訓練）、低〜中コストプラットフォーム向け3種の新データセット（最大規模のオープン双腕データセット「MolmoAct2-BimanualYAM」720時間を含む）、オープンな行動トークナイザー「OpenFAST」、フローマッチング連続行動エキスパートをKVキャッシュ条件付けで統合した新アーキテクチャ、さらに変化領域のみ深度トークンを再予測する適応型推論「MolmoThink」を提供する。7つのベンチマークでPi-05を上回り、MolmoERは13の身体推論ベンチマークでGPT-5およびGemini Robotics ER-1.5を超えると報告している。モデル重み・訓練コード・データはすべて公開される。

#multimodal#robotics#benchmark#fine-tuning

論文深掘り Hugging Face 2026-05-03 HF ↑3

PhysicianBench：実際の電子カルテ環境におけるLLMエージェントの評価ベンチマーク

「医療AIは知識があっても動けない」─臨床エージェントの実力差が数値で可視化される時代へ

電子カルテ（EHR: Electronic Health Record）環境における医師業務をLLMエージェントで評価するベンチマーク「PhysicianBench」が提案された。既存の医療エージェント評価は静的な知識想起や単一ステップの行動に限定されており、実臨床の複雑な長期ワークフローを再現できていないという課題があった。PhysicianBenchは、一次診療と専門診療間の実際のコンサルテーション事例を元にした100の長期タスクで構成され、21専門科・複数のワークフロー種別を網羅、1タスクあたり平均27回のツール呼び出しを必要とする。商用EHRと同じ標準APIを用い、670のチェックポイントで実行結果を検証可能な形で評価する。13のLLMエージェントを評価した結果、最高性能モデルでも成功率46%（pass@1）にとどまり、オープンソースモデルは最大19%と、現状のエージェント能力と実臨床要求の間に大きなギャップがあることが示された。

#agent#benchmark#llm

企業動向 OpenAI 2026-05-05

New ways to buy ChatGPT ads

OpenAI expands ChatGPT ads with a beta self-serve Ads Manager, CPC bidding, and enhanced measurement tools—built to protect privacy and keep conversations separate from ads....

論文 Hugging Face 2026-05-03 HF ↑2

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads ...

#rl#llm#agent#benchmark

論文 Hugging Face 2026-05-03 HF ↑3

AcademiClaw: 学生がAIエージェントに挑戦を設定する

近年のAIエージェント評価ベンチマークはアシスタントレベルのタスクに偏っており、学術レベルの能力評価が不十分という課題がある。本研究ではOpenClawエコシステム向けに、大学生の実際の学術ワークフロー（宿題・研究プロジェクト・コンテスト・個人プロジェクト）から収集した80件の複雑・長期タスクで構成されるバイリンガルベンチマーク「AcademiClaw」を提案する。230件の学生提出候補から厳格な専門家レビューを経て選定されたタスクは、数学オリンピックや言語学問題からGPU集約型強化学習・フルスタックデバッグまで25以上の専門領域に及び、16タスクはCUDA GPU実行を要する。各タスクはDockerサンドボックスで実行され、6つの補完的手法を組み合わせた多次元ルーブリックで採点される。6つの最先端モデルによる実験では最高でも55%の合格率に留まり、タスク領域間の明確な能力境界やトークン消費量と出力品質の乖離など、集約指標では見えない詳細な診断情報を提供する成果を示した。

#agent#benchmark#rl#alignment

論文 Hugging Face 2026-05-03 HF ↑1

視覚的根拠推論のための知覚フローネットワーク

大規模視覚言語モデル（LVLM）は標準的な最尤推定（MLE）などの汎用最適化目標を用いるため、視覚的な推論軌跡を適切に制約できず、言語バイアスや幻覚（hallucination）が生じやすい。既存手法は視覚エキスパートからの幾何学的事前知識を追加監督として導入するが、これは幾何学的精度に偏りすぎており推論への有用性が限定的だと著者らは指摘する。この課題に対し、本論文はPerceptual Flow Network（PFlowNet）を提案する。PFlowNetは知覚と推論を分離し自己条件付き生成プロセスを確立することで、エキスパート事前知識への硬直した整合を排除する。さらに変分強化学習（variational reinforcement learning）を用いて多次元報酬と近傍幾何学的整形を統合し、視覚的信頼性を保ちながら推論指向の知覚行動を促進する。理論的な性能保証を示すとともに、V* Bench（90.6%）およびMME-RealWorld-lite（67.0%）にて新たなSOTAを達成したと報告している。

#rl#multimodal#alignment

論文 arXiv 2026-05-04

TOC-SR: Task-Optimal Compact diffusion for Image Super Resolution

Diffusion models have recently demonstrated strong performance for image restoration tasks, including super-resolution. However, their large model size and iterative sampling procedures make them computationally expensive for practical deployment. In this work, we present TOC-SR, a framework for bui...

#diffusion

企業動向 Microsoft Research 2026-05-05

Microsoft at NSDI 2026: Advances in large-scale networked systems

Microsoft researchers share advances in building and operating large-scale distributed systems, spanning datacenters, networking, and the growing intersection with AI during NSDI ’26. The post Microsoft at NSDI 2026: Advances in large-scale networked systems appeared first on Microsoft Research ....

モデル OpenAI 2026-05-05

GPT-5.5 Instant System Card

GPT-5.5 Instant System Card...

モデル OpenAI 2026-05-05

GPT-5.5 Instant: smarter, clearer, and more personalized

GPT-5.5 Instant updates ChatGPT’s default model with smarter, more accurate answers, reduced hallucinations, and improved personalization controls....

論文深掘り arXiv 2026-05-04

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems t...

#agent#llm#rl#benchmark

論文深掘り arXiv 2026-05-04

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and co...

#diffusion#rl#alignment#benchmark

論文深掘り arXiv 2026-05-04

Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it de...

#rag#benchmark#llm

企業動向 NVIDIA 2026-05-05

NVIDIA and ServiceNow Partner on New Autonomous AI Agents for Enterprises

Enterprise AI has learned to generate. It has learned to reason. Now companies are asking the next question: How should AI act? Early agent systems have shown what’s possible, moving beyond simple prompts to take on more complex tasks. The next step is bringing those capabilities into enterprise env...

#agent

企業動向 OpenAI 2026-05-04

OpenAI and PwC collaborate to reimagine the office of the CFO

OpenAI and PwC are partnering to help enterprises use AI agents to automate finance workflows, improve forecasting, strengthen controls, and modernize the CFO function....

#agent

企業動向 OpenAI 2026-05-04

How OpenAI delivers low-latency voice AI at scale

How OpenAI rebuilt its WebRTC stack to power real-time Voice AI with low latency, global scale, and seamless conversational turn-taking....

#speech