← アーカイブ一覧
論文 Hugging Face 2026-05-20 HF ↑48
Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and r...
#agent#llm#fine-tuning#benchmark
論文 深掘り Hugging Face 2026-05-20 HF ↑30
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous...
#llm#multimodal#benchmark
論文 深掘り Hugging Face 2026-05-20 HF ↑62
Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral...
#llm#benchmark#multimodal#agent
論文 Hugging Face 2026-05-20 HF ↑4
Model cards describe model behavior through a mixture of textual descriptions and structured artifacts, including performance, configuration, and dataset tables. Existing model search systems rely predominantly on semantic similarity over text, which can produce homogeneous result sets and limit exp...
#alignment#benchmark
論文 深掘り Hugging Face 2026-05-20 HF ↑10
Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. De...
#coding#benchmark
論文 Hugging Face 2026-05-20 HF ↑16
The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs o...
#llm#multimodal#rl#agent#benchmark
論文 Hugging Face 2026-05-20 HF ↑18
Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens...
#llm#benchmark#multimodal#fine-tuning
論文 Hugging Face 2026-05-20 HF ↑22
Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a pr...
#agent#llm#rl#fine-tuning#benchmark
論文 Hugging Face 2026-05-20 HF ↑26
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks re...
#benchmark#diffusion#fine-tuning#coding
論文 Hugging Face 2026-05-20 HF ↑16
Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. ...
#agent#diffusion#benchmark
企業動向 OpenAI 2026-05-22
OpenAI is named a leader in the 2026 Gartner Magic Quadrant for Enterprise AI Coding Agents, with Codex recognized for innovation and enterprise-scale deployment....
#agent#coding
企業動向 OpenAI 2026-05-22
How Virgin Atlantic used Codex to ship its revamped mobile app on a fixed holiday travel deadline, reaching near-total unit test coverage and zero P1 defects....
論文 深掘り arXiv 2026-05-21
Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival...
#diffusion#benchmark
企業動向 OpenAI 2026-05-21
AdventHealth is using ChatGPT for Healthcare to streamline workflows, reduce administrative burden, and return more time to patient care....
企業動向 Hugging Face 2026-05-22
Specialization Beats Scale: A Strategic Variable Most AI Procurement Decisions Overlook...
論文 深掘り arXiv 2026-05-21
Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. De...
#coding#benchmark
論文 深掘り arXiv 2026-05-21
The Flexible Job Shop Scheduling Problem (FJSP) is the optimal allocation of a set of jobs to machines. Two primary challenges persist in FJSP: the unpredictable arrival of future jobs and the combinatorial complexity of the problem, rendering it intractable for conventional mixed-integer linear pro...
#agent#coding#rl#benchmark
論文 arXiv 2026-05-21
Parameter-efficient fine-tuning enables fast personalization of text-to-image diffusion models, but composing multiple custom concepts remains challenging due to representation interference. Existing modular methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limit...
#diffusion#fine-tuning#vision
論文 arXiv 2026-05-21
AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse....
#alignment#benchmark#llm
論文 arXiv 2026-05-21
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated messa...
#llm#benchmark