2026-05-22

20件

← アーカイブ一覧

論文 Hugging Face 2026-05-20 HF ↑48

ACC: Compiling Agent Trajectories for Long-Context Training

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and r...

#agent#llm#fine-tuning#benchmark
論文 深掘り Hugging Face 2026-05-20 HF ↑30

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous...

#llm#multimodal#benchmark
論文 深掘り Hugging Face 2026-05-20 HF ↑62

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral...

#llm#benchmark#multimodal#agent
論文 Hugging Face 2026-05-20 HF ↑4

Diversed Model Discovery via Structured Table Discovery

Model cards describe model behavior through a mixture of textual descriptions and structured artifacts, including performance, configuration, and dataset tables. Existing model search systems rely predominantly on semantic similarity over text, which can produce homogeneous result sets and limit exp...

#alignment#benchmark
論文 深掘り Hugging Face 2026-05-20 HF ↑10

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. De...

#coding#benchmark
論文 Hugging Face 2026-05-20 HF ↑16

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs o...

#llm#multimodal#rl#agent#benchmark
論文 Hugging Face 2026-05-20 HF ↑18

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens...

#llm#benchmark#multimodal#fine-tuning
論文 Hugging Face 2026-05-20 HF ↑26

WorldKV: Efficient World Memory with World Retrieval and Compression

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks re...

#benchmark#diffusion#fine-tuning#coding
論文 Hugging Face 2026-05-20 HF ↑16

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. ...

#agent#diffusion#benchmark
企業動向 OpenAI 2026-05-22

How Virgin Atlantic ships faster with Codex

How Virgin Atlantic used Codex to ship its revamped mobile app on a fixed holiday travel deadline, reaching near-total unit test coverage and zero P1 defects....

論文 深掘り arXiv 2026-05-21

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival...

#diffusion#benchmark
論文 深掘り arXiv 2026-05-21

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. De...

#coding#benchmark
論文 深掘り arXiv 2026-05-21

Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

The Flexible Job Shop Scheduling Problem (FJSP) is the optimal allocation of a set of jobs to machines. Two primary challenges persist in FJSP: the unpredictable arrival of future jobs and the combinatorial complexity of the problem, rendering it intractable for conventional mixed-integer linear pro...

#agent#coding#rl#benchmark
論文 arXiv 2026-05-21

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

Parameter-efficient fine-tuning enables fast personalization of text-to-image diffusion models, but composing multiple custom concepts remains challenging due to representation interference. Existing modular methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limit...

#diffusion#fine-tuning#vision
論文 arXiv 2026-05-21

AMEL: Accumulated Message Effects on LLM Judgments

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated messa...

#llm#benchmark