2026-05-14

20件

論文深掘り Hugging Face 2026-05-12 HF ↑30

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse...

#benchmark#rl#multimodal

論文深掘り Hugging Face 2026-05-12 HF ↑30

Qwen-Image-VAE-2.0 Technical Report

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Conne...

#benchmark#diffusion#alignment#coding

論文深掘り Hugging Face 2026-05-12 HF ↑60

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, p...

#benchmark#multimodal#agent

論文 Hugging Face 2026-05-12 HF ↑14

Useful Memories Become Faulty When Continuously Updated by LLMs

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM r...

#agent#llm

論文 Hugging Face 2026-05-12 HF ↑19

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understa...

#llm#benchmark#fine-tuning

論文 Hugging Face 2026-05-12 HF ↑74

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merge...

#llm#rl#benchmark

論文 Hugging Face 2026-05-12 HF ↑10

Asymmetric Flow Models

Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise predictio...

#fine-tuning#diffusion#vision#benchmark

論文 Hugging Face 2026-05-12 HF ↑7

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represe...

#rag#benchmark

論文 Hugging Face 2026-05-12 HF ↑3

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions...

#llm#benchmark#agent#alignment

論文 Hugging Face 2026-05-12 HF ↑18

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

Vision-Language-Action (VLA) policies are commonly trained from dense robot demonstration trajectories, often collected through teleoperation, by sampling every recorded frame as if it provided equally useful supervision. We argue that this convention creates a temporal supervision imbalance: long l...

#alignment#robotics#benchmark

企業動向 OpenAI 2026-05-14

Work with Codex from anywhere

Use Codex anywhere with the ChatGPT mobile app. Monitor, steer, and approve coding tasks in real time across devices and remote environments....

#coding

企業動向 OpenAI 2026-05-14

Helping ChatGPT better recognize context in sensitive conversations

Learn how new ChatGPT safety updates improve context awareness in sensitive conversations, helping detect risk over time and respond more safely....

#alignment

論文深掘り arXiv 2026-05-13

Identifying AI Web Scrapers Using Canary Tokens

From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website...

#llm#agent#robotics

論文深掘り arXiv 2026-05-13

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-L...

#llm#fine-tuning#coding#benchmark

企業動向 OpenAI 2026-05-13

Building a safe, effective sandbox to enable Codex on Windows

Learn how OpenAI built a secure sandbox for Codex on Windows, enabling safe, efficient coding agents with controlled file access and network restrictions....

#agent#coding

企業動向 OpenAI 2026-05-13

Our response to the TanStack npm supply chain attack

OpenAI details its response to the TanStack “Mini Shai-Hulud” supply chain attack, outlines protections taken to secure systems and signing certificates, and explains why macOS users must update OpenAI apps by June 12, 2026. Learn what happened, what was affected, and how OpenAI is strengthening def...

企業動向 Hugging Face 2026-05-14

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality...

#benchmark

論文 arXiv 2026-05-13

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide...

#benchmark

論文深掘り arXiv 2026-05-13

Neurosymbolic Auditing of Natural-Language Software Requirements

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped wi...

#alignment#llm#benchmark

論文 arXiv 2026-05-13

The WidthWall: A Strict Expressivity Hierarchy for Hypergraph Neural Networks

Hypergraphs provide a natural framework to model higher-order interactions in scientific, social, and biological systems. Hypergraph neural networks (HGNNs) aim to learn from such data, yet it remains unclear which higher-order structures these models can represent. We show that hypergraph expressiv...