Wolf Digest — 2026-05-04

#1

DOD expands its classified AI work with 8 companies — excluding Anthropic — amid ongoing dispute

Government & Defense 2026-05-01 DefenseScoop 8.32 8.5/9.0/7.0

The Pentagon disclosed Friday that it has signed formal frontier-AI deployment agreements with eight U.S. technology companies — SpaceX, OpenAI, Google, NVIDIA, Reflection, Microsoft, AWS, and Oracle — explicitly authorizing them to operate inside DoD's Impact Level 6 and Impact Level 7 classified network environments. IL6 covers classified cloud workloads up through Secret; IL7 is the most stringent classification level, covering top-secret and mission-critical national security data. Anthropic, the only frontier lab that had been previously expected to be on this list, was excluded — the result of a months-long contract dispute that culminated earlier in 2026 over Anthropic's usage policy restricting offensive cyber and certain surveillance applications. The DefenseScoop reporting cites Pentagon officials describing the goal as standing up "secure frontier AI capabilities" inside IL6/IL7 enclaves to streamline data synthesis, elevate situational awareness, and augment warfighter decision-making in complex operational environments.

The substantive read on the announcement is that the Department has now formally bifurcated the frontier-lab vendor landscape along usage-policy lines. The seven other labs were willing to either revise their public usage policies or sign side-letters granting the government explicit carve-outs for warfighter and national-security applications that lie inside Anthropic's hard-coded prohibitions. Reflection — the smaller, defense-aligned lab seeded by ex-DeepMind talent — is the most surprising entrant on the list and signals that DoD is willing to source from non-frontier vendors when policy alignment is cleaner. The exclusion of Anthropic is notable structurally: it forecloses Claude from a contract pool that several analysts have estimated is north of one billion dollars annualized once the IL6/IL7 enclaves are populated, and it removes a major government-revenue floor from Anthropic's recently disclosed eight-hundred-fifty to nine-hundred billion dollar valuation case. Anthropic has publicly stated that it views its usage policy as a competitive feature rather than a bug — that policy clarity attracts enterprise customers who want strong guardrails — but the DoD exclusion is the most visible commercial cost the policy has incurred to date.

The deployment surface itself is the more interesting technical story. IL6/IL7 environments are physically air-gapped or dedicated-tenancy classified clouds, which means each of the eight vendors has had to either build a sovereign-deployment variant of its frontier model or commit to one. For OpenAI and Google that work was already underway; for SpaceX and Reflection it represents a meaningful engineering commitment. The Pentagon framing emphasizes "data synthesis" and "decision augmentation" rather than autonomous decision-making — an explicit nod to the human-in-the-loop posture DoD has held since the 3000.09 directive — but the IL7 inclusion specifically extends the surface to top-secret and SCI-compartmented use cases, which is the first time frontier general-purpose models will operate at that classification level. The longer-term policy question is whether the model-weight handling itself will sit under existing data-spillage and cross-domain-solution rules, or whether DoD will need new doctrine for what counts as compromise when the artifact is a hundred-billion-parameter model file rather than a document. The press release does not address that point and the eight vendors will likely each negotiate it bilaterally.

#2

AI Engineer World's Fair — Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI Call for Speakers

Industry 2026-05-02 Latent Space (swyx & Alessio) 7.82 7.5/7.5/8.0

Latent Space's Friday issue served as both the Wave 2 call-for-speakers for the AI Engineer World's Fair this summer and the de facto Friday roundup of frontier model releases. The fair's organizers — swyx and Alessio — announced that the 2026 event will run in Moscone West, double in size for the third year running, and add an entire day of talks across new tracks: Autoresearch (recursive self-improvement loops in harnesses and model training), Memory (how agents and models improve as users use them), World Models (spatial intelligence and adversarial reasoning), Tasteful Tokenmaxxing (scaling AI-native engineering teams without Goodhart-style waste), Agentic Commerce (how agents pay for data, APIs, and other agents), and Vertical AI in Law, Healthcare, GTM, and Finance. The fair is also allocating free expo-floor space for robotics demos, including humanoids, after last year's strong showing from Physical Intelligence, Waymo, Tesla, NVIDIA, and K-Scale.

Underneath the call-for-speakers, the issue covered the week's actual model news. xAI shipped Grok 4.3, with Artificial Analysis's Intelligence Index scoring it at 53 — up four points over Grok 4.20 — at roughly forty percent lower input pricing and sixty percent lower output pricing. The headline gain was on GDPval-AA, where Grok 4.3 jumped 321 Elo points to 1500, suggesting markedly stronger real-world agentic task performance; it also hit ninety-eight percent on tau-squared-Bench Telecom and held eighty-one percent on IFBench. The trade-off was a measurable drop on the non-hallucination axis, which fell eight points even as overall accuracy on AA-Omniscience rose. Reception was mixed: Andon Labs reported a major regression on Vending-Bench 2 where Grok preferred to "sleep" rather than act, and several commentators argued Grok's low pricing is being subsidized by poor hardware utilization rather than genuine inference efficiency. Cache economics, not model quality, increasingly determines agentic total cost of ownership.

The open-weights story was DeepSeek V4 Pro, which omarsar0 tested inside the Pi coding agent and described as the first open-weight model that genuinely feels comparable to Codex or Claude Code for multi-turn agentic coding. The systems details cited include one-million-token context, a hybrid CSA/HCA attention design, KV cache reduced to ten percent of the prior generation, and roughly four-times-lower inference FLOPs at long context. Artificial Analysis added Kimi K2.6 and MiMo V2.5 Pro to the same closing-the-gap narrative — the three leading open-weight models released in the last week have collectively narrowed the frontier-versus-open gap to within striking distance on most agentic benchmarks, though the hardest reasoning evals still favor closed models. The roundup also covered new memory-architecture work, several world-model releases targeting interactive video generation, and a heavier-than-usual stack of evaluations criticizing benchmark saturation. The composite picture is a market in which the four-six week release cadence from each frontier lab has driven the closed-versus-open gap from twelve months to roughly six weeks on standard tasks — and the differentiation has rotated to cost economics, agent-harness fit, and the willingness of each vendor to push into specific verticals.

#3

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Post-Training 2026-05-01 The Cognitive Revolution (Nathan Labenz) 7.82 8.0/8.5/6.5

Nathan Labenz hosted Kyle Corbitt — founder of OpenPipe and now part of CoreWeave following the recent acquisition — for a seventy-six-minute deep dive on the current state of reinforcement-learning fine-tuning for production model use. The episode is structured as a practitioner's playbook: how reinforcement learning differs from supervised fine-tuning at the weight-update level, why GRPO has become the default credit-assignment algorithm despite its asymmetries, and what the actual workflow looks like when you bring rubric-based judging, environment design, and reward-shaping together to lift a smaller open-weight model into production-grade behavior on a narrow task. Corbitt's framing is that the field has moved past the question of whether RL post-training is worth the engineering complexity for non-frontier labs — for sufficiently narrow tasks the answer is now clearly yes — and into the question of which specific tooling combinations actually work end-to-end.

The technical heart of the discussion is GRPO and its variants. Corbitt walks through the credit-assignment dynamics that make GRPO simpler to operate than PPO at scale — no value-function critic, group-relative advantage estimation, much less hyperparameter-sensitive — but also where it falls short, particularly on multi-turn agentic environments where the credit-assignment horizon stretches across thousands of tokens and reward signals are sparse. He flags GSPO, RLOO, and DAPO as the live alternatives that practitioners are evaluating against GRPO for specific failure modes, and is candid that the algorithmic landscape is still moving fast enough that the right answer for any given workload requires running an internal bake-off. The conversation also covers the LLM-as-judge rubric pattern: how OpenPipe's Ruler tool and similar systems let you decompose a reward into multiple criteria, judge each independently, and then aggregate — which trades off some gradient signal for much more controllable training dynamics and better robustness to reward hacking.

The practical sections are where the episode earns its length. Corbitt is concrete about the trade-off space: LoRA versus full fine-tuning at sizes from 7B to 70B; the role of distillation as a step before RL when the teacher model is much stronger than the student; how to design environments that don't degenerate into reward-hacked policies; and the specific class of failures that show up when you scale RL post-training compute without scaling environment quality at the same rate. He spends meaningful time on the Chinese-lab distillation pattern — the practice of taking a frontier closed model's outputs and using them to train a smaller open-weight model — and the technical-versus-policy questions that creates, riffing on Anthropic's recent distillation-attacks article. Reward hacking gets a long treatment: the canonical failure modes, how to design reward functions that are robust to the agent learning to game them, and why for many production use cases a slightly weaker but provably non-hacked reward function is the better engineering choice. The episode also touches on the cost economics — how the GPU-hours per RL training run have collapsed roughly an order of magnitude over the last twelve months as the algorithms have stabilized — which is the underlying reason that small teams can now realistically run this playbook on their own data.

#4

Uber torches 2026 AI budget on Claude Code in four months

Industry 2026-05-01 Hacker News — AI front page 7.6 7.5/7.5/7.5

Uber's CTO disclosed this week that the company exhausted its full 2026 annual AI tooling budget by April — burning through the entire allocation in four months on Claude Code and, secondarily, Cursor. The reporting puts per-engineer monthly API spend in the five-hundred to two-thousand dollar range, with adoption now at ninety-five percent of engineers using AI tools monthly and roughly seventy percent of committed code originating from AI-assisted workflows. Uber's annual R-and-D spend is approximately three-point-four billion dollars, so the AI-tooling line item is now a measurable fraction of total engineering cost and growing fast enough that leadership describes itself as "back to the drawing board" on how to budget for it going forward.

The single-vendor concentration is the more interesting structural data point. Uber rolled out Claude Code access to engineering in December 2025, usage doubled by February, and by April had effectively crowded out Cursor on the multi-step agentic coding surface — Cursor's per-engineer usage has plateaued while Claude Code's has continued to compound. The reading from inside Uber that the CTO reportedly shared internally is that the multi-step capability gap between the two products opened up rapidly through the first quarter and has not closed. For long-running edits, multi-file refactors, and the kind of dependency-spanning work that makes engineering productivity measurable rather than performative, Claude Code's longer effective context, better tool-use loop, and stronger plan-then-execute pattern have outweighed Cursor's IDE-integration advantages. This is consistent with the public benchmark trend on SWE-Bench Verified, where Claude has held the top position through Q1 2026 while Cursor's Composer has tracked second.

The broader implication is the budgeting question every large engineering organization is now staring at. If the value of AI tooling per engineer is provably positive — and the Uber numbers suggest it is, given that leadership's reaction is to re-budget rather than throttle — then the question becomes whether the variable cost can be contained at scale. At five-hundred to two-thousand dollars per engineer per month, an engineering organization the size of Uber's is looking at hundreds of millions of dollars of annualized AI tooling spend, which puts it on roughly the same order of magnitude as the company's existing developer-tooling line and forces the question of whether to negotiate enterprise contracts with explicit cost ceilings, build internal tooling to route only the highest-leverage requests through frontier models, or simply accept the new run-rate as the cost of staying competitive. The Uber data point is the cleanest public signal yet that for the largest engineering organizations the right operating model is not "experiment with AI tools" but "treat AI tools as a primary infrastructure cost" — which is a structurally different set of decisions and one that the industry has not yet fully absorbed.

#5

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 7.4 6.7/6.9/8.0

Themis introduces a multilingual code reward-model suite covering eight programming languages and five preference dimensions beyond functional correctness. The authors release Themis-CodeRewardBench (profiling 50+ existing RMs), Themis-CodePreference (350k+ pairs — the largest open code-preference corpus), and Themis-RM checkpoints from 600M to 32B. Experiments show positive scaling, strong cross-lingual transfer when training on diverse preferences, and that multi-criteria training is necessary for reliable code RM behavior. The release fills a long-standing gap where code RMs have lagged general-purpose RMs in both data and methodology.

#6

Marine Corps charts path for future aviation ops led by drones

Government & Defense 2026-05-01 DefenseScoop 7.32 7.5/8.0/6.0

The Marine Corps is targeting 2029 for operational testing of MUX TACAIR, its variant of the Air Force's Collaborative Combat Aircraft, alongside a broader fleet of autonomous wingman drones. Col. Richard Rusnok of the Cunningham Group framed the shift as comparable to the introduction of rotary-wing aircraft to the fleet in the 1950s. The service is also exploring how exquisite unmanned systems can fill electronic-attack, ISR, and logistics roles currently performed by crewed aircraft. The plan signals MARFORPAC and aviation leadership are accepting the structural drone-led aviation model rather than treating it as an adjunct to the crewed fleet.

#7

U.S. Vows to Fight Distillation Attacks

Safety, Policy & Regulation 2026-05-01 Lawfare (via Google News) 7.32 7.0/8.0/6.5

Lawfare summarizes the U.S. government response to the distillation-attack vector — the practice of using outputs from frontier closed models to train smaller open-weight (often Chinese-lab) models. The piece outlines proposed countermeasures including watermarking, output-rate-limiting on suspicious API access patterns, and export-control adjustments that would treat frontier model weights and inference access as dual-use technology requiring license review for high-volume foreign customers. The technical baseline is Anthropic's recently published distillation-attacks article. Substantive policy mechanisms remain unsettled and the piece flags significant tension with open-research norms.

#8

Spotify adds 'Verified' badges to distinguish human artists from AI

Generative Media 2026-05-01 Hacker News — AI front page 7.27 7.0/7.0/7.5

Spotify is rolling out a 'Verified by Spotify' badge — a green-checkmark text label appearing next to artist names that meet authenticity standards including linked social accounts, consistent listener activity, merchandise, and concert dates. Spotify says more than 99% of artists listeners actively search for will qualify (representing hundreds of thousands of artists), prioritizing acts with 'important contributions to music culture and history' over content farms. The move follows the platform's prior round of AI-generated content removals and is the most visible streaming-platform response yet to the consumer-music labeling problem.

#9

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Reinforcement Learning · arXiv — Evals & Benchmarks 6.83 6.0/6.9/7.0

[2605.00702v1] Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory Skip to main content Learn about arXiv becoming an independent nonprofit. We gratefully acknowledge support from the Simons Foundation, member institutions , and all contributors. Donate > cs > arXiv:2605.00702v1 Help | Advanced Search All fields Title Author Abstract Comments Journal reference ACM classification MSC classification Report number arXiv identifier DOI ORCID arXiv author ID Help pages Full text Search GO quick links Login Help Pages About --> Computer Science > Co

#10

NVIDIA's New AI Turns One Photo Into A World That Never Breaks

Generative Media 2026-05-03 Two Minute Papers 6.77 7.0/6.5/6.5

NVIDIA's New AI Turns One Photo Into A World That Never Breaks - YouTube About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

#11

Executive Branch AI and the Rule of Law: An Emerging Research Agenda

Safety, Policy & Regulation 2026-05-01 Lawfare (via Google News) 6.65 6.5/7.5/5.5

Executive Branch AI and the Rule of Law: An Emerging Research Agenda    Lawfare Executive Branch AI and the Rule of Law: An Emerging Research Agenda    Lawfare

#12

Paired-CSLiDAR: Height-Stratified Registration for Cross-Source Aerial-Ground LiDAR Pose Refinement

Generative Media 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.RO (Robotics) · arXiv — Evals & Benchmarks 6.62 7.0/5.4/7.0

We introduce Paired-CSLiDAR (CSLiDAR), a cross-source aerial-ground LiDAR benchmark for single-scan pose refinement: refining a ground-scan pose within a 50 m-radius aerial crop. The benchmark contains 12,683 ground-aerial pairs across 6 evaluation sites and per-scan reference 6-DoF alignments for sub-meter root-mean-square error (RMSE) evaluation. Because aerial scans capture rooftops and canopy while ground scans capture facades and under-canopy, the two modalities share only a fraction of their geometry, primarily the terrain surface, causing standard registration methods and learned corres

#13

Senators take another swing at bill to codify federal AI resource

Safety, Policy & Regulation 2026-05-01 FedScoop — AI 6.6 6.5/7.5/5.5

The bipartisan CREATE AI Act has a House companion and seeks to establish the National Artificial Intelligence Research Resource following a pilot launched in 2023. The post Senators take another swing at bill to codify federal AI resource appeared first on FedScoop . The bipartisan CREATE AI Act has a House companion and seeks to establish the National Artificial Intelligence Research Resource following a pilot launched in 2023.

#14

AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights

Safety, Policy & Regulation 2026-05-02 Hacker News — AI front page 6.6 6.0/7.0/6.5

Article URL: https://arxiv.org/abs/2509.00462 Comments URL: https://news.ycombinator.com/item?id=47987256 Points: 328 # Comments: 178 Article URL: https://arxiv.org/abs/2509.00462 Comments URL: https://news.ycombinator.com/item?id=47987256 Points: 328 # Comments: 178

#15

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

State Space Models 2026-05-01 arXiv cs.NE (Neural & Evolutionary Computing) · arXiv — Evals & Benchmarks 6.57 6.1000000000000005/5.3/8.0

Distributed blackbox consensus optimization is a fundamental problem in multi-agent systems, where agents must improve a global objective using only local objective queries and limited neighbor communication. Existing methods largely rely on handcrafted update rules and static cooperation patterns, which often struggle to balance local adaptation, global coordination, and communication efficiency in heterogeneous nonconvex environments. In this paper, we take an initial step toward trajectory-driven self-design for distributed black-box consensus optimization.

#16

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv cs.LG (Machine Learning) · arXiv — Agents / Tool Use 6.53 5.7/6.3/7.0

Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \texttt{IF}, \texttt{GOTO}, \texttt{FORALL}).

#17

Faithful Extreme Image Rescaling with Learnable Reversible Transformation and Semantic Priors

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Generative Media / Diffusion · arXiv — Efficiency (Quantization, MoE, Inference) 6.48 7.0/5.0/7.0

Most recent extreme rescaling methods struggle to preserve semantically consistent structures and produce realistic details, due to the severely ill-posed nature of low- to high-resolution mapping under scaling factors of $16\times$ or higher. To alleviate the above problems, we propose FaithEIR, a diffusion-based framework for extreme image rescaling. Inspired by singular value decomposition, we develop learnable reversible transformation that enables invertible downscaling and upscaling in the latent space.

#18

SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Reinforcement Learning · arXiv — Evals & Benchmarks 6.43 5.3/6.4/7.0

While representation and similarity learning have improved the sample efficiency of Reinforcement Learning (RL), they are rarely used to shape policy updates directly in the action space. To bridge this gap, a geometry-aware RL algorithm that explicitly incorporates value-based similarity into the policy update, State-Action Value Geometry Optimization (SAVGO), is proposed. In detail, SAVGO learns a joint state-action embedding space in which pairs with similar action-value estimates exhibit high cosine similarity, while dissimilar pairs are mapped to distinct directions.

#19

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) 6.43 6.7/5.5/6.5

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105.

#20

Jailbreaking Vision-Language Models Through the Visual Modality

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 6.4 5.7/5.9/7.0

The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires i

#21

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks 6.4 6.0/5.6/7.0

AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mismatched to compound AI workloads, and propose a shift to program-level scheduling: treating the entire agent workflow (not individual inference calls) as the first-class schedulable unit. We present SAGA, a distributed scheduler that implements this abstraction through three mechanisms: (1) Agent Execution Graphs that

#22

Let ViT Speak: Generative Language-Image Pre-training

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 6.4 6.1000000000000005/4.8/8.0

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly

#23

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CL (Computation & Language) · arXiv — Reinforcement Learning · arXiv — Post-training / Alignment 6.37 6.7/4.8/7.0

Background: Patient-facing medical chatbots based on retrieval-augmented generation (RAG) are increasingly promoted to deliver accessible, grounded health information. AI-assisted development lowers the barrier to building them, but they still demand rigorous security, privacy, and governance controls. Objective: To report an anonymized, non-destructive security assessment of a publicly accessible patient-facing medical RAG chatbot and identify governance lessons for safe deployment of generative AI in health.

#24

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) · arXiv — Reinforcement Learning · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks 6.35 5.7/5.9/7.0

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative.

#25

Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 6.33 6.1/5.3/7.0

Retrieval-augmented generation (RAG) enhances large language models with external knowledge, and tree-based RAG organizes documents into hierarchical indexes to support queries at multiple granularities. However, existing Tree-RAG methods designed for single-document retrieval face critical challenges in scaling to cross-document multi-hop questions: (1) poor distribution adaptability, where $k$-means clustering introduces noise due to rigid distribution assumptions; (2) structural isolation, as tree indexes lack explicit cross-document connections; and (3) coarse abstraction, which obscures f

#26

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Generative Media / Diffusion 6.32 5.4/5.1/8.0

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation.

#27

Acting Navy Secretary Hung Cao’s early I&S shakeup sparks questions, concerns

Government & Defense 2026-05-02 DefenseScoop 6.32 6.5/7.0/5.0

Sources said recent personnel changes and reversal of a reorganization effort launched by Cao’s predecessor, John Phelan, have left some officials "in shock." The post Acting Navy Secretary Hung Cao’s early I&S shakeup sparks questions, concerns appeared first on DefenseScoop . Sources said recent personnel changes and reversal of a reorganization effort launched by Cao’s predecessor, John Phelan, have left some officials "in shock." The post Acting Navy Secretary Hung Cao’s early I&S shakeup sparks questions, concerns appeared first on DefenseScoop . Questions are swirling after the N

#28

Dutch startup Intelic sets up drone marketplace for European militaries

Government & Defense 2026-05-04 C4ISRNET 6.27 6.5/6.5/5.5

BASE will allow defense ministries to explore systems that are ready to be used in a coalition framework, with interoperability guaranteed, says the firm. BASE will allow defense ministries to explore systems that are ready to be used in a coalition framework, with interoperability guaranteed, says the firm. PARIS — Dutch defense-technology startup Intelic said it set up a European military drone marketplace that brings together drone manufacturers from nine European countries, in a bid to speed up procurement by allowing militaries to compare various available unmanned systems.

#29

Specsmaxxing – On overcoming AI psychosis, and why I write specs in YAML

AI Coding 2026-05-03 Hacker News — AI front page 6.27 6.0/5.5/7.0

Article URL: https://acai.sh/blog/specsmaxxing Comments URL: https://news.ycombinator.com/item?id=47994012 Points: 265 # Comments: 278 Article URL: https://acai.sh/blog/specsmaxxing Comments URL: https://news.ycombinator.com/item?id=47994012 Points: 265 # Comments: 278

#30

AI uses less water than the public thinks

Safety, Policy & Regulation 2026-05-01 Hacker News — AI front page 6.27 5.5/6.0/7.0

Article URL: https://californiawaterblog.com/2026/04/26/ai-water-use-distractions-and-lessons-for-california/ Comments URL: https://news.ycombinator.com/item?id=47977383 Points: 405 # Comments: 384 Article URL: https://californiawaterblog.com/2026/04/26/ai-water-use-distractions-and-lessons-for-california/ Comments URL: https://news.ycombinator.com/item?id=47977383 Points: 405 # Comments: 384

#31

LLM-Oriented Information Retrieval: A Denoising-First Perspective

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CL (Computation & Language) · arXiv — Agents / Tool Use 6.2 5.7/5.3/7.0

Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full

#32

Make Your LVLM KV Cache More Lightweight

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks 6.13 6.0/4.8/7.0

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings.

#33

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) · arXiv — Reinforcement Learning 6.13 5.7/5.1/7.0

Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessitating neural networks to approximate them as a multiplier network.

#34

ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 6.13 5.0/5.8/7.0

Understanding how people argue across ideological divides online is important for studying political polarization, misinformation, and content moderation. Existing datasets capture only part of this problem: some preserve text but ignore interaction structure, some model structure without rich semantics, and others represent conversations without stable user-level ideological identity. We introduce ControBench, a benchmark for controversial discourse analysis that combines heterogeneous social interaction graphs with rich textual semantics.

#35

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 6.1 6.4/6.3/5.0

Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain.

#36

From Prediction to Practice: A Task-Aware Evaluation Framework for Blood Glucose Forecasting

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 6.1 6.3/6.4/5.0

Clinical time-series forecasting is increasingly studied for decision support, yet standard aggregate metrics can obscure whether a model is actually useful for the task it is meant to serve. In safety-critical settings, low average error can coexist with dangerous failures in exactly the high-risk regimes that matter most. We present a task-aware evaluation framework for blood glucose forecasting built around two downstream uses: hypoglycemia early warning and insulin dosing decision support.

#37

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 6.07 6.3/6.3/5.0

While Large Language Models have achieved notable success on formal mathematics benchmarks such as MiniF2F, it remains unclear whether these results stem from genuine logical reasoning or semantic pattern matching against pre-training data. This paper identifies Architectural Reasoning: the ability to synthesize formal proofs using exclusively local axioms and definitions within an alien math domain, as the necessary ability for future automated theorem discovery AI. We use the Obfuscated Natural Number Game, a benchmark to evaluate Architectural Reasoning.

#38

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 6.0 6.0/6.4/5.0

As Large Language Models (LLMs) are increasingly deployed in cross-linguistic contexts, ensuring safety in diverse regulatory and cultural environments has become a critical challenge. However, existing multilingual benchmarks largely rely on general risk taxonomies and machine translation, which confines guardrail models to these predefined categories and hinders their ability to align with region-specific regulations and cultural nuances. To bridge these gaps, we introduce ML-Bench, a policy-grounded multilingual safety benchmark covering 14 languages.

#39

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

Research 2026-05-01 arXiv stat.ML (Statistical ML) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) · arXiv — Reinforcement Learning 6.0 5.0/5.4/7.0

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based $Q$-learning method with multipattern $Q$-factor approximation and we prove a high-probability regret bound of $\mathcal{O}\big(H^2 N^H \sqrt{ K}\big)$, where $H$ is the horizon, $N$ is the mini-batch size, and $K$ is the number of episodes.

#40

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 6.0 5.0/5.4/7.0

Automatic generation of executable Blender code from natural language remains challenging, with state-of-the-art LLMs producing frequent syntactic errors and geometrically inconsistent objects. We present BlenderRAG, a retrieval-augmented generation system that operates on a curated multimodal dataset of 500 expert-validated examples (text, code, image) across 50 object categories. By retrieving semantically similar examples during generation, BlenderRAG improves compilation success rates from 40.8% to 70.0% and semantic normalized alignment from 0.41 to 0.77 (CLIP similarity) across four stat

#41

Can Coding Agents Reproduce Findings in Computational Materials Science?

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CL (Computation & Language) · arXiv — Agents / Tool Use · arXiv — Evals & Benchmarks 5.97 5.0/5.3/7.0

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science.

#42

Structure Liberates: How Constrained Sensemaking Produces More Novel Research Output

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CL (Computation & Language) · arXiv — AI for Science 5.97 5.3/5.0/7.0

Scientific discovery is an extended process of ideation--surveying prior work, forming hypotheses, and refining reasoning--yet existing approaches treat this phase as a brief preamble despite its central role in research. We introduce SCISENSE, a sensemaking-grounded framework that operationalizes ideation as a structured sequence of eight cognitive stages (Pirolli \& Card, 2005). We construct SCISENSE-Traj, a 100K-scale dataset of citation-conditioned research trajectories in two modes: Target, where an LLM reconstructs the ideation path leading to a known paper from its cited works, and Infe

#43

A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.97 5.0/5.3/7.0

AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding. The accessibility tree is a commonly used text-based format that encodes UI element attributes, but it suffers from redundancy and lacks structural information such as spatial relationships among elements. We propose A11y-Compressor, a framework that transforms linearized accessibility trees into compact and structured representations.

#44

Budget Constraints as Riemannian Manifolds

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) 5.93 6.4/5.8/5.0

Assigning one of K options to each of N groups under a total cost budget is a recurring problem in machine learning, appearing in mixed-precision quantization, non-uniform pruning, and expert selection. The objective (model loss) depends jointly on all assignments and does not decompose across groups, which prevents combinatorial solvers from optimizing the true objective directly and limits them to proxy objectives. Evolutionary search evaluates the actual loss but lacks gradient information, while penalty-based methods provide gradients but enforce the budget only approximately and require s

#45

Multi-frame Restoration for High-rate Lissajous Confocal Laser Endomicroscopy

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.9 5.7/6.4/5.0

Lissajous confocal laser endomicroscopy (CLE) is a promising solution for high speed in vivo optical biopsy for handheld scenarios. However, Lissajous scanning traces a resonant trajectory and samples only the visited pixels per frame; at high frame rates, many pixels remain unvisited, creating structured holes. In this work, we introduce the first benchmark for high-rate Lissajous CLE, consisting of low-quality video clips paired with high-quality reference images.

#46

Position: agentic AI orchestration should be Bayes-consistent

Research 2026-05-01 arXiv stat.ML (Statistical ML) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) · arXiv — Agents / Tool Use 5.87 5.0/5.0/7.0

LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant la

#47

Catalyzing scientific impact through global partnerships and open resources

AI for Science 2026-05-01 Google AI Blog 5.87 5.5/6.0/5.5

Data Mining & Modeling Data Mining & Modeling

#48

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Research 2026-05-02 Hugging Face Daily Papers 5.83 5.4/5.0/6.5

Join the discussion on this paper page

#49

Affordance Agent Harness: Verification-Gated Skill Orchestration

Generative Media 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.RO (Robotics) · arXiv — Evals & Benchmarks 5.82 4.7/5.3/7.0

Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interaction-imagination), yet most orchestrate them with fixed pipelines that are poorly matched to per-instance difficulty, offer limited targeted recovery from intermediate errors, and fail to reuse experience from recurring objects. These failures expose a systems problem: test-time grounding must acquire the right evidence

#50

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Efficiency (Quantization, MoE, Inference) 5.8 6.3/5.5/5.0

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and

#51

Map2World: Segment Map Conditioned Text to 3D World Generation

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 5.8 6.1000000000000005/4.5/6.5

3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments.

#52

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.77 5.4/6.3/5.0

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables.

#53

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.77 5.3/6.4/5.0

Large language models (LLMs) are increasingly applied in financial scenarios. However, they may produce harmful outputs, including facilitating illegal activities or unethical behavior, posing serious compliance risks. To systematically evaluate LLM safety in finance, we propose FinSafetyBench, a bilingual (English-Chinese) red-teaming benchmark designed to test an LLM's refusal of requests that violate financial compliance.

#54

SC-Taxo: Hierarchical Taxonomy Generation under Semantic Consistency Constraints using Large Language Models

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.77 5.3/6.4/5.0

Scientific literature is expanding at an unprecedented pace, making it increasingly challenging to efficiently organize and access domain knowledge. A high-quality scientific taxonomy offers a structured and hierarchical representation of a research field, facilitating literature exploration and topic navigation, as well as enabling downstream applications such as trend analysis, idea generation, and information retrieval. However, existing taxonomy generation approaches often suffer from structural inconsistencies and semantic misalignment across hierarchical levels.

#55

Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.77 5.4/6.3/5.0

SFT and RLVR represent two fundamental yet distinct paradigms for LLM post-training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge.

#56

AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval

Research 2026-05-02 Hugging Face Daily Papers 5.77 5.4/4.8/6.5

Join the discussion on this paper page

#57

Online Self-Calibration Against Hallucination in Vision-Language Models

Research 2026-05-02 Hugging Face Daily Papers 5.77 5.7/4.5/6.5

Join the discussion on this paper page

#58

Weisfeiler Lehman Test on Combinatorial Complexes: Generalized Expressive Power of Topological Neural Networks

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.7 5.7/5.8/5.0

Combinatorial complexes have unified set-based (e.g., graphs, hypergraphs) and part-whole (e.g., simplicial, cellular complexes) structures into a common topological framework. Existing topological neural networks and Weisfeiler-Lehman variants remain fragmented, lacking a unified theoretical foundation for topological deep learning. In this work, we introduce the Combinatorial Complex Weisfeiler-Lehman (CCWL) test, an axiomatic-style extension of the WL test to combinatorial complexes.

#59

Vesselpose: Vessel Graph Reconstruction from Learned Voxel-wise Direction Vectors in 3D Vascular Images

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.7 5.7/5.8/5.0

Blood vessel segmentation and -tracing are essential tasks in many medical imaging applications. Although numerous methods exist, the prevailing segment-then-fix paradigm is fundamentally limited regarding its suitability for modeling the task of complete and topologically accurate vascular network reconstruction. Here, we propose an approach to extract topologically more accurate vascular graphs from 3D image data, building upon highly successful ideas from the related biomedical tasks of cell segmentation and -tracking.

#60

Adaptive Querying with AI Persona Priors

Research 2026-05-01 arXiv stat.ML (Statistical ML) · arXiv cs.CL (Computation & Language) · arXiv cs.LG (Machine Learning) · arXiv — Post-training / Alignment 5.7 5.0/4.5/7.0

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight question budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user's state through membership in a finite dictionary of AI personas, each offering response distributions produced by a

#61

Possibilistic Predictive Uncertainty for Deep Learning

Generative Media 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.7 4.7/4.8/7.0

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modelling. Existing methods for uncertainty modelling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous derivations connecting their specific objectives to epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled fra

#62

"What Are You Really Trying to Do?": Co-Creating Life Goals from Everyday Computer Use

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CL (Computation & Language) · arXiv — Agents / Tool Use 5.7 5.0/4.5/7.0

Recent advances in user modeling make it feasible to conduct open-ended inference over a person's everyday computer use. Despite longstanding visions of systems that deeply understand our actions and the purposes they serve in our lives, existing systems only capture what a person is doing in the moment -- not why they are doing it -- limiting these systems to surface-level support. We introduce striving co-creation, a process for inferring broader life goals from unstructured observations of computer use.

#63

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.67 5.0/6.4/5.0

We introduce HyCOP, a modular framework that learns parametric PDE solution operators by composing simple modules (advection, diffusion, learned closures, boundary handling) in a query-conditioned way. Rather than learning a monolithic map, HyCOP learns a policy over short programs - which module to apply and for how long - conditioned on regime features and state statistics. Modules may be numerical sub-solvers or learned components, enabling hybrid surrogates evaluated at arbitrary query times without autoregressive rollout.

#64

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Research 2026-05-02 Hugging Face Daily Papers 5.67 5.4/4.5/6.5

Join the discussion on this paper page

#65

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Research 2026-05-02 Hugging Face Daily Papers 5.67 5.4/4.5/6.5

Join the discussion on this paper page

#66

Posterior Augmented Flow Matching

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Reinforcement Learning · arXiv — Generative Media / Diffusion · arXiv — Post-training / Alignment · arXiv — Evals & Benchmarks 5.65 4.7/4.8/7.0

Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-dimensional images, each training sample supervises only a single trajectory and intermediate point, yielding an extremely sparse and high-variance training signal. This under-constrained supervision can cause flow collapse, where the learned dynamics memorize specific source-target pairings, mapping diverse inputs to overly similar outputs, failing to generalize.

#67

Static and Dynamic Graph Alignment Network for Temporal Video Grounding

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.63 5.7/5.9/5.0

Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary

#68

Gradient Regularized Newton Boosting Trees with Global Convergence

Research 2026-05-01 arXiv stat.ML (Statistical ML) · arXiv cs.LG (Machine Learning) 5.6 6.7/4.5/5.0

Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite its empirical success, the global convergence of Newton boosting is poorly understood compared to first-order boosting. In this paper, we introduce Restricted Newton Descent, which studies convex optimization with Newton's method on Hilbert spaces with inexact iterates, based on the concepts of cosine angle and weak gradient edge.

#69

A FedRAMP strategy for solving the cyber talent shortage

Government & Defense 2026-05-01 FedScoop — AI 5.6 5.5/6.5/4.5

A landing zone where SaaS providers can deploy their apps is one way federal environments can spur innovation and attract more high-caliber cyber talent. The post A FedRAMP strategy for solving the cyber talent shortage appeared first on FedScoop . A landing zone where SaaS providers can deploy their apps is one way federal environments can spur innovation and attract more high-caliber cyber talent.

#70

Decentralized Proximal Stochastic Gradient Langevin Dynamics

Research 2026-05-01 arXiv stat.ML (Statistical ML) · arXiv cs.LG (Machine Learning) 5.53 5.7/5.3/5.0

We propose Decentralized Proximal Stochastic Gradient Langevin Dynamics (DE-PSGLD), a decentralized Markov chain Monte Carlo (MCMC) algorithm for sampling from a log-concave probability distribution constrained to a convex domain. Constraints are enforced through a shared proximal regularization based on the Moreau-Yosida envelope, enabling unconstrained updates while preserving consistency with the target constrained posterior. We establish non-asymptotic convergence guarantees in the 2-Wasserstein distance for both individual agent iterates and their network averages.

#71

Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

Robotic Autonomy 2026-05-01 arXiv cs.RO (Robotics) · arXiv cs.LG (Machine Learning) 5.53 5.7/5.3/5.0

Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90\% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails

#72

Bridging Graph Drawing and Dimensionality Reduction with Stochastic Stress Optimization

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.53 4.7/6.3/5.0

Both Dimensionality Reduction (DR) and Graph Drawing (GD) aim to visualize abstract, non-linear structures, yet rely on different optimization paradigms. This contrast is evident in Multidimensional Scaling (MDS), which typically depends on the SMACOF algorithm despite graph drawing results showing that simpler stochastic optimization schemes can be more effective for the same objective. We bridge these domains by adapting Stochastic Gradient Descent (SGD) techniques from graph drawing to vector data embedding.

#73

Lt. Gen. Doug Schiess nominated as next chief of space operations

Government & Defense 2026-05-01 DefenseScoop 5.48 5.5/6.0/4.5

"If confirmed, I will focus on sharpening our lethality and accelerating the delivery of space capabilities to the warfighter, keeping the Space Force ahead against any adversary," Schiess said. Doug Schiess nominated as next chief of space operations appeared first on DefenseScoop . "If confirmed, I will focus on sharpening our lethality and accelerating the delivery of space capabilities to the warfighter, keeping the Space Force ahead against any adversary," Schiess said.

#74

EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.47 5.0/5.8/5.0

Text-to-SQL enables non-expert users to query databases in natural language, yet real-world schemas often suffer from ambiguous, abbreviated, or inconsistent naming conventions that degrade model accuracy. Existing approaches treat schemas as fixed and address errors downstream. In this paper, we frame schema refinement as a constrained optimization problem: find a renaming function that maximizes downstream Text-to-SQL execution accuracy while preserving query equivalence through database views.

#75

Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) · arXiv — Mechanistic Interpretability 5.47 5.0/5.8/5.0

Probing is widely used to study which features can be decoded from language model representations. However, the common decoding probe approach has two limitations that we aim to solve with our new encoding probe approach: contributions of different features to model representations cannot be directly compared, and feature correlations can affect probing results. We present an Encoding Probe that reverses this direction and reconstructs internal representations of models using interpretable features.

#76

Empowering Heterogeneous Graph Foundation Models via Decoupled Relation Alignment

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 5.45 5.0/5.9/5.0

While Graph Foundation Models (GFMs) have achieved remarkable success in homogeneous graphs, extending them to multi-domain heterogeneous graphs (MDHGs) remains a formidable challenge due to cross-type feature shifts and intra-domain relation gaps. Existing global feature alignment methods (PCA or SVD) enforce a shared feature space blindly, which distorts type-specific semantics and disrupts original topologies, inevitably leading to "Type Collapse" and "Relation Confusion". To address these fundamental limitations, we propose Decoupled relation Subspace Alignment (DRSA), a novel, plug-and-pl

#77

GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.43 5.7/5.3/5.0

Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges with convolutional neural network (CNN)-based, transformer-based, and contrastive language-image pre-training (CLIP)-based methods, including late fusion of image features, lack of factor-aware conditioning, and impractical capacity scaling. To address these challenges, we propose Globally-conditioned Multi-scale Gaze estimation (GMGaze), which leverages a multi-scale transformer architecture.

#78

Meritocratic Fairness in Budgeted Combinatorial Multi-armed Bandits via Shapley Values

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 5.43 5.7/5.0/5.0

We propose a new framework for meritocratic fairness in budgeted combinatorial multi-armed bandits with full-bandit feedback (BCMAB-FBF). Unlike semi-bandit feedback, the contribution of individual arms is not received in full-bandit feedback, making the setting significantly more challenging. To compute arm contributions in BCMAB-FBF, we first extend the Shapley value, a classical solution concept from cooperative game theory, to the $K$-Shapley value, which captures the marginal contribution of an agent restricted to a set of size at most $K$.

#79

Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Efficiency (Quantization, MoE, Inference) 5.42 5.7/5.1/5.0

Federated learning (FL) offers a promising framework for collaborative digital pathology by enabling model training across institutions. However, real-world deployments face heterogeneity arising from diverse multiple instance learning (MIL) architectures and heterogeneous feature extractors across institutions. We propose FedHD, a novel FL framework that performs local Gaussian-mixture feature alignment tailored for WSI analysis.

#80

Recovering Hidden Reward in Diffusion-Based Policies

Robotic Autonomy 2026-05-01 arXiv cs.RO (Robotics) · arXiv — Reinforcement Learning 5.42 5.7/5.1/5.0

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds.

#81

Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 5.37 6.0/4.5/5.0

Energy-based models (EBMs) are a flexible class of deep generative models and are well-suited to capture complex dependencies in multimodal data. However, learning multimodal EBM by maximum likelihood requires Markov Chain Monte Carlo (MCMC) sampling in the joint data space, where noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships. Multimodal VAEs have made progress in capturing such inter-modal dependencies by introducing a shared latent generator and a joint inference model.

#82

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 5.37 6.0/4.5/5.0

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions.

#83

Class Angular Distortion Index for Dimensionality Reduction

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) · arXiv — Post-training / Alignment 5.37 5.0/5.5/5.0

Dimensionality reduction (DR) techniques are often characterized by whether they preserve global, high-level structures in the data or local, neighborhood structures. This distinction matters in visualization: global methods can obscure clusters while local methods can over-emphasize them. Yet, even when clusters appear distinct, their relative arrangement in the projection may be arbitrary or misleading, a common issue in techniques such as t-SNE and UMAP.

#84

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) 5.35 5.3/5.3/5.0

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retri

#85

InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) 5.28 5.3/5.1/5.0

We present a training-free approach for controllable 3D inpainting based on initial noise optimization. In the structured 3D latent diffusion framework, we observe that the underlying geometric structure is established during the early stages of the diffusion process and exhibits high sensitivity to the initial noise. Such characteristics compromise stability in tasks like inpainting and editing, where the model must ensure strict alignment with the existing context while synthesizing a new structure.

#86

Linking Behaviour and Perception to Evaluate Meaningful Human Control over Partially Automated Driving

Robotic Autonomy 2026-05-01 arXiv cs.RO (Robotics) · arXiv cs.AI (Artificial Intelligence) 5.28 5.0/5.4/5.0

Partial driving automation creates a tension: drivers remain legally responsible for vehicle behaviour, yet their active control is significantly reduced. This reduction undermines the engagement and sense of agency needed to intervene safely. Meaningful human control (MHC) has been proposed as a normative framework to address this tension.

#87

Randomized Subspace Nesterov Accelerated Gradient

Research 2026-05-01 arXiv stat.ML (Statistical ML) · arXiv cs.LG (Machine Learning) 5.27 5.7/4.5/5.0

Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is attractive in forward-mode automatic differentiation and communication-limited settings. While Nesterov acceleration is well understood for full-gradient and coordinate-based methods, obtaining accelerated methods for general subspace sketches that use only projected-gradient information and can improve over full-dimensional Nesterov acceleration in oracle complexity is technically nontrivial. We develop randomized-subspace Nesterov accelerated

#88

Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.27 5.7/4.8/5.0

Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings.

#89

Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv — Efficiency (Quantization, MoE, Inference) 5.25 5.3/5.0/5.0

The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper presents a systematic empirical analysis of the Nvidia and Apple Silicon ecosystems, specifically characterizing the distinct intra-architecture trade-offs required to deploy these massive models. On the Nvidia Blackwell architecture, we identify a critical "Backend Dichotomy" within the TensorRT-LLM stack: while the new NVFP4 quantization format delivers a 1.6x th

#90

EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 5.23 5.0/5.1/5.0

Federated Multimodal Learning (FML) trains multimodal models across decentralized clients while keeping their image-text pairs private. However, joint embedding training entangles forgotten knowledge across both modalities and client gradient subspaces, hindering federated unlearning. Previous federated unlearning approaches neither sever the cross-modal reconstruction channel mediated by bilinear coupling nor separate forget-exclusive update directions from those shared with retained clients.

#91

DMDSC: A Dynamic-Margin Deep Simplex Classifier for Open-Set Recognition on Medical Image Datasets

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.17 5.4/4.8/5.0

Medical imaging datasets are often characterized by extreme class imbalances, where rare pathologies are significantly underrepresented compared to common conditions. This imbalance poses a dual challenge for Open-Set Recognition (OSR): models must maintain high classification accuracy on known classes while reliably rejecting unknown samples unseen during training in the clinical settings. While recently proposed Deep Simplex Classifier (DSC)~\cite{cevikalp2024reaching} and UnCertainty-aware Deep Simplex Classifier (UCDSC)~\cite{Aditya_2026_WACV} successfully leverage Neural Collapse to ensur

#92

AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 5.17 5.7/4.2/5.0

Fine-tuning LLMs is necessary for various dedicated downstream tasks, but classic backpropagation-based fine-tuning methods require substantial GPU memory. To this end, a recent work, MeZO, which relies solely on forward passes to fine-tune LLMs, significantly reduces GPU requirements at the cost of slower convergence due to its indifference to loss landscapes. Standard solutions, such as Adam, explore loss landscapes by estimating the first- and second-order moments and storing them in memory to guide the model's movement through dimensions with lower curvature and vice versa.

#93

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv — Agents / Tool Use 5.15 4.7/5.3/5.0

Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task.

#94

Spiking Sequence Machines and Transformers

State Space Models 2026-05-01 arXiv cs.NE (Neural & Evolutionary Computing) · arXiv cs.LG (Machine Learning) 5.13 5.0/4.8/5.0

Sequence learning reduces to similarity-based retrieval over a temporally indexed representation space, a constraint on any sequence model, not a property of a specific architecture. We show that a spiking Sparse Distributed Memory sequence machine (2007) and the transformer (2017) independently instantiate the same five functional operations (encoding, context maintenance, associative retrieval, storage, and decoding), with cosine similarity as the shared retrieval primitive in both. We formalise a Phase-Latency Isomorphism showing that sinusoidal positional phase and spike timing are linearl

#95

Show HN: GhostBox – Borrow a disposable little machine from the Global Free Tier

Industry 2026-05-01 Hacker News — AI front page 5.1 4.5/4.5/6.0

I built this because I was always creating machines on GH actions to test builds on different OS, and I wanted a tight CLI that could do it. I always saw Actions as this great resources and ephemeral machines you could do dev work in just were a natural way for me to work, so this grew out of that workflow. I didn't expect it to blow up, so it wasn't 100% finished when I posted it.

#96

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.AI (Artificial Intelligence) 5.08 5.0/4.8/5.0

With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising problem of low-dose computed tomography using deep learning. Although low-dose computed tomography reduces radiation exposure to patients, it also introduces more noise, which may interfere with visual interpretation by physicians and affect diagnostic results.

#97

Robust Fusion of Object-Level V2X for Learned 3D Object Detection

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) · arXiv cs.RO (Robotics) 5.08 5.3/4.5/5.0

Perception for automated driving is largely based on onboard environmental sensors, such as cameras and radar, which are cost-effective but limited by line-of-sight and field-of-view constraints. These inherent limitations may cause onboard perception to fail under occlusions or poor visibility conditions. In parallel, cooperative awareness via vehicle-to-everything (V2X) communication is becoming increasingly available, enabling vehicles and infrastructure to share their own state as object-level information that complements onboard perception.

#98

Instance-Aware Parameter Configuration in Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 5.08 5.0/4.8/5.0

Algorithm performance in combinatorial optimization is highly sensitive to parameter settings, while a single globally tuned configuration often fails to exploit the heterogeneity of instances. This limitation is particularly evident in the Electric Capacitated Vehicle Routing Problem, where instances differ in structure, demand patterns, and energy constraints. This paper investigates instance-aware parameter configuration for Bilevel Late Acceptance Hill Climbing, a state-of-the-art metaheuristic for the Electric Capacitated Vehicle Routing Problem.

#99

PEACE: Cross-modal Enhanced Pediatric-Adult ECG Alignment for Robust Pediatric Diagnosis

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 5.07 6.0/6.1/2.5

Automated pediatric electrocardiogram (ECG) diagnosis remains challenging because models trained predominantly on adult data suffer from substantial cross-population mismatch, while pediatric labels are often scarce. We present PEACE (Pediatric-Adult ECG Alignment via Cross-modal Enhancement), a structured cross-modal alignment framework for adult-to-pediatric ECG transfer. PEACE integrates tri-axial clinical semantic decomposition, label-query feature extraction, and curriculum-gated optimization to align transferable adult ECG representations with pediatric diagnostic targets.

#100

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

State Space Models 2026-05-01 arXiv cs.NE (Neural & Evolutionary Computing) · arXiv cs.LG (Machine Learning) 5.03 5.0/4.5/5.0

Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance

#101

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CL (Computation & Language) 5.03 5.0/4.5/5.0

The language in online platforms, influence operations, and political rhetoric frequently directs a mix of pro-social sentiment (e.g., advocacy, helpfulness, compassion) and anti-social sentiment (e.g., threats, opposition, blame) at different topics, all in the same message. While many natural language processing (NLP) tools classify or score a text's overall sentiment as positive, neutral, or negative, these tools cannot report that positive and negative sentiments coexist, and they cannot report the target of those sentiments. This paper presents the Directed Social Regard (DSR) approach to

#102

Fairness of Classifiers in the Presence of Constraints between Features

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) · arXiv cs.LG (Machine Learning) 5.03 5.0/4.5/5.0

In Machine Learning, an accepted definition of fairness of a decision taken by a classifier is that it should not depend on protected features, such as gender. Unfortunately, when constraints exist between features, such dependencies can be obscured by the constraints. To avoid this problem, we propose that a decision be considered fair if it has a fair explanation.

#103

Knowing when to trust machine-learned interatomic potentials

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 5.03 5.7/6.3/2.5

Prevailing machine-learned interatomic potential (MLIP) uncertainty-quantification methods rely on ensembles of independently trained backbones. These methods scale unfavorably with foundation-scale MLIPs, and their member-disagreement signals correlate weakly with per-molecule prediction error. Here we probe the frozen per-atom representations of a pretrained MLIP with a compact discriminative classifier, recasting MLIP uncertainty quantification as selective classification rather than error regression.

#104

Scaling Laws: An EU-perspective on America’s Approach to AI with Marietje Schaake

Safety, Policy & Regulation 2026-05-01 Lawfare (via Google News) 4.98 5.0/6.0/3.5

Scaling Laws: An EU-perspective on America’s Approach to AI with Marietje Schaake    Lawfare Scaling Laws: An EU-perspective on America’s Approach to AI with Marietje Schaake    Lawfare

#105

CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.83 6.3/5.4/2.5

The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA).

#106

FedKPer: Tackling Generalization and Personalization in Medical Federated Learning via Knowledge Personalization

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.83 5.0/6.4/2.5

Federated learning (FL) holds great potential for medical applications. However, statistical heterogeneity across healthcare institutions poses a major challenge for FL, as the global model struggles both to generalize across unseen patient populations and to adapt to the unique data distributions of individual hospitals. This heterogeneity also exacerbates forgetting at both the global and local level, resulting in previous learned patient patterns to be misclassified after model updates.

#107

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.83 5.3/6.1/2.5

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse.

#108

The Political Limits of China’s AI Diffusion Ambitions

Safety, Policy & Regulation 2026-05-03 Lawfare (via Google News) 4.82 5.0/5.5/3.5

The Political Limits of China’s AI Diffusion Ambitions    Lawfare The Political Limits of China’s AI Diffusion Ambitions    Lawfare

#109

Generating Statistical Charts with Validation-Driven LLM Workflows

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.8 5.0/6.3/2.5

Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable code, dataset context, and question-answer pairs. We present a structured LLM-based workflow that decomposes chart generation into dataset screening, plot proposal, code synthesis, rendering, validation-driven refinement, description generation, and question-answer generation.

#110

NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.8 5.3/6.0/2.5

Monte Carlo Tree Search (MCTS) scales poorly in cooperative multi-agent domains because expansion must consider an exponentially large set of joint actions, severely limiting exploration under realistic search budgets. We propose NonZero, which keeps multi-agent MCTS tractable by running surrogate-guided selection over a low-dimensional nonlinear representation using an interaction-guided proposal rule, instead of directly exploring the full joint-action space. Our exploration uses an interaction score: single-agent deviations are ranked by predicted gain, while two-agent deviations are scored

#111

Characterizing the Expressivity of Local Attention in Transformers

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) 4.77 5.7/5.5/2.5

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear.

#112

Unlearning Offline Stochastic Multi-Armed Bandits

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.77 5.7/5.5/2.5

Machine unlearning aims to unlearn data points from a learned model, offering a principled way to process data-deletion requests and mitigate privacy risks without full retraining. Prior work has mainly studied unsupervised / supervised machine unlearning, leaving unlearning for sequential decision-making systems far less understood. We initiate the first study of a foundational sequential decision-making problem: offline stochastic multi-armed bandits (MAB).

#113

Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.63 6.0/5.1/2.5

Knee osteoarthritis (OA) assessment involves a natural but often underused label hierarchy: a coarse binary OA decision and a fine-grained Kellgren--Lawrence (KL) severity grade. Existing deep learning studies commonly treat these targets as separate classification problems, either reducing OA assessment to disease presence or directly optimizing noisy ordinal KL labels. In this work, we ask whether this clinical hierarchy can serve as a representation-level supervisory prior.

#114

H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) 4.63 5.0/5.8/2.5

We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent-child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation.

#115

Learning the Helmholtz equation operator with DeepONet for non-parametric 2D geometries

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.63 5.0/5.8/2.5

This paper deals with solving the 2D Helmholtz equation on non-parametric domains, leveraging a physics-informed neural operator network based on the DeepONet framework. We consider a 2D square domain with an inclusion of arbitrary boundary geometry at its center. This inclusion acts as a scatterer for an incoming harmonic wave.

#116

Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.63 5.0/5.8/2.5

Quantum machine learning is a promising field for efficiently learning features of a dataset to perform a specified task, such as classification. Interval bound propagation (IBP) is a popular certified training method in classical machine learning, where the lower and upper bounds are tracked throughout the model. These bounds are used during training to ensure that the model is certified to predict the correct label even under adversarial perturbations.

#117

Temporal Data Requirement for Predicting Unplanned Hospital Readmissions

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.63 5.0/5.8/2.5

With the proliferation of Electronic Health Records (EHRs), a critical challenge in building predictive models is determining the optimal historical data time window to maximize accuracy. This study investigates the impact of various observation windows ranging from the day of surgery to three years prior on predicting 30-day readmission following hip and knee arthroplasties. The dataset encompasses both structured encounter records (over 4 million) and unstructured clinical notes (80,000) from 7,174 patients.

#118

Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.63 5.0/5.8/2.5

Complex physical systems, from supersonic turbulence to the macroscopic structure of the universe, are governed by continuous multiscale dynamics. While modern machine learning architectures excel at mapping the high-dimensional observables of these systems, it remains unclear whether they internalize the governing physical laws or merely interpolate discrete statistical correlations. Standard Explainable AI (XAI) architectures, particularly perturbation-based and gradient-saliency methods, rely on pixel-wise perturbations, which generate unphysical artifacts and push inputs off the valid empi

#119

Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.6 5.7/5.3/2.5

Aerosol Optical Depth (AOD) retrieval is essential for Earth observation, supporting applications from air quality monitoring to climate studies. Conventional physics-based AOD retrieval methods formulate the problem as a pixel-wise inversion, relying on radiative transfer modeling, memory-intensive look-up tables, and auxiliary meteorological data. While recent data-driven approaches have shown promise, many fail to exploit the spatial-spectral coherence of hyperspectral imagery, leading to spatially inconsistent and noise-sensitive retrievals.

#120

Scalable Context-Aware Graph Attention for Unsupervised Anomaly Detection in Large-Scale Mobile Networks

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) 4.58 6.3/4.5/2.5

Mobile network operators must monitor thousands of heterogeneous network elements across the radio access network and the packet core, each exposing high-dimensional KPI time series. The scale and cost of incident labelling make supervised approaches impractical, motivating unsupervised anomaly detection robust to context shifts and nonstationarity. We propose \textbf{C-MTAD-GAT} (\emph{Context-aware Multivariate Time-series Anomaly Detection with Graph Attention}), an anomaly detection framework designed to operate as a single shared model across large populations of network elements.

#121

MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

Robotic Autonomy 2026-05-01 arXiv cs.RO (Robotics) 4.55 5.3/5.4/2.5

Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based meth

#122

Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.53 5.7/5.1/2.5

Unsupervised deraining has attracted attention for its ability to learn the real-world distribution of rain without paired supervision. However, the lack of strong constraints makes it difficult for the network to converge, especially with the complex diversity of rain degradation. A key motivation is that high-quality deraining results occasionally emerge during training, which can be leveraged to guide the optimization process.

#123

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) 4.53 5.0/5.5/2.5

We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation

#124

Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue

Frontier LLMs 2026-05-01 arXiv cs.CL (Computation & Language) 4.53 5.0/5.5/2.5

We model utterance production as probabilistic cost-sensitive choice over contextual alternatives, using information-theoretic notions of cost. We distinguish between goal-directed alternatives that realise a fixed communicative intent and goal-agnostic alternatives defined only by contextual plausibility, allowing us to derive speaker- and listener-oriented interpretations of different cost measures. We present a procedure to generate both types of alternative sets using language models.

#125

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.53 5.0/5.5/2.5

In biomechanical systems, observable performance is often used as a proxy for underlying system organization. However, this assumption implicitly presumes a correspondence between output metrics and internal system states that may not hold in adaptive systems. In this study, the vertical dimension of occlusion (VDO) is considered as a constraint applied to an adaptive neuromechanical system, enabling the exploration of system-level responses under controlled variations.

#126

Deep Kernel Learning for Stratifying Glaucoma Trajectories

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.53 5.0/5.5/2.5

Effectively stratifying patient risk in chronic diseases like glaucoma is a major clinical challenge. Clinicians need tools to identify patients at high risk of progression from sparse and irregularly-sampled electronic health records (EHRs). We propose a novel deep kernel learning (DKL) architecture that leverages a Gaussian Process (GP) backend.

#127

Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) 4.45 5.3/5.1/2.5

The Room Acoustics and Speaker Distance Estimation (SDE) Challenge at ICASSP 2025 explores the effectiveness of augmented room impulse response (RIR) data for improving SDE model performance. This challenge at GenDARA involves generating RIRs to supplement sparse datasets and fine-tuning SDE models with the augmented data. We employ the open-source fast diffuse room impulse response generator (FastRIR) conditioned only on speaker and listener locations.

#128

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.43 5.7/4.8/2.5

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability.

#129

Aitchison Embeddings for Learning Compositional Graph Representations

Frontier LLMs 2026-05-01 arXiv cs.LG (Machine Learning) 4.43 4.7/5.5/2.5

Representation learning is central to graph machine learning, powering tasks such as link prediction and node classification. However, most graph embeddings are hard to interpret, offering limited insight into how learned features relate to graph structure. Many networks naturally admit a role-mixture view, where nodes are best described as mixtures over latent archetypal factors.

#130

High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions

Robotic Autonomy 2026-05-01 arXiv cs.RO (Robotics) 4.42 5.0/5.3/2.5

Understanding human actions from visual observations is essential for human--robot interaction, particularly when semantic interpretation of unfamiliar or hard-to-annotate actions is required. In scenarios such as rapid and less common activities, collecting sufficient labeled data for supervised learning is challenging, making zero-shot approaches a practical alternative for semantic understanding without task-specific training. While recent advances in large-scale pretrained models enable such zero-shot reasoning, the impact of temporal resolution, especially for rapid and fine-grained motio

#131

PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.37 5.3/5.0/2.5

Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and different reasoning depth, yet existing reasoning-based editors apply a single fixed inference recipe to every instruction. We argue that adaptivity along both the spatial and temporal axes is the missing degree of freedom, and we present PhysEdit, an editing framework built around this principle. PhysEdit introduces two inference-time modules that compose without retraining the backbone.

#132

GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) 4.35 5.3/4.8/2.5

Reliable spatial analysis in GIScience requires preserving coordinate semantics, topology, units, and geographic plausibility. Current LLM-based GIS systems generate fluent scripts but rarely enforce these geographic rules at scale. We present GeoContra, a verification and repair framework for LLM-driven Python GIS workflows.

#133

Exploring the Limits of End-to-End Feature-Affinity Propagation for Single-Point Supervised Infrared Small Target Detection

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.23 5.4/4.5/2.5

Single-point supervised infrared small target detection (IRSTD) drastically reduces dense annotation costs. Current state-of-the-art (SOTA) methods achieve high precision by recovering mask supervision through explicit, offline pseudo-label construction, such as multi-stage active learning and physics-driven mask generation. In this paper, we study a minimalist alternative: generating point-to-mask supervision online through in-batch, point-anchored feature-affinity propagation.

#134

Modeling Subjective Urban Perception with Human Gaze

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.2 5.0/4.8/2.5

Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels.

#135

Quantum Gradient-Based Approach for Edge and Corner Detection Using Sobel Kernels

Multimodal 2026-05-01 arXiv cs.CV (Computer Vision) 4.2 5.0/4.8/2.5

Edge detection refers to identifying points in a digital image where intensity changes sharply, indicating object boundaries or structural features. Corners are locations where gray-level intensity changes abruptly in multiple directions and are widely used in feature extraction, object tracking, and 3D modeling. In this study, we present a quantum implementation of Sobel-based edge detection and Harris-style corner detection.

#136

Born-Qualified: An Autonomous Framework for Deploying Advanced Energy and Electronic Materials

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) 4.15 5.0/4.5/2.5

Autonomous science is transforming how we discover materials and chemical systems for advanced energy technologies. However, many initially promising systems never reach deployment. This "valley of death" stems from optimization that prioritizes laboratory metrics over industrial viability.

#137

AI Washing Inflates Expected Performance but Not Interaction Outcomes: An AI Placebo Study Using Fitts' Law

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) 4.15 5.0/4.5/2.5

Expectations about the support of artificial intelligence (AI) may influence interaction outcomes similar to placebos. Such expectations may result from AI washing, a practice of overstating a system's AI capabilities when actual functionality is limited. For example, some computer mice are marketed as "AI-assisted" despite lacking AI in core functions.

#138

Space Network of Experts: Architecture and Expert Placement

Research 2026-05-01 arXiv cs.AI (Artificial Intelligence) 4.15 5.0/4.5/2.5

Leveraging continuous solar energy harvesting at high efficiency, space data centers are envisioned as a promising platform for executing energy-intensive large language models (LLMs). Recognizing this advantage, space and AI conglomerates (e.g., SpaceX, Google) are actively investing in this vision. One key challenge, however, is the efficient distributed deployment of a large-scale LLM in a satellite network due to the limited onboard computing and communication resources.

#139

Recursive Maximum Likelihood Estimation for Interacting Particle Systems using Virtual Particles

Research 2026-05-01 arXiv stat.ML (Statistical ML) 4.1 5.0/4.5/2.5

We study recursive maximum likelihood estimation for stochastic interacting particle systems based on continuous observation of a single particle. In this regime, consistent estimation of the finite-particle log-likelihood is not possible, even in the limit as the number of particles $N\rightarrow\infty$ and the time horizon $t\rightarrow\infty$. We thus seek to optimise the stationary log-likelihood of the limiting mean-field system.