← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Thursday, April 30, 2026

Coverage window: 2026-04-29 03:27 ET2026-04-30 03:18 ET
Press play to listen
Thursday, April 30, 2026
13m 57s · top-4 narrated briefing
#1 · Industry
Sources: Anthropic could raise a new $50B round at a valuation of $900B
TechCrunch is reporting, citing sources familiar with the matter, that Anthropic has received multiple pre-emptive offers for a new financing round in the $850 billion to $900 billion valuation range, with a target raise of around $50 billion. If it lands at the upper end, that w…
8.7 · 1 srcs
#2 · Infrastructure
Building the compute infrastructure for the Intelligence Age
OpenAI published a research-page essay framing Stargate — its multi-site data-center program with Oracle, SoftBank, and a roster of regional partners — as the operational core of how the company plans to scale toward AGI. The post is light on novel numbers and heavy on positionin…
8.2 · 1 srcs
#3 · Evaluations & Benchmarks
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM, presented in a new paper that surfaced through arXiv's robotics, vision, and AI feeds along with Hugging Face's Daily Papers, is an attempt to unify two streams of work that have been moving in parallel for the last year: world-model video synthesis and direct robotic acti…
8.0 · 6 srcs
6.5
#1
Industry 2026-04-30 TechCrunch — AI 8.7 9.5/8.5/8.5

TechCrunch is reporting, citing sources familiar with the matter, that Anthropic has received multiple pre-emptive offers for a new financing round in the $850 billion to $900 billion valuation range, with a target raise of around $50 billion. If it lands at the upper end, that would put Anthropic within striking distance of OpenAI's most recent valuation marks and would represent a roughly 7×–10× revaluation from where the company sat on its previous primary in 2024. The round is being characterized as pre-emptive — investors approached the company rather than the other way around — and the size puts it in the same bucket as the largest private financings in technology history.

The number is striking against Anthropic's reported revenue trajectory. Public estimates over the last twelve months have placed Anthropic's annualized run rate somewhere in the $4–7 billion range, dominated by Claude API revenue, Claude Code, and a fast-growing enterprise contract book. A $900 billion mark implies multiples that only make sense if buyers are pricing in a continued 3–5× per-year revenue ramp and an eventual stable share of the frontier-LLM market — and if they believe Claude's trajectory in coding and agentic workflows persists into the next training cycle. The same investors are presumably modeling Claude Opus 4.7 (released this month) and a likely Opus 5 generation later this year as the products that bear out those assumptions.

The capital, if raised, points squarely at compute. Anthropic's training and serving spend is the bulk of its cost base, and the partnerships disclosed earlier this month — the expanded $50B+ AWS commitment for Trainium-class capacity, the Google–Broadcom collaboration on TPU access, and the NEC build in Japan — all imply a forward training spend that dwarfs current revenue. A $50 billion equity round materially extends Anthropic's runway against that build-out and meaningfully reduces the chance the company has to take on debt or convert future capacity commitments into dilution.

The signal to the rest of the industry is the harder thing to read. If Anthropic prints at $900B, it cements a market structure in which two private labs (OpenAI, Anthropic) and one or two public-cloud-aligned labs (Google DeepMind, possibly xAI) carry valuations and capital pools an order of magnitude beyond every other frontier player. That reshapes the talent market, the compute-allocation negotiation with TSMC and the hyperscalers, and the willingness of GPU-rich nation states to back national-champion labs as counterweights. Worth tracking whether the round closes near the floated number or compresses on the way to a final term sheet — a $50B round at $700B is a meaningfully different signal than the same round at $900B.

#2
Infrastructure 2026-04-29 OpenAI Research 8.2 8.5/8.5/7.0

OpenAI published a research-page essay framing Stargate — its multi-site data-center program with Oracle, SoftBank, and a roster of regional partners — as the operational core of how the company plans to scale toward AGI. The post is light on novel numbers and heavy on positioning: it lays out a mental model in which compute capacity is the binding constraint on capability progress, sets out a plan for adding "multiple gigawatts" of new training and serving capacity over the coming year, and asserts that frontier capability gains and product reach are now coupled to data-center construction in a way that can't be wished away with software-only optimizations.

The substantive content concentrates on three things. First, the post confirms continued buildout of the Abilene, Texas Stargate campus alongside additional U.S. sites in coordination with Oracle, plus the previously announced UAE expansion and new partnerships across Asia. Second, it sketches a power-procurement strategy that mixes long-term PPAs for clean firm power, behind-the-meter natural gas, and direct relationships with utilities to expedite grid interconnect timelines that would otherwise gate construction. Third, it positions Stargate's $500 billion+ committed capex as both training infrastructure and the inference base for hundreds of millions of free-tier and enterprise users — pre-empting the criticism that the largest training runs cannot economically justify their dedicated capacity by reframing the same hardware as serving infrastructure for the consumer footprint.

What's notable about the essay is the rhetorical shift. OpenAI has historically described compute as a means; this post is the clearest articulation yet that, internally, the company treats compute capacity as the objective function to be maximized — and reasoning, agents, and product features as the natural consequences of having the substrate. That mirrors the message public Anthropic communications have been leaning into for the last six months and Google DeepMind's equivalent in Alphabet's earnings calls. It also sets the framing for the simultaneous Anthropic financing reporting, the Microsoft–OpenAI exclusivity unwind, and the Q1 cloud earnings prints from Google and AWS that all landed within the same 48-hour window: the entire frontier conversation has rotated from model architectures to gigawatts and grid interconnect.

The post does not commit to a specific GPU count, training-cluster size, or compute-vs-revenue ratio, and it does not disclose unit economics. Anyone trying to forecast 2026–2027 frontier compute supply will still need to reconstruct from public hyperscaler capex disclosures, Nvidia and Broadcom shipment guidance, and energy-permit filings. But the public framing is now unambiguous: OpenAI sees the next two years' capability frontier as being decided in concrete pours and substation upgrades, not in clever post-training tricks.

#3
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.RO (Robotics) · arXiv cs.CV (Computer Vision) · arXiv — Generative Media / Diffusion · arXiv — Evals & Benchmarks · Hugging Face Daily Papers 8.0 8.0/6.9/9.0

X-WAM, presented in a new paper that surfaced through arXiv's robotics, vision, and AI feeds along with Hugging Face's Daily Papers, is an attempt to unify two streams of work that have been moving in parallel for the last year: world-model video synthesis and direct robotic action prediction. Prior unified models such as UWM operate in 2D pixel space and consistently force a tradeoff — either you get high-fidelity rollouts that are too slow to act on, or you get fast action decoding from a model that hallucinates physics. X-WAM resolves that tradeoff by predicting multi-view RGB-D videos rather than flat pixels, which gives the model a depth representation that's grounded in geometry, and by adding what the authors call Asynchronous Noise Sampling — a denoising schedule that lets the model emit actions after a small number of diffusion steps while continuing to sharpen the corresponding video for full-resolution scene reconstruction.

The architectural move that makes this work is small but interesting. X-WAM reuses the final blocks of a pretrained video Diffusion Transformer to spawn a depth-prediction branch, rather than training a fresh depth head from scratch. That means the spatial reasoning rides on top of the visual priors already baked into a strong video DiT, and the additional parameter count is modest. ANS, the asynchronous denoising contribution, applies a specialized schedule at inference time so action tokens are decoded under a low-step regime while video tokens run the full denoising chain. Crucially, training samples from the joint distribution of action and video timesteps rather than fully decoupling them, which keeps the inference distribution aligned with the training distribution — an alignment that earlier asymmetric-rollout schemes have struggled with.

The model is pretrained on over 5,800 hours of robotic data and evaluated on RoboCasa and RoboTwin 2.0. Reported success rates are 79.2% on RoboCasa and 90.7% on RoboTwin 2.0, both ahead of the prior unified baselines, while the visual generations match or exceed the dedicated 4D world models the team compares against. The combined result — fast policy decoding plus high-fidelity simulation rolling out of the same network — is the kind of capability that has been nominally promised by the world-model line of work for three years and has consistently failed to materialize at deployable speeds.

The paper landed on arXiv this morning and was almost immediately picked up by Hugging Face's Daily Papers, which placed it in the multi-source bucket for today's run. The right way to read it is alongside Physical Intelligence's recent π-series releases and the broader push from robot-learning groups to converge VLA models, world models, and policy heads into a single trainable object. If X-WAM's results replicate at scale, the gap between robot simulation and robot deployment narrows by another notch — and the unified-model design choice (one network, joint training) becomes harder to argue against.

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.AI, arXiv cs.RO, arXiv cs.CV.
  • Matched topical feeds: Generative Media / Diffusion, Evals & Benchmarks — wide thematic overlap.
  • Hugging Face Daily Papers picked it up — community-curated visibility signal.
cs.AI cs.RO cs.CV
#4
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks · Hugging Face Daily Papers 7.9 8.3/6.3/9.0

TIDE is the first systematic framework for cross-architecture distillation of diffusion language models — the kind of distillation where teacher and student differ not only in size but in the underlying generation paradigm itself. Within-architecture distillation for diffusion LLMs has been studied for about a year and the techniques are well understood; what has remained open is whether you can distill a diffusion LLM down to a much smaller dense or MoE student that runs on different attention machinery and even a different tokenizer. TIDE's authors argue this gap is the binding constraint on practical deployment of diffusion LLMs because the existing teacher families are 8B+ dense or 16B MoE, both of which are too large for most edge or low-latency serving budgets.

The framework decomposes into three modular components. TIDAL is the cross-timestep distillation strength controller — it modulates the loss weight as a function of both training progress and the diffusion timestep, on the principle that the teacher's predictive reliability depends on the noise level at which it's queried. CompDemo enriches the teacher's context with complementary mask splits, which gives it a stronger signal under heavy masking conditions where diffusion LLMs tend to degrade. Reverse CALM is the cross-tokenizer alignment objective: it inverts the chunk-level likelihood matching used in earlier work to produce bounded gradients and to filter noise from both ends of the alignment, which the authors describe as the technical move that makes the cross-tokenizer setting tractable at all.

The headline empirical claim is that distilling 8B dense and 16B MoE teachers down to a 0.6B student through two heterogeneous pipelines beats baseline by an average of 1.53 points across eight benchmarks, with the largest gains concentrated in code generation. HumanEval climbs from 32.3 with the autoregressive baseline to 48.78 with the TIDE-distilled diffusion student — a ~50% relative improvement that's hard to dismiss. The result implies that the parallel-decoding and bidirectional-context advantages of diffusion LLMs survive aggressive size compression as long as the distillation is done correctly across the architectural boundary.

This matters for the broader frontier-LLM conversation because it provides a credible path to deploying diffusion LLMs at the scale that autoregressive LLMs already serve at. Diffusion language modeling has been a slow-burn research thread; its inference throughput advantages have been documented but the parameter inefficiency has been real, and the distillation literature has been the gating factor on whether you could get a sub-billion-parameter diffusion model that matters. TIDE's HumanEval result suggests that a small diffusion student can outperform an autoregressive baseline of comparable size, which inverts the prior conventional wisdom about which paradigm dominates at the edge.

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference), Evals & Benchmarks — wide thematic overlap.
  • Hugging Face Daily Papers picked it up — community-curated visibility signal.
cs.LG cs.CL cs.AI
#5
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Reinforcement Learning · arXiv — Evals & Benchmarks · Hugging Face Daily Papers 7.8 8.0/6.3/9.0

ClawGym is a scalable framework for building what the authors call "Claw-style" personal agents — agents that operate over local files, tool calls, and persistent workspace state, the modal computer-use deployment surface. The framework targets the gap that's been frustrating practical claw/agent development since the GPT-4 era: there's no shared infrastructure for synthesizing verifiable training data, integrating it with agent training, and running diagnostic evaluation. Every group rolls its own pipeline and the resulting agents underperform because the data and the training loop are starved of signal.

The contribution comes in three pieces. ClawGym-SynData is a 13.5K-task dataset synthesized through persona-driven intent generation paired with skill-grounded operations, where each task is matched to a realistic mock workspace and verified through a hybrid mechanism that combines automated checks with LLM judges. The persona-and-skill combinatorial generation is the move that gets to coverage at scale — instead of hand-curating tasks, the framework samples from intent distributions plausible for given personas and from operation distributions over the available tool set. ClawGym-Agents is the family of trained models, produced first via supervised fine-tuning on rollouts and then refined with a lightweight reinforcement-learning pipeline that parallelizes rollouts across per-task sandboxes — a system design choice that's necessary to make the RL phase tractable since claw tasks involve real filesystem and tool side effects. ClawGym-Bench is a 200-instance evaluation slate filtered through automated quality checks and human-LLM review, designed to be reliable enough to compare different training recipes.

The 13.5K-task scale puts ClawGym in the same league as the largest open agentic-task corpora released in the last year and substantially ahead of most of them on verifiability — most prior corpora trade verifiability for scale or vice versa. The paper claims competitive performance from the resulting ClawGym-Agents family, though the absolute benchmark numbers are best read alongside the codebase release the authors promise. The framework's open release at github.com/ClawGym, once it lands, will be useful for any group trying to do agentic post-training without building data and evaluation from scratch.

This sits at the intersection of two important threads: the agentic capability push that everyone from Anthropic to OpenAI is pursuing in product, and the open-research push to give the community a trainable substrate that doesn't require frontier-lab data resources to be useful. The cross-source coverage on this paper — three arXiv categorical feeds plus two virtual feeds plus Hugging Face Daily Papers — is the cleanest signal yet that the field is treating agentic-data infrastructure as a first-class research surface rather than a tooling problem.

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Reinforcement Learning, Evals & Benchmarks — wide thematic overlap.
  • Hugging Face Daily Papers picked it up — community-curated visibility signal.
cs.LG cs.CL cs.AI
#6
Infrastructure 2026-04-29 Dwarkesh Podcast 7.8 7.5/8.0/7.5

Dwarkesh Patel's latest interview is a meaningful departure in format. Rather than the usual conversational long-form, the episode is a blackboard lecture in which Reiner Pope — formerly of Google's TPU compiler and software-efficiency teams, now CEO of MatX — walks through the math of how frontier LLMs are trained and served. The premise Patel opens with is the right one: from a small set of equations, the published API prices, and chalk, you can recover most of what the frontier labs are doing internally. The episode tries to make that derivation legible.

The substantive content runs through the standard scaling-law and inference-economics arithmetic but with a level of mechanical precision that the public discourse usually elides. Pope walks through the parameter, FLOP, and memory-bandwidth budgets that determine whether a given model architecture is training-bound, inference-bound, or memory-bound, and shows how API price points reveal the rough cost-per-token at which a model is being served — which in turn tells you the likely model size, the precision regime, and whether the lab is running a dense or MoE backbone. He extends the same analysis to inference-time techniques like speculative decoding and test-time scaling, deriving the conditions under which they're economically dominant rather than treating them as catalog items. The episode also covers TPU and GPU architectural tradeoffs at a level of detail that is uncommon in public conversation, including the bandwidth and arithmetic-intensity envelopes that constrain modern accelerators.

Pope is now CEO of MatX, the chip startup focused on building accelerators specifically optimized for transformer inference economics rather than general-purpose deep learning. That biographical detail is relevant: the lecture lands within a broader argument for purpose-built silicon, and listeners should weigh the framing accordingly. Pope's prior work on TPU compilers and on the open scaling book he co-authored gives him an unusual credibility on the math even setting aside the MatX framing.

The episode is best treated as a study guide. Patel publishes flashcards and practice problems alongside it — an admission that the material rewards re-listening with a notepad. For practitioners trying to forecast frontier model trajectories from public information, the playbook Pope lays out is one of the most useful summaries available. The release lands at the same time as OpenAI's compute-infrastructure essay and the hyperscaler Q1 prints, all of which together rotate the field's frame from architectures and post-training tricks toward the gigawatt-and-bandwidth substrate. Worth watching the YouTube version specifically — the chalkboard is doing real work.

#7
Interpretability 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) · arXiv — Reinforcement Learning · arXiv — Mechanistic Interpretability · arXiv — Post-training / Alignment 7.5 7.3/7.6/7.5

This paper takes on a long-standing failure mode of vision-language models: they confidently produce factual hallucinations on long-tail or specialized visual content because they have no calibrated mechanism for refusing queries that fall outside their parametric knowledge. The authors propose a framework that delineates the model's knowledge boundaries through targeted probing and aligns the model to refuse appropriately rather than confabulate. The dataset they construct — Visual-Idk, for "Visual I don't know" — is built by multi-sample consistency probing across the model's own outputs, which is the right way to identify where a given model genuinely lacks coverage rather than measuring against an external oracle.

The training recipe is two-stage. Supervised fine-tuning teaches the model the basic refusal pattern on the Visual-Idk dataset; preference-aware optimization (the authors evaluate both DPO and ORPO) then sharpens the boundary so the model refuses on uncertain queries while remaining responsive on queries within its competence. The headline result is a Truthful Rate increase from 57.9% to 67.3% on the Visual-Idk slate — a 9.4-point absolute improvement, which is substantial in a setting where most prior work has produced either marginal gains or required external retrieval scaffolding to get there.

What makes the paper more than a routine refusal-tuning study is the internal probing the authors run to verify that the model has learned to recognize its own boundaries rather than memorize refusal templates. The probing distinguishes "actually knows it doesn't know" behavior from "pattern-matched to refuse on superficial cues" — a distinction that has plagued the alignment literature on refusal more generally. The probing results suggest the trained model genuinely tracks knowledge boundaries internally, which is consistent with what mechanistic interpretability work on text-only refusals has been finding over the last year.

The framework generalizes to out-of-distribution medical and perceptual domains in the authors' evaluation, which is the right test — knowledge boundaries are precisely the place where naive fine-tuning approaches fail to transfer. Multi-source coverage on this paper is heavy: it surfaced through arXiv's AI and CV feeds, the post-training and interpretability virtual feeds, and the reinforcement-learning virtual feed, since DPO and ORPO are RL-adjacent. The combination places it in the cluster of trustworthy-VLM work that's becoming foundational for medical and scientific deployments where confabulation is a deal-breaker.

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.CV.
  • Matched topical feeds: Reinforcement Learning, Mechanistic Interpretability, Post-training / Alignment — wide thematic overlap.
cs.AI cs.CV
#8
Agents & Tool Use 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Reinforcement Learning · arXiv — Agents / Tool Use · Hugging Face Daily Papers 7.4 8.3/6.3/7.5

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Reinforcement Learning, Agents / Tool Use — wide thematic overlap.
  • Hugging Face Daily Papers picked it up — community-curated visibility signal.
cs.CV
#9
Industry 2026-04-29 Hacker News 7.4 7.5/6.0/8.6

Hacker News discussion (275 points) — Why AI companies want you to be afraid of them. Visit the comments thread for community caveats and the linked article for primary reporting.

#10
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · Hugging Face Daily Papers 7.2 7.5/6.8/7.0

RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms,…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL.
  • Hugging Face Daily Papers picked it up — community-curated visibility signal.
cs.LG cs.CL
#11
Post-Training 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Reinforcement Learning · arXiv — Post-training / Alignment 7.2 7.0/6.8/7.5

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Reinforcement Learning, Post-training / Alignment — wide thematic overlap.
cs.LG cs.CL cs.AI
#14
Efficiency 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.RO (Robotics) · arXiv cs.CV (Computer Vision) · arXiv — Efficiency (Quantization, MoE, Inference) 7.1 7.3/7.8/6.0

Deploying accurate object detection for Vulnerable Road User (VRU) safety on edge hardware requires balancing model capacity against computational constraints. Large models achieve high accuracy but fail under INT8 quantization required for edge deployment, while small models sacrifice detection performance. This paper presents a knowledge distillation (KD) framework that trains a compact YOLOv8-S student (11.2M parameters) to mimic a YOLOv8-L teacher (43.7M parameters), achieving 3.9x compression while preserving quantization robustness. We evaluate on full-scale BDD100K (70K training images) with Post-Training Quantization to INT8. The teacher suffers catastrophic degradation under INT8…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.RO, arXiv cs.CV.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.LG cs.RO cs.CV
#15
Agents & Tool Use 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv — Reinforcement Learning · arXiv — Agents / Tool Use · arXiv — Evals & Benchmarks 7.1 7.3/6.3/7.5

Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just as interactive environments have often driven progress in agents, advancing live future prediction naturally motivates viewing it as a learning environment. Prior works have explored future prediction from several different parts, but have generally not framed it as a unified learning environment. This task is appealing for learning because…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI.
  • Matched topical feeds: Reinforcement Learning, Agents / Tool Use, Evals & Benchmarks — wide thematic overlap.
cs.LG cs.AI
#16
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — State Space Models · arXiv — Reinforcement Learning · arXiv — Post-training / Alignment · arXiv — Evals & Benchmarks 7.1 7.3/6.3/7.5

The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: State Space Models, Reinforcement Learning, Post-training / Alignment, Evals & Benchmarks — wide thematic overlap.
cs.LG
#17
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv — Reinforcement Learning · arXiv — Evals & Benchmarks 7.0 7.3/7.6/6.0

Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates.…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI.
  • Matched topical feeds: Reinforcement Learning, Evals & Benchmarks — wide thematic overlap.
cs.LG cs.AI
#24
Frontier LLMs 2026-04-29 OpenAI Research 6.9 7.0/6.5/7.0

How goblin outputs spread in AI models: timeline, root cause, and fixes behind personality-driven quirks in GPT-5 behavior.

#25
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Reinforcement Learning · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks 6.8 7.3/6.8/6.0

Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Reinforcement Learning, Efficiency (Quantization, MoE, Inference), Evals & Benchmarks — wide thematic overlap.
cs.LG
#26
Generative Media 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Generative Media / Diffusion 6.7 7.0/6.9/6.0

When do language diffusion models memorize their training data, and how to quantitatively assess their true generative regime? We address these questions by showing that Uniform-based Discrete Diffusion Models (UDDMs) fundamentally behave as Associative Memories (AMs) $\textit{with emergent creative capabilities}$. The core idea of an AM is to reliably recover stored data points as $\textit{memories}$ by establishing distinct basins of attraction around them. Historically, models like Hopfield networks use an explicit energy function to guarantee these stable attractors. We broaden this perspective by leveraging the observation that energy is not…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Generative Media / Diffusion — wide thematic overlap.
cs.LG cs.CL cs.AI
#27
Robotic Autonomy 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.RO (Robotics) · arXiv — Reinforcement Learning 6.7 7.0/7.0/6.0

This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenarios under limited simulation training. The framework combines a fixed rule-based high-level advisor with an online goal-conditioned low-level reinforcement learning (RL) controller. To stress-test early adaptation, we also consider a strict no-pretraining deployment regime. The high-level advisor is defined offline from a structured task specification and compiled into deterministic rules. It provides interpretable mission- and safety-aware guidance through recommended actions, avoided actions, and regime-dependent arbitration weights. The low-level controller learns online from task-defined…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI, arXiv cs.RO.
  • Matched topical feeds: Reinforcement Learning — wide thematic overlap.
cs.LG cs.AI cs.RO
#29
Industry 2026-04-29 TechCrunch — AI 6.7 7.5/5.9/6.5

AI-generated video has gone from novelty to creative tool almost overnight, and Runway has a front row seat to the shift. The New York-based company has raised close to $860 million at a $5.3 billion valuation, and its models are going toe-to-toe with the most well-funded labs in the world, including Google and OpenAI. The technology goes way beyond […]

#31
Agents & Tool Use 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv — Agents / Tool Use · arXiv — AI for Science 6.6 7.3/6.3/6.0

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI.
  • Matched topical feeds: Agents / Tool Use, AI for Science — wide thematic overlap.
cs.LG cs.AI
#32
Interpretability 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) · arXiv — Mechanistic Interpretability 6.6 7.3/6.3/6.0

Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will emerge as a byproduct of successful symbol grounding. This work presents the first systematic empirical analysis to challenge this assumption by disentangling the contributions of grounding and reasoning. To operationalize this investigation, we introduce the Iterative Logic Tensor Network ($i$LTN), a fully differentiable architecture designed for multi-step deduction. Using a formal taxonomy of generalization -- probing for…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI, arXiv cs.CV.
  • Matched topical feeds: Mechanistic Interpretability — wide thematic overlap.
cs.LG cs.AI cs.CV
#33
Evaluations & Benchmarks 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks 6.6 6.5/7.6/5.5

Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token consistently resides within…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CL.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference), Evals & Benchmarks — wide thematic overlap.
cs.CL
#34
Agents & Tool Use 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Agents / Tool Use · arXiv — Evals & Benchmarks 6.6 7.3/6.3/6.0

Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Agents / Tool Use, Evals & Benchmarks — wide thematic overlap.
cs.CL cs.AI
#35
Agents & Tool Use 2026-04-29 arXiv cs.RO (Robotics) · arXiv cs.CV (Computer Vision) · arXiv — Robotic Autonomy / Embodied AI · arXiv — Agents / Tool Use 6.6 7.3/6.3/6.0

Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation-based training and real-world deployment. While 3D generative modeling has advanced rapidly, embodied applications impose requirements far beyond visual realism: generated objects must carry kinematic structure and material properties, scenes must support interaction and task execution, and the resulting content must bridge the gap between simulation and reality. This survey presents the first survey of 3D generation for embodied AI and organizes the literature around three roles that 3D generation plays in embodied systems.…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.RO, arXiv cs.CV.
  • Matched topical feeds: Robotic Autonomy / Embodied AI, Agents / Tool Use — wide thematic overlap.
cs.RO cs.CV
#36
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks 6.5 7.3/6.2/6.0

The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock.…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.CV.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference), Evals & Benchmarks — wide thematic overlap.
cs.AI cs.CV
#37
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Mechanistic Interpretability · arXiv — Evals & Benchmarks 6.5 6.2/7.6/5.5

Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Mechanistic Interpretability, Evals & Benchmarks — wide thematic overlap.
cs.CV
#40
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Reinforcement Learning · arXiv — Evals & Benchmarks 6.4 6.5/7.0/5.5

Predictive safety filters (PSFs) leverage model predictive control to enforce constraint satisfaction during deep reinforcement learning (RL) exploration, yet their reliance on first-principles models or Gaussian processes limits scalability and broader applicability. Meanwhile, model-based RL (MBRL) methods routinely employ probabilistic ensemble (PE) neural networks to capture complex, high-dimensional dynamics from data with minimal prior knowledge. However, existing attempts to integrate PEs into PSFs lack rigorous uncertainty quantification. We introduce the Uncertainty-Aware Predictive Safety Filter (UPSi), a PSF that provides rigorous safety predictions using PE dynamics models by formulating future outcomes…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Reinforcement Learning, Evals & Benchmarks — wide thematic overlap.
cs.LG
#41
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Mechanistic Interpretability · arXiv — Evals & Benchmarks 6.4 6.5/7.0/5.5

Multimodal large language models (MLLMs) have achieved impressive progress on general multimodal tasks, yet they remain brittle on dial-based measurement reading. In this paper, we study this problem through controlled benchmarks and feature-space probing, and show that current MLLMs not only achieve unsatisfactory accuracy on dial-based readout, but also suffer sharp performance drops under viewpoint and illumination changes even when the underlying dial state remains fixed. Our probing analysis further reveals that same-state samples under appearance variation are not consistently clustered, while neighboring states fail to preserve the local structure…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Mechanistic Interpretability, Evals & Benchmarks — wide thematic overlap.
cs.CV
#43
Infrastructure 2026-04-30 Latent Space 6.4 6.5/6.2/6.5

Just as we covered World Models early this year, we’ll be releasing a short miniseries on the CPU compute/sandbox industry on the pod over the coming weeks, and it’s a good time to explain why. In recent days: Noam Brown : “inference compute is a strategic resource, currently undervalued” Sam Altman : “To a significant degree, we have to become an AI inference company now.” Taken individually, these comments might seem unremarkable normal reactions to a very successful GPT 5.5…

#44
Industry 2026-04-29 Hacker News 6.4 6.6/6.0/6.6

Hacker News discussion (157 points) — "People who don't use AI will be left behind". Visit the comments thread for community caveats and the linked article for primary reporting.

#46
Interpretability 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · arXiv — Mechanistic Interpretability 6.3 6.2/7.0/5.5

Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL.
  • Matched topical feeds: Mechanistic Interpretability — wide thematic overlap.
cs.LG cs.CL
#47
Evaluations & Benchmarks 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 6.3 6.5/6.8/5.5

Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before reasoning begins, while reasoning models require evidence injection during multi-step inference chains. We introduce ReaLM-Retrieve, a reasoning-aware retrieval framework that addresses this mismatch through three key innovations: (1) a step-level uncertainty detector that identifies knowledge gaps at reasoning-step granularity rather than token or sentence level; (2) a retrieval intervention policy that learns when external…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CL cs.AI
#48
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.RO (Robotics) · arXiv — Evals & Benchmarks 6.3 6.5/6.9/5.5

Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories,…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.RO.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.AI cs.RO
#49
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Post-training / Alignment · arXiv — Evals & Benchmarks 6.3 6.5/6.8/5.5

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Post-training / Alignment, Evals & Benchmarks — wide thematic overlap.
cs.CV
#52
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 6.2 6.5/6.3/5.5

I propose the \emph{Random Cloud} method, a training-free approach to neural architecture search that discovers minimal feedforward network topologies through stochastic exploration and progressive structural reduction. Unlike post-training pruning methods that require a full train-prune-retrain cycle, this method evaluates randomly initialized networks without backpropagation, progressively reduces their topology, and only trains the best minimal candidate at the end. I evaluate on 7 classification benchmarks against magnitude pruning and random pruning baselines. The Random Cloud matches or outperforms both baselines in 6 of 7 datasets, achieving statistically significant improvements on Sonar…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG cs.AI
#53
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Post-training / Alignment · arXiv — Evals & Benchmarks 6.2 6.5/6.3/5.5

Despite being resource-intensive to train, 3D convolutional neural networks (CNNs) have been the standard approach to classify CT and MRI scans. Recent work suggests that deep multiple instance learning (MIL) may be a more efficient alternative for 3D brain scans, especially when the pre-trained image encoder used to embed each 2D slice is frozen and only the pooling operation and classifier are trained. In this paper, we provide a systematic comparison of simple MIL, attention-based MIL, 3D CNNs, and 3D ViTs across three CT and four MRI datasets, including two…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Post-training / Alignment, Evals & Benchmarks — wide thematic overlap.
cs.LG
#54
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv stat.ML (Statistical ML) · arXiv — Evals & Benchmarks 6.2 6.5/6.3/5.5

Uncertainty estimation is essential for robust decision-making in the presence of ambiguous or out-of-distribution inputs. Gaussian Processes (GPs) are classical kernel-based models that offer principled uncertainty quantification and perform well on small- to medium-scale datasets. Alternatively, formulating the weight space learning problem under tensor network assumptions yields scalable tensor network kernel machines. However, these assumptions break Gaussianity, complicating standard probabilistic inference. This raises a fundamental question: how can tensor network kernel machines provide principled uncertainty estimates? We propose a novel Bayesian Tensor Network Kernel Machine (LA-TNKM) that employs a (linearized)…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv stat.ML.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG stat.ML
#55
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) · arXiv — Evals & Benchmarks 6.2 6.5/6.3/5.5

GPU-accelerated Self-Organizing Map (SOM) implementations are among the most competitive options for large-scale SOM analysis, but growing dataset sizes increasingly challenge their practical use because workloads no longer fit cleanly within device-memory limits. We introduce FloatSOM, a SOM framework for scalable training and deployment that supports multi-GPU execution, out-of-memory disk-backed streaming, and novel topologies beyond regular lattices. We evaluate FloatSOM on 14 synthetic and real benchmark datasets together with controlled speed scaling benchmarks, and show that these improved topologies, combined with topology-aware hyperparameter fine-tuning, yield lower quantization error than current…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference), Evals & Benchmarks — wide thematic overlap.
cs.LG
#56
Evaluations & Benchmarks 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 6.2 6.5/6.3/5.5

The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CL cs.AI
#57
Robotic Autonomy 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.RO (Robotics) · arXiv — Reinforcement Learning 6.2 6.2/6.7/5.5

Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy learning methods. Existing annotation tools, however, are often limited: they are designed primarily for vision-only data, do not natively support synchronized visualization of robot-specific time-series signals (e.g., gripper state or force/torque), or require substantial effort to adapt to different dataset formats. In this paper, we introduce ATLAS, an annotation tool tailored for long-horizon robotic action segmentation. ATLAS provides time-synchronized visualization of multi-modal robotic data, including multi-view video and proprioceptive signals,…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.RO.
  • Matched topical feeds: Reinforcement Learning — wide thematic overlap.
cs.AI cs.RO
#59
Industry 2026-04-29 TechCrunch — AI 6.2 6.0/5.9/6.5

Meta is losing billions on Reality Labs each quarter, and its AI expenditures are only going to increase its spending.

#61
Industry 2026-04-29 TechCrunch — AI 6.2 6.0/5.9/6.5

Google TV just got more Gemini features, including the ability to transform photos and videos with tools Nano Banana and Veo.

#63
Safety, Policy & Regulation 2026-04-29 Hacker News 6.2 6.3/6.0/6.1

Hacker News discussion (127 points) — Ramp's Sheets AI Exfiltrates Financials. Visit the comments thread for community caveats and the linked article for primary reporting.

#65
Government & Defense 2026-04-29 DefenseScoop 6.2 6.5/7.5/4.5

The U.S. military will soon have a new sub-unified command focused on autonomous warfare, Secretary of Defense Pete Hegseth told lawmakers Wednesday. Sub-unified commands, which combatant commanders can set up with the approval of the SecDef, are joint organizations designed to conduct operations and certain missions assigned to the geographic or functional combatant command that they fall under. The designation typically signifies that the organization’s mission is enduring and a high priority for military leadership. Examples of sub-unified commands include…

#66
State Space Models 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.NE (Neural & Evolutionary Computing) 6.1 6.2/6.3/5.5

Can Neural Assemblies -- groups of neurons that fire together and strengthen through co-activation -- learn the direction of causal influence between variables? While established as a computationally general substrate for classification, parsing, and planning, neural assemblies have not yet been shown to internalize causal directionality. We demonstrate that the inherent operations of neural assemblies -- projection, local plasticity control, and sparse winner selection -- are sufficient for directional learning. We introduce DIRECT (DIRectional Edge Coupling/Training), a mechanism that co-activates source and target assemblies under an adaptive gain schedule to…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI, arXiv cs.NE.
cs.LG cs.AI cs.NE
#67
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) 6.1 6.2/6.3/5.5

Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small language models (SLMs) can serve as reliable, privacy-preserving decision-support tools for clinical triage. We systematically compared multiple SLMs across diverse prompting pipelines and found that clinical vignettes, concise summaries of triage narratives, yielded the most accurate predictions. The SLM, Qwen2.5-7B, demonstrated the strongest balance of accuracy, stability, and computational efficiency. Through large-scale domain adaptation using expert-curated and…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL, arXiv cs.AI.
cs.LG cs.CL cs.AI
#68
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 6.1 6.2/6.3/5.5

Recent advances in self-evolution video understanding frameworks have demonstrated the potential of autonomous learning without human annotations. However, existing methods often suffer from weakly controlled optimization and uncontrolled difficulty progression, as they lack structured guidance throughout the iterative learning process. To address these limitations, we propose CurEvo, a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. CurEvo dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG cs.CV
#69
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Reinforcement Learning · arXiv — Evals & Benchmarks 6.1 6.2/6.3/5.5

Electric truck operations require routing decisions that remain feasible under limited battery range, long charging times, travel and energy consumption, and competition for shared charging infrastructure. These features make electric truck routing a coupled logistics and energy problem, limiting the practicality of heuristics-based methods and rendering them computationally infeasible at scale. This paper proposes a learning-based framework for the stochastic electric truck routing under charging constraints and operational uncertainty. The problem, solved by Reinforcement Learning, is formulated as an event-driven semi-Markov decision process with shared charging resources, stochastic travel and…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Reinforcement Learning, Evals & Benchmarks — wide thematic overlap.
cs.LG
#70
Multimodal 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) 6.1 6.2/6.3/5.5

Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI, arXiv cs.CV.
cs.LG cs.AI cs.CV
#71
Post-Training 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Reinforcement Learning · arXiv — Post-training / Alignment 6.1 6.2/6.3/5.5

We study sequential interventions under prerequisite constraints. In this setting, admissible intervention sequences are paths in the ideal lattice of a finite prerequisite poset rather than unconstrained action strings. We give an exact local-to-global theory of order sensitivity on this state space. First, we prove that any two admissible paths with the same endpoints differ by a finite sequence of elementary diamond swaps. Second, for edge-additive path valuations, we show that path-independence is equivalent to vanishing diamond curvature, yielding an endpoint potential with a canonical Möbius parameterization on the ideal…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Reinforcement Learning, Post-training / Alignment — wide thematic overlap.
cs.LG
#72
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) 6.1 6.2/6.3/5.5

We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL, arXiv cs.AI.
cs.LG cs.CL cs.AI
#73
Agents & Tool Use 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv — Agents / Tool Use · arXiv — Evals & Benchmarks 6.1 6.2/6.3/5.5

Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CL.
  • Matched topical feeds: Agents / Tool Use, Evals & Benchmarks — wide thematic overlap.
cs.CL
#74
Post-Training 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv — Post-training / Alignment 6.1 6.2/6.3/5.5

Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis communication is limited by the scarcity of curated parallel data. We propose a domain-adaptive pipeline that expands a small reference corpus, by retrieving and filtering data from general corpora. We use the resulting dataset to fine-tune a small language model for crisis-domain translation and then apply preference optimization to bias outputs toward CEFR A2-level English. Automatic and human evaluation shows that this approach improves readability, while maintaining strong adequacy. Our results indicate…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI.
  • Matched topical feeds: Post-training / Alignment — wide thematic overlap.
cs.CL cs.AI
#75
State Space Models 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) · arXiv cs.NE (Neural & Evolutionary Computing) 6.1 6.2/6.3/5.5

This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying…

How it was discussed
  • Cross-listed in 3 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI, arXiv cs.NE.
cs.CL cs.AI cs.NE
#76
Interpretability 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) · arXiv — Mechanistic Interpretability 6.1 6.2/6.3/5.5

Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by the profound spatial heterogeneity inherent in natural images. Specifically, the imposition of a uniform global receptive field across regions of varying information density inevitably leads to local feature degradation, particularly in dense conflict zones populated by microscopic targets. To address this mechanistic limitation, we propose ViCrop-Det, a training-free inference framework that introduces adaptive spatial trust region shrinkage. Inspired by the use of attention entropy in anomaly segmentation, ViCrop-Det leverages the detection decoder's cross-attention…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.CV.
  • Matched topical feeds: Mechanistic Interpretability — wide thematic overlap.
cs.AI cs.CV
#77
Agents & Tool Use 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv — Agents / Tool Use · arXiv — Efficiency (Quantization, MoE, Inference) 6.1 6.5/6.2/5.5

Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.AI.
  • Matched topical feeds: Agents / Tool Use, Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.AI
#78
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 6.1 6.5/6.2/5.5

Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.AI cs.CV
#79
Agents & Tool Use 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv — Agents / Tool Use · arXiv — Evals & Benchmarks 6.1 6.5/6.2/5.5

Recent advances in large language models and agentic frameworks have enabled virtual customer assistants (VCAs) for complex support. We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals. Device specificity is provided by a lightweight local diagnostic utility, while user specificity relies on implicit proficiency inference and profile-aware troubleshooting. Service specificity is achieved through a proactive, context-aware recommender. We evaluate SecMate in a controlled study with 144 participants and 711 conversations. Device-level evidence increased correct resolutions from about 50%…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.AI.
  • Matched topical feeds: Agents / Tool Use, Evals & Benchmarks — wide thematic overlap.
cs.AI
#80
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — State Space Models · arXiv — Evals & Benchmarks 6.1 6.2/6.3/5.5

Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations,…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: State Space Models, Evals & Benchmarks — wide thematic overlap.
cs.CV
#81
Industry 2026-04-30 Hacker News 6.1 6.1/6.0/5.9

Hacker News discussion (113 points) — Claude.ai and API unavailable [fixed]. Visit the comments thread for community caveats and the linked article for primary reporting.

#82
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 6.0 5.5/7.4/5.0

Federated Unlearning (FU) is an emerging paradigm in Federated Learning (FL) that enables participating clients to fully remove their contributions from a trained global model, driven by data protection regulations that mandate the right to be forgotten. However, existing FU methods mostly rely on synchronous coordination. This requirement forces the entire federation to halt and wait for stragglers to complete erasure, creating significant delays due to device heterogeneity. Furthermore, these methods often face the problem that the influence of erased data is merely suppressed temporarily and resurfaces during subsequent training,…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG
#83
Efficiency 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) 6.0 5.8/7.0/5.0

Post-training quantization (PTQ) has become an important technique for reducing the inference cost of Large Language Models (LLMs). While recent mixed-precision methods improve ultra-low bit quantization by preserving critical subspaces in high precision, they typically construct these subspaces relying solely on activation statistics. This ignores the fundamental nature of linear operations, where the output perturbation is jointly driven by both activation and weight quantization noise. In this paper, we propose CoQuant, a joint weight-activation subspace projection method. By theoretically modeling the expected output error, CoQuant formulates a closed-form weighted PCA…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.LG
#84
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 6.0 5.8/7.6/4.5

Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds. Existing topology-based alignment methods rely on persistence diagram matching, which neither guarantees geometric alignment nor utilizes the image-text pairing information central to vision-language learning. We propose Topology-Aware Multimodal Representation Alignment (ToMA), a framework that uses persistent homology to identify…

cs.LG
#85
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv stat.ML (Statistical ML) 5.9 5.5/6.9/5.0

We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv stat.ML.
cs.LG stat.ML
#86
Efficiency 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) 5.9 5.5/7.0/5.0

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.LG
#87
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv stat.ML (Statistical ML) 5.9 5.5/6.9/5.0

Deep learning methods have proved highly effective for classification and image recognition problems. In this paper, we ask whether this success can be transferred to hypothesis testing: if a neural network can distinguish, for example, an image of a handwritten digit from another, can it also distinguish an "image of a sample" (such as a scatter plot) generated under a given statistical model from one generated outside that model? Motivated by this idea, we propose a novel procedure called deep-testing, which approaches the classical inferential problem of hypothesis testing through…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv stat.ML.
cs.LG stat.ML
#88
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) 5.9 5.5/7.6/4.5

Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing formal privacy guarantees with grammatical coherence, its impact on the register identity of text remains largely unexplored. By conducting a multidimensional stylistic profiling of differentially-private rewriting, we demonstrate that the cost of privacy extends far beyond lexical variation. Specifically, we find that rewriting under privacy constraints induces a systematic functional mutation of the text's communicative signature. This…

cs.CL
#89
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.9 5.8/7.4/4.5

Assisting humans in open-world outdoor environments requires robots to translate high-level natural-language intentions into safe, long-horizon, and socially compliant navigation behavior. Existing map-based methods rely on costly pre-built HD maps, while learning-based policies are mostly limited to indoor and short-horizon settings. To bridge this gap, we propose Walk with Me, a map-free framework for long-horizon social navigation from high-level human instructions. Walk with Me leverages GPS context and lightweight candidate points-of-interest from a public map API for semantic destination grounding and waypoint proposal. A High-Level Vision-Language Model grounds abstract instructions…

cs.RO
#90
Evaluations & Benchmarks 2026-04-29 arXiv cs.RO (Robotics) · arXiv — Evals & Benchmarks 5.9 5.8/6.7/5.0

Deploying a neuro-symbolic task planner on a new domain today requires significant manual effort: a domain expert must author relaxation and complementary rules, and hundreds of training problems must be solved to supervise a Graph Neural Network (GNN) object scorer. We propose LLM-Flax, a three-stage framework that eliminates all three sources of manual effort using a locally hosted LLM given only a PDDL domain file. Stage 1 automatically generates relaxation and complementary rules via structured prompting with format validation and self-correction. Stage 2 introduces LLM-guided failure recovery with a feasibility-gated…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.RO.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.RO
#91
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.9 5.8/6.8/5.0

Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#92
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.9 5.8/6.8/5.0

Existing 3D anomaly detection methods are built on a rigid prior: normal geometry is pose-invariant and can be canonicalized through registration or alignment. This prior does not hold for articulated objects with hinge or sliding joints, where valid pose changes induce structured geometric variations that cannot be collapsed to a single canonical template, causing pose-induced deformations to be misidentified as anomalies while true structural defects are obscured. No existing benchmark addresses this challenge. We introduce ArtiAD, the first large-scale benchmark for articulated 3D anomaly detection, comprising 15,229 point clouds across…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#93
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.9 5.8/6.6/5.0

Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives,…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#94
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.9 5.8/6.6/5.0

Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#95
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.9 5.8/6.8/5.0

Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#96
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.9 5.8/6.6/5.0

Hyperspectral imaging (HSI) semantic segmentation typically relies on in-domain training, but limited data availability often restricts model performance in real-world applications. Current approaches to leverage foundation models in proximal sensing use cross-modality techniques, bridging RGB and HSI to exploit vision foundation models. However, these methods either discard spectral information or introduce architectural complexity. We propose cross-domain transfer as an alternative, reusing HSI foundation models - originally trained in remote sensing - for proximal sensing applications. By eliminating the need to bridge modality gaps, our approach preserves spectral information while maintaining…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#97
AI Coding 2026-04-30 Simon Willison · Hacker News 5.9 6.0/6.7/5.0

Zig has one of the most stringent anti-LLM policies of any major open source project: No LLMs for issues. No LLMs for pull requests. No LLMs for comments on the bug tracker, including translation. English is encouraged, but not required. You are welcome to post in your native language and rely on others to have their own translation tools of choice to interpret your words. The most prominent project written in Zig may be the Bun JavaScript runtime, which was…

How it was discussed
  • Simon Willison reported it.
  • Hacker News reported it.
#98
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.8 5.8/6.3/5.0

The Alternating Direction Method of Multipliers (ADMM) is a widely used method for structured convex optimization, and its practical performance depends strongly on the choice of penalty and relaxation parameters. Motivated by settings such as Model Predictive Control (MPC), where one repeatedly solves related optimization problems with fixed structure and changing parameter values, we propose learning online updates of the relaxation parameter to improve performance on problem classes of interest. This choice is computationally attractive in OSQP-like architectures, since adapting relaxation does not trigger the matrix refactorizations associated with penalty…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG
#99
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.8 5.8/6.3/5.0

Norway's electricity market is heavily dominated by hydropower, but the 2021--2022 energy crisis and stronger integration with Continental Europe have fundamentally altered price formation, reducing the reliability of forecasting models calibrated on historical data. Despite the critical need for updated models, a unified benchmark evaluating feature contributions across all structurally diverse Norwegian bidding zones remains lacking. Here we present a comprehensive evaluation of electricity price forecasting across all five Norwegian Nord Pool bidding zones. We constructed a multimodal hourly dataset spanning 2019--2025 and evaluated eight forecasting model families including LightGBM,…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG
#100
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) 5.8 5.8/6.3/5.0

Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL-Comp, a neuro-symbolic AI agent architecture designed to address this challenge by grounding actions of the agent. AGEL-Comp integrates three core innovations: (1) a dynamic Causal Program Graph (CPG) as a world model, representing procedural and causal knowledge as a directed hypergraph; (2) an Inductive Logic Programming (ILP) engine that synthesizes new Horn clauses from experiential feedback, grounding symbolic knowledge through interaction; and (3) a hybrid reasoning core where an…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI.
cs.LG cs.AI
#101
Government & Defense 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CV (Computer Vision) 5.8 5.8/6.3/5.0

One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CV.
cs.LG cs.CV
#102
AI Coding 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.8 5.8/6.3/5.0

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CL.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CL
#103
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) 5.8 5.8/7.0/4.5

Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time distress signals, and strategic intervention planning. This level of clinical reasoning is critical for safety and therapeutic effectiveness but is often missing in general-purpose Large Language Models (LLMs). We introduce SAGE (Strategy-Aware Graph-Enhanced), a novel framework designed to bridge the gap between structured clinical knowledge and generative AI. SAGE constructs a heterogeneous graph that unifies conversational dynamics with a psychologically grounded layer, explicitly anchoring interactions in a theory-driven lexicon. Our architecture first employs…

cs.CL
#104
Evaluations & Benchmarks 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.8 5.8/6.3/5.0

Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CL.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CL
#105
Evaluations & Benchmarks 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.8 5.8/6.3/5.0

LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CL.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CL
#106
Evaluations & Benchmarks 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv — Evals & Benchmarks 5.8 5.8/6.3/5.0

Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CL.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CL
#107
Robotic Autonomy 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.RO (Robotics) 5.8 5.5/6.7/5.0

Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typed-composition methods (BLADE, SymSkill, Generative Skill Chaining) treat the library as frozen at test time and do not analyze how composition outcomes change when a skill is replaced. We introduce a paired-sampling cross-version swap protocol on robosuite manipulation tasks to characterize this dimension of compositional skill learning. On a dual-arm peg-in-hole task we discover a dominant-skill effect: one ECM achieves 86.7% atomic success rate while every other ECM is at or below…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.RO.
cs.AI cs.RO
#108
Multimodal 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) 5.8 5.5/6.9/5.0

The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives,…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.CV.
cs.AI cs.CV
#109
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.8 5.8/6.9/4.5

Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) analysis, which provides a reliable signal when strategic reasoning surfaces, but cannot distinguish deception from capability failures if traces are absent or unfaithful. We formalize AF as a composite behavioural event and detect it through observable tool selection, where the LLM selects the safe tool when unmonitored, but switches to the unsafe tool…

cs.AI
#110
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.8 5.5/6.6/5.0

Depth ambiguity and joint uncertainty are the two main obstacles in obtaining accurate human pose predictions by 2D-to-3D lifting methods proposed in the literature. In particular, these issues are caused by 2D joint locations that can be mapped to multiple 3D positions, inducing multiple possible final poses. Following these considerations, we propose leveraging diffusion-based models generation capability to predict multiple hypotheses and aggregate them in a final accurate pose. Therefore, we introduce SnapPose3D, a pose-lifting framework trained deterministically to denoise 3D poses conditioned on both visual context and 2D pose…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#111
Post-Training 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Post-training / Alignment 5.8 5.5/6.8/5.0

Accurate quantification of the geometry of curvilinear biological structures is essential for understanding cellular mechanics and disease-related morphological alterations. Microtubule curvature is a key descriptor of filament rigidity and mechanical perturbations. However, reliable curvature extraction from fluorescence microscopy images remains challenging due to noise, low contrast, and partial filament visibility. Existing approaches rely on segmentation pipelines with pre or post-processing, which are highly sensitive to segmentation errors and often fail under adverse imaging conditions. In this work, we propose MTCurv, a deep learning framework for direct, segmenta-tion-free regression of microtubule…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Post-training / Alignment — wide thematic overlap.
cs.CV
#112
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.8 5.5/7.3/4.5

Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying input perturbation intensities for training samples near decision boundaries in AT have minimal impact on model robustness. This finding directly exposes the inconsistency between accuracy and robustness score fluctuations, leading us to identify the misalignment between input and latent spaces as a critical driver of the robustness-accuracy trade-off.…

cs.CV
#114
Safety, Policy & Regulation 2026-04-29 MIT Tech Review 5.8 6.0/6.9/4.5

Today, nuclear energy enjoys a rare moment of support across the political spectrum in the US. Interest from tech companies that are scrambling to meet demand for massive data centers has sparked a resurgence of money and attention in the industry. That newfound interest is exactly why it’s time to talk about an old problem: nuclear waste. In the US alone, nuclear reactors produce about 2,000 metric tons of high-level waste each year. And there’s nowhere to put it. Though…

#119
Government & Defense 2026-04-29 DefenseScoop 5.8 6.5/6.2/4.5

I co-founded Kessel Run , the Department of War’s (DoW) first software factory, with a simple mission: To continuously deliver valuable software that warfighters love. At our peak, we deployed five applications from concept to operations in an average of 124 days, reducing target development timelines by 85%. Section 31, the U.S. Space Force’s first software factory, deployed eight applications to operations in an average of 64 days and reduced conjunction analysis from three hours to 15 minutes. These outcomes…

#120
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv stat.ML (Statistical ML) 5.7 5.5/6.3/5.0

In Orabona and Pál [2016], we introduced the shifted KT potentials, to remove the $\ln \ln T$ factor in the parameter-free learning with expert bound. In this short technical note, I show that this is equivalent to changing the prior in the Krichevsky--Trofimov algorithm. Then, I show how to use the same idea to remove the $\ln \ln T$ factor in the data-independent bound for the Squint algorithm.

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv stat.ML.
cs.LG stat.ML
#121
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv stat.ML (Statistical ML) 5.7 5.5/6.3/5.0

Learning curves are a fundamental primitive in supervised learning, describing how an algorithm's performance improves with more data and providing a quantitative measure of its generalization ability. Formally, a learning curve plots the decay of an algorithm's error for a fixed underlying distribution as a function of the number of training samples. Prior work on revenue-maximizing learning algorithms, starting with the seminal work of Cole and Roughgarden [STOC, 2014], adopts a distribution-free perspective, which parallels the PAC learning framework in learning theory. This approach evaluates performance against the hardest possible…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv stat.ML.
cs.LG stat.ML
#122
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv stat.ML (Statistical ML) 5.7 5.5/6.3/5.0

We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv stat.ML.
cs.LG stat.ML
#123
Government & Defense 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CL (Computation & Language) 5.7 5.5/6.3/5.0

Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by focusing on grounded question answering over EHRs, and this paper presents the system developed by the HealthNLP_Retrievers team for this task. The proposed approach uses a multi-stage cascaded pipeline powered by the Gemini 2.5 Pro large language model to interpret patient-authored questions and retrieve relevant evidence from lengthy clinical…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CL.
cs.LG cs.CL
#124
Multimodal 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.CV (Computer Vision) 5.7 5.5/6.3/5.0

We present KAYRA, an end-to-end karyotyping system that operates inside the operational constraints of a clinical cytogenetic laboratory. KAYRA is architected as a containerized microservice pipeline whose ML stack combines an EfficientNet-B5 + U-Net semantic segmenter, a Mask R-CNN (ResNet-50 + FPN) instance detector, and a ResNet-18 classifier, orchestrated through a cascaded ROI-narrowing strategy that focuses each downstream model on the chromosome-bearing region. The same container images are deployed both as a cloud service and as an on-premise installation, supporting clinical environments where patient-data egress is not permitted as well…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.CV.
cs.LG cs.CV
#125
Efficiency 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) 5.7 5.5/6.3/5.0

Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.LG
#126
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.7 5.5/6.3/5.0

We present a quantum feature-selection framework based on a higher-order unconstrained binary optimization (HUBO) formulation that explicitly incorporates multivariate dependencies beyond standard quadratic encodings. In contrast to QUBO-based approaches, the proposed model includes one-, two-, and three-body interaction terms derived from mutual-information measures, enabling the objective function to capture feature relevance, pairwise redundancy, and higher-order statistical structure within a unified energy model. To suppress trivial all-selected solutions, we further include structured linear penalties that promote sparsity while preserving informative variables. The resulting HUBO instances are optimized with digitized counterdiabatic quantum…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG
#127
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) 5.7 5.5/6.3/5.0

The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI.
cs.LG cs.AI
#128
Efficiency 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) 5.7 5.5/6.3/5.0

Municipal solid waste incineration is increasingly central to urban waste management, yet its sustainability benefit depends on controlling carbon emissions and multiple air pollutants under highly heterogeneous operating conditions. Current data-driven models are often accurate within individual plants but are difficult to transfer across facilities, limiting their value for scalable emission-control strategies. Here we show that multi-site emission behaviour can be represented through transferable system-level structures when physical constraints, operating-regime heterogeneity and carbon--pollutant coupling are jointly considered. We develop a physics-informed transfer learning framework built on a carbon--pollutant mixture-of-experts model,…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.LG
#129
Efficiency 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Efficiency (Quantization, MoE, Inference) 5.7 5.5/6.3/5.0

Dynamic quantization emerged as a practical approach to increase the utilization and efficiency of the machine learning serving flow. Unlike static quantization, which applies quantization offline, dynamic quantization operates on tensors at run-time, adapting its parameters to the actual input data. Today's mainstream machine learning frameworks, including ML compilers and inference engines, frequently recommend dynamic quantization as an initial step for optimizing model serving. This is because dynamic quantization can significantly reduce memory usage and computational load, leading to faster token generation and improved model serving efficiency without substantial loss…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.LG
#130
Robotic Autonomy 2026-04-29 arXiv cs.LG (Machine Learning) 5.7 5.5/7.0/4.5

Safety-critical prediction systems, such as autonomous vehicles, weather forecasters, and medical monitors, commonly rely on probabilistic forecasters. These forecasters make predictions about possible future outcomes, and their quality and robustness needs to be validated and certified. Often, only accuracy -- the mean of the predictions -- is evaluated against true outcomes. However, for safety-critical scenarios and decision making under uncertainty, the full distributional properties of the forecasts should be checked: do the observed prediction errors actually follow the forecasted probability distributions? To this end, we introduce a framework for calibration…

cs.LG
#131
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv cs.AI (Artificial Intelligence) 5.7 5.5/6.3/5.0

Accurate end-to-end tail-latency forecasting is critical for proactive SLO management in microservice systems. However, modeling long-range dependency propagation and non-stationary, bursty workloads while maintaining inference efficiency at scale remains challenging. We present STLGT (Scalable Trace-based Linear Graph Transformer), a per-API predictor that encodes traces as span graphs for multi-step p95 tail-latency forecasting. STLGT uses a structure-aware linear graph Transformer to propagate cross-service dependencies with inference time linear in span graph size, and a decoupled temporal module to capture workload dynamics. Across a personalized education microservice application, DeathStarBench, and Alibaba traces,…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.LG, arXiv cs.AI.
cs.LG cs.AI
#132
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.7 5.5/7.0/4.5

Runtime monitoring is essential to ensure the safety of ML applications in safety-critical domains. However, current research is fragmented, with independent methods emerging from different communities. In this paper, we propose a unified framework categorising runtime monitoring approaches into three distinct types: Operational Design Domain (ODD) monitoring, which ensures compliance with expected operating conditions; Out-of-Distribution (OOD) monitoring, which rejects inputs that deviate from the training data; and Out-of-Model-Scope (OMS) monitoring, which detects anomalous model behaviour based its internal states or outputs. We demonstrate the benefits of this categorization with a…

cs.LG
#133
Evaluations & Benchmarks 2026-04-29 arXiv cs.LG (Machine Learning) · arXiv — Evals & Benchmarks 5.7 5.5/6.3/5.0

Federated Split Learning has been identified as an efficient approach to address the computational resource constraints of clients in classical federated learning, while guaranteeing data privacy for distributed model training across data owners. However, it faces some critical challenges when such a training strategy meets large language models (LLMs) for fine-tuning. Such challenges include setting the cutlayer adaptively across different clients to address the data and device heterogeneity issues, which affect the system performance significantly. In addition, efficiently reducing the communication overhead during the fine-tuning procedure is also another challenge.…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.LG.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.LG
#134
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) 5.7 5.5/6.3/5.0

We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do not correspond to any existing work. Such citations not only undermine the credibility of scientific papers but also impose an additional burden on reviewers and authors, who must manually verify their validity during the review process. In this study, we formalize hallucinated citation detection as an NLP task and provide a…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI.
cs.CL cs.AI
#135
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) 5.7 5.5/6.3/5.0

Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In medicine, trust must be engineered as a measurable system property grounded in evidence, supervision, and operational boundaries of AI autonomy. This article proposes a practical framework for trustworthy clinical AI built around three principles: evidence, supervision, and staged autonomy. Rather than replacing deterministic clinical logic wholesale with end-to-end black-box models, the proposed approach combines a deterministic core, a patient-specific AI assistant for contextual validation, a multi-tier model escalation mechanism,…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI.
cs.CL cs.AI
#136
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv cs.AI (Artificial Intelligence) 5.7 5.5/6.3/5.0

Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrative generation. Traditional model-based approaches require large, annotated datasets, while prompt-based methods using large language models (LLMs) often struggle with hallucination due to weak table comprehension. To overcome these challenges, we propose Tree-of-Text, a tree-structured prompting framework that guides LLMs through a three-stage generation process: (1) Content Planning, where relevant operations and arguments are selected from the input tables; (2) Operation Execution, which breaks down large tables into manageable sub-tables;…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.CL, arXiv cs.AI.
cs.CL cs.AI
#137
Efficiency 2026-04-29 arXiv cs.CL (Computation & Language) · arXiv — Efficiency (Quantization, MoE, Inference) 5.7 5.5/6.3/5.0

Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CL.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.CL
#138
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 5.7 5.8/6.2/5.0

Standard density functional theory (DFT) routinely misclassifies the electronic ground state of correlated and structurally complex compounds, predicting metallic behaviour for materials that experiments report as semiconductors. Each such mismatch encodes a specific non-ideality -- magnetic ordering, electron correlation, an alternative polymorph, or a defect -- that the calculation excluded, but extracting that signal at scale has remained a manual exercise. Here we introduce XDFT, a closed-loop agent that diagnoses the mismatch automatically: it draws candidate hypotheses from a curated catalogue, executes the corresponding first-principles tests, and updates a global…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.AI.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.AI
#139
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 5.7 5.8/6.2/5.0

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.AI.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.AI
#140
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 5.7 5.8/6.2/5.0

As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models -- Eagle (Llama 3.1-8B)…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.AI.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.AI
#141
Evaluations & Benchmarks 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv — Evals & Benchmarks 5.7 5.8/6.2/5.0

Technology mapping is a critical yet challenging stage in logic synthesis. While Large Language Models (LLMs) have been applied to generate optimization scripts, their potential for core algorithm enhancement remains untapped. We introduce MappingEvolve, an open-source framework that pioneers the use of LLMs to directly evolve technology mapping code. Our method abstracts the mapping process into distinct optimization operators and employs a hierarchical agent-based architecture, comprising a Planner, Evolver, and Evaluator, to guide the evolutionary search. This structured approach enables strategic and effective code modifications. Experiments show our method significantly…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.AI.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.AI
#142
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.7 5.8/6.7/4.5

Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assigned value perspectives. We present the AI Council, a three-phase deliberation framework, and conduct 120 deliberations across two policy scenarios to test two interventions. First, architectural heterogeneity (assigning a different 7-9B parameter model to each value perspective) significantly reduces first-choice concentration compared to a homogeneous baseline (child welfare: 70.9% to 46.1%, p < 0.001, r = 0.58; housing: 46.0% to…

cs.AI
#143
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.7 5.5/6.9/4.5

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalable capacity, existing file-based designs rely heavily on the kernel page cache, leading to cache thrashing, unpredictable latency, and high software overhead under memory pressure. We present DUAL-BLADE, a dual-path KV residency framework that dynamically assigns KV tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability.…

cs.AI
#144
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.7 5.5/6.9/4.5

Generative AI tools are widely used by youth and have introduced new privacy and safety challenges. While prior research has explored youth's safety in GenAI within western context, it often overlooks the cultural, religious, and social dimensions of technology use that strongly shape youths digital experiences in countries like Saudi Arabia. To address the gap, this study explores children (aged 7 to 17), parents and teachers interactions with GenAI tools and risk perceptions through non-western lens. Through a mixed methods approach, we analyzed 736 Reddit and 1,262 X(Twitter) posts and…

cs.AI
#145
Multimodal 2026-04-29 arXiv cs.RO (Robotics) · arXiv cs.CV (Computer Vision) 5.7 5.8/6.2/5.0

Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, "look forward" to extract global landmarks and sketch a…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.RO, arXiv cs.CV.
cs.RO cs.CV
#146
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.7 5.8/6.7/4.5

Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation to convert predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings. Real-world experiments further improve average success from 42.5%…

cs.RO
#147
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.7 5.8/6.6/4.5

Surgical training involves didactic teaching, mentor-led learning, surgical skills laboratories, and direct exposure to surgery; however, increasing clinical pressures have limited operating room (OR) exposure. This work leverages virtual reality (VR) to provide a safe and immersive training environment. Existing VR training is often based on standardized scenarios not tailored to individual clinical cases. This study addresses this limitation using artificial intelligence (AI) based computer vision methods to generate patient-specific simulations from computed tomography (CT) and magnetic resonance imaging (MRI). This study focuses on patient-specific spinal decompression simulation for spinal…

cs.CV
#148
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.7 5.8/6.0/5.0

Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non machine readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task relevant information. Although recent vision-language models achieve strong benchmark performance, directly applying them end to end to full financial reports often leads to unreliable extraction under real world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction.…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#149
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.7 5.8/6.0/5.0

Industrial object detection systems typically rely on large annotated datasets, which are expensive to collect and challenging to maintain in industrial scenarios where the inventory of objects changes frequently. This work addresses the challenge of few-shot object detection in such industrial scenarios, where only a limited number of labeled samples are available for newly introduced objects. We present a detection framework that leverages vision foundation models to recognize objects with minimal supervision. The method constructs class prototypes from a small set of reference samples by extracting feature representations. For a…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#151
Government & Defense 2026-04-29 FedScoop 5.7 5.5/7.0/4.5

The Office of Management and Budget’s public tally of governmentwide AI use again grew in 2025 — this time amid the Trump administration’s push to use the technology in the name of efficiency. Per OMB’s recent publication on GitHub , the U.S. government reported about 3,600 AI use cases across agencies, a nearly 70% increase in disclosed applications of the technology from the previous reporting year. As with previous disclosures, the accounting captures pre-deployment uses, pilot projects, those in active…

#152
Multimodal 2026-04-29 arXiv cs.AI (Artificial Intelligence) · arXiv cs.CV (Computer Vision) 5.6 5.5/6.2/5.0

Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional "Lost-in-Space" (LIS) algorithms often suffer from high computational overhead and sensitivity to sensor-induced noise. While deep learning has emerged as a promising alternative, standard regression models are often confounded by the non-Euclidean topology of the celestial sphere and by the periodic boundary conditions of Right Ascension (RA) and Declination (Dec). In this paper, we present Star-Fusion, a multi-modal architecture that reformulates orientation estimation as a discrete topological classification task. Our approach leverages spherical K-Means clustering to…

How it was discussed
  • Cross-listed in 2 arXiv categorical feeds: arXiv cs.AI, arXiv cs.CV.
cs.AI cs.CV
#153
Government & Defense 2026-04-29 arXiv cs.RO (Robotics) · arXiv — AI, Defense & National Security 5.6 5.5/6.2/5.0

Human-robot interaction is emerging as an important paradigm for integrating persons with disabilities into the workplace. While these systems can enable individuals to work, their design is mostly personalized, hindering widespread use beyond the individual user. The universal design paradigm is a central pillar of inclusive design, describing usability of systems by all. To incorporate universal design into process design for human-robot workplaces expert knowledge is required that is often not available. To simplify process design of human-robot workplaces, we propose a persona-based design approach. First, typical impairments prevalent in…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.RO.
  • Matched topical feeds: AI, Defense & National Security — wide thematic overlap.
cs.RO
#154
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.6 5.5/6.7/4.5

Navigating quadruped robots in unstructured 3D environments poses significant challenges, requiring goal-directed motion, effective exploration to escape from local minima, and posture adaptation to traverse narrow, height-constrained spaces. Conventional approaches employ a sequential mapping-planning pipeline but suffer from accumulated perception errors and high computational overhead, restricting their applicability on resource-constrained platforms. To address these challenges, we propose Hierarchical Posture-Adaptive Navigation (HiPAN), a framework that operates directly on onboard depth images at deployment. HiPAN adopts a hierarchical design: a high-level policy generates strategic navigation commands (planar velocity and body posture), which…

cs.RO
#155
Post-Training 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Post-training / Alignment 5.6 5.5/6.2/5.0

We introduce ProcFunc, a library for Blender-based procedural 3D generation in Python. ProcFunc provides a library of easy-to-use Python functions, which streamline creating, combining, analyzing, and executing procedural generation code. ProcFunc makes it easy to create large-scale diverse training data, by combinatorial compositions of semantic components. VLMs can use ProcFunc to edit procedural material and geometry code and can create new procedural code with significantly fewer coding errors. Finally, as an example use case, we use ProcFunc to develop a new procedural generator of indoor rooms, which includes a collection…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Post-training / Alignment — wide thematic overlap.
cs.CV
#156
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.6 5.5/6.6/4.5

The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years. However, most conventional cameras are bandwidth-limited to 30-60 FPS, restricting these methods to static or slowly evolving scenes. While overcoming bandwidth limitations is difficult for general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific applications (e.g., motion capture and particle image velocimetry). However, most of these methods require modifications to a camera's optics or the addition of mechanically moving…

cs.CV
#157
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.6 5.5/6.0/5.0

Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. We present AnimateAnyMesh++, a feed-forward framework for text-driven animation of arbitrary 3D meshes with substantial upgrades in data, architecture, and generative capability. First, we expand the DyMesh-XL dataset by mining dynamic content from Objaverse-XL, increasing the number of unique identities from 60K to 300K and substantially broadening category and motion diversity. Second, we redesign DyMeshVAE-Flex with power-law…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#158
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Generative Media / Diffusion 5.6 5.5/6.2/5.0

Synthesizing a target concept from a single reference image is challenging in diffusion-based personalized text-to-image generation, particularly for sticker personalization where prompts often require explicit attribute edits. With only one reference, test-time fine-tuning (TTF) methods tend to overfit, producing \textit{visual entanglement}, where background artifacts are absorbed into the learned concept, and \textit{structural rigidity}, where the model memorizes reference-specific spatial configurations and loses contextual controllability. To address these issues, we introduce \textbf{SE}mantic-aware single-image sticker person\textbf{AL}ization (\textbf{SEAL}), a plug-and-play, architecture-agnostic adaptation module that integrates into existing personalization pipelines without modifying their U-Net-based…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Generative Media / Diffusion — wide thematic overlap.
cs.CV
#159
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.6 5.5/6.0/5.0

Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely \textbf{\textit{Bridge}}, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment, \textbf{\textit{Bridge}} blocks confounders' effects to mitigate spurious correlations,…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#160
Efficiency 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Efficiency (Quantization, MoE, Inference) 5.6 5.5/6.2/5.0

3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Efficiency (Quantization, MoE, Inference) — wide thematic overlap.
cs.CV
#161
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.6 5.5/6.6/4.5

Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant). In this paper, we tackle sound recovery for…

cs.CV
#162
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.6 5.5/6.0/5.0

Face Recognition (FR) is used in a variety of application domains, from entertainment and banking to security and surveillance. Such applications rely on the FR model to be robust and perform well in a variety of settings. To achieve this, state-of-the-art FR models typically use expressive adaptive margin loss functions, which tie the feature norm to concepts related to sample quality, such as recognizability and perceptual image quality. Recently, through the development of Face Image Quality Assessment (FIQA) techniques, biometric utility has become the preferred measure of face-image quality and…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#163
Evaluations & Benchmarks 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Evals & Benchmarks 5.6 5.5/6.0/5.0

The rapid evolution of deepfake technology poses an unprecedented threat to the authenticity of Graphics Interchange Format (GIF) imagery, which serves as a representative of short-loop temporal media in social networks. However, existing proactive forensics works are designed for static images, which limits their applicability to animated GIFs. To bridge this gap, we propose GIFGuard, the first spatiotemporal watermarking framework tailored for deepfake proactive forensics in GIFs. In the embedding stage, we propose the Spatiotemporal Adaptive Residual Encoder (STARE) to ensure robustness against high-level semantic tampering. It employs a 3D…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Evals & Benchmarks — wide thematic overlap.
cs.CV
#164
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) · arXiv — Generative Media / Diffusion 5.6 5.5/6.2/5.0

Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry.…

How it was discussed
  • Cross-listed in 1 arXiv categorical feeds: arXiv cs.CV.
  • Matched topical feeds: Generative Media / Diffusion — wide thematic overlap.
cs.CV
#165
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.6 5.5/6.8/4.5

Visual data compression is shifting from human-centered reconstruction to machine-oriented representation coding. In this setting, an image is often mapped to a compact semantic embedding, which is then compressed and transmitted for downstream inference. We propose an adaptive transform-coding method for semantic-feature compression motivated by the conditional rate-distortion function of a Gaussian mixture model. The scheme uses mode-dependent transforms and quantizers selected according to the inferred source component, enabling more efficient coding of heterogeneous feature distributions. Evaluations on features from widely used vision backbones and foundation models show that the…

cs.CV
#166
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.6 5.5/6.6/4.5

Purpose: Rapid and reliable diagnostic tools are crucial for managing respiratory diseases like COVID-19, where chest X-ray analysis coupled with artificial intelligence techniques has proven invaluable. However, most existing works on X-ray images have not considered lung segmentation, raising concerns about their reliability. Additionally, some have employed disproportionate and impractical augmentation techniques, making models less generalized and prone to overfitting. This study presents a critical analysis of both issues and proposes a methodology (SDL-COVID) for more reliable classification of chest X-rays for COVID-19 detection. Methods: We use class activation mapping…

cs.CV
#167
Safety, Policy & Regulation 2026-04-29 MIT Tech Review 5.6 6.0/6.2/4.5

This is today’s edition of The Download , our weekday newsletter that provides a daily dose of what’s going on in the world of technology. It’s time to make a plan for nuclear waste Today, nuclear energy enjoys rare support across the political spectrum. Public approval has spiked, and Big Tech is throwing money around to meet rising electricity demand. That newfound interest is exactly why it’s time to talk about an old problem: nuclear waste. In the US, nuclear…

#168
Government & Defense 2026-04-29 DefenseScoop 5.6 5.5/6.7/4.5

The U.S. government is modernizing its Small Business Innovation Research and Small Business Technology Transfer (SBIR/STTR) programs to get after contemporary warfare and national security gaps, senior officials involved in the work said on Wednesday. Referred to collectively as “America’s seed fund,” that decades-old pair of federal programs provides technology-focused small businesses and startups with early-stage investments and support to commercialize their products, and ultimately field them for use by federal agencies and the military. “I think what you’re going…

#169
Government & Defense 2026-04-29 FedScoop 5.6 5.5/6.6/4.5

Agencies would be pushed to pick up the pace on the elimination of legacy IT systems under a new bill from a bipartisan group of House lawmakers. The Legacy IT Reduction Act of 2026 ( H.R.8408 ) from Reps. Maxwell Frost, D-Fla., William Timmons, R-S.C., Eric Burlison, R-Mo., and Byron Donalds, R-Fla., would require agency chief information officers to lead the charge on lessening the federal government’s reliance on and expenditures for aging systems. The first step in that reduction…

#170
Government & Defense 2026-04-29 War on the Rocks 5.6 5.5/6.5/4.5

In 2024, Judd Devermont wrote, “Human Geography Is Mission-Critical,” where he argued that the United States should focus on behaviors and attitudes informed by human geography to craft better strategy. Two years later, we asked Judd to revisit his arguments. Image: Samuel Lamptey via Wikimedia CommonsIn your 2024 article, you argued that the United States needed to focus its attention on behaviors and attitudes informed by human geography to craft strategy that adequately navigates a more complex world and threat…

#171
Government & Defense 2026-04-29 War on the Rocks 5.6 5.5/6.5/4.5

The United States and Canada are both racing to rebuild their defense industrial bases, recognizing that future conflicts will be determined not only by military capability, but by the ability to produce at scale. But they cannot succeed alone — and importantly, they do not need to start from scratch.After decades of reliance on globalized supply chains for everything from consumer products to critical defense technologies, the United States is reasserting a more active industrial policy, using tools ranging from…

#172
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

This paper extends and explains the Multiple Additive Neural Networks (MANN) methodology, an enhancement to the traditional Gradient Boosting framework, utilizing nearly shallow neural networks instead of decision trees as base learners. This innovative approach leverages neural network architectures, notably Convolutional Neural Networks (CNNs) and Capsule Neural Networks, to extend its application to both structured data and unstructured data such as images and audio. For structured data the advantages of capsule neural networks as feature extractors are used and combined with MANN as a classifier. MANN's unique architecture promotes continuous…

cs.LG
#173
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

This paper proposes a novel algorithm for semisupervised learning. This algorithm learns graph cuts that maximize the margin with respect to the labels induced by the harmonic function solution. We motivate the approach, compare it to existing work, and prove a bound on its generalization error. The quality of our solutions is evaluated on a synthetic problem and three UCI ML repository datasets. In most cases, we outperform manifold regularization of support vector machines, which is a state-of-the-art approach to semi-supervised max-margin learning.

cs.LG
#174
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

Motivated by sensing modalities in modern autonomous systems that involve hardware-constrained spatial sampling over large arrays with limited coherence time, we develop a novel framework for rapid super-resolution multi-signal direction-of-arrival (DoA) estimation based on Hankel-structured sensing and data matrix decomposition of arbitrary rank, under both the $L_2$ and $L_1$-norm formulation. The resulting $L_2$-norm estimator is shown to be maximum-likelihood optimal in white Gaussian noise. The $L_1$-norm estimator is shown to be maximum-likelihood optimal in independent, identically distributed (i.i.d.) isotropic Laplace noise, offering broad robustness to impulsive interference and corrupted measurements…

cs.LG
#175
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

We consider the problems of computing the optimal rank-$1$ Hankel and Toeplitz-structured approximation of arbitrary matrices under $L_2$ and $L_1$-norm error. Such problems arise naturally in engineered systems, including the basic few-shot signal Direction-of-Arrival (DoA) estimation problem that is of importance to modern autonomous systems applications. We develop accurate and computationally efficient structured matrix decomposition algorithms for both formulations and then derive analytically grounded small-sample-support DoA estimators for practical sensing system deployments. The resulting estimators under the $L_2$ and $L_1$ norms are formally shown to be maximum-likelihood optimal under white…

cs.LG
#176
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

We investigate variational quantum classifiers (VQCs) for land-cover classification from multispectral satellite imagery, adopting a feature-map perspective in which the quantum circuit defines a nonlinear data embedding while the readout determines how this representation is exploited. Using the EuroSAT-MS dataset, we perform a systematic one-vs-one evaluation across all class pairs under a controlled experimental protocol, comparing classical baselines (logistic regression, SVMs, neural networks) with VQCs employing both linear readout and quantum-kernel SVM strategies. Our results show that, while VQCs with linear readout do not outperform strong classical baselines such as…

cs.LG
#177
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

With the increasing availability of online information, recommender systems have become an important tool for many web-based systems. Due to the continuous aspect of recommendation environments, these systems increasingly rely on contextual multi-armed bandits (CMAB) to deliver personalized and real-time suggestions. A critical yet underexplored component in these systems is the representation of user state, which typically encapsulates the user's interaction history and is deeply correlated with the model's decisions and learning. In this paper, we investigate the impact of different embedding-based state representations derived from matrix factorization models on…

cs.LG
#178
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

We propose a dual-channel reservoir-computing scheme for inferring the dynamics of two distinct chaotic systems with a single machine. By augmenting a standard reservoir with a system-label channel and a parameter-control channel, the machine can be trained from time series collected from a few sampled states of the two systems. We show that the trained machine not only predicts the short-time evolution of the sampled states, but also reproduces the long-term statistical properties of unseen states, thereby enabling reconstruction of the bifurcation diagrams of both systems from partial observations. The…

cs.LG
#179
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

Federated learning (FL) trains a shared model from updates contributed by distributed clients, often implicitly assuming that contributing clients are representative of the target population. In practice, this representativeness assumption can fail at two distinct stages, inducing selection bias. First, eligibility rules such as device constraints, software requirements, or user consent determine which clients are ever enrolled and reachable for training, inducing \emph{enrollment bias}. Second, among enrolled clients, user and system factors such as battery state, network status, and local time determine which clients participate in each communication round, inducing…

cs.LG
#180
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

Digital twins provide a powerful paradigm for diagnostic and prognostic tasks in the monitoring and control of engineered systems; however, their deployment for complex structures remains challenged by model-form uncertainty, arising from unknown nonlinear dynamics, and by sparse sensing. These limitations hinder reliable online state estimation using either purely physics-based or purely data-driven approaches. This work introduces the Physics-Guided Graph Neural ODE (PiGGO) framework, a physics-informed, graph-based Bayesian state estimation approach in which a learned graph neural ordinary differential equation (GNODE) serves as the continuous-time state-transition model within an extended…

cs.LG
#182
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

DNNs have gained widespread adoption in feature interaction recommendation models. However, there has been a longstanding debate on their roles. On one hand, some works claim that DNNs possess the ability to implicitly capture high-order feature interactions. Conversely, recent studies have highlighted the limitations of DNNs in effectively learning dot products, specifically second-order interactions, let alone higher-order interactions. In this paper, we present a novel perspective to understand the effectiveness of DNNs: their impact on the dimensional robustness of the representations. In particular, we conduct extensive experiments involving both parallel…

cs.LG
#183
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

Industrial systems increasingly depend on Machine Learning (ML), and operate on heterogeneous nodes that must satisfy tight latency, energy, and memory constraints. Dynamic ML models, which reconfigure their computational footprint at runtime, promise high energy efficiency and lower average latency for modest accuracy tradeoffs; however, their deployment is complex due to the additional hyperparameters they rely on. These hyperparameters, controlling the accuracy versus average latency tradeoff, are often tuned on a calibration dataset that must match the test time distribution, an assumption that rarely holds in real-world scenarios, leading to…

cs.LG
#184
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

We study three problems that involve identifying homogeneous halfspaces under Gaussian distributions: agnostic learning, one-sided reliable learning, and fairness auditing. In each of these problems, we are given labeled examples $(\mathbf{x}, \mathrm{y})$ drawn from an unknown distribution on $\mathbb{R}^d\times\{-1, +1\}$, whose marginal distribution on $\mathbf{x}$ is standard Gaussian and on $\mathrm{y}$ is arbitrary. The goal of each problem is to output a homogeneous halfspace that approaches the best-fitting homogeneous halfspace in terms of its corresponding loss measure. We prove near-optimal computational hardness results for these problems under the widely believed…

cs.LG
#185
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

We prove that any continuous function f from [0,1]^n to R representable by a finite computation tree with N internal nodes and compositional sparsity s = O(1) admits a deep Kolmogorov-Arnold Network (KAN) representation. Each internal node is realised by a primitive KAN block with controlled block depth and Lipschitz product. The layer-wise Lipschitz product satisfies the primary domain-sensitive bound independent of the input dimension n. It simplifies to P(KAN_f) <= max(C*,1)^L_f with L_f <= c_max * N. For the standard operations {+,-,x,sin,cos} with x nodes on [0,1]-bounded inputs we…

cs.LG
#186
Frontier LLMs 2026-04-29 arXiv cs.LG (Machine Learning) 5.5 5.5/6.3/4.5

Balancing differential privacy (DP) with recommendation accuracy is a key challenge in privacy-preserving recommender systems, since DP-noise degrades accuracy. We address this trade-off at both the data and model levels. At the data level, we apply DP only to the most stereotypical user data likely to reveal sensitive attributes, such as gender or age, to reduce unnecessary perturbation; we refer to this as targeted DP. At the model level, we use meta-learning to improve robustness to remaining DP-noise. This achieves a better trade-off between accuracy and privacy than standard approaches:…

cs.LG
#187
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) 5.5 5.5/6.3/4.5

Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-verb-subject word order) or impossible languages to very common combinations of features (e.g., subject-object-verb word order). One central question is under what conditions such typological tendencies can be predicted, and specifically whether the learning bias of language models (LMs) is sufficient to reproduce such patterns. In this study, we add one dimensionality to such analysis -- the learning scenario for LMs -- to explore its interaction with the inductive bias…

cs.CL
#188
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) 5.5 5.5/6.3/4.5

Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at inference time, offering a promising alternative to in-context retrieval augmentation. Despite its potential, many PRAG implementations train document adapters with task-supervised objectives, which may cause each adapter to encode both document-specific facts and reusable task-solving behavior. This entanglement may make adapter composition less reliable: when multiple adapters are merged at inference time, their overlapping task behaviors can accumulate together with document-specific updates, potentially making the merged adapter less stable and less focused…

cs.CL
#189
Government & Defense 2026-04-29 arXiv cs.CL (Computation & Language) 5.5 5.5/6.3/4.5

Languages of the world vary concerning the order of subject, object and verb. The most frequent dominant orders are SOV and SVO, and researchers have tailored models to this fact. However, there are still languages whose dominant order does not conform to these expectations or even lack a dominant order. Here we show that across linguistic families and macroareas, word order variation within languages is shaped by the principle of swap distance minimization even when the dominant order is not SOV/SVO and even when a dominant order is lacking.

cs.CL
#190
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) 5.5 5.5/6.3/4.5

Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach…

cs.CL
#191
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) 5.5 5.5/6.3/4.5

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing…

cs.CL
#192
Frontier LLMs 2026-04-29 arXiv cs.CL (Computation & Language) 5.5 5.5/6.3/4.5

Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale…

cs.CL
#193
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.5 5.8/6.2/4.5

Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of Machine Learning scalability and rigid reasoning. Methods: In this work, we propose a theoretical framework for reasoning through object-relations in an automated manner integrated with Artificial Neural Networks. We present a formal analysis of the Reasoning, and we show the theory in practice through a paradigm integrating Reasoning and Machine Learning.…

cs.AI
#194
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.5 5.8/6.0/4.5

We propose UAPAR, an Uncertainty-Aware Pedestrian Attribute Recognition framework. To the best of our knowledge, this is the first EDL-based uncertainty-aware framework for pedestrian attribute recognition (PAR). Unlike conventional deterministic methods, which fail to assess prediction reliability on low-quality samples, UAPAR effectively identifies unreliable predictions and thus enhances system robustness in complex real-world scenarios. To achieve this, UAPAR incorporates Evidential Deep Learning (EDL) into a CLIP-based architecture. Specifically, a Region-Aware Evidence Reasoning module employs cross-attention and spatial prior masks to capture fine-grained local features, which are further processed by an…

cs.CV
#195
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.5 5.8/6.0/4.5

Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the…

cs.CV
#196
Safety, Policy & Regulation 2026-04-30 AI Alignment Forum 5.5 5.0/6.9/4.5

One of the main hopes for AI safety is using AIs to automate AI safety research . However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: Perform sloppy research in order to slow down the rate of research progress Make AI systems appear safer than they are Train a successor model to be misaligned Whether we should worry about those things depends substantially on how hard it is to sabotage…

#197
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.4 5.5/6.2/4.5

This paper provides a concise yet comprehensive review of recent advancements in millimeter-wave (mm-wave) oscillators below 100 GHz and sub-terahertz (sub-THz/THz) oscillators above 100 GHz for next-generation computing and communication systems, including 5G, 6G, and beyond. Various design approaches, including CMOS, SiGe, and III-V semiconductor technologies, are explored in terms of performance metrics such as phase noise, output power, efficiency, frequency tunability, and stability. The review highlights key challenges in achieving high-performance and reliable oscillator designs while discussing emerging techniques for performance enhancement. By evaluating recent design trends, this work…

cs.AI
#198
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.4 5.5/6.2/4.5

When generative AI (genAI) systems are used in high-stakes decision-making, its recommended role is to aid, rather than replace, human decision-making. However, there is little empirical exploration of how professionals making high-stakes decisions, such as those related to employment, perceive their agency and level of control when working with genAI systems. Through interviews with 22 recruiting professionals, we investigate how genAI subtly influences control over everyday workflows and even individual hiring decisions. Our findings highlight a pressing conflict: while recruiters believe they have final authority across the recruiting pipeline, genAI…

cs.AI
#199
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.4 5.5/6.2/4.5

We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio.…

cs.AI
#200
AI Coding 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.4 5.5/6.2/4.5

Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows. While test-driven development (TDD) provides a structured Red-Green-Refactor process, existing LLM-based approaches typically use tests as auxiliary inputs rather than enforceable process constraints. We present an AI-native TDD framework that operationalizes classical TDD principles as structured prompt-level and workflow-level governance mechanisms. Extracted principles are formalized in a machine-readable manifesto and distributed across planning, generation, repair, and validation stages within a layered architecture that separates model proposal from deterministic engine…

cs.AI
#201
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.4 5.5/6.2/4.5

Reusing verification artefacts requires identifying structural and semantic similarities across programs and their specifications. In this paper, we focus on graph construction as a foundational step toward this goal. We present a pipeline that converts imperative programs and their annotations into typed, attributed graphs. Our experiments cover datasets including C with ACSL, Java with JML, and Dafny for C\#. The pipeline integrates abstract syntax tree parsing with semantic embeddings derived from models such as SentenceTransformer and CodeBERT. This enables the generation of graph representations that capture both structural relationships and…

cs.AI
#202
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.4 5.5/6.2/4.5

Over the past 25 years, I have been involved in some intriguing developments in the foundations of physics, exploring the quantum reality problem, the relationship between quantum theory and gravity and the interplay between consciousness and physical laws. These investigations make it plausible that we will find physics beyond quantum theory, potentially including both new evolution laws and new types of measurement. There is also a significant chance they could have potentially transformative impact on information processing and on the development of and our future with AI.

cs.AI
#203
Research 2026-04-29 arXiv cs.AI (Artificial Intelligence) 5.4 5.5/6.2/4.5

This paper presents Quantum Gatekeeper, a context-bound image steganography framework where successful payload recovery depends on both cryptographic decryption and the reconstruction of a precise extraction path. The system integrates lossless least significant bit (LSB) embedding with a deterministic variational quantum circuit (VQC)-derived gate key, multi-factor contextual binding, and authenticated encryption. Payload extraction is contingent upon four requisite factors: a password, a shared secret, a user-supplied context string, and a reference image signature. Any deviation in these factors causes the system to read from an incorrect pixel sequence or fail…

cs.AI
#204
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.4 5.5/6.2/4.5

This paper presents a planning pipeline framework for locomotion in rope-assisted robots climbing vertical surfaces. The proposed framework is formulated as a bi-level optimization scheme that addresses a mixed-integer problem: selecting feasible terrain regions for landing while simultaneously optimizing the control inputs, namely rope tensions and leg forces, and landing location. The outer level of the optimization is solved using the Cross-Entropy Method, while the inner level relies on gradient-based nonlinear optimization to compute dynamically feasible motions. The approach is validated on a novel climbing robot platform, ALPINE, across a…

cs.RO
#205
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.4 5.5/6.2/4.5

Safe navigation in cluttered environments is an important challenge for autonomous systems. Robots navigating through obstacle ridden scenarios need to be able to navigate safely in the presence of obstacles, goals, and ego objects of varying geometries. In this work, reachable set representations of the robot's real-time capabilities in the state space can be utilized to capture safe navigation requirements. While neural radiance fields (NeRFs) are utilized to compute, store, and manipulate the volumetric representations of the obstacles, or ego vehicle, as needed. Constrained optimal control is employed to represent…

cs.RO
#206
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.4 5.5/6.2/4.5

Origami-inspired robotic grippers have shown promising potential for object manipulation tasks due to their compact volume and mechanical flexibility. However, robust capture of objects with random shapes in dynamic working environments often comes at the cost of additional actuation channels and control complexity. Here, we introduce a tendon-driven origami tentacle gripper capable of universal object gripping by exploiting a synergy between local, deterministic deformation programming and global, stochastic entanglements. Each origami tentacle is made by cutting thin Mylar sheets; It features carefully placed holes for routing an actuation tendon, origami…

cs.RO
#207
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.4 5.5/6.2/4.5

Approximating collision-free space is fundamental to robot planning in complex environments. Convex geometric representations, such as polytopes and ellipsoids, are widely employed due to their structural properties, which can be easily integrated with convex optimization. Iterative optimization-based inflation methods can generate large volume polytopes in cluttered environments, but their efficiency degrades as the obstacle set becomes more complex or when sensor data are noisy. These methods are also sensitive to initialization and often rely on accurate geometric models. In this paper, we propose the STAR-Filter, a lightweight framework that employs…

cs.RO
#208
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.4 5.5/6.2/4.5

As with every emerging technology, new tools in the hands of artists reshape the nature of artwork creation. Current frameworks for robotics in arts deploy the robot as an autonomous creator or a collaborator, thus leaving a certain gap between the human artist and the machine. Now, we stand at the dawn of an era where artists can escape physical limitations and reshape their creative identity by inhabiting an alternative body. This new paradigm allows artists not only to command a robot remotely, but also to {\it be} a robot,…

cs.RO
#209
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.4 5.5/6.2/4.5

Dynamical systems (DS) methods for Learning-from-Demonstration (LfD) provide stable, continuous policies from few demonstrations. First-order dynamical systems (DS) are effective for many point-to-point and periodic tasks, as long as a unique velocity is defined for each state. For tasks with intersections (e.g., drawing an "8"), extensions such as second-order dynamics or phase variables are often used. However, by incorporating velocity, second-order models become sensitive to disturbances near intersections, as velocity is used to disambiguate motion direction. Moreover, this disambiguation may fail when nearly identical position-velocity pairs correspond to different onward…

cs.RO
#210
Robotic Autonomy 2026-04-29 arXiv cs.RO (Robotics) 5.4 5.5/6.2/4.5

In multi-agent systems, should limited resources be concentrated into a few capable agents or distributed among many simpler ones? This work formulates the split over $n$ resource sharing problem where a group of $n$ agents equally shares a common resource (e.g., monetary budget, computational resources, physical size). We present a case study in multi-agent coverage where the area of the disk-shaped footprint of agents scales as $1/n$. A formal analysis reveals that the initial coverage rate grows with $n$. However, if the speed of agents decreases proportionally with their radii,…

cs.RO
#211
Robotic Autonomy 2026-04-29 arXiv cs.NE (Neural & Evolutionary Computing) 5.4 5.5/6.0/4.5

We present a Spatially Embedded Evolutionary Algorithm where robot individuals exist in a physically simulated 2D environment, must navigate to encounter potential mates, and compete for survival under various spatially-aware selection pressures. Using HyperNEAT evolved neural controllers for ARIEL gecko-inspired quadrupeds in MuJoCo, we investigate how spatial structure fundamentally alters evolutionary dynamics. Our experiments show a modest 4.9% difference in peak fitness between proximity-based and random pairing possibly within stochastic variation while combining spatial parent selection with stochastic death selection produces unstable population dynamics. We discover a continuous phase transition…

cs.NE
#212
State Space Models 2026-04-29 arXiv cs.NE (Neural & Evolutionary Computing) 5.4 5.5/6.0/4.5

This paper presents an application of the biologically realistic JASTAP neural network model to classification tasks. The JASTAP neural network model is presented as an alternative to the basic multi-layer perceptron model. An evolutionary procedure previously applied to the simultaneous solution of feature selection and neural network training on standard multi-layer perceptrons is extended with JASTAP model. Preliminary results on IRIS standard data set give evidence that this extension allows the use of smaller neural networks that can handle noisier data without any degradation in classification accuracy.

cs.NE
#213
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.4 5.5/6.0/4.5

Accurate BRDF acquisition is important for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small number of BRDF measurements that are most useful for reconstructing material appearance under a learned reflectance prior. Our method combines a set encoder for sparse coordinate-value observations, a pretrained hypernetwork-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor is kept fixed and gradients from BRDF-space and rendered-image losses are used to optimize measurement locations. This separates sample selection from prior fitting and encourages the…

cs.CV
#214
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.4 5.5/6.0/4.5

Traditional iterative reconstruction methods are accurate but computationally expensive, limiting their use in high-throughput and real-time ptychography. Recent deep learning approaches improve speed, but often predict phase as a Euclidean scalar despite its $2π$ periodicity, which can introduce wrapping artifacts, discontinuities at $\pmπ$, and a mismatch between the loss and the underlying signal geometry. We present a deep learning framework for ptychographic reconstruction that models phase on the unit circle using cosine and sine components. Phase error is optimized with a differentiable geodesic loss, which avoids branch-cut discontinuities and provides…

cs.CV
#215
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.4 5.5/6.0/4.5

Aerial-Ground Re-Identification (AG-ReID) is constrained by the viewpoint-domain gap, as drastic viewpoint disparities occlude or distort discriminative features, making cross-viewpoint image retrieval challenging. While existing methods rely on paired cross-view annotations, real-world deployments, such as wilderness search-and-rescue (SAR), often lack target-domain data, requiring retrieval from ground-level references alone. To our knowledge, we are the first to address this challenge by formalizing the Single-View AG-ReID (SV AG-ReID) setting, where models trained on a single real viewpoint must generalize to an unseen viewpoint. We propose 3D Lifting-based Elevated Novel-view Synthesis (3D-LENS), a…

cs.CV
#216
Multimodal 2026-04-29 arXiv cs.CV (Computer Vision) 5.4 5.5/6.0/4.5

Monocular depth estimation (MDE) is a fundamental yet inherently ill-posed task. Recent vision foundation models (VFMs), particularly DINO-based transformers, have significantly improved accuracy and generalization for dense prediction. Prior works generally follow a unified paradigm: sampling a fixed set of intermediate transformer layers at uniform intervals to build multi-scale features. This common practice implicitly assumes that geometric information is uniformly distributed across layers, which may underutilize the structural 3D cues encoded in VFMs. In this study, we present a systematic layer-wise analysis of DINOv3, revealing that 3D information is distributed…

cs.CV
#217
Research 2026-04-29 arXiv stat.ML (Statistical ML) 5.4 5.5/6.0/4.5

We prove that any random variable $X$ whose moment generating function is point-wise upper bounded by that of $ G \sim \mathcal{N}(0,1) $ must be dominated by $ G/\mathbb{E}[|G|] $ in convex order, meaning $ \mathbb{E}[f(X)] \le \mathbb{E}[f(G/\mathbb{E}[|G|])] $ for all convex $f$. Equality is attained by taking $ X \sim \mathrm{Unif}(\{-1,1\}) $ and $ f(x) = |x| $.

stat.ML
#218
Research 2026-04-29 arXiv stat.ML (Statistical ML) 5.4 5.5/6.0/4.5

We show that if the conditional distribution p(C | T) factors through a sufficient statistic φ(T), then the Information Bottleneck (IB) problem for (T, C) is exactly equivalent to the IB problem for (φ(T), C). The reduction is loss-free: it preserves the full IB curve, the Lagrangian optimum at every trade-off parameter \b{eta}, and the optimal representations up to pullback through φ. As a result, the computational complexity of solving the IB problem is governed by the dimension of the sufficient statistic rather than the ambient dimension of the source.…

stat.ML
#219
AI Coding 2026-04-29 Simon Willison 5.4 5.5/6.2/4.5

Release: llm 0.32a1 Fixed a bug in 0.32a0 where tool-calling conversations were not correctly reinflated from SQLite. #1426 Tags: llm

#220
AI Coding 2026-04-29 Simon Willison 5.4 5.5/6.2/4.5

I just released LLM 0.32a0 , an alpha release of my LLM Python library and CLI tool for accessing LLMs, with some consequential changes that I've been working towards for quite a while. Previous versions of LLM modeled the world in terms of prompts and responses. Send the model a text prompt, get back a text response. import llm model = llm . get_model ( "gpt-5.5" ) response = model . prompt ( "Capital of France?" ) print ( response…

#221
AI Coding 2026-04-29 Simon Willison 5.4 5.5/6.2/4.5

Release: llm 0.32a0 See the annotated release notes . Tags: llm

#229
Government & Defense 2026-04-29 DefenseScoop 5.4 5.5/6.2/4.5

As the National Geospatial-Intelligence Agency adopts artificial intelligence into HR workflows, the organization is taking a prudent approach to ensure its workforce doesn’t become overdependent on the technology. “My biggest fear is that in five years, we’re going to lose a lot of expertise because we have automated so many of the things that have helped those individuals really understand their tradecraft, understand HR and the nuances and complexities and be able to grow,” Sasha Muth, deputy director of human…

#230
Government & Defense 2026-04-29 C4ISRNET 5.4 5.5/6.0/4.5

KYIV — President Volodymyr Zelenskiy has leveraged Ukraine’s expertise in drone warfare into a series of successful diplomatic deals during visits to the Middle East and Europe, showcasing how Kyiv is using military prowess to boost its diplomatic clout. Since Russia’s invasion in 2022, Zelenskiy has sought to strengthen Kyiv’s alliances, both with Western allies and with countries of the “global south,” to restrict Russia’s diplomatic sway. The Iran war has confirmed how central drones are to modern warfare and…

#231
Government & Defense 2026-04-29 FedScoop 5.4 5.5/6.0/4.5

Two DOGE associates dispatched to the Treasury Department in the early days of the second Trump administration flouted various IT security rules while the agency itself fell short on implementing proper cyber controls, a new watchdog report found. The Government Accountability Office examined access that a pair of DOGE staffers had to Bureau of the Fiscal Service payment systems from Jan. 20-April 11, 2025. The audit aimed to determine what the DOGE duo planned to do with BFS systems, and…

#232
Government & Defense 2026-04-29 War on the Rocks 5.4 5.5/6.0/4.5

On April 25, armed groups launched near-simultaneous attacks against military installations and key strategic sites across Mali. Claimed by Jama’at Nusrat al-Islam wal-Muslimin, a jihadist group, and conducted in coordination with Tuareg separatist forces from the Front de libération de l’Azawad, the attacks targeted multiple nodes across the country’s security architecture simultaneously, from the capital Bamako to Gao, Mopti, and Kidal.While the attacks themselves were a shock, they should be understood as the logical endpoint of a deteriorating security trajectory…

#235
Industry 2026-04-29 Gradient Flow 5.2 5.0/5.9/4.5

Recent results suggest that research mathematics is no longer a purely speculative test case for AI. A growing set of examples shows AI contributing not just to short contest puzzles, but to open-ended mathematical work that requires literature search, cross-domain connection-making, revision, and verification. The important lesson for enterprise AI teams is not that AI has suddenly become a mathematician. It is that progress accelerates in settings where outputs can be checked, workflows are iterative, and human experts remain responsible…

#236
Research 2026-04-29 Computerphile 5.2 5.0/5.9/4.5

What uh my research team and I have been most interested in in the last few years is trying to make EDA tools more reliable. So by EDA tools I mean these are the the tools that uh hardware engineers use to make hardware. The basic uh kind of flow would be your your hardware engineer would come up with a design for what they want uh the functionality of your hardware uh to be. uh and they'll agonize over…

Items
236
Multi-source
106
Long-form (≥7.5)
7
Sources OK / attempted
58 / 69
Top category
Evaluations & Benchmarks
50 items