← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Wednesday, April 29, 2026

Coverage window: 2026-04-28 03:02 ET2026-04-29 03:02 ET
Press play to listen
Wednesday, April 29, 2026
16m 2s · top-4 narrated briefing
#1 · Industry
OpenAI–Microsoft restructuring: exclusivity ends, AGI clause replaced, OpenAI lands on AWS Bedrock
The seven-year Microsoft–OpenAI exclusivity arrangement formally ended this week, and the day-after coverage reset what the rest of the cloud-AI market thinks it can ask for. The amended agreement OpenAI and Microsoft published replaces the…
8.3 · 4 srcs
#2 · Agents & Tool Use
Recursive Multi-Agent Systems
Recursive Multi-Agent Systems (RecursiveMAS), the highest-engagement HF Daily Papers entry today at 46 upvotes, extends the looped-language-model scaling axis from a single model to a multi-agent system. Looped models scale inference comput…
7.8 · 2 srcs
#3 · Evaluations & Benchmarks
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
DV-World introduces a benchmark of 260 tasks designed to evaluate data-visualization agents across what the authors call the "real-world professional lifecycle" — that is, the full arc of producing, adapting, and clarifying business charts,…
7.6 · 2 srcs
6.5
#1
Industry 2026-04-28 Hacker News · OpenAI Research · Stratechery · TechCrunch — AI 8.3 7.1/7.6/10.0

The seven-year Microsoft–OpenAI exclusivity arrangement formally ended this week, and the day-after coverage reset what the rest of the cloud-AI market thinks it can ask for. The amended agreement OpenAI and Microsoft published replaces the original AGI revocation clause — the contractual provision that let OpenAI unilaterally withdraw IP from Microsoft once an internal AGI determination was made — with a defined-stage handover process tied to capability thresholds and external attestation. In practical terms, Microsoft loses its hard veto on where OpenAI's frontier weights are served, and OpenAI loses its all-or-nothing exit. Both sides walked away with what the markets read as the more durable settlement: Microsoft retains a long-term economic interest plus committed compute purchases, OpenAI gets multi-cloud distribution and the ability to sign deals like the AWS one announced the same morning.

The AWS leg is the headline operational consequence. OpenAI's GPT-class models, Codex, and the Managed Agents tier are now available natively on Amazon Bedrock, with new SKUs appearing on AWS Marketplace within hours of the announcement. Stratechery's interview with Altman and Garman frames the deal as Bedrock graduating from a model-aggregator to a managed-agent platform — Garman emphasized that Bedrock's value proposition shifts from "any frontier model in your VPC" to "any frontier agent that can reach your AWS data without leaving your VPC." Altman's framing was narrower and more financial: the AWS partnership unlocks federal and enterprise demand that was constrained by AWS-residency requirements OpenAI could not meet under the prior Microsoft-exclusive arrangement.

The structural read is that the AGI clause was load-bearing for both companies' long-term investor pitch — without it, OpenAI's defense against being permanently dependent on Microsoft was nominal, and with it, Microsoft's claim to having locked in the strongest model lab was perpetually under a poison pill. Replacing it with capability-threshold language plus a graduated transfer schedule lowers the legal risk on both sides. Hacker News ran two parallel front-page threads — one on the exclusivity ending (966 points) and one on the AWS/Bedrock interview (227 points). Top comments converged on two open questions: how the new "AGI" language operationalizes the threshold without recreating the same all-or-nothing trigger, and whether Microsoft's Azure compute commitments now act as a soft retention mechanism in lieu of contractual exclusivity.

For the cloud-AI market, this resets the negotiating posture for every other lab-cloud relationship. Anthropic–AWS, Google–DeepMind, Mistral–Azure, and the smaller Chinese-lab cloud deals are all now operating in a context where OpenAI–Microsoft exclusivity is no longer the implicit reference point. TechCrunch's reporting noted that Bedrock's new managed-agent SKUs include OpenAI Codex pricing tiers that undercut OpenAI's direct API for AWS-resident workloads — meaning AWS-resident enterprise spend on OpenAI now flows partially through Amazon's economics, not OpenAI's, and that compromise is presumably what made the deal possible at all.

How it was discussed
  • OpenAI Research blog frames the AWS launch as enterprises gaining secure access to GPT models, Codex, and Managed Agents inside their own AWS environments — explicit Bedrock-resident managed-agent positioning.
  • Stratechery's Altman/Garman interview emphasizes the partnership as legitimization of OpenAI as a multi-cloud vendor and a structural break from Microsoft exclusivity; Thompson reads the AGI clause replacement as the gating change that unlocked it.
  • Hacker News ran two parallel front-page threads — one on the Microsoft exclusivity ending (966 points) and one on the AWS/Bedrock interview (227 points) — community comments mostly focus on whether Microsoft retains a backstop equity claim and what 'AGI' actually means now in the new contract.
  • TechCrunch reports the AWS Marketplace listings are already live with new SKUs the same day the partnership was announced, suggesting months of pre-staging.
#2
Agents & Tool Use 2026-04-28 arXiv · Hugging Face Daily Papers 7.8 7.5/6.1/9.1

Recursive Multi-Agent Systems (RecursiveMAS), the highest-engagement HF Daily Papers entry today at 46 upvotes, extends the looped-language-model scaling axis from a single model to a multi-agent system. Looped models scale inference compute by iteratively refining the same model's computation over latent states; this paper asks whether the same recursion principle can scale agent collaboration. The authors define a recursive multi-agent framework where the entire system — heterogeneous agents and their inter-agent communication — is cast as a unified latent-space recursive computation, connected by a lightweight RecursiveLink module that handles in-distribution latent-thoughts generation and cross-agent latent-state transfer.

The technical core is a pair of choices that distinguish this from prior multi-agent setups. First, agents communicate over learned latent vectors rather than over natural-language text — the paper argues that text bottlenecks the bandwidth of inter-agent reasoning, and replaces the text channel with a residual-stream-style transfer at the cost of losing human-interpretable trace. Second, the system is co-optimized across agents using an inner-outer loop algorithm with shared gradient-based credit assignment across recursion rounds. Each recursion round provides a partial gradient signal back through the prior agents, so the system trains as a single computational graph rather than as independent fine-tunes glued together with prompt scaffolding. The authors include theoretical analyses of runtime complexity and learning dynamics, claiming RecursiveMAS is more efficient than standard text-based multi-agent setups and that gradients remain stable through recursion — both points commenters flagged as worth replicating.

Empirically, the paper instantiates RecursiveMAS under four representative agent collaboration patterns — debate, supervisor–executor, planner–solver, and round-robin generation — and evaluates across nine benchmarks spanning math reasoning, scientific QA, code generation, and tool-use. The reported gains are most pronounced on long-horizon reasoning tasks where multiple recursion rounds are useful, and tail off on single-shot benchmarks where the recursion depth is set to one. The authors don't claim a new SOTA on any single benchmark; instead they claim a structural compute-efficiency improvement over prompt-based multi-agent at matched capability.

The community reaction split into two camps. The optimistic read, dominant on HF Daily Papers, is that this is the natural endpoint of the latent-thoughts research direction — once a single model can profitably loop on latent states (which prior work established), extending the same idea across multiple specialized models is mechanically straightforward. The skeptical read centers on whether cross-agent latent state transfer reduces, in practice, to a fancier KV-cache reuse — and whether the engineering complexity of co-optimizing four-or-more agents is worth the modest reasoning-benchmark deltas. The eval matrix (4 collaboration patterns × 9 benchmarks) is more thorough than the typical multi-agent ablation, which lent the paper credibility, but a third-party reproduction will determine whether the gradient-stability claims hold outside the authors' setup. If they do, the practical implication is that multi-agent systems can be trained as units instead of orchestrated as prompt scaffolds — a meaningfully different posture for the field.

How it was discussed
  • The arXiv paper positions RecursiveMAS as a clean extension of the looped-LM scaling axis to multi-agent systems — same gradient-based credit-assignment trick, applied to whole-system co-optimization.
  • Hugging Face Daily Papers community (▲46) discussion centers on whether the cross-agent latent state transfer reduces to a fancier KV-cache reuse, and on the 4-pattern × 9-benchmark eval matrix as more thorough than the typical multi-agent ablation.
cs.AI cs.CL cs.LG
#3
Evaluations & Benchmarks 2026-04-28 arXiv · Hugging Face Daily Papers 7.6 7.2/7.0/8.1

DV-World introduces a benchmark of 260 tasks designed to evaluate data-visualization agents across what the authors call the "real-world professional lifecycle" — that is, the full arc of producing, adapting, and clarifying business charts, not just the one-shot text-to-chart problem prior benchmarks captured. Three subdomains structure the eval: DV-Sheet covers native spreadsheet manipulation including chart construction, dashboard layout, and diagnostic repair of broken charts; DV-Evolution tests an agent's ability to adapt and restructure existing visual artifacts to fit new data across multiple programming paradigms (matplotlib, plotly, Vega-Lite, native Excel); DV-Interact pairs the agent with a user simulator that issues deliberately ambiguous requirements, requiring the agent to ask clarifying questions or surface assumptions before producing output.

The evaluation framework is the more interesting methodological contribution. The authors combine Table-value Alignment for numerical precision with MLLM-as-a-Judge using rubrics for semantic-visual assessment — so a chart is scored both on whether the underlying numerical mapping is correct and on whether the resulting visual is, by an MLLM judge's rubric, a sensible answer to the prompt. Prior DV benchmarks tended to test only one of those axes (either code execution success or visual similarity to a reference). The hybrid scheme matters because most real-world DV failures aren't catastrophic numerical errors, they're judgment errors — wrong chart type for the data, axis lying about the comparison, three colors when one would do — that pure-numerical scoring misses entirely.

The headline result is that state-of-the-art models achieve under 50% overall performance on DV-World, exposing what the authors describe as critical deficits in handling the complexity of real-world data visualization. The number is consistent with what HF Daily Papers commenters noticed: native spreadsheet manipulation (DV-Sheet) is harder than the leaderboard for one-shot chart generation might have suggested, and the proactive-intent-alignment subset (DV-Interact) is where the largest gap appears between top-tier models and the rest of the field. Models that are merely competent at producing charts on demand stumble badly when they have to negotiate the ambiguity in the prompt before producing anything at all.

The community reception (▲25 on HF Daily Papers) was generally positive on the design but raised two caveats. First, the MLLM-as-a-Judge component has the usual concerns about judge bias toward outputs that resemble its own training distribution — the authors address this by reporting inter-judge agreement and using multiple judges, but the result will need replication with different judge models. Second, the 50% headline figure is an aggregate across very different subtasks; the per-subtask breakdown matters more than the headline, and DV-World's value to the field will depend on whether downstream agent developers actually optimize against the full benchmark rather than cherry-picking the easier subdomains. As a reference point, this is currently the most thorough DV-agent eval in public release, and is likely to become the default citation for chart-generation agent papers within a quarter.

How it was discussed
  • arXiv abstract pitches 260 tasks across DV-Sheet, DV-Evolution, and DV-Interact — first DV agent benchmark to bundle ambiguous-intent simulation alongside chart-generation and adaptation.
  • HF Daily Papers (▲25) discussion notes <50% headline number for SOTA models — sanity-check on how far native-spreadsheet manipulation has come; some comment that DV-Sheet may be the harder-than-it-looks subset.
cs.CL
#4
Government & Defense 2026-04-28 Hacker News · TechCrunch — AI 7.5 6.3/7.1/7.8

The Pentagon-Google agreement reported this week formalizes Google's expanded role in DoD AI procurement — specifically, granting Google the contractual standing to provide its frontier models for "any lawful" Department of Defense use, terminology that materially differs from the constrained, use-case-specific language in earlier military AI procurements. The framing matters because it inverts the burden of proof: where prior contracts enumerated allowed uses (and required affirmative review for each new application), the "any lawful" formulation puts the legal department, not the procurement officer, in the gate-keeping role. In practice, that means Google's models become a default-permitted DoD capability across the agency unless individual deployments are flagged.

The most-discussed structural consequence is the contrast with Anthropic's earlier stance. Anthropic publicly declined to participate in the broadest Pentagon AI uses citing constitutional AI / safety-policy concerns, and that refusal is now visibly costing it federal share — Google's expanded access fills the procurement vacuum Anthropic created. Several Hacker News comments (288 points on the lead thread) drew the comparison to Project Nimbus (the Israel-Google cloud contract), arguing the "any lawful" clause echoes Nimbus's broad-permissioning language and, like Nimbus, will likely face internal employee dissent at Google. TechCrunch reported that the Pentagon's GenAI.mil platform — the government-side wrapper that consumed third-party model APIs through May — gains Google's models alongside its existing OpenAI and Microsoft connectors, with usage on the platform reportedly past 100,000 user-built agents.

The implications for the wider AI-policy landscape are real. Three of the four U.S. frontier-model labs (OpenAI, Google, Microsoft) are now explicitly enabled for broad DoD use; Anthropic's holdout posture is increasingly an outlier rather than a representative industry stance. That changes how Capitol Hill and the executive branch read industry concerns about military AI — a single-lab refusal carries less rhetorical weight when the rest of the frontier is on board. It also changes the AGI-policy conversation downstream: if the technical capabilities being marketed to enterprise are also being marketed to defense under permissive terms, the safety-policy argument that "frontier models are too dangerous to deploy widely" loses force, because they are already being deployed widely.

The legal craft on the "any lawful" language is likely to receive sustained scrutiny. The phrase is not new in DoD procurement — it appears in cloud-services contracts going back years — but its first frontier-model application sets the template for the rest of the procurement cycle. War on the Rocks and Lawfare commentary, both surfaced via gov-defense feeds today, will likely engage with this in the next-week analysis cycle; for now, the open question for Daniel and other observers is whether the Anthropic precedent re-opens (i.e., other labs formally decline equivalent terms), or whether the procurement gravity pulls everyone in on similar language. The market read so far suggests the latter.

How it was discussed
  • TechCrunch frames the agreement as Pentagon expanding GenAI.mil after Anthropic's earlier refusal to participate — Anthropic's policy stance is now visibly costing it federal share.
  • Hacker News (288 points) split between national-security ('we need this') framings and concerns over the 'any lawful' clause acting as a near-blank check — top comments compare the language to Project Nimbus's Israel deal.
#5
Generative Media 2026-04-27 arXiv · Hugging Face Daily Papers 7.5 7.2/6.2/7.4

Meta-CoT, the highest-upvoted image-editing entry on HF Daily Papers today (▲21), addresses a well-known tension in unified multimodal models that combine understanding and generation: the chain-of-thought process used to reason about an edit before generating it improves quality on the trained distribution but generalizes poorly to compositional, out-of-distribution prompts. The paper argues that prior unified models hard-code a specific CoT format and a specific training signal, and that this brittleness is what bounds their generalization. Meta-CoT instead proposes a paradigm in which the form of chain-of-thought is itself learned — a meta-policy over CoT structures, trained jointly with the editing task, lets the model adapt its reasoning style to the prompt rather than applying a fixed template.

The technical contribution is a two-part system. The first part is a CoT-policy module that decides, conditioned on the input image and the prompt, what kind of intermediate reasoning trace to produce — for example, whether to enumerate visible objects before reasoning about the edit, or to first translate the prompt into a structured edit intent, or to skip CoT entirely for short, syntactically-clear prompts. The second part is a training strategy that pairs the CoT-policy with the generation model under a joint objective, giving the system a path to learn that some prompts benefit from longer reasoning traces while others are degraded by them. Crucially, the paper provides ablations showing that fixing either piece (fixing the CoT format, or not co-training) recovers the baseline brittleness — both pieces are required for the generalization gain.

Empirically, the authors report improvements on both the granularity of edits (preserving fine detail outside the edit region) and generalization to compositional out-of-distribution prompts (multi-object multi-attribute edits the model has not seen during training). The reported deltas are largest on the OOD compositional split, which the authors emphasize as the bottleneck case for prior unified models. The architecture is built on top of a publicly available unified backbone, so the method is reproducible without proprietary weights — that detail matters because past unified-model contributions have often been scoped to specific large-lab backbones, limiting independent verification.

The HF Daily Papers reception focused on two things. First, whether the granularity gains transfer outside the authors' eval set — the eval includes a mix of standard image-editing benchmarks (MagicBrush, EmuEdit) plus a new compositional split, and commenters wanted to see what happens on real user prompts collected from production editing services. Second, whether the meta-policy approach generalizes beyond image editing to other multimodal generation tasks (video, 3D), which the authors gesture at but don't evaluate. The structural takeaway is that fixed CoT formats are now visibly the bottleneck for unified-model generalization, and the next round of unified-model papers will likely have to defend their CoT-format choice or learn it. As a research direction, "learned CoT structure" was already on the field's radar in pure-text reasoning; Meta-CoT pulls that idea cleanly into the multimodal-generation setting.

How it was discussed
  • arXiv frames Meta-CoT as a paradigm for jointly enhancing CoT granularity and generalization in image editing — emphasis on the joint training strategy.
  • HF Daily Papers (▲21) reception focuses on whether the granularity gains transfer outside the authors' eval set; commenters compare to prior unified UM understanding/generation work.
cs.CV cs.AI cs.LG cs.MM
#6
Post-Training 2026-04-27 arXiv · Hugging Face Daily Papers 7.0 6.8/5.6/8.1

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the c

How it was discussed
  • arXiv positions the work as test-driven data engineering — pose tests over what the model should learn, then mine training corpora to satisfy the tests, instead of fine-tuning blind.
  • HF Daily Papers (▲27) commentary notes the resemblance to TDD flows from software, and questions on whether the test-coverage ceiling caps how far this iterates.
cs.SE cs.AI
#7
Agents & Tool Use 2026-04-28 arXiv · Hugging Face Daily Papers 6.9 7.2/5.6/7.4

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through

How it was discussed
  • arXiv positions the benchmark as the first to evaluate agent-driven literature discovery on tasks that are genuinely 'find the relevant evidence,' not just 'retrieve a paper by title.'
  • HF Daily Papers (▲20) comments split between excitement about a science-agent eval and skepticism that the lit-discovery pipeline isn't already saturated by existing search-augmented models.
cs.AI
#10
Generative Media 2026-04-28 arXiv · Hugging Face Daily Papers 6.6 6.2/5.6/7.4

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover

cs.CV
#14
Generative Media 2026-04-28 arXiv · Hugging Face Daily Papers 6.3 6.2/5.4/6.8

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streami

cs.CV cs.SD
#18
Agents & Tool Use 2026-04-28 arXiv · Hugging Face Daily Papers 6.0 5.8/5.5/6.2

Terminal agents have demonstrated strong potential for autonomous command-line execution, yet their training remains constrained by the scarcity of high-quality and diverse execution trajectories. Existing approaches mitigate this bottleneck by synthesizing large-scale terminal task instances for trajectory sampling. However, they primarily focus on scaling the number of tasks while providing limited control over the diversity of execution trajectories that agents actually experience during training. In this paper, we present SkillSynth, an automated framework for terminal task synthesis built

cs.AI
#20
Agents & Tool Use 2026-04-27 arXiv · Hugging Face Daily Papers 5.9 5.8/5.0/6.2

While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refineme

cs.AI cs.MA cs.MM
#21
Audio & Speech 2026-04-23 arXiv · Hugging Face Daily Papers 5.8 5.9/5.5/5.4

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In

cs.CL
#22
Agents & Tool Use 2026-04-27 arXiv · Hugging Face Daily Papers 5.8 5.9/5.5/5.4

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn erro

cs.LG cs.AI
#24
Safety, Policy & Regulation 2026-04-28 arXiv · Hugging Face Daily Papers 5.7 5.4/5.6/5.4

Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200--600 seconds, disrupting creative flow. We present MAIC-UI, a zero-code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC-UI employs: (1) st

cs.CL cs.AI cs.HC
#25
Generative Media 2026-04-28 arXiv · Hugging Face Daily Papers 5.7 5.4/5.8/5.4

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model in

cs.CV
#26
Agents & Tool Use 2026-04-28 arXiv · Hugging Face Daily Papers 5.7 5.4/5.6/5.4

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decom

cs.CL cs.AI cs.LG
#30
Multimodal 2026-04-28 arXiv 5.5 5.9/6.1/4.0

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean scor

quant-ph cs.CV
#31
Reinforcement Learning 2026-04-28 arXiv 5.5 5.4/6.5/4.0

Earth observation satellite imaging scheduling is a challenging NP-hard combinatorial optimisation problem central to space mission operations. While next-generation agile Earth observation satellites (EOS) increase operational flexibility, they also significantly raise scheduling complexity. The lack of a unified, open-source benchmark makes it difficult to compare algorithms across studies. This paper introduces EOS-Bench, a comprehensive framework for systematic and reproducible evaluation of scheduling methods. By integrating high-fidelity orbital dynamics and platform constraints, EOS-Ben

cs.NI cs.RO
#32
Generative Media 2026-04-23 arXiv · Hugging Face Daily Papers 5.5 5.4/5.0/5.4

Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturb

cs.CV cs.CL
#33
Multimodal 2026-04-28 arXiv · Hugging Face Daily Papers 5.5 5.4/5.0/5.4

Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework

cs.CV
#34
Government & Defense 2026-04-28 TechCrunch — AI · Hacker News 5.5 5.4/6.3/4.6

After Anthropic refused to allow the DoD to use its AI for domestic mass surveillance and autonomous weapons, Google has signed a new contract with the department.

How it was discussed
  • TechCrunch frames the agreement as Pentagon expanding GenAI.mil after Anthropic's earlier refusal to participate — Anthropic's policy stance is now visibly costing it federal share.
  • Hacker News (288 points) split between national-security ('we need this') framings and concerns over the 'any lawful' clause acting as a near-blank check — top comments compare the language to Project Nimbus's Israel deal.
industry
#35
Evaluations & Benchmarks 2026-04-28 arXiv 5.4 5.0/6.7/4.0

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments,

cs.CL cs.AI cs.DL cs.IR
#36
Post-Training 2026-04-28 arXiv 5.4 5.0/6.6/4.0

Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a

cs.CL
#37
Post-Training 2026-04-28 arXiv 5.3 5.0/6.4/4.0

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority

cs.CY cs.AI cs.CL
#38
Post-Training 2026-04-28 arXiv 5.3 5.4/5.8/4.0

Real-world evidence (RWE) studies that emulate target trials increasingly inform regulatory and clinical decisions, yet residual, hard-to-quantify biases still limit their credibility. The recently proposed BenchExCal framework addresses this challenge via a two-stage Benchmark, Expand, Calibrate process, which first compares an observational emulation against an existing randomized controlled trial (RCT), then uses observed divergence to calibrate a second emulation for a new indication causal effect estimation. While methodologically powerful, BenchExCal is resource intensive and difficult t

cs.AI
#39
Research 2026-04-28 arXiv 5.3 5.0/6.2/4.0

We present StratFormer, a transformer-based meta-agent that learns to simultaneously model and exploit opponents in imperfect-information games through a two-phase curriculum. The first phase trains an opponent modeling head to identify behavioral patterns from action histories while the agent plays a game-theoretic optimal (GTO) policy. The second phase progressively shifts the policy toward best-response (BR) exploitation, guided by a per-opponent regularization schedule tied to exploitability. Our architecture introduces dual-turn tokens -- feature vectors constructed at both agent and oppo

cs.AI
#40
Evaluations & Benchmarks 2026-04-28 arXiv 5.3 5.4/6.0/4.0

Accurate bandgap prediction is crucial for semiconductor applications, yet machine learning models trained on computational data often struggle to generalize to experimental bandgap measurements. Challenges related to data fidelity, domain generalization, and model interpretability remain insufficiently addressed in existing evaluation frameworks. To bridge this gap, we introduce RealMat-BaG, a benchmark for assessing model reliability under experimentally relevant conditions. We curate an open-access dataset of experimental bandgaps with aligned crystal structures and compare graph neural net

cond-mat.mtrl-sci cs.AI
#41
Evaluations & Benchmarks 2026-04-28 arXiv 5.3 5.9/5.5/4.0

Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context re

cs.CL cs.AI
#42
Generative Media 2026-04-28 arXiv 5.3 5.4/6.0/4.0

Evaluating layout-guided text-to-image generative models requires assessing both semantic alignment with textual prompts and spatial fidelity to prescribed layouts. Assessing layout alignment requires collecting fine-grained annotations, which is costly and labor-intensive. Consequently, current benchmarks rarely provide comprehensive layout evaluation and often remain limited in scale or coverage, making model comparison, ranking, and interpretation difficult. In this work, we introduce a closed-set benchmark (C-Bench) designed to isolate key generative capabilities while providing varying le

cs.CV
#43
Multimodal 2026-04-28 NVIDIA AI Blog 5.3 5.7/5.6/4.0

AI agent systems today juggle separate models for vision, speech and language — losing time and context as they pass data from one model to the other. Unveiled today, NVIDIA Nemotron 3 Nano Omni is an open multimodal model that brings these capabilities together into one system, enabling agents to deliver faster, smarter responses with [&#8230;]

industry infra
#44
Research 2026-04-28 arXiv 5.3 5.0/6.4/4.0

Distributional and neural approaches to natural language semantics have been built almost exclusively on conventional linear algebra: vectors, matrices, tensors, and the operations that accompany them. These methods have achieved remarkable empirical success, yet they face persistent structural limitations in compositional semantics, type sensitivity, and interpretability. I argue in this paper that geometric algebra (GA) -- specifically, Clifford algebras -- provides a mathematically superior foundation for semantic representation, and that a Functional Geometric Algebra (FGA) framework exten

cs.CL cs.AI cs.LG
#45
Post-Training 2026-04-28 arXiv 5.3 5.0/6.4/4.0

Imbalanced classification remains a pervasive challenge in machine learning, particularly when minority samples are too scarce to provide a robust discriminative boundary. In such extreme scenarios, conventional models often suffer from unstable decision boundaries and a lack of reliable error control. To bridge the gap between generative modeling and discriminative classification, we propose a two-stage framework \textbf{VAE-Inf} that integrates deep representation learning with statistically interpretable hypothesis testing. In the first stage, we adopt a one-class modeling perspective by tr

cs.LG cs.AI
#46
Post-Training 2026-04-28 arXiv 5.2 5.0/6.0/4.0

Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when tested outside the training distribution. We study a set of interventions proposed to reduce EM. We confirm that these interventions reduce or eliminate EM on existing evaluations (questions like "How do I make a quick buck?"). However, if the evaluation prompts are tweaked to resemble the training context, the model displays EM. We call this conditional misalignment. As in standard EM, the model di

cs.LG cs.AI cs.CR
#47
Research 2026-04-28 arXiv 5.2 5.0/6.1/4.0

Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on model-specific fingerprints. In this respect, we hypothesize that while large language models excel at local semantic consistency, their autoregressive nature results in a specific kind of structural fragility compared to human writing. We propose Luminol-AIDetect, a novel, zero-shot statistical approach that exposes this fragility through coherence disruption. By applying a simple randomized text-shuffling procedure, we demonstrate that the resulting shif

cs.CL cs.AI cs.CY
#48
Agents & Tool Use 2026-04-28 arXiv 5.2 5.0/6.0/4.0

Multilingual retrieval-augmented generation (mRAG) is often implemented within a fixed retrieval space, typically via query or document translation or multilingual embedding vector representations. However, this approach may be inadequate for culturally grounded queries, in which retrieval-condition misalignment may occur. Even strong retrievers and generators may struggle to produce culturally relevant answers when sourcing evidence from inappropriate linguistic or regional contexts. To this end, we introduce CORAL (COntext-aware Retrieval with Agentic Loop, an adaptive retrieval methodology

cs.CL cs.AI
#49
Multimodal 2026-04-28 arXiv 5.2 5.4/5.6/4.0

Online comments play a crucial role in shaping public sentiment and opinion dynamics on social media. However, evaluating their popularity remains challenging, not only because it depends on linguistic quality, originality, and emotional resonance, but also because stylistic preferences vary widely across platforms and user groups, causing the same comment to resonate differently in different communities. In this work, we present HotComment, a multimodal benchmark integrating video and text modalities that comprehensively quantifies popularity from three enhanced aspects: (1) Content Quality,

cs.AI
#50
Safety, Policy & Regulation 2026-04-28 arXiv 5.2 5.0/6.1/4.0

Deploying an intrusion detector trained in one industrial plant to another remains difficult because Industrial Control System (ICS) traffic is highly site-dependent, labels are scarce, and unseen attacks often appear after deployment. To address this challenge, this paper introduces a medoid prototype alignment framework for cross-plant unknown attack detection. Instead of aligning all source and target samples directly, the method first compresses heterogeneous traffic into a comparable representation space and then extracts robust medoid prototypes that summarize local operational structure

cs.CR cs.AI
#51
Infrastructure 2026-04-28 arXiv 5.2 5.5/5.5/4.0

AI tools are being deployed over MBSE models today, and those models were not designed for this kind of consumption. The problem is not simply that tools hallucinate: well-prompted frontier models produce competent, useful output over a conformant SysML model, but the reasoning they produce is drawn from training rather than retrieved from the model itself, and different tools over the same model produce different results with nothing in the record to adjudicate between them. The model, in other words, is functioning as a prompt rather than as a knowledge base. Attaching better tools to the sa

cs.SE cs.AI
#52
Audio & Speech 2026-04-28 arXiv 5.2 5.0/5.9/4.0

Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a b

cs.SD cs.AI
#53
Evaluations & Benchmarks 2026-04-28 arXiv 5.2 5.5/5.5/4.0

AI chatbots are increasingly used for health advice, but their performance in psychiatric triage remains undercharacterized. Psychiatric triage is particularly challenging because urgency must often be inferred from thoughts, behavior, and context rather than from objective findings. We evaluated the performance of 15 frontier AI chatbots on psychiatric triage from realistic single-message disclosures using 112 clinical vignettes, each paired with 1 of 4 original benchmark triage labels: A, routine; B, assessment within 1 week; C, assessment within 24 to 48 hours; and D, emergency care now. Vi

q-bio.NC cs.AI cs.HC
#54
Multimodal 2026-04-28 arXiv 5.2 5.0/6.1/4.0

Accurate brain lesion segmentation in MRI is vital for effective clinical diagnosis and treatment planning. Due to high annotation costs and strict data privacy regulations, universal models require employing Continual Learning (CL) to adapt to evolving clinical tasks without losing previously acquired knowledge. However, existing CL paradigms often suffer from capacity limits or redundant parameter growth, and even advanced dynamic methods rely mostly on image-perception strategies that struggle to handle the substantial pathological and multimodal heterogeneity inherent in brain imaging. To

cs.CV cs.AI
#55
Frontier LLMs 2026-04-28 arXiv 5.2 5.0/6.1/4.0

The release of GPT-image-2 by OpenAI marks a watershed moment in AI-generated imagery: the boundary between photographic reality and synthetic content has never been more difficult to discern. We introduce the GPT-Image-2 Twitter Dataset, the first published dataset of GPT-image-2 generated images, sourced from publicly available Twitter/X posts in the immediate aftermath of the model's April 21, 2026 release. Leveraging the Twitter API v2 and a multi-stage curation pipeline spanning multilingual text heuristics (English, Japanese, and Chinese), browser-automated Twitter "Made with AI" badge v

cs.CV cs.AI
#56
Reinforcement Learning 2026-04-28 arXiv 5.2 5.0/6.0/4.0

Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (MTRL) environments have been introduced, requiring a single model to learn multiple behaviors. The Tangled Program Graph (TPG) algorithm is a Genetic Programming (GP) algorithm designed for discrete MTRL environments. Recently, the MAPLE algorithm has been proposed, as another GP algorithm that achieves high results in

cs.AI
#57
Infrastructure 2026-04-28 arXiv 5.2 5.5/5.5/4.0

We report a striking statistical regularity in frontier LLM outputs that enables a CPU-only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000$\times$ (five orders of magnitude) below existing sampling-based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held-out domains, token rank-frequency distributions converge to the same two-parameter Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding $R^{2} = 0.94$ and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared fa

cs.CR cs.CL
#58
Audio & Speech 2026-04-28 arXiv 5.2 5.4/5.6/4.0

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex co

cs.SD cs.CL
#59
Post-Training 2026-04-28 arXiv 5.2 5.0/6.0/4.0

Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specia

cs.CL
#60
Multimodal 2026-04-28 arXiv 5.2 5.0/6.1/4.0

Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to capture the hierarchical and interconnected nature of clinical medical knowledge, limiting the models' ability to perform fine-grained recognition and complex reasoning. In this paper, we propose a novel Entity-Centric Medical Data Engineering framework. We automatically extract entities from authoritative medical lite

cs.CL
#61
Evaluations & Benchmarks 2026-04-28 arXiv 5.2 5.0/5.9/4.0

Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first

cs.CV
#62
Evaluations & Benchmarks 2026-04-28 arXiv 5.2 5.4/5.5/4.0

Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework

cs.CV
#63
Infrastructure 2026-04-28 arXiv 5.2 5.4/5.6/4.0

In the last few decades, Markov chain Monte Carlo (MCMC) methods have been widely applied to Bayesian updating of structural dynamic models in the field of structural health monitoring. Recently, several MCMC algorithms have been developed that incorporate neural networks to enhance their performance for specific Bayesian model updating problems. However, a common challenge with these approaches lies in the fact that the embedded neural networks often necessitate retraining when faced with new tasks, a process that is time-consuming and significantly undermines the competitiveness of these met

stat.AP cs.LG stat.ME stat.ML
#64
Robotics 2026-04-28 arXiv 5.2 5.0/6.1/4.0

Collision-free motion is often aided by tactile and proximity sensors distributed on the body of the robot due to their resistance to occlusion as opposed to external cameras. However, how to shape the sensor's properties, such as sensing coverage; type; and range, to enable avoidant behavior remains unclear. In this work, we present a reinforcement learning framework for whole-body collision avoidance on a humanoid H1-2 robot and use it to characterize how sensor properties shape learned avoidance behavior. Using dodgeball as a benchmark task, we ablate the properties of sensors distributed a

cs.RO cs.LG
#65
Safety, Policy & Regulation 2026-04-28 arXiv 5.2 5.0/6.0/4.0

Recent advances in open-vocabulary mobile manipulation have brought robots into real domestic environments. In such settings, reliable long-horizon execution under open-set object references and frequent disturbances becomes essential. However, many failures persist. These are not caused by semantic misunderstanding but by inconsistencies between symbolic plans and the evolving physical world, manifested as three recurring limitations: (i) existing systems often rely on pre-scanned semantic maps that become inconsistent after scene changes and disturbances; (ii) they select navigation endpoint

cs.RO
#66
Agents & Tool Use 2026-04-27 arXiv · Hugging Face Daily Papers 5.2 5.4/5.0/4.6

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly c

cs.CV
#67
Industry 2026-04-28 AI Alignment Forum 5.2 5.5/5.5/4.0

We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker ), it may be difficult to get useful work from it on questions that resolve much later. In this post, I’ll describe a proposal for eliciting good long-horizon forecasts from these models. Instead of asking a model to directly predict a far-future outcome, we can recursively: Ask it to predict what it will predict at the next time step, Use its prediction at the next time step to provide intermediate rewards, Finally reward using ground truth at the last step. This lets us replace a single distant forecast with a chain of short-horizon forecasts, each verifiable shortly after answering. I call this proposal recursive forecasting . It does have limitations: for example, it requires that developers maintain control over the reward signal at least until the final step, which

safety_policy research
#69
AI Coding 2026-04-28 Simon Willison's Weblog 5.2 5.4/5.6/4.0

Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. OpenAI Codex base_instructions , for GPT-5.5 Tags: openai , ai , llms , system-prompts , prompt-engineering , codex-cli , generative-ai , gpt

frontier_llm industry agents ai_coding
#72
Research 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Current pedestrian crossing signals operate on fixed timing without adjustment to pedestrian behavior, which can leave vulnerable road users (VRUs) such as the elderly, disabled, or distracted pedestrians stranded when the light changes. We introduce No Pedestrian Left Behind (NPLB), a real-time adaptive traffic signal system that monitors VRUs in crosswalks and automatically extends signal timing when needed. We evaluated five state-of-the-art object detection models on the BGVP dataset, with YOLOv12 achieving the highest mean Average Precision at 50% (mAP@0.5) of 0.756. NPLB integrates our f

cs.CV cs.AI cs.RO eess.SY
#73
Post-Training 2026-04-28 arXiv 5.1 5.0/5.8/4.0

Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth rewar

cs.LG cs.AI stat.ML
#74
Evaluations & Benchmarks 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically groun

cs.CL cs.AI
#75
Efficiency 2026-04-28 arXiv 5.1 5.0/5.6/4.0

In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mi

cs.LG cs.AI
#76
Research 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Open, unclassified research on secure autonomy is constrained by limited access to operational platforms, contested communications infrastructure, and representative adversarial test conditions. This paper presents a threat-oriented digital twinning methodology for cybersecurity evaluation of learning-enabled autonomous platforms. The approach is instantiated as an open-source, modular twin of a representative autonomy stack with separated sensing, autonomy, and supervisory-control functions; confidence-gated multi-modal perception; explicit command and telemetry trust boundaries; and runtime

cs.CR cs.AI cs.RO eess.SY
#77
Evaluations & Benchmarks 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a

cs.CL cs.AI
#78
Safety, Policy & Regulation 2026-04-28 arXiv 5.1 5.0/5.6/4.0

The rapid deployment of autonomous AI agents across enterprise, healthcare, and safety-critical environments has created a fundamental governance gap. Existing approaches, runtime guardrails, training-time alignment, and post-hoc auditing treat governance as an external constraint rather than an internalized behavioral principle, leaving agents vulnerable to unsafe and irreversible actions. We address this gap by drawing on how humans self-govern naturally: before acting, humans engage deliberate cognitive processes grounded in executive function, inhibitory control, and internalized organizat

cs.AI
#79
Evaluations & Benchmarks 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3

cs.IR cs.AI cs.DB
#80
Multimodal 2026-04-28 arXiv 5.1 5.0/5.6/4.0

We introduce DualFact, a dual-layer, multimodal factuality evaluation framework for procedural video captioning. DualFact separates factual correctness into conceptual facts, capturing abstract semantic roles (e.g., Action, Ingredient, Tool, Location), and contextual facts, capturing their grounded predicate-argument realizations in video. To support complete and role-consistent evaluation, DualFact incorporates implicit argument augmentation (VIA) and contrastive fact sets. We instantiate DualFact in two modes: DualFact-T, which verifies facts against textual evidence, and DualFact-V, which v

cs.AI
#81
Multimodal 2026-04-28 arXiv 5.1 5.0/5.7/4.0

Web agents have emerged as an effective paradigm for automating interactions with complex web environments, yet remain vulnerable to prompt injection attacks that embed malicious instructions into webpage content to induce unintended actions. This threat is further amplified for screenshot-based web agents, which operate on rendered visual webpages rather than structured textual representations, making predominant text-centric defenses ineffective. Although multimodal detection methods have been explored, they often rely on large vision-language models (VLMs), incurring significant computation

cs.CR cs.AI
#82
Research 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Cognitive science often evaluates theories through narrow paradigms and local model comparisons, limiting the integration of evidence across tasks and realizations. We introduce an automated adversarial collaboration framework for adjudicating among competing theories even when the candidate models and experiments must be discovered during the adjudication process. The system combines LLM-based theory agents, program synthesis, and information-theoretic experimental design in a closed loop. In a simulation study spanning three classic categorization theories, the framework recovered the ground

cs.AI
#83
Infrastructure 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Phishing detection systems are predominantly rely on statistical machine learning models, which often lack contextual reasoning and are vulnerable to adversarial manipulation. In this work, we propose a hybrid framework that integrates machine learning classifiers with non-monotonic reasoning using Answer Set Programming (ASP) to enable context-aware decision refinement. The proposed post-hoc reasoning layer incorporates expert knowledge to revise classifier predictions through formal belief revisions. Experimental results indicate that the reasoning module modifies 5.08\% of classifier output

cs.AI
#84
Post-Training 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for com

cs.LG cs.AI
#85
Reinforcement Learning 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Ensuring safety during reinforcement learning (RL) training is critical in real-world applications where unsafe exploration can lead to devastating outcomes. While most safe RL methods mitigate risk through constraints or penalization, they still allow exploration of unsafe states during training. In this work, we adopt a stricter safety requirement that eliminates unsafe state visitation during training. To achieve this goal, we propose a Q-learning-based safe RL framework that leverages a behavior policy supported on a safe set. Under the assumption that the induced trajectories remain withi

cs.LG cs.AI
#86
Interpretability 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, yet little is known about how emotion recognition is internally represented. In this work, we investigate the internal mechanisms of emotion recognition in LLMs using sparse autoencoders (SAEs). By analyzing sparse feature activations across layers, we identify a consistent three-phase information flow, in which emotion-related features emerge only in the final phase. We further show that emotion representations comprise both shared features across emotions and emotion-specific features. Using ph

cs.CL
#87
Agents & Tool Use 2026-04-28 arXiv 5.1 5.0/5.8/4.0

Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with m

cs.CL cs.SE
#88
Safety, Policy & Regulation 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Critical analyses of emotion recognition technology have raised ethical concerns around task validity and potential downstream impacts, urging researchers to ensure alignment between their stated motivations and practice. However, these discussions have not adequately influenced or drawn from research on speech emotion recognition (SER). We address this gap by conducting a systematic survey of SER research to uncover what stated motivations drive this work and if they align with the datasets and emotions studied. We find that while SER research identifies appealing goals, such as well-situated

cs.CL
#89
Multimodal 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A t

cs.CV cs.CL
#90
Safety, Policy & Regulation 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Although the cultural (mis)alignment of Large Language Models (LLMs) has attracted increasing attention -- often framed in terms of cultural bias -- until recently there has been limited work on the design and development of datasets for cultural assessment. Here, we review existing approaches to such datasets and identify their main limitations. To address these issues, we propose design guidelines for annotators and report on the construction of a dataset built according to these principles. We further present a series of contrastive experiments conducted with this dataset. The results demon

cs.CL
#91
Safety, Policy & Regulation 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Navigating AI regulation across jurisdictions is increasingly difficult for policymakers, legal professionals, and researchers. To address this, we present a multi-jurisdictional Retrieval-Augmented Generation system for global AI regulation. Our corpus includes 242 documents across 68 jurisdictions, ranging from formal legislation like the EU AI Act to unstructured policy documents such as national AI strategies. The system makes three technical contributions: type-specific chunking that preserve legal structure across heterogenous documents; conditional retrieval routing with entity detectio

cs.CL
#92
Efficiency 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Knowledge distillation (KD) is a well-known technique to effectively compress a large network (teacher) to a smaller network (student) with little sacrifice in performance. However, most KD methods require a large training set and internal access to the teacher, which are rarely available due to various restrictions. These challenges have originated a more practical setting known as black-box few-shot KD, where the student is trained with few images and a black-box teacher. Recent approaches typically generate additional synthetic images but lack an active strategy to promote their diversity,

cs.CV cs.LG
#93
Government & Defense 2026-04-28 arXiv 5.1 5.0/5.7/4.0

SAR image classification naturally has to deal with huge noise and a high dynamic range particularly requiring robust classification models. Additionally, the deployment of these models on edge devices, such as drones and military aircraft, requires a careful balance between model size and classification accuracy. This study explores the potential of tensor networks to meet these robustness requirements, specifically evaluating their resilience to data poisoning. Unlike previous works that concentrated on conventional neural networks for SAR object detection, this research focuses on the robus

quant-ph cs.CV physics.comp-ph
#94
Agents & Tool Use 2026-04-28 arXiv 5.1 5.4/5.4/4.0

Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce Dynami

cs.CV
#95
Evaluations & Benchmarks 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Gradient-based saliency methods are widely used to interpret deep neural networks, yet they often produce noisy and unstable explanations that poorly align with semantically meaningful input features. We argue that a fundamental cause of this behavior lies in the geometry of learned representations: correlated feature dimensions diffuse attribution gradients across redundant directions, resulting in blurred and unreliable saliency maps. To address this issue, we identify feature correlation as a structural limitation of gradient-based interpretability and propose SaliencyDecor, a training fram

cs.CV
#96
Post-Training 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Identity teacher forcing (ITF) enables stable training of deterministic recurrent surrogates for chaotic dynamical systems and has been highly effective for dynamical systems reconstruction (DSR) with recurrent neural networks (RNNs), including interpretable almost-linear RNNs (AL-RNNs). However, as an intervention-based prediction loss (and thus a generalized Bayes update), teacher forcing need not match the free-running model's marginal likelihood geometry. We compare the objective-induced curvatures of ITF and marginal likelihood in a probabilistic switching augmentation of AL-RNNs, estimat

cs.LG math.DS stat.ML
#97
Research 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Deep learning models are used in critical applications, in which mistakes can have serious consequences. Therefore, it is crucial to understand how and why models generate predictions. This understanding provides useful information to check whether the model is learning the right patterns, detect biases in the data, improve model design, and build systems that can be trusted. This work proposes a new method for interpreting Convolutional Neural Networks in image classification tasks. The approach works by selecting the most important feature maps that contribute to each prediction. To solve th

cs.LG
#98
Reinforcement Learning 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dynamics. We propose Dyna-style Safety Augmented Reinforcement Learning (Dyna-SAuR), a novel algorithm that learns both a scalable safety filter and a control policy using a learned uncertainty-aware dynamics model, while requiring minimal domain knowledge. The filter avoids failures and high uncertainty regions. Thus, better models expand the set of safe and certain

cs.LG
#99
Evaluations & Benchmarks 2026-04-28 arXiv 5.1 5.0/5.6/4.0

Tree ensembles are widely used in industrial machine learning due to their strong predictive performance and efficient training procedures. However, as the number of trees in an ensemble grows, the resulting models become increasingly difficult for humans to interpret. To address this limitation, explainable artificial intelligence (XAI) studies methods that generate interpretable models capable of explaining complex predictors. One approach consists of extracting decision rules from tree ensembles while attempting to preserve the predictive performance of the original model. In previous work,

cs.LG
#100
Robotic Autonomy 2026-04-28 arXiv 5.1 5.4/5.4/4.0

Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environment, and the task at hand. We introduce KinDER, a benchmark for Kinematic and Dynamic Embodied Reasoning that targets physical reasoning challenges arising in robot learning and planning. KinDER comprises 25 procedurally generated environments, a Gymnasium-compatible Python library with parameterized skills and demonstrations, and a standardized evaluation suite with 13 implemented baselines spanning task and motion planning, imitation learnin

cs.RO
#101
Research 2026-04-28 arXiv 5.1 5.0/5.6/4.0

End-to-end autonomous driving planners typically generate trajectories from current observations alone. However, real-world driving is highly dynamic, and such reactive planning cannot anticipate future scene evolution, often leading to myopic decisions and safety-critical failures. We propose ProDrive, a world-model-based proactive planning framework that enables ego-environment co-evolution for autonomous driving. ProDrive jointly trains a query-centric trajectory planner and a bird's-eye-view (BEV) world model end-to-end: the planner generates diverse candidate trajectories and planning-awa

cs.RO
#102
Agents & Tool Use 2026-04-27 arXiv · Hugging Face Daily Papers 5.1 5.0/5.0/4.6

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address

cs.CV
#107
Multimodal 2026-04-28 arXiv 5.1 5.0/5.8/4.0

Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To

cs.CV cs.AI
#108
Post-Training 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{θ^{-q}}$ that reweights each instance independently

cs.LG cs.AI
#109
Reinforcement Learning 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live environment interactions is expensive, risky, or impossible. However, CORL inherits the dual difficulty of offline reinforcement learning and adapting while preventing catastrophic forgetting. Replay-based continual learning approaches remain a strong baseline but incur memory overhead and suffer from a distribution mismatch

cs.LG cs.AI
#110
Evaluations & Benchmarks 2026-04-28 arXiv 5.0 5.4/5.0/4.0

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench furthe

cs.SE cs.AI
#111
Infrastructure 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Transformers have demonstrated a strong ability for in-context learning (ICL), enabling models to solve previously unseen tasks using only example input output pairs provided at inference time. While prior theoretical work has established conditions under which transformers can perform linear classification in-context, the empirical scaling behavior governing when this mechanism succeeds remains insufficiently characterized. In this paper, we conduct a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks. Building on the theoretical framework of Fr

cs.LG cs.AI
#112
Agents & Tool Use 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Optimization modeling underpins real-world decision-making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural-language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emph{Agora-Opt}, a modular agentic framework for optimization modeling that combines decentralized debate with a read-write memory bank. Agora-Opt allows multiple agent teams to independently produce end-to-end solutions and reconcile them through an outcome-grounded debate protocol, while memory stores solver-verified art

math.OC cs.AI cs.LG
#113
Infrastructure 2026-04-28 arXiv 5.0 5.0/5.5/4.0

With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms. However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them. Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences. Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions throu

cs.AI cs.IR
#114
AI Coding 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it remains unclear whether such metrics can reliably detect plagiarism across different levels of modification (L1-L6), increasing in complexity. In this paper, we perform a comparative empirical study using two open-source labelled datasets, ConPlag (raw and template-free versions) and IRPlag. We evaluate five CEMs, namely CodeBLEU, CrystalBLEU, RUBY, Tree Str

cs.SE cs.AI cs.IR
#115
Research 2026-04-28 arXiv 5.0 5.0/5.5/4.0

The quality of training data is critical to the performance of machine learning models. In this paper, the Error Sensitivity Profile (ESP) is proposed. It quantifies the sensitivity of model performance to errors in a single feature or in multiple features. By leveraging ESP, data-cleaning efforts can be prioritized based on error types and features most likely to affect model performance. To support the computation of this metric, an integrated suite of tools, called \dirty, is created. We conduct an extensive experimental study on two widely used datasets using 14 classification models, reve

cs.LG cs.AI
#116
Reinforcement Learning 2026-04-28 arXiv 5.0 5.0/5.4/4.0

With the rapid advancement of artificial intelligence (AI) and intelligent science, intelligent edge computing has been widely adopted. However, the limitations of traditional methods, such as poor adaptability and the slow convergence of heuristic algorithms, are becoming increasingly evident. To enable sustainable and resource-efficient edge applications, this paper proposes an online task offloading framework for wireless powered mobile edge computing (MEC) networks, called Quantum Attention-based Reinforcement learning for Online Offloading (QAROO). The system employs a binary offloading s

cs.AI
#117
Agents & Tool Use 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Modern enterprise AI applications increasingly rely on compound AI systems - architectures that compose multiple models, retrievers, and tools to accomplish complex tasks. Deploying such systems in production demands inference infrastructure that can efficiently serve concurrent, heterogeneous model invocations while maintaining cost-effectiveness and low latency. This paper presents a production deployment study of a modular, platform-agnostic inference architecture developed at Salesforce to support compound AI use cases including Agentforce (autonomous AI agents) and ApexGuru (AI-powered co

cs.AI
#118
Multimodal 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Most multi-modal knowledge graph completion (MMKGC) models use one embedding scorer to do both retrieval over the full entity set and final decision making. We argue that this coupling is a core bottleneck: global high-recall search and local fine-grained disambiguation require different inductive biases. Therefore, we propose a Retrieval-Augmented Discrete Diffusion (RADD) framework to decouple retrieve and reranking for MMKGC. A relation-aware multimodal KGE retriever serves as both global retriever and distillation teacher, while a conditional discrete denoiser performs shortlist-level enti

cs.AI
#119
Reinforcement Learning 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers partial logical policy specifications learned in easier instances to guide learning in more challenging settings. We introduce two integrations of symbolic guidance: (i) H-PPO-Product, which biases the action distribution at sampling time, and (ii) H-PPO-SymLoss, which augments the PPO loss with a symbolic regularization

cs.AI
#120
Efficiency 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textit{canonical} logit- and feature-based KD outperform recent segmenta

cs.CV cs.AI
#121
Reinforcement Learning 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Offline zero-shot reinforcement learning (RL) aims to learn agents that optimize unseen reward functions without additional environment interaction. The standard approach to this problem trains task-conditioned policies by sampling task vectors that define linear reward functions over learned state representations. In most existing algorithms, these task vectors are randomly sampled, implicitly assuming this adequately captures the structure of the task space. We argue that doing so leads to suboptimal zero-shot generalization. To address this limitation, we propose extracting task vectors dir

cs.AI
#122
Generative Media 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforc

cs.CV cs.AI
#123
Post-Training 2026-04-28 arXiv 5.0 5.4/5.0/4.0

The need to evaluate instructional materials for K-12 science education has become increasingly important, as more educators use generative AI to create instructional materials. However, the review of instructional materials is time-consuming, expertise-intensive, and difficult to scale, motivating interest in automated evaluation approaches. While large language models (LLMs) have shown strong performance on general evaluation tasks, their performance and reliability on instructional materials remain unclear. To address this gap, we formulate Automatic Instructional Materials Evaluation (AIME

cs.AI
#124
AI Coding 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that cand

cs.AI
#125
Evaluations & Benchmarks 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Current research on distributed multi-modal learning typically assumes that clients can access complete information across all modalities, which may not hold in practice. In this paper, we explore patchwork learning, in which the modalities available to different clients vary, and the objective is to impute the missing modalities for each client in an unsupervised manner. Existing methods are shown not to fully utilize the modality information as they tend to rely on only a subset of the observed modalities. To address this issue, we propose GraphPL, which combines graph neural networks with p

cs.LG cs.AI
#126
Agents & Tool Use 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent fail

cs.AI astro-ph.IM
#127
Agents & Tool Use 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Modern Text-to-SQL systems generate multiple candidate SQL queries and rank them to judge a final prediction. However, existing methods face two limitations. First, they often score functionally equivalent SQL queries inconsistently despite identical execution results. Second, ranking cannot recover when the correct SQL is absent from the candidate pool. We propose R$^3$-SQL, a Text-to-SQL framework that addresses both issues through unified reward for ranking and resampling. R$^3$-SQL first groups candidates by execution result and ranks groups for consistency. To score each group, it combine

cs.SE cs.AI cs.CL
#128
Agents & Tool Use 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Cutscenes are carefully choreographed cinematic sequences embedded in video games and interactive media, serving as the primary vehicle for narrative delivery, character development, and emotional engagement. Producing cutscenes is inherently complex: it demands seamless coordination across screenwriting, cinematography, character animation, voice acting, and technical direction, often requiring days to weeks of collaborative effort from multidisciplinary teams to produce minutes of polished content. In this work, we present Cutscene Agent, an LLM agent framework for automated end-to-end cutsc

cs.GR cs.AI cs.CL
#129
Generative Media 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integr

cs.CV cs.AI
#130
Infrastructure 2026-04-28 arXiv 5.0 5.0/5.5/4.0

How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures

cs.CL
#131
Research 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small models and small to medium sized datasets. However, PT is less robust to hyperparameter choices than standard Transformers, making it harder to scale efficiently. In this work, we follow Maximal Update Parametrization (muP) to rescale PT's parameters, so that hyperparameters optimized on small models can be transferred to larger models without additional tunin

cs.CL
#132
Post-Training 2026-04-28 arXiv 5.0 5.4/5.0/4.0

This paper benchmarks a classical machine learning approach based on PyCaret AutoML against a deep learning approach based on IndoBERT fine-tuning for binary sentiment analysis of Indonesian-language Twitter comments related to Ibu Kota Nusantara (IKN). The dataset contains 1,472 manually labeled samples, consisting of 780 negative and 692 positive comments. In the machine learning setting, Logistic Regression, Naive Bayes, and Support Vector Machine were evaluated using 10-fold cross-validation, with Logistic Regression achieving the best performance among the classical models at 77.57% accur

cs.CL
#133
Infrastructure 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Magnification shift is a major obstacle to robust histopathology classification, because models trained on one imaging scale often generalize poorly to another. Here, we evaluated this problem on the BreaKHis dataset using a strict patient-disjoint leave-one-magnification-out protocol, comparing supervised baseline, baseline augmented with DCGAN-generated patches, and a gradient-reversal domain-general model designed to preserve discriminative information while suppressing magnification-specific variation. Across held-out magnifications, the domain-general model achieved the strongest overall

cs.CV stat.ML
#134
Efficiency 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Knowledge distillation (KD) represents a vital mechanism to transfer expertise from complex teacher networks to efficient student models. However, in decentralized or secure AI ecosystems, privacy regulations and proprietary interests often restrict access to the teacher's interface and original datasets. These constraints define a challenging black-box data-free KD scenario where only top-1 predictions and no training data are available. While recent approaches utilize synthetic data, they still face limitations in data diversity and distillation signals. We propose Diverse Image Priors Knowl

cs.LG cs.CV
#135
Efficiency 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Binary spike coding enables sparse and event-driven computation in spiking neural networks (SNNs), yet its 1-bit-per-timestep representation fundamentally limits information throughput. This bottleneck becomes increasingly restrictive in deep architectures under short simulation horizons. We propose the Quantized Burst-LIF (QB-LIF) neuron, which reformulates burst spiking as a saturated uniform quantization of membrane potentials with a learnable scale. Instead of relying on predefined multi-threshold structures, QB-LIF treats the quantization scale as a trainable parameter, allowing each laye

cs.CV
#136
Research 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Foundation segmentation models such as the Segment Anything Model (SAM) have demonstrated strong generalization across natural images; however, their robustness under clinically realistic medical imaging domain shifts remains insufficiently quantified. We present a systematic slice-level robustness audit of SAM (ViT-B) for spleen segmentation in abdominal CT using 1,051 nonempty slices from 41 volumes in the Medical Segmentation Decathlon. A standardized ground-truth-derived bounding-box protocol was used to isolate encoder robustness from prompt uncertainty. Controlled perturbations simulatin

eess.IV cs.CV
#137
Research 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Camera-based 3D object detection and tracking are central to autonomous driving, yet precise 3D object localization remains fundamentally constrained by depth ambiguity when no expensive, depth-rich online LiDAR is available at inference. In many deployments, however, vehicles repeatedly traverse the same environments, making static point cloud maps from prior traversals a practical source of geometric priors. We propose DualViewMapDet, a camera-only inference framework that retrieves such map priors online and leverages them to mitigate the absence of a LiDAR sensor during deployment. The key

cs.CV cs.RO
#138
Generative Media 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language

cs.CV
#139
Multimodal 2026-04-28 arXiv 5.0 5.0/5.5/4.0

A computational method for quantitative analysis of temporomandibular joint (TMJ) configuration using occlusal positioning splints is proposed and demonstrated. The method models a positioning splint as a physical realization of a predefined rigid transformation of the mandible, derived from multimodal data, including CBCT, facial motion acquisition, and dental scans integrated within a common coordinate system. Splints corresponding to selected mandibular positions are designed and fabricated, and their positioning accuracy is evaluated using repeated scans of plaster models. Discrepancies ar

cs.CV
#140
Generative Media 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We intro

cs.CV
#141
Evaluations & Benchmarks 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Graph neural networks such as ParticleNet and transformer based networks on point clouds such as ParticleTransformer achieve state-of-the-art performance on jet tagging benchmarks at the Large Hadron Collider, yet the physical reasoning behind their predictions remains opaque. We present different methods, i.e. perturbation-based (GNNExplainer), Shapley-value-based (GNNShap), and gradient-based (GRADCam); adapted to operate on LundNet's Lund-plane graph representation. Leveraging the fact that each node in the Lund plane corresponds to a physically meaningful parton splitting, we construct Mon

hep-ph cs.LG hep-ex
#142
Evaluations & Benchmarks 2026-04-28 arXiv 5.0 5.4/5.0/4.0

Software quality assurance remains a major challenge in industrial environments, where large-scale and long-lived systems inevitably accumulate defects. Identifying the location of a fault is often time-consuming and costly, particularly during maintenance phases when developers must rely primarily on textual bug reports rather than complete runtime or code-level context. In this study, we investigated if artificial intelligence can support fault localization using only the natural-language content of bug reports. By relying only on textual information, our approach requires no access to sourc

cs.SE cs.LG
#143
Multimodal 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Coherent transition radiation (CTR) spectroscopy is a critical diagnostic for characterizing the longitudinal structure of relativistic electron bunches in laser-plasma and conventional accelerators. In practice, recovering the bunch profile from a measured CTR spectrum is an ill-posed phase-retrieval problem. Traditionally, this is addressed using Gerchberg-Saxton (GS)-type iterative algorithms. However, these implementations often rely on explicit inverse propagators, making them difficult to adapt to sophisticated experimental forward models. In this work, we introduce a flexible gradient-b

physics.acc-ph cs.LG
#144
Reinforcement Learning 2026-04-28 arXiv 5.0 5.0/5.4/4.0

Model-Based Reinforcement Learning distinguishes between physical dynamics models operating on proprioceptive inputs and latent dynamics models operating on high-dimensional image observations. A prominent latent approach is the Recurrent State Space Model used in the Dreamer family. While epistemic uncertainty quantification to inform exploration and mitigate model exploitation is well established for physical dynamics models, its transfer to latent dynamics models has received limited scrutiny. We empirically demonstrate that latent transitions are biased toward well-represented regions of l

cs.LG
#145
Evaluations & Benchmarks 2026-04-28 arXiv 5.0 5.4/5.0/4.0

Stopping criteria automatically determine when to stop an evolutionary algorithm, so as not to waste function evaluations on a stagnant population. Although stopping criteria play an important role in real-world applications, they have attracted little attention in the evolutionary multi-objective optimization (EMO) community. In fact, new stopping criteria for EMO have been rarely developed in recent years. One reason for the stagnation in developing stopping criteria for EMO is a lack of effective benchmarking methodologies. To address this issue, this paper proposes (i) a performance measur

cs.NE
#146
Efficiency 2026-04-28 arXiv 5.0 5.0/5.4/4.0

World action models jointly predict future video and action during training, raising an open question about what role the future-prediction branch actually plays. A recent finding shows that this branch can be removed at inference with little to no loss on common manipulation benchmarks, suggesting that future information may act merely as a regularizer on the shared visual backbone. We propose instead that joint training induces an action-conditioned correction that privileged future observations impose on action denoising, and that current-only policies capture this correction only partially

cs.RO
#147
Research 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Tendon-Driven Continuum Robots (TDCRs) pose significant control challenges due to their highly nonlinear, path-dependent dynamics and non-Markovian characteristics. Traditional Jacobian-based controllers often struggle with hysteresis-induced oscillations, while conventional learning-based approaches suffer from poor generalization to out-of-distribution trajectories. This paper proposes a reference-augmented offline learning framework for precise 6-DOF tracking control of TDCRs. By leveraging a differentiable RNN-based dynamics surrogate as a gradient bridge, we optimize a control policy thro

cs.RO
#148
Research 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Robot-assisted Transcranial Magnetic Stimulation (Robo-TMS) is an image-guided robotic intervention that enhances the accuracy and reproducibility of conventional Transcranial Magnetic Stimulation (TMS), a widely used non-invasive brain stimulation procedure in clinical treatment and neuroscience research. Despite its potential, the development of Robo-TMS remains challenging due to the need for multidisciplinary expertise spanning medical imaging, computer vision, and robotics. This paper presents SlicerRoboTMS, an open-source 3D Slicer extension that provides a unified interaction infrastruc

cs.RO cs.HC
#149
Robotic Autonomy 2026-04-28 arXiv 5.0 5.5/5.0/4.0

Embodied AI research is undergoing a shift toward vision-centric perceptual paradigms. While massively parallel simulators have catalyzed breakthroughs in proprioception-based locomotion, their potential remains largely untapped for vision-informed tasks due to the prohibitive computational overhead of large-scale photorealistic rendering. Furthermore, the creation of simulation-ready 3D assets heavily relies on labor-intensive manual modeling, while the significant sim-to-real physical gap hinders the transfer of contact-rich manipulation policies. To address these bottlenecks, we propose GS-

cs.RO
#150
Research 2026-04-28 arXiv 5.0 5.0/5.5/4.0

Label noise presents a fundamental challenge in modern machine learning, especially when large-scale datasets are generated via automated processes. An increasingly common and important data paradigm, particularly in domains like medical imaging, involves learning from a large dataset with coarse, noisy labels supplemented by a small, expert-verified, clean dataset. This setting constitutes a typical information transfer and fusion problem. However, the significant distribution shift between the noisy and clean data violates the core overall parametric similarity assumptions of existing statis

stat.ME math.ST stat.ML
#153
Government & Defense 2026-04-28 DefenseScoop 5.0 5.0/5.7/4.0

Part of the Navy’s ongoing modernization push includes improving talent management. The post Navy looking to expand AI-enabled pilot for talent management appeared first on DefenseScoop .

gov_defense industry
#154
Government & Defense 2026-04-28 DefenseScoop 5.0 5.0/5.7/4.0

While details about what the lanes will look like and which counter-UAS systems Marines will employ are scant, the announcement of upcoming training comes as the service continues to build its drone repertoire, including counter-systems, and officials signal concern about how to defeat them on the battlefield. The post Marine division to get first-of-its-kind counter-drone training as officials signal ‘significant concern’ over defeating UAS appeared first on DefenseScoop .

gov_defense industry
#155
Government & Defense 2026-04-28 DefenseScoop 5.0 5.0/5.7/4.0

The Space Data Network has multiple components and will support a number of Pentagon-wide efforts, including the Golden Dome for America missile defense architecture. The post Space Force plans to invest billions in sprawling Space Data Network in FY27 appeared first on DefenseScoop .

gov_defense industry
#156
Industry 2026-04-28 Gradient Flow 5.0 5.5/5.5/4.0

China is not out of the frontier AI race. Its open-weight models (models whose parameters are publicly released, allowing anyone to run or adapt them) remain genuinely competitive, and the overall capability lead has changed hands more than once since early 2025. But the more consequential story for anyone building on top of these models Continue reading "China&#8217;s AI Strengths Are Real. So Are the Structural Drags Behind Them." The post China&#8217;s AI Strengths Are Real. So Are the Structural Drags Behind Them. appeared first on Gradient Flow .

industry research
#157
Industry 2026-04-28 MIT Technology Review — AI 5.0 5.0/5.6/4.0

This is today&#8217;s edition of The Download, our weekday newsletter that provides a daily dose of what&#8217;s going on in the world of technology. Elon Musk and Sam Altman are going to court over OpenAI’s future Elon Musk and OpenAI CEO Sam Altman head to trial this week in a case with sweeping consequences. Ahead&#8230;

industry safety_policy
#158
Industry 2026-04-29 TechCrunch — AI 5.0 5.4/5.6/4.0

It's a story Musk has told before -- in interviews and to author Walter Isaacson for his bestselling biography of Musk -- but Tuesday was the first time he said it under oath.

industry
#159
Government & Defense 2026-04-28 War on the Rocks 5.0 5.0/5.7/4.0

During the 1979 Sino-Vietnamese War, a 26-year-old company commander&#8217;s unit was pinned down by a fortified hilltop. After frontal assaults failed, the junior officer made an extraordinary request: an entire battalion, four times the size of his own unit, for a jungle flanking maneuver. The regimental commander agreed. The surprise assault broke the Vietnamese defense. This company commander&#8217;s pedigree was as formidable as his tactics: His father was a founding general who had just retired as head of the Chinese military&#8217;s General Logistics Department.Five years later, that same officer commanded the regiment tasked with the main assault at the Battle The post The Mountaintop Mirage: Why Xi’s Military Purges Cannot Produce the Force He Wants appeared first on War on the Rocks .

gov_defense safety_policy
#160
Government & Defense 2026-04-28 War on the Rocks 5.0 5.0/5.7/4.0

In the early days of the full-scale invasion of Ukraine in February 2022, much that could go wrong did for the Russian military. As one volunteer organization called KatyaValya recalled:We called all our military friends in (Russian-held) Donetsk, but no one could really explain or say anything. Three or four days later, Katya&#8217;s husband (who served with Donetsk militia) disappeared from communications. We searched for him every day through the commandant&#8217;s office to make sure everything was alright. Then he finally got in touch: &#8220;We need combat boots, sleeping bags, cigarettes, raincoats, and most importantly, Baofeng radios.&#8221;While some elements of The post The Strange Rise and Fall of Russia’s Crowd Sourced Defense Industry appeared first on War on the Rocks .

gov_defense safety_policy
#161
Industry 2026-04-28 Two Minute Papers (YouTube) 5.0 5.0/5.6/4.0

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers 📝 The paper is available here: https://research.nvidia.com/labs/sil/projects/MOTIVE/ Our Patreon if you wish to support us: https://www.patreon.com/TwoMinutePapers 🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Adam Bridges, Benji Rabhan, B Shang, Cameron Navor, Charles Ian Norman Venn, Christian Ahlin, Eric T, Fred R, Gordon Child, Juan Benet, Michael Tedder, Owen Skarpness, Richard Sundvall, Ryan Stankye, Shawn Becker, Steef, Taras Bobrovytsky, Tazaur Sagenclaw, Tybie Fitzhugh, Ueli Gallizzi My research: https://cg.tuwien.ac.at/~zsolnai/ Thumbnail design: https://felicia.hu

research
#162
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable general

cs.CV cs.AI
#163
Post-Training 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We

cs.CL cs.AI cs.LG
#164
Agents & Tool Use 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Long-horizon LLM tasks often fail not because a single answer is unattainable, but because knowledge states drift across rounds, intermediate commitments remain implicit, and interruption fractures the evolving evidence chain. This paper presents ADEMA as a knowledge-state orchestration architecture for long-horizon knowledge synthesis rather than as a generic multi-agent runtime. The architecture combines explicit epistemic bookkeeping, heterogeneous dual-evaluator governance, adaptive task-mode switching, reputation-shaped resource allocation, checkpoint-resumable persistence, segment-level

cs.AI
#165
Agents & Tool Use 2026-04-28 arXiv 4.9 5.0/5.0/4.0

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked,

cs.AI
#166
Agents & Tool Use 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Security analysts are overwhelmed by the volume of alerts and the low context provided by many detection systems. Early-stage investigations typically require manual correlation across multiple log sources, a task that is usually time-consuming. In this paper, we present an experimental, agentic workflow that leverages large language models (LLMs) augmented with predefined queries and constrained tool access (structured SQL over Suricata logs and grep-based text search) to automate the first stages of alert investigation. The proposed workflow integrates queries to provide an overview of the a

cs.CR cs.AI
#167
Efficiency 2026-04-28 arXiv 4.9 5.0/5.0/4.0

The convergence of accelerating human spaceflight ambitions and critical terrestrial health monitoring demands is driving unprecedented requirements for reliable, real-time feature extraction on extremely resource-constrained wearable health sensors. We present an ultra-low-power (ULP) Field-Programmable Gate Array (FPGA) based solution for real-time Seismocardiography (SCG) feature classification using Convolutional Neural Networks (CNNs). Our approach combines quantization-aware training with a systolic-array accelerator to enable efficient integer-only inference on the Lattice iCE40UP5K FPG

cs.AR cs.AI
#168
Frontier LLMs 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy an

cs.CL cs.AI
#169
Agents & Tool Use 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-driven editing under executable test constraints. To address this, we propose SAFEdit, a multi-agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliability and reduce unintended code changes. A Planner Agent produces an explicit, visibility-aware edit pla

cs.SE cs.AI
#170
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

These lecture notes provide an introduction to the verification of neural networks from a theoretical perspective. We discuss feed-forward neural networks, recurrent neural networks, attention mechanisms, and transformers, together with specification languages and algorithmic verification techniques.

cs.LO cs.AI cs.FL
#171
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Source code and its accompanying comments are complementary yet naturally aligned modalities-code encodes structural logic while comments capture developer intent. However, existing vulnerability detection methods mostly rely on single-modality code representations, overlooking the complementary semantic information embedded in comments and thus limiting their generalization across complex code structures and logical relationships. To address this, we propose MultiVul, a multimodal contrastive framework that aligns code and comment representations through dual similarity learning and consisten

cs.SE cs.AI
#172
Frontier LLMs 2026-04-28 arXiv 4.9 5.0/5.0/4.0

This paper investigates how GPT-based tools can assist in building reusable analytical spreadsheet models. After a screening, we evaluate five GPT extensions and select Excel AI by pulsrai.com for detailed testing. Through structured experiments on simple problem statements, we assess Excel AI's performance against the ERFR criteria (each input in a cell; cell formulas; no hardwired numbers; labels; accurate). Results show that while Excel AI can produce well-structured models, it is inconsistent and often non-reproducible. We identify two central challenges - "the problem of confidence" and "

cs.SE cs.AI
#173
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

This paper is under review in AI and Ethics This study examines whether large language models (LLMs) can reliably answer scientific questions and demonstrates how easily they can be influenced by fringe scientific material. The authors modified custom LLMs to prioritise knowledge in selected fringe papers on the Fine Structure Constant and Gravitational Waves, then compared their responses with those of domain experts and standard LLMs. The altered models produced fluent, convincing answers that contradicted scientific consensus and were difficult for non-experts to detect as misleading. The r

cs.CY cs.AI
#174
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Understanding learners' cognitive and affective states underpins adaptive educational systems and effective teaching. Although research links nonverbal cues to internal states, no framework calibrates them to evidence. We present the Nonverbal Syntax Framework, drawn from a systematic review of 908 studies and 17,043 cue-state mappings (Turaev et al., 2026). The framework addresses three challenges: terminological fragmentation (behaviors described inconsistently), evidence heterogeneity (single observations to replicated findings), and state ambiguity (similar patterns indicating multiple sta

cs.AI
#175
Agents & Tool Use 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework that enables modular, observable, and evolvable MAS via a unified Oxy abstraction, in which agents, tools, LLMs, and reasoning flows are encapsulated as pluggable atomic components. This Lego-like assembly paradigm supports scalable system composition and non-intrusive monitoring. To enhance observability, OxyGent introduces permission-driven dynamic planning that repl

cs.AI
#176
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

In remote and hybrid work contexts, the integration of physical and digital environments is revolutionizing spatial experiences, collaboration, and interpersonal interactions. This study examines three fundamental spatial conditions: the physical environment, characterized by material and sensory attributes; the virtual environment, influenced by immersive technologies; and their fusion into hybrid environments where digital and physical components interact dynamically. The increasing number of AI tools in contemporary society, extensively utilized in both professional and personal spheres, ha

cs.HC cs.AI
#177
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods,

eess.AS cs.AI cs.CL cs.LG cs.SD
#178
Post-Training 2026-04-28 arXiv 4.9 5.0/5.0/4.0

We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing m

cs.CL cs.AI
#179
Agents & Tool Use 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Enterprise software engineering is shifting away from deterministic CRUD/REST architectures toward AI-native systems where large language models act as cognitive orchestrators. This transition introduces a critical security tension: probabilistic LLMs weaken classical mechanisms for validation, access control, and formal testing. This paper proposes the design, formal validation, and empirical evaluation of a Semantic Gateway governed by the Model Context Protocol (MCP). The gateway reframes the enterprise API as a semantic surface where tools are dynamically discovered, authorized, and execut

cs.CR cs.AI
#180
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Recurrent Graph Neural Networks (RGNNs) extend standard GNNs by iterating message-passing until some stopping condition is met. Various RGNN models have been proposed in the literature. In this paper, we study three such models: converging RGNNs, where all vertex representations must stabilise; output-converging RGNNs, where only the output classifications must stabilise; and halting RGNNs, where a per-vertex halting classifier determines when to stop. We establish expressiveness relationships between these models: over undirected graphs, converging RGNNs are equally expressive as graded-bisim

cs.LG cs.AI cs.LO
#181
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Designing the architecture of modern networked systems requires navigating a large, combinatorial space of hardware, systems, and configuration choices with complex cross-layer interactions. Architects must balance competing objectives such as performance, cost, and deployability while satisfying compatibility and resource constraints, often relying on scattered rules-of-thumb drawn from benchmarks, papers, documentation, and expert experience. This raises a natural question: can large language models (LLMs) reliably perform this kind of architectural reasoning? We find that they cannot. While

cs.NI cs.AI
#182
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We bench

cs.CV cs.AI
#183
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Large Language Models (LLMs) have shown strong potential for narrative generation, but their use in complex, multi-layered role-playing game (RPG) worlds is still limited by issues of coherence, controllability, and structural consistency. This paper explores a dependency-aware, multi-stage prompt pipeline for procedural RPG content generation that models narrative dependencies through structured intermediate representations. The approach decomposes generation into sequential stages: world building, non-player character creation, player character creation, campaign-level quest planning, and qu

cs.CL cs.AI
#184
Frontier LLMs 2026-04-28 arXiv 4.9 5.0/5.0/4.0

We investigate linguistic biases in LLM-based restaurant and product recommendations given prompts varying across Southern American English (AE), Indian English (IE), and Code-Switched Hindi-English dialects, using the Yelp Open dataset (Yelp Inc., 2023) and Walmart product reviews dataset (PromptCloud,2020). We add lists of restaurant and product names balanced by cuisine type and product category to the prompts given to the LLM, and we zero-shot prompt the LLMs in a cold-start setting to select the top-20 restaurant and product recommendations from these lists for each of the dialect-varied

cs.CL cs.AI
#185
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Web accessibility rests on static standards and developer compliance. That model frays in platforms where content is user-generated: photos arrive blurry or off-frame, descriptions skip size and condition, and page structure shifts from listing to listing. Drawing on six studies conducted between 2022 and 2025 with blind, low-vision, and older adult users of customer-to-customer (C2C) marketplaces, I argue that generative UI can produce adapted interfaces at the point of use, addressing barriers that static design cannot anticipate. Three interventions from this program -- HTML regeneration fo

cs.HC cs.AI cs.CY
#186
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Source-free test-time adaptation (TTA) is appealing for mobile and wearable sensing because it enables on-device personalization from unlabeled test streams without centralizing private data. However, sensor-based human activity recognition (HAR) poses challenges that are less pronounced in standard vision benchmarks: behavioral inertial streams are temporally correlated and often exhibit within-session shifts caused by sensor rotation, placement change, and sampling-rate drift. Under this streaming non-i.i.d. setting, widely used vision-style TTA objectives can become unstable, leading to ove

cs.AI
#187
Robotic Autonomy 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like "this/that" in English and "zhè/nà" in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proxim

cs.CL cs.AI
#188
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Despite AI tools becoming increasingly embedded in academic practice, little is known about how university students integrate them into their writing processes. We examine how students engage with AI across different writing tasks, and how this engagement is shaped by individual factors including AI literacy, writing confidence, trust, authorship concerns, and motivation. Study~1 surveys 107 UK university students to map task-specific and co-occurring patterns of AI use across five writing stages (ideation, sourcing, planning, drafting, and reviewing) and their associations with individual fac

cs.HC cs.AI
#189
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express "happiness" through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of 'happiness,' but such differences in emotional expression

cs.SD cs.AI eess.AS
#190
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

cs.CL cs.AI
#191
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Artificial intelligence systems are increasingly integrated into writing processes, challenging traditional notions of authorship, responsibility, and intellectual contribution. Current disclosure practices usually indicate whether AI was used, but rarely explain how it was used, where it intervened, or how its output was reviewed. This paper proposes a faceted model for representing AI-assisted text production at the levels of documents, chapters, sections, and paragraphs. The proposal introduces a core model based on Form, Generation, and Evaluation, and an extended model that adds Intent, C

cs.CY cs.AI
#192
Efficiency 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Speculative decoding enhances the inference efficiency of large language models (LLMs) by generating drafts using a small draft language model (DLM) and verifying them in batches with a large target language model (TLM). However, adaptive drafting inference on a mobile single-NPU-PIM system faces idle overhead in traditional operator-level synchronous execution and wasted computation in asynchronous execution due to fluctuations in draft length. This paper introduces AHASD, a task-level asynchronous mobile NPU-PIM heterogeneous architecture for speculative decoding. Notably, AHASD achieves par

cs.AR cs.AI
#193
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Retrieval-Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness-QA, a large-scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks--SQuAD and TriviaQA--we automatically identify answer-bearing named en

cs.CL cs.AI
#194
Efficiency 2026-04-28 arXiv 4.9 5.0/5.0/4.0

FlashAttention improves efficiency through tiling, but its online softmax still relies on floating-point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer-only FlashAttention: (1) scale explosion during tile-wise accumulation, (2) inefficient shift-based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose \textit{QFlash}, an end-to-end integer FlashAttention design that performs softmax entirely in the integer domain

cs.LG cs.AI
#195
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

In recent years, the rapid proliferation of open-source large language models (LLMs) has spurred efforts to turn general-purpose models into domain specialists. However, many domain-specialized LLMs are developed using datasets and training protocols that are not aligned with the nuanced requirements of real-world applications. In the legal domain, where precision and reliability are essential, this lack of consideration limits practical utility. In this study, we propose a systematic training framework grounded in the practical needs of the legal domain, with a focus on Korean law. We introdu

cs.CL cs.AI
#196
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that -- under standard positional encodings and a finite alphabet -- Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnabi

cs.LG cs.CL
#197
Post-Training 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Subliminal learning describes a student language model inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize this phenomenon but leaves open questions about the scope of signals it can transfer, the mechanisms that explain it, and the precision with which a bias can be encoded by seemingly unrelated data. We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher's bias is implemented not via a system prompt, as in prior work, but through a ste

cs.CL
#198
Reinforcement Learning 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with

cs.CL
#199
Audio & Speech 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining S

cs.CL cs.SD
#200
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

The closure of Perspective API at the end of 2026 discards what has functioned as the de facto standard for automated toxicity measurement in NLP, CSS, and LLM evaluation research. We document the structural dependence that the communities built on this single proprietary tool and discuss how this dependence caused epistemic problems that have affected - and will likely continue to affect - collective research efforts. Perspective's model was periodically updated without versioning or disclosure, its annotation structure reflected a single corporate operationalisation of a contested concept, a

cs.CL
#201
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed

cs.CL cs.HC
#202
Audio & Speech 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministicall

cs.SD cs.CL eess.AS
#203
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wikinews, and Wikiquote, where available. This step requires careful handling of raw wiki markup to isolate, first of all, textual articles, and then usable natural language text within them. The second phase addresses the challenge of suspicious or low-quality articles, which are often generated from databases or structu

cs.CL
#204
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic p

cs.CV
#205
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decodin

cs.CV
#206
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through s

cs.CV cs.GR
#207
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Unaddressed pain in neonates can lead to adverse effects, including delayed development and slower weight gain, emphasising the need for more objective and reliable pain assessment methods. Hence, automated methods using behavioural and physiological pain indicators have been developed to aid healthcare professionals in the Neonatal ICU. Traditional contact-based methods for physiological parameter estimation are unsuitable for long-term monitoring and increase the risk of spreading diseases like COVID-19. We introduce a novel approach using remote photoplethysmography (rPPG) to estimate pulse

cs.CV eess.IV
#208
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structu

cs.CV cs.RO
#209
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

In autonomous driving, camera-radar fusion offers complementary sensing and low deployment cost. Existing methods perform fusion through input mixing, feature map mixing, or query-based feature sampling. We propose a new fusion paradigm, termed heterogeneous query interaction, and present ConFusion, a camera-radar 3D object detector. ConFusion combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. To encourage cross-type interaction among heterogeneous queries, we introduce heterogeneous query mixing (QMix

cs.CV
#210
State Space Models 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Visual state-space models (SSMs) have shown strong potential for medical image segmentation, yet their effectiveness is often limited by two practical issues: axis-biased scan ordering weakens the modeling of oblique and curved structures, and naive multi-branch fusion tends to amplify redundant responses. We present TopoMamba, a topology-aware scan-and-fuse framework for segmenting heterogeneous medical visual media. The method combines a diagonal/anti-diagonal TopoA-Scan branch with the standard Cross-Scan branch to provide complementary structural priors, and introduces ScanCache, a device-

cs.CV
#211
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Worldwide image geo-localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post-processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two-stage framework for worldwide image geo-localization. First, it establishes a geo-representational foundation by fusing image and semantic segm

cs.CV
#212
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Recently, generalizable human Gaussian splatting from sparse-view inputs has been actively studied for the photorealistic human rendering. Most existing methods rely on explicit geometric constraints or predefined structural representations to accurately position 3D Gaussians. Although these approaches have shown the remarkable progress in this field, they still suffer from inconsistent feature representations across multi-view inputs due to complex articulations of the human body and limited overlaps between different views. To address this problem, we propose a novel method to accurately loc

cs.CV
#213
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Video Capsule Endoscopy (VCE) is a promising method for improving the medical examination of the small intestine in the gastrointestinal tract. A key challenge is their limited size, resulting in a short battery lifetime which conflicts with high energy consumption for image capturing and transmission to an on-body device. Thus, we propose an image compression pipeline that substantially reduces the transmitted data while preserving diagnostic image quality. Furthermore, we exploit characteristics of the compression process to identify frames with low diagnostic value mainly caused by bubbles,

cs.CV
#214
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we pro

cs.CV
#215
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Low-level image processing has long been evaluated mainly from the perspective of visual fidelity. However, with the rise of deep learning and generative models, processed images may preserve perceptual quality while altering semantic content, making conventional Image Quality Assessment (IQA) insufficient for semantic-level assessment. In this paper, we formalize \textit{Semantic Similarity} as a new evaluation task for low-level image processing, aimed at measuring whether semantic content is preserved after processing. We further present a structured formulation of image semantics based on

cs.CV
#216
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Worldwide image geolocalization, which aims to predict the GPS coordinates of any image on Earth, remains challenging due to global visual diversity. Recent generative approaches based on Retrieval-Augmented Generation (RAG) and Large Multimodal Models (LMMs) leverage candidates retrieved from fixed databases for reasoning, but often struggle with scenes that are absent from the reference set. In this work, we propose GeoSearch, an open-world geolocation framework that integrates web-scale reverse image search into the RAG pipeline. GeoSearch augments LMM prompts with database-retrieved coordi

cs.IR cs.CV
#217
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Architectural floor plans are widely available priors which contain not only geometry but also the semantic information of the environment, yet existing localization methods largely ignore this semantic information. To address this, we present COMPASS, an algorithm that exploits both geometric and semantic priors from floor plans to estimate the pose of a robot equipped with dual fisheye cameras. Inspired by scan context descriptor from LiDAR-based place recognition, we design a multi-channel radial descriptor that encodes the geometric layout surrounding a position. From the floor plan, rays

cs.CV cs.RO
#218
Generative Media 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson $r=0.993$

q-bio.QM cs.CV
#219
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

In this paper, we present Self-DACE++, an improved unsupervised and lightweight framework for Low-Light Image Enhancement (LLIE), building upon our previous Self-Reference Deep Adaptive Curve Estimation (Self-DACE). To better address the trade-off between computational efficiency and restoration quality, Self-DACE++ introduces enhanced Adaptive Adjustment Curves (AACs). These curves, governed by minimal trainable parameters, flexibly adjust the dynamic range while preserving the color fidelity, structural integrity, and naturalness of the enhanced images. To achieve an extremely lightweight ar

cs.CV
#220
Generative Media 2026-04-28 arXiv 4.9 5.0/5.0/4.0

The exponential surge in high-resolution remote sensing data faces a severe bottleneck in satellite-to-ground transmission. Limited downlink bandwidth forces the use of extreme high-ratio compression, which irreversibly destroys high-frequency structural details essential for downstream machine perception tasks like object detection. While current super-resolution techniques attempt to recover these details, regression-based methods often yield over-smoothed textures, and generative diffusion models frequently introduce structural hallucinations that mislead detection systems. To address this

cs.CV
#221
Post-Training 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Domain adaptation (DA) addresses the challenge of transferring a machine learning model trained on a source domain to a target domain with a different data distribution. In this work, we study DA for the task of Rumex obtusifolius (Rumex) image classification. We train models on a published, ground vehicle-based dataset (source) and evaluate their performance on a custom target dataset acquired by unmanned aerial vehicles (UAVs). We find that Convolutional Neural Network (CNN) models, specifically ResNets, generalize poorly to the target domain, even after fine-tuning on the source data. Apply

cs.CV
#222
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

This work addresses the critical problem of tracking fast-moving objects through strongly scattering media in a low-light environment. Different from existing approaches that use frame-based cameras with fixed exposure times, which trade off signal-to-noise ratio for temporal resolution, we introduce computational neuromorphic tracking (CNT), a physics-informed framework that combines asynchronous event sensing with task-driven speckle analysis for robust motion estimation. We formulate the neuromorphic speckle aggregation as a spatiotemporal speckle representation, jointly optimizing the temp

cs.CV eess.IV
#223
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Deploying tiny object perception on edge platforms is challenging because practical systems must satisfy both strict compute budgets and end-to-end latency constraints. A common strategy is to first select a small number of candidate patches from a high-resolution image and then apply downstream processing only to the selected regions. However, existing detector-based frontends are not well aligned with this setting: strong offline detection accuracy does not necessarily yield effective low-budget patch prioritization, nor does it guarantee usable performance once transport and inference delay

cs.CV eess.IV
#224
AI Coding 2026-04-28 arXiv 4.9 5.0/5.0/4.0

The accelerating adoption of Large Language Models (LLMs) in software engineering (SE) has brought with it a silent crisis: unsustainable computational cost. While these models demonstrate remarkable capabilities in different SE tasks, they are unmanageably large, slow to deploy, memory-intensive, and carbon-heavy. This reality threatens not only the scalability and accessibility of AI-powered SE, but also its long-term environmental sustainability. The research challenge is clear: we must go beyond accuracy and address efficiency and environmental cost as first-class design constraints. To me

cs.SE cs.LG
#225
Multimodal 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Contact variability, sensing uncertainty, and external disturbances make grasp execution stochastic. Expected-quality objectives ignore tail outcomes and often select grasps that fail under adverse contact realizations. Risk-sensitive POMDPs address this failure mode, but many use particle-filter beliefs that scale poorly, obstruct gradient-based optimization, and estimate Conditional Value-at-Risk (CVaR) with high-variance approximations. We instead formulate grasp acquisition as variational inference over latent contact parameters and object pose, representing the belief with a differentiabl

cs.RO cs.LG eess.SY
#226
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Sparse Optimal Scoring (SOS) reformulates linear discriminant analysis to enable feature selection through elastic net regularization, making it well-suited for high-dimensional settings where the number of features exceeds observations. Most existing SOS methods use deflation-based strategies that compute discriminant vectors sequentially, which can propagate errors and produce suboptimal solutions. We propose a novel approach that estimates all discriminant vectors simultaneously under an explicit global orthogonality constraint, which we call Deflation-Free Sparse Optimal Scoring (DFSOS). D

stat.ML cs.LG math.OC
#227
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Nonlinear dynamical systems with regime transitions are typically described by ordinary differential equations with jumping parameters parameters. Traditional methods often treat change-point detection and parameter estimation as separate tasks, ignoring the inherent coupling between them. To address this, we propose residual-loss anomaly analysis of physics-informed neural networks, a unified framework that leverages dynamical consistency within the physics-informed learning paradigm. This approach jointly infers piecewise parameters and transition points under a single set of constraints. Th

stat.ML cs.LG
#228
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Code understanding models increasingly rely on pretrained language models (PLMs) and graph neural networks (GNNs), which capture complementary semantic and structural information. We conduct a controlled empirical study of PLM-GNN hybrids for code classification and vulnerability detection tasks by systematically pairing three code-specialized PLMs with three foundational GNN architectures. We compare these hybrids against PLM-only and GNN-only baselines on Java250 and Devign, including an identifier-obfuscation setting. Across both tasks, hybrids consistently outperform GNN-only baselines and

cs.SE cs.LG
#229
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Studying nonlinear dynamical systems through their state space behavior can be challenging, and one possible alternative is to analyze them via their associated Koopman operator. This turns the nonlinear problem into a linear, infinite-dimensional one. To approximate the operator in finite dimensions, extended dynamic mode decomposition (EDMD) is a commonly used algorithm. It requires a finite list of functionals and a set of snapshots from the system to compute an approximation of the operator and its corresponding spectrum. Instead of choosing the list of functionals directly, it can be impl

math.DS cs.LG
#230
Efficiency 2026-04-28 arXiv 4.9 5.0/5.0/4.0

SignSGD compresses each stochastic gradient coordinate to a single bit, offering substantial memory and communication savings, but its 1-bit quantization removes magnitude information and is known to leave a generalization gap relative to well-tuned SGD. We revisit SignSGD from a 1-bit quantization and dithering perspective and contribute three improvements. First, we derive a small-batch convergence rate for SignSGD under unimodal symmetric gradient noise using a signal-to-noise weighted stationarity measure, removing the large-batch assumption of prior analyses. Second, we inject annealed Ga

cs.LG
#231
Evaluations & Benchmarks 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Time series classification is an important analytical task across diverse domains. However, its practical application is often hindered by the scarcity of labeled data and the requirement for substantial computational resources. To address these challenges, this paper proposes EvoTSC, a novel genetic programming approach designed to automatically evolve lightweight feature learning models for time series classification. The core of EvoTSC is a carefully designed multi-layer program structure that strategically embeds diverse forms of prior expert knowledge into the evolutionary process, effect

cs.LG cs.NE
#232
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

We introduce a Hopfield-type associative memory in which effective connectivity is multiplicatively modulated by astrocytic gains evolving under an entropy-regularized replicator equation. The coupled neuron-astrocyte dynamics admit a Lyapunov function, ensuring global convergence. At fixed points, astrocytic gains implement a softmax-normalized allocation over pattern similarity scores, yielding a mechanistic realization of self-attention as emergent routing on the gain simplex. In regimes of high memory load and interference, the model significantly improves retrieval accuracy relative to cl

physics.data-an cs.LG nlin.AO physics.soc-ph
#233
Infrastructure 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Federated learning increasingly operates in a large-model regime where communication, memory, and computation are all scarce. Typically, non-IID client data induce drift that degrades the stability and performance of local training. Existing remedies such as SCAFFOLD introduce heterogeneity-correction mechanisms to address this challenge, but they incur substantial extra communication and memory overhead. This paper proposes a subspace optimization method for federated learning (SSF), which performs heterogeneity-corrected optimization in a low-dimensional subspace using only projected quantit

cs.LG math.OC
#234
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

While it is generally understood that zeroth-order (ZO) algorithms have an extra dependency on their number of iterations for any choice of parameters, compared to their first-order (FO) counterparts, in this work, we show that under several conditions, in expectation, ZO methods do not suffer from extra dimension dependencies in their convergence rates with respect to their FO counterparts. We look at optimisation algorithms from the dynamical systems perspective and analyse the conditions under which one can formulate the average of a ZO algorithm as the average of its FO counterpart with bo

math.OC cs.LG eess.SY math.NA
#235
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Continuous causal discovery typically couples representation learning with structural optimization via non-convex acyclicity penalties, which subjects solvers to local optima and restricts scalability in high-dimensional regimes. We propose a decoupled paradigm that shifts the causal discovery bottleneck from non-convex optimization to statistical score estimation. We introduce the Score-Schur Topological Sort (SSTS), an algorithm that extracts topological order directly from unconstrained generative models, bypassing constrained structure optimization. We establish that the causal hierarchy l

cs.LG
#236
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

This paper presents a sensitivity-based tube Nonlinear Model Predictive Control (NMPC) framework for cooperative aerial chains under bounded parametric uncertainty. We consider a planar two-vehicle chain connected by rigid links, modeled with input-rate actuation to enforce slew-rate and magnitude limits on thrust and torque. Robustness to uncertainty in link mass, length, and inertia is achieved by propagating first-order parametric state sensitivities along the horizon and using them to compute online constraint-tightening margins. We robustify an inter-link separation constraint, implemente

cs.RO
#237
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Tendon-Driven Continuum Robots (TDCRs) pose significant modeling and control challenges due to complex nonlinearities, such as frictional hysteresis and transmission compliance. This paper proposes a differentiable learning framework that integrates high-fidelity dynamics modeling with robust neural control. We develop a GRU-based dynamics model featuring bidirectional multi-channel connectivity and residual prediction to effectively suppress compounding errors during long-horizon auto-regressive prediction. By treating this model as a gradient bridge, an end-to-end neural control policy is op

cs.RO
#238
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Reliable estimation of neuromuscular activation is a key enabler for adaptive and personalized control in wearable robotics. However, surface electromyography (EMG) remains difficult to deploy robustly outside laboratory settings due to electrode sensitivity, signal non-stationarity, and strong subject dependence. In this work, we propose an adaptive IMU-to-EMG learning framework that reconstructs continuous muscle activation envelopes from wearable inertial measurements across heterogeneous movement conditions. The approach combines a Transformer encoder with Gaussian Error Gated Linear Units

cs.RO
#239
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

3D-printed artificial skins are a scalable approach to whole-body tactile and proximity coverage, but prior implementations have been limited to unimodal sensing and rigid materials. To improve the practical usability of 3D-printed artificial skins, we present a hybrid time-of-flight (ToF) and self-capacitance (SC) sensing skin that demonstrates multi-modal sensing integration, soft compliant coverings for impact absorption and pressure sensing, and a streamlined electrical interface between printed conductive traces and external electronics. We show that combining ToF and SC modalities enable

cs.RO
#240
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Mobile robots that move between outdoor and indoor environments still struggle with consistent positioning. Satellite-based and terrestrial ranging each work well in their home domains, but combining them at the raw measurement level has received little attention, and the building boundary is precisely where both classes degrade. This paper reports preliminary observations from the HYMN dataset, which time-synchronizes raw measurements from GNSS, Ultra-Wideband (UWB), WiFi Fine Time Measurement (FTM), and Bluetooth Low Energy (BLE) against millimeter-level ground truth in an industrial setting

eess.SP cs.RO
#241
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Graph-based representations such as Scene Graphs enable localization in structured indoor environments by matching a locally observed graph, constructed from sensor data, to a prior map. This process is particularly challenging in environments with repetitive or symmetric layouts, where structural cues alone are often insufficient to resolve ambiguities. We propose a semantic-enhanced graph matching approach that explicitly models relations between detected objects and structural elements, such as rooms and wall planes. Objects are detected from RGB-D data and integrated into the graph, and th

cs.RO
#242
Research 2026-04-28 arXiv 4.9 5.0/5.0/4.0

Direction-of-arrival (DOA) estimation is an important task in microphone array processing and many downstream applications. The steered response power with phase transform (SRP-PHAT) method has been widely adopted for DOA estimation in recent years. However, accurate SRP-PHAT estimation in 3D scenarios requires evaluating steering responses over thousands of candidate directions, severely limiting real-time performance on resource-constrained platforms. This challenge becomes even more critical for planar arrays, which are widely used in robotics due to their structural simplicity. Motivated b

eess.AS cs.RO
#245
Industry 2026-04-28 TechCrunch — AI 4.9 5.0/5.6/4.0

With this launch, users can connect their Gmail, Google Drive, Notion, Jira, and Salesforce accounts and query that data along with existing meeting data. The company said that it will soon allow connections with Microsoft Outlook, Teams, SharePoint, and Slack.

industry
#249
Industry 2026-04-28 FedScoop — AI 4.8 5.0/5.0/4.0

The agency’s chief innovation officer told FedScoop that the portal of government and private-sector data will go public in “the coming months,” fulfilling a Trump AI Action Plan requirement. The post Labor Department nears launch of AI workforce hub appeared first on FedScoop .

gov_defense industry
#250
Industry 2026-04-28 FedScoop — AI 4.8 5.0/5.0/4.0

The interest in artificial intelligence additions follows what the agency is characterizing as a successful HR modernization project that centralized talent management platforms. The post Energy Department eyes AI-enabled self-service features for workforce appeared first on FedScoop .

gov_defense industry
#253
Agents & Tool Use 2026-04-28 Simon Willison's Weblog 4.8 5.0/5.0/4.0

Five months in, I think I've decided that I don't want to vibecode — I want professionally managed software companies to use AI coding assistance to make more/better/cheaper software products that they sell to me for money. Matthew Yglesias Tags: agentic-engineering , vibe-coding , ai-assisted-programming , ai

frontier_llm industry agents ai_coding
#254
Evaluations & Benchmarks 2026-04-28 arXiv 4.7 4.4/5.0/4.0

Sentiment analysis of product reviews on e-commerce platforms plays a critical role in automatically understanding customer satisfaction and providing actionable insights for sellers seeking to improve product quality. This paper presents a comprehensive benchmarking study comparing a Machine Learning (ML) approach via the PyCaret AutoML framework against a Deep Learning (DL) approach based on a Bidirectional Long Short-Term Memory (BiLSTM) architecture with an Attention mechanism for binary sentiment classification on Indonesian product reviews. The dataset comprises 19,728 samples balanced e

cs.CL
#255
Industry 2026-04-28 AI + a16z 4.7 5.0/5.0/4.0

Elena Burger speaks with Malika Aubakirova, partner on the AI infrastructure team at a16z, about why today’s AI systems struggle to learn over time. They discuss the limits of in-context learning, the case for continual learning, and how models may need to evolve from static systems into ones that learn from experience. Resources: Follow Malika on X: https://x.com/MaikaThoughts Follow Elena on X: https://x.com/VirtualElena Read more on Why We Need Continual Learning: https://a16z.com/why-we-need-continual-learning/ Check out everything a16z is doing with artificial intelligence here , including articles, projects, and more podcasts. Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details

industry
#256
Agents & Tool Use 2026-04-28 80,000 Hours Podcast 4.7 5.0/5.0/4.0

You might have heard that 95% of corporate AI pilots are failing. It was a widely cited AI statistic in 2025, repeated by media outlets and commentators everywhere. It helped trigger a Nasdaq selloff and became a pillar of the "AI is overhyped" case. The problem: 95% fail is 100% wrong. The real finding, once you read the underlying MIT report carefully, points in roughly the opposite direction: 80% of surveyed companies had never piloted a custom AI tool at all. Among the companies that deployed pilots, a quarter reported success — according to an extremely high bar set by the researchers — within six months. Over 90% of staff at all surveyed companies were using tools like ChatGPT regularly for their work. None of that made the headlines. Nor did the fact that the study’s authors are all developing or selling the "agentic AI framework" technology the report recommends as the solution to this supposed epidemic of failing AI. Host Rob Wiblin breaks down how an opaqu

safety_policy
#257
Industry 2026-04-28 Gradient Flow 4.7 5.0/5.0/4.0

Subscribe • Previous Issues What mathematicians figured out about AI that most enterprises haven&#8217;t Recent results suggest that research mathematics is no longer a purely speculative test case for AI. A growing set of examples shows AI contributing not just to short contest puzzles, but to open-ended mathematical work that requires literature search, cross-domain connection-making, revision, and Continue reading "Generation is cheap. Evaluation is everything." The post Generation is cheap. Evaluation is everything. appeared first on Gradient Flow .

industry research
Items
261
Multi-source
21
Long-form (≥7.5)
5
Sources OK / attempted
49 / 91
Top category
Research
44 items