Wolf Digest — 2026-06-26

#1

White House asks OpenAI to limit GPT-5.6 to vetted partners before any public release

Frontier LLMs 2026-06-25 TechCrunch — AI 7.9 7.8/8.4/7.5

OpenAI’s next model, GPT-5.6, will not ship the way its predecessors did. According to a report in The Information relayed by TechCrunch, the company plans to make the model available only to a small set of close partners rather than the general public, and it is doing so at the request of the federal government. At an internal meeting this week, Sam Altman reportedly told staff that the administration would be “approving access customer by customer” during a preview window, and that if the limited rollout goes smoothly OpenAI hopes to follow with a broader release roughly a couple of weeks later. The two agencies named as having asked for the restricted release are the Office of the National Cyber Director and the Office of Science and Technology Policy, and OpenAI staff are described as having worked closely with the government on the launch.

The mechanism here is new even if the actors are familiar. Earlier this month an executive order directed certain AI companies to voluntarily submit new models to the government for testing and evaluation prior to public release, and GPT-5.6 appears to be the first frontier model to move through that pipeline in practice. The stated rationale is cyber capability. Frontier models have grown markedly better at discovering and exploiting software vulnerabilities, and at writing functional malware; the concern is that a freely available model strong enough to find and weaponize bugs faster than human analysts could shift the offense-defense balance for anyone running complex software infrastructure. Because the most capable versions of these systems stay behind closed doors, it remains hard to quantify precisely how dangerous they are, which is part of what makes the gating decision contentious.

The move also lands OpenAI in roughly the posture Anthropic adopted on its own initiative. Earlier this year Anthropic restricted its frontier cyber model, Claude Mythos, to a limited group of partners under a program it called Project Glasswing, arguing the model was too powerful to release openly. Observers have debated whether that framing is substantive or partly a marketing posture, and the same question now attaches to a government-mediated release. What is concrete is the precedent: a staged, customer-by-customer preview of a flagship general-purpose model, reviewed by federal cyber and science-policy offices before wider distribution. For anyone tracking model-release cadence, the practical implications are a slower and more discretionary path from training run to public API, a larger role for government review in deciding who gets early access, and a developing norm in which the most capable frontier systems are treated less like products to be launched and more like capabilities to be metered out. Whether the “couple of weeks later” broad release actually materializes on that timeline is the detail worth watching.

#2

IBM unveils a two-layer ‘nanostack’ CFET chip it says extends the transistor roadmap a decade

Infrastructure 2026-06-25 MIT Technology Review — AI 7.7 7.8/7.9/7.4

IBM has built a prototype chip carrying roughly 100 billion transistors on an area the size of a fingernail, about twice the density of the company’s previous state-of-the-art design from 2021. The advance does not come from shrinking transistors further. Conventional scaling has stalled because transistors are now only a few dozen nanometers across, close to the point where quantum-mechanical effects interfere with their operation, so IBM’s approach is to build upward instead of smaller. Its new architecture, which it calls a nanostack, vertically stacks transistors in two layers on a single silicon die, a structure known in the literature as a complementary field-effect transistor, or CFET. Compared with IBM’s prior architecture, the company reports the new design can do as much as 50 percent more work in the same time and run up to 70 percent more energy efficient.

The fabrication is layer-by-layer, “like a cake” in the words of the team: transistors are built on a first silicon layer, a second silicon layer is placed on top, a second set of transistors is fabricated directly on that, and electrical connections are then formed between the two tiers. IBM says its variant is distinguished by staggering the upper-layer transistors rather than seating them directly above the lower ones, which simplifies wiring. That contrasts with bond-after-fabrication approaches such as AMD’s 3D V-Cache and Huawei’s forthcoming LogicFolding, where each layer is made separately and then bonded; building the second layer in place allows tighter alignment, which matters at these dimensions. The channel in each transistor is made of three nanosheets, each fifteen atoms thick and spaced nine nanometers apart. IBM markets the generation as “sub-nanometer” or “0.7 nanometer,” but that label is a naming convention, not a physical measurement; the actual spacing between transistors has held near forty nanometers for years.

The reception was unusually strong for an architecture announcement. Jay Gambetta, director of IBM Research, called it “a meaningful leap forward” and said he expects nanostacking to be widely used in data centers within a decade, where the efficiency gains could help operators manage energy consumption. Dan Hutcheson of TechInsights called it “transformational,” estimating it puts another ten to fifteen years on the industry roadmap. IBM intends to license the layout to semiconductor manufacturers and expects it to appear across many chip types, including GPUs and CPUs. The hard problems are yield and heat. Because a stacked chip fails if either layer fails, the defect rate is structurally higher than for single-layer parts, which raises cost. The other constraint is the thermal budget: building the upper layer without melting the connections beneath it requires keeping the process below 400 degrees Celsius, and IBM has stayed quiet about exactly how it manages that. Academic groups are pushing the same idea further, with at least one demonstrating a junctionless-transistor stack fabricated below 200 degrees, though only as a proof of principle. For a field whose progress is increasingly bounded by power and packaging rather than raw lithography, a credible path to denser, more efficient logic is directly load-bearing for the cost of training and serving large models.

#3

OpenAI reports internal Codex output tokens grew up to 56x in nine months, led by its research team

AI Coding 2026-06-26 Latent Space (swyx and Alessio) 7.5 7.2/7.4/7.9

OpenAI’s economic-research group published internal data showing how quickly its own employees have leaned on Codex, the company’s coding agent, and the numbers are steep. Measuring the change in combined output tokens among active internal users, median usage in June 2026 versus November 2025 rose by a factor of 56 in the Research organization, 32 in Customer Support, 27 in Engineering, and 13 in Legal. The Legal figure grew more gradually but still reached thirteen times its November level. For context, the report notes that through August 2025 the average OpenAI worker spent less than ten percent of their tokens on Codex, so the growth is measured against a low and recent baseline.

What makes the data interesting is less the absolute multiples than what they imply about where agentic coding actually takes hold. The research team, not engineering, posted the largest jump, and the gains show up across functions that are not traditionally code-heavy, including support and legal. OpenAI frames this as agents reshaping work “in every department,” with Codex increasingly handling longer-running and more cross-functional tasks rather than single-shot completions. The newsletter writing it up, AI News from the Latent Space team, treats OpenAI’s internal Codex telemetry as a leading indicator: a company with unlimited free access and strong incentives to dogfood its own tools is a natural upper bound on adoption, and even there the steep ramp only arrived once the surrounding workflow caught up.

The caveat embedded in the same data is that capability and usage are not the same thing. Employees had unlimited access throughout the period and were, by the report’s own framing, underusing the tools well into late 2025 despite no cost barrier. The inflection came when the organization built out the review loops, tooling, and persistent multi-step workflows that let an agent be trusted with longer tasks, a pattern echoed in the surrounding commentary about “skills and concurrent agents.” That is a useful corrective to the assumption that better models automatically translate into proportional productivity. The token-growth curve is real and large, but it tracks the maturation of the harness and the habits around the model at least as much as the raw intelligence of the model itself. For practitioners, the read is that the bottleneck on agentic coding adoption is increasingly organizational and infrastructural rather than a question of whether the underlying model is good enough.

#4

General Intuition raises $320M at a $2.3B valuation to turn gameplay action data into embodied agents

Robotic Autonomy 2026-06-25 TechCrunch — AI 7.5 7.3/7.4/7.8

General Intuition has raised 320 million dollars at a 2.3 billion dollar valuation, bringing total disclosed funding to 454 million dollars following a 134 million dollar round when it launched last October. The round was led by Khosla Ventures, with participation from General Catalyst, Jeff Bezos, Eric Schmidt, Nico Rosberg, and researchers at Google DeepMind and MIT. The company is a spin-out of Medal, the gameplay-clip platform run by chief executive Pim de Witte, and its bet is that the enormous corpus of recorded gameplay sitting inside a service like Medal is an unusually good substrate for training agents that reason about space and motion.

The technical thesis is specific: what matters is not the video itself but the action labels attached to it, the record of exactly which buttons a player pressed and when. Pairing pixels with the actions that produced them gives a model the cause-and-effect structure that passive video lacks, and General Intuition is building a single agentic foundation model for what it calls spatial-temporal reasoning that is meant to generalize across gameplay, simulation, and physical embodiment. As demonstrations, the company showed an agent playing a Fortnite-style game for a hundred hours straight, and a quadruped robot fine-tuned to walk with just eight minutes of real-world data on top of the pretrained model. The claim of strong transfer from game action data to a physical robot with minimal real-world fine-tuning is the crux of the pitch, and the part most worth scrutinizing as more results appear.

The company was co-founded by de Witte along with Eloi Alonso, Adam Jelley, and Vincent Micheli, several of whom have research backgrounds in world models and reinforcement learning. Most of the new capital is earmarked for compute, supported by a deal with CoreWeave, and the company says it plans to broaden access to its model through an API by the end of the summer. De Witte has also said the company bars the use of its agents to harm humans. Strategically the raise sits at the intersection of two threads the field has been circling: the search for scalable sources of action-labeled data to train embodied policies, and the idea that games are a cheap, high-volume proxy for the messy spatial reasoning the physical world demands. Whether gameplay distributions transfer cleanly to robots and real environments remains the open empirical question, but the size of the round and the names attached to it signal serious conviction that action data, not just more video or more language, is a missing ingredient for general-purpose embodied agents.

#5

Credit-card and search data show Anthropic’s Claude gaining fast among paying consumers

Industry 2026-06-25 TechCrunch — AI 7.3 7.0/7.2/7.7

Transaction-analytics firm Indagari, which tracks billions of anonymized card payments from about 28 million U.S. consumers, finds Claude’s paying-consumer count and revenue up roughly 75 percent since January 2026, suggesting Anthropic’s base extends well beyond its usual enterprise-and-developer reputation around Claude Code. A second signal comes from education platform DataCamp, where “Claude” is now the most-searched term on the site, ahead of “AI” itself, and self-directed learner demand for Claude courses is outpacing ChatGPT three to one, up 18x in the last 30 days. The caveats are real: ChatGPT remains far larger by every absolute measure, with many more paying users per both Indagari and Sensor Tower data. The story is trajectory, not parity.

#6

Ai2 dissects where its Olmo hybrid model beats a transformer, token by token

Recurrent & Linear Attention 2026-06-25 Allen Institute for AI (Ai2)Hugging Face Blog 7.2 7.1/7.4/7.2

Ai2 compared its strongest 7B transformer (Olmo 3) against an architecturally matched Olmo Hybrid, which keeps a few attention layers but replaces the rest with recurrent layers carrying fixed-size, lossy compressed memory. Scoring every token by a per-token loss gap across prose, Wikipedia, books, papers, and code, the hybrid has lower loss on most tokens and is strongest on meaning-bearing content words (nouns, verbs, adjectives), with a loss gap near 0.04 versus about 0.02 on function words like “the,” “of,” and “is.” The hybrid’s edge vanishes exactly where attention is needed: closing braces, where bracket-matching demands precise recall, and verbatim repeated n-grams, where the advantage shrinks toward zero as the repeat lengthens. It is a clean, mechanistic account of the recall-versus-compression tradeoff that motivates hybrid designs.

#7

Naveen Rao’s Unconventional AI releases Un-0, an oscillator-architecture image model it claims could cut inference power 1,000x

Efficiency 2026-06-25 TechCrunch — AI 7.1 7.3/7.2/6.8

Naveen Rao, formerly head of AI at Databricks, has unveiled Unconventional AI, which is pursuing an oscillator-based computer architecture for inference. The company released Un-0, an image-generation model, alongside a paper showing a fully functional generator built on a software simulation of the new architecture that it says performs on par with state-of-the-art diffusion systems such as Stable Diffusion or GPT Image 1. Rao calls Un-0 the “hello world” of a new kind of computer and projects the approach could ultimately reduce power use by as much as 1,000x. For now everything runs on a simulation of Unconventional’s oscillator chips; the company, with fewer than 50 employees, plans to publish chip schematics and build out a full inference stack. The 1,000x figure is a projection against unbuilt silicon, but the framing — energy as the binding constraint on inference — is increasingly the industry consensus.

#8

DanceOPD composes text-to-image and editing skills with on-policy generative field distillation

Generative Media 2026-06-25 arXivHugging Face Daily Papers 7.0 7.1/6.9/7.0

Unifying text-to-image generation with local and global editing in one model is hard because the capabilities conflict — editing degrades T2I, and global and local edits interfere with each other. DanceOPD frames each capability as a velocity field over a shared flow-matching state space and routes each training sample to one capability field, querying a single low-noise student-induced state and training with a plain velocity-MSE objective. Because the student learns from fields evaluated on its own rollout states rather than off-policy teacher trajectories, the capabilities compose without the usual destructive interference. It is one of several on-policy distillation papers in today’s batch, part of a visible shift toward training students on their own generations rather than fixed teacher data.

cs.CV cs.LG

#9

OPID extracts hierarchical skill supervision from an agent’s own trajectories for agentic RL

Reinforcement Learning 2026-06-25 arXivHugging Face Daily Papers 7.0 7.0/7.1/6.9

Outcome-based RL gives language agents a stable backbone but only sparse trajectory-level reward, leaving unclear which intermediate decisions to reinforce. OPID (On-Policy Skill Distillation) derives dense token-level supervision directly from completed on-policy trajectories, representing hindsight as hierarchical skills: episode-level skills capture global workflows and failure-avoidance rules, while step-level skills capture local decisions. Crucially it avoids the external skill memories or retrieved privileged context that earlier skill-conditioned methods rely on, which are costly to maintain and tend to mismatch the current policy’s state distribution in multi-turn interaction. Together with DanceOPD, ReNIO, and V-Zero in today’s feed, it marks on-policy distillation as one of the week’s most active threads.

cs.CL cs.LG

#10

iLLaDA: an 8B masked diffusion language model posts large gains over LLaDA across math and code

Research 2026-06-25 arXivHugging Face Daily Papers 6.9 7.1/7.0/6.7

iLLaDA is an 8B masked diffusion language model trained from scratch with fully bidirectional attention, keeping the masked-diffusion objective through both pre-training (scaled to 12 trillion tokens) and supervised fine-tuning (a 25-billion-token instruction corpus for 12 epochs). With variable-length generation for efficiency and confidence-based scoring for multiple-choice evaluation, it improves broadly over LLaDA: the base model gains 21.6 points on BBH and 14.9 on ARC-Challenge, and the instruct model gains 14.5 on MATH and 16.5 on HumanEval. Despite non-autoregressive training it stays competitive with comparable autoregressive baselines, adding evidence that diffusion LMs are closing the gap with the dominant AR-plus-causal-attention recipe at the 8B scale.

cs.CL cs.LG

#11

A data-management study asks whether agent memory systems are ready for production

Agents & Tool Use 2026-06-25 arXivHugging Face Daily Papers 6.9 6.8/7.1/6.9

Agent memory has grown from simple retrieval augmentation into a full data-management layer — persistent storage, retrieval, update, consolidation, and lifecycle governance — yet it is still benchmarked mostly through end-to-end task scores (F1, BLEU) that treat the system as a black box. This paper decomposes agent memory into four core modules (representation, storage, retrieval, and management) and studies them as a data-management problem, surfacing the operational costs, architectural trade-offs, and robustness-under-update concerns that aggregate task metrics hide. The framing is a useful corrective for a subfield where “memory” is often a single retrieval call: it argues for evaluating the storage system itself, not just whether the agent answered the question.

cs.AI cs.CL

#12

ViQ learns text-aligned discrete visual tokens at native resolution

Multimodal 2026-06-25 arXivHugging Face Daily Papers 6.8 6.8/6.7/6.9

Discretizing images the way text is tokenized loses information, and prior work trades off detail against semantics: reconstruction-oriented codes lack meaning, while semantically strong features blur detail. ViQ structures quantization in two stages — a text-aligned pre-training stage followed by detail recovery — to balance high-level semantics with low-level detail while accepting inputs at native resolution. The goal is a single general-purpose discrete visual representation usable across multimodal modeling, where one tokenizer serves both understanding and generation without the usual semantics-versus-fidelity compromise.

cs.CV

#13

CARVE fixes ‘memory-blind’ gating in delta-rule linear attention

Recurrent & Linear Attention 2026-06-25 arXiv 6.7 7.0/6.8/6.4

Recurrent models must forget to remember, yet state-of-the-art delta-rule architectures decide what to erase by looking only at the arriving token, not the memory being modified. CARVE (Content-Aware Recurrent with Value Efficiency) targets three coupled defects it identifies in the leading GDN-2 design: this memory-blind gating, a value-axis erase mask that wastes parameters at the scale of the value projection, and — which the authors prove — an incompatibility that mathematically blocks the WY-form triangular chunk solver that makes recurrent training competitive with transformers. By making the gate content-aware and restoring chunk-parallel training, CARVE aims to recover both expressivity and the hardware-efficient parallelism that linear-attention models depend on. A genuinely architectural contribution in a category that is mostly incremental.

cs.LG cs.CL

#14

Do thinking tokens help with safety? A first-token probe says refusal is largely decided before reasoning

Safety, Policy & Regulation 2026-06-25 arXivHugging Face Daily Papers 6.7 6.7/6.9/6.5

The common intuition is that chain-of-thought gives a model a “safe space” to reconsider whether an answer violates its principles, improving alignment. This paper finds otherwise. Across open-weight reasoning models from the GPT-OSS, Qwen, Olmo, and Phi families, a trained probe on the first token’s hidden representation predicts the eventual refuse-or-comply outcome with 0.84 to 0.95 AUROC and roughly 88 percent balanced accuracy — before any visible thinking. The reasoning trace behaves more like prefix completion than deliberative revision, with the final decision rarely changing. The implication for safety tuning is pointed: deliberation is not where the refusal decision is being made, so safety interventions aimed at the thinking process may be targeting the wrong stage.

cs.CL cs.AI

#15

Adobe acquires image and video enhancement maker Topaz Labs

Generative Media 2026-06-25 TechCrunch — AI 6.5 6.4/6.2/7.0

Adobe is acquiring Topaz Labs, maker of widely used AI upscaling, denoising, and sharpening tools for photo and video, and says it will integrate Topaz’s technology across its applications. Topaz’s models are popular with photographers and video editors for restoring and enlarging footage, and the deal slots neatly into Adobe’s push to own more of the generative and enhancement stack inside Creative Cloud. Terms were not disclosed.

#16

EBench diagnoses generalist mobile-manipulation policies beyond a single success rate

Robotic Autonomy 2026-06-25 arXivHugging Face Daily Papers 6.5 6.6/6.5/6.4

EBench is a simulation benchmark of 26 manipulation tasks annotated along five capability dimensions and four generalization dimensions, built to replace the single success-rate scalar with a capability profile. Evaluating current VLA policies — including Pi-0, Pi-0.5, XVLA, and InternVLA-A1 — it shows that models with similar headline success rates have strikingly different profiles: Pi-0.5 has the best test success and train-test retention, InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA is strong on a disjoint set of atomic skills. The point is that aggregate success rate hides where a generalist policy actually generalizes and where it breaks, which is exactly the information needed to improve it.

cs.RO cs.LG

#17

‘Hallucination in world models is predictable and preventable,’ with new data and coverage-aware training

Robotic Autonomy 2026-06-25 arXivHugging Face Daily Papers 6.5 6.6/6.6/6.3

Generative world models render fluent but physically drifting rollouts. This work hypothesizes that such hallucination concentrates in low-coverage regions of the state-action space, where lightweight data-centric signals can both detect and mitigate it. The authors introduce MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, rewards, and live simulators, train a 350M-parameter world model on it, and identify three distinct hallucination modes — perceptual, action-marginalized, and scene-diverging — each tied to a different pipeline stage, with a predictive signal for each. A coverage-aware sampling technique then closes the gaps at training time. The framing of hallucination as a measurable coverage problem rather than an intrinsic limitation is the useful contribution for anyone building action-conditioned world models.

cs.CV cs.LG cs.RO

#18

Wan-Streamer models real-time, full-duplex audio-video interaction in one transformer

Multimodal 2026-06-25 arXivHugging Face Daily Papers 6.5 6.6/6.4/6.5

Wan-Streamer is a native-streaming, end-to-end interactive foundation model for low-latency, full-duplex audio-visual interaction. Rather than cascading separate voice-activity-detection, ASR, language, TTS, animation, and video-generation modules, it represents language, audio, and video as interleaved input and output tokens within a single transformer, coordinated by block-causal attention for incremental streaming. Perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are all learned jointly in one model, cutting the pipeline latency and error-stacking that plague cascaded systems. It is an ambitious unification of the real-time conversational-avatar stack into a single sequence model.

cs.CV cs.CL

#19

101st Airborne unit tests the limits of AI and uncrewed systems in a Fort Polk breach exercise

Government & Defense 2026-06-25 DefenseScoop 6.5 6.5/6.6/6.3

At the Joint Readiness Training Center at Fort Polk, the 3rd Mobile Brigade Combat Team of the 101st Airborne ran an exercise built around making an obstacle breach “uncontested” for assaulting riflemen by putting uncrewed systems forward of soldiers. The unit combined attack drones, uncrewed ground vehicles, and AI-enabled tooling to find and suppress threats ahead of the breach, and the after-action assessment is candid about where the technology held up and where it did not. It is a concrete data point on how the Army is integrating autonomy and AI at the small-unit tactical level, as opposed to in slideware, including the friction of making heterogeneous robotic systems work under field conditions.

#20

Amazon commits another $13B to AI infrastructure in India

Infrastructure 2026-06-25 TechCrunch — AI 6.4 6.3/6.4/6.5

Amazon is investing a fresh 13 billion dollars in AI and cloud infrastructure in India, deepening a build-out as global hyperscalers race to add data-center and compute capacity in the country. The commitment continues a pattern of very large regional infrastructure pledges tied to AI demand, and positions AWS to capture Indian enterprise and developer workloads as local adoption accelerates.

#21

Patronus AI raises $50M to build ‘digital worlds’ that stress-test AI agents

Agents & Tool Use 2026-06-25 TechCrunch — AI 6.4 6.3/6.3/6.6

Patronus AI, founded by former Meta AI researchers, raised 50 million dollars to build simulated “digital worlds” for stress-testing AI agents before deployment, with investors describing nearly insatiable demand. As agents move into production, evaluation and red-teaming of multi-step agent behavior in controlled environments is becoming its own infrastructure layer — the same need that General Intuition’s simulation work and a wave of today’s agent-evaluation papers are circling from different directions.

#22

The Verification Horizon: why reward design, not generation, is now the hard part for coding agents

AI Coding 2026-06-25 arXivHugging Face Daily Papers 6.4 6.5/6.6/6.1

The classical intuition that verifying a solution is easier than producing one is being inverted for coding agents: as models reason better and harnesses grow more capable, generating candidate solutions is cheap, but reliably verifying them is the bottleneck. Every verifier is a proxy for human intent, never the intent itself, so verification faces a twofold difficulty — intent is underspecified by nature, and optimization during training widens the proxy-intent gap, surfacing as reward hacking or signal saturation. The paper characterizes verification-signal quality along scalability, faithfulness, and other axes, arguing there is “no silver bullet” reward for coding agents and that the field’s progress increasingly hinges on better verification rather than better generation.

cs.SE cs.CL

#23

Qwen-Image-Agent closes the ‘context gap’ in real-world image generation

Generative Media 2026-06-25 arXivHugging Face Daily Papers 6.4 6.4/6.2/6.5

Text-to-image models struggle with real-world prompts that are underspecified, implicit, or dependent on current knowledge. Qwen-Image-Agent frames this as a Context Gap between the user’s partial input and the generation context the model actually needs, and closes it with an agentic loop: Context-Aware Planning identifies what context is missing and how to acquire it, while Context Grounding gathers it through reasoning, search, memory, and feedback before generation. The system treats image generation as a context-construction problem rather than a one-shot mapping, with a new benchmark to evaluate this agentic style of generation.

cs.CV cs.AI

#24

GUI versus CLI: a matched benchmark isolates execution bottlenecks in computer-use agents

Agents & Tool Use 2026-06-25 arXivHugging Face Daily Papers 6.4 6.4/6.3/6.4

Comparisons between screen-only GUI agents and command-line agents usually confound modality with differences in tasks, states, and verifiers. This work introduces a matched benchmark of 440 desktop tasks across 18 applications and 12 workflow categories where both agent types get identical goals, states, and final-state verifiers while restricted to modality-native actions. The strongest GUI agent reaches 59.1 percent full pass rate, beating the strongest original-skill CLI agent at 48.2 percent — but verifier-guided skill augmentation lifts CLI to 69.3 percent, showing most of the CLI deficit comes from incomplete skill coverage rather than model capability. The takeaway is that the GUI-versus-CLI gap is largely an engineering question about skill libraries, not an intrinsic property of the interface.

cs.AI cs.CL

#25

Pentagon’s new post-quantum cryptography strategy frames quantum computing as an ‘existential threat’

Government & Defense 2026-06-25 DefenseScoop 6.4 6.3/6.6/6.2

The Department of Defense released a Post-Quantum Cryptography Strategy that characterizes the full realization of quantum computing as an existential threat to U.S. national security and military dominance, on the logic that a sufficiently capable quantum computer would break the public-key cryptography securing military communications and weapons systems. The document sets direction for migrating to quantum-resistant algorithms across defense systems. It lands alongside a parallel federal push giving civilian agencies a four-month window to finalize quantum-ready migration plans, signaling that “harvest now, decrypt later” risk has moved from research concern to formal program of record.

#26

Microsoft Research uses AI to generate and test explanations of how the brain works

AI for Science 2026-06-25 Microsoft Research Blog 6.3 6.3/6.4/6.2

Microsoft Research describes a system that uses AI to propose mechanistic explanations of neural activity and then design experiments to test them, aiming to turn large-scale brain data into falsifiable hypotheses rather than just predictive models. The approach pairs models that fit neural recordings with an explanation-and-experiment loop, positioning AI as an active participant in the scientific method for neuroscience rather than a black-box predictor — part of the broader trend of AI systems that generate and refine scientific hypotheses.

#27

GitHub benchmarks the Copilot agentic harness across models and tasks

AI Coding 2026-06-25 GitHub Blog — AI and ML 6.3 6.2/6.2/6.4

GitHub published an evaluation of its Copilot agentic harness — the shared scaffolding that turns a raw model into a task-completing agent — measuring performance and efficiency across different underlying models and task types. The central argument is that the harness, not just the model, shapes how effectively the underlying intelligence is applied, and that holding the harness fixed while swapping models isolates where gains actually come from. It is a useful counterpart to OpenAI’s internal Codex-usage data: both point at tooling and scaffolding as the decisive variable in agentic-coding outcomes.

#28

JetSpec breaks the scaling ceiling of speculative decoding with parallel tree drafting

Efficiency 2026-06-25 arXivHugging Face Daily Papers 6.3 6.4/6.3/6.1

Speculative decoding speeds up autoregressive LLMs by drafting tokens and verifying them in parallel, but gains stall when larger draft budgets raise overhead or lower acceptance. Prior head-based methods face a causality-efficiency dilemma: autoregressive drafters give high-acceptance path-conditioned candidates but cost grows with tree depth, while bidirectional block-diffusion drafters generate all positions in one pass but produce mutually inconsistent trees. JetSpec is a head-based framework that combines one-forward drafting efficiency with parallel tree drafting, aiming to keep acceptance high while holding drafting cost flat as the tree grows — pushing past the budget ceiling that has limited speculative decoding speedups.

cs.CL cs.LG

#29

‘Plans don’t persist’: context management is load-bearing for LLM agents

Agents & Tool Use 2026-06-25 arXivHugging Face Daily Papers 6.3 6.3/6.4/6.1

Long-horizon agents compress, summarize, and evict old tokens to continue past finite context windows — safe only if the dropped information is no longer needed or has been internalized. Plans are the stress case: written early, used for many steps, and first to be evicted. The authors introduce “replay pairing,” running a trajectory with and without the plan in history and measuring hidden-state cosine distance. On Llama-3.1-70B the plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step (12.4x on HotpotQA), evidence that standard agents do not carry plans forward as persistent internal state but depend on the plan staying literally in context. A sharp, quantified warning for anyone aggressively trimming agent context.

cs.CL cs.AI

#30

Detecting jailbreaks from how predictive entropy evolves across intermediate layers

Interpretability 2026-06-25 arXivHugging Face Daily Papers 6.2 6.3/6.3/6.0

Most jailbreak defenses operate on prompts or outputs; this work probes internal representations. Using the logit lens to track token-level predictive entropy across layers of a frozen LLM, the authors find that static aggregate statistics of prompt entropy carry little signal, but features capturing how entropy evolves across token positions — such as monotonic rank-based trend scores — are far more discriminative. The signal is concentrated in intermediate layers and degrades at the final layer, suggesting jailbreak-relevant information about harmful intent is encoded mid-network and partly washed out by the time it reaches the output. A useful localization result for building internal, representation-level safety monitors.

cs.CL cs.CR

#31

‘The Hitchhiker’s Guide to Agentic AI’: a full-stack practitioner reference

Agents & Tool Use 2026-06-25 arXivHugging Face Daily Papers 6.2 6.1/6.0/6.5

This is a book-length practitioner reference for building autonomous AI systems, organized around the thesis that good agentic systems require understanding every layer of the pipeline. It opens with the LLM substrate (transformer architecture, GPU systems, training and fine-tuning including SFT, LoRA, and MoE, compression, and inference optimization), develops the alignment-and-reasoning layer (RLHF, PPO, DPO and variants, GRPO, reward modeling, and RL for reasoning with chain-of-thought and test-time scaling), and devotes its second half to agentic AI proper, including agentic training and trajectory-based methods. Useful less as a research result than as a consolidated map of the field for people building production agents.

cs.AI

#32

RoPE-aware bit allocation makes KV-cache quantization position-sensitive

Efficiency 2026-06-25 arXivHugging Face Daily Papers 6.2 6.3/6.2/6.0

Low-bit KV-cache quantizers usually treat each cached key as a flat vector, but under RoPE a key’s contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks, making key-cache quantization a block-wise bit-allocation problem in which high-energy RoPE blocks deserve more bits. Block-GTQ computes a label-free energy score per RoPE block and greedily allocates integer bit-widths by marginal gain. Under matched bit budgets it cuts per-layer mean absolute error by 32 to 80 percent at 2- and 3-bit key-only quantization and wins all 367 of 367 layer comparisons against uniform allocation — a clean, well-motivated refinement for long-context inference memory.

cs.LG cs.CL

#33

In-Context World Modeling lets robot policies infer system parameters on the fly

Robotic Autonomy 2026-06-25 arXivHugging Face Daily Papers 6.2 6.3/6.2/6.0

Vision-Language-Action models often fail under altered camera viewpoints or robot morphologies because they condition only on current observations and language, implicitly assuming a fixed execution context. In-Context World Modeling treats system identification as in-context adaptation: the policy infers essential system variables from a short history of self-generated, task-agnostic interactions, using the context window to understand how the system behaves rather than what task to do. The aim is generalization to new embodiments and setups without the data-intensive fine-tuning that VLA models usually require for each new environment.

cs.RO cs.LG

#34

Why multi-step tool-use RL collapses, and which supervisory signals fix it

Reinforcement Learning 2026-06-25 arXivHugging Face Daily Papers 6.2 6.2/6.3/6.0

Agentic RL for tool use often destabilizes, and some models exhibit catastrophic collapse where performance abruptly drops and tool-invocation structure breaks. The authors trace this to unexpected probability spikes in specific control tokens that disrupt structured execution — while the underlying tool-use capability stays intact, merely obscured by broken formatting. They then survey supervisory signals (off-policy supervision, hint-based guidance, erroneous-example supervision, and others) under synchronous and interleaved training, finding that interleaving supervision stabilizes training and recovers the lost structure. A practical diagnosis of a failure mode that anyone training tool-using agents with RL will recognize.

cs.LG cs.CL

#35

‘Look Light, Think Heavy’: where multimodal chain-of-thought helps and where it does not

Multimodal 2026-06-25 arXivHugging Face Daily Papers 6.1 6.2/6.2/6.0

Chain-of-thought is standard for boosting LLM reasoning, but its value in multimodal tasks is unclear. Evaluating 12 multimodal tasks across perception and reasoning with 14 non-reasoning and 8 reasoning models, this study finds CoT is “not a free lunch”: it helps on reasoning-heavy tasks but can hurt on perception-dominated ones, where extended deliberation over a fundamentally perceptual judgment adds noise rather than signal. The result argues for selective, task-aware use of multimodal reasoning rather than always-on CoT, and clarifies when the “think harder” reflex actually pays off in vision-language settings.

cs.CV cs.CL

#36

How post-training reshapes biological reasoning models, stage by stage

AI for Science 2026-06-25 arXivHugging Face Daily Papers 6.1 6.2/6.3/5.9

Scientific reasoning models for biology pair language models with foundation models over DNA, RNA, and proteins, but how each post-training stage shapes them is poorly understood. Training and evaluating more than 100 models across genomics, transcriptomics, and proteins under controlled variation of backbone, continued pre-training, supervised fine-tuning, and RL, the authors find each stage reshapes generalization distinctly rather than adding uniform gains: continued pre-training aligns the model with biological “language” and improves downstream performance, while later stages can induce over-specialization that trades in-domain gains for out-of-domain robustness. A careful, large-N anatomy of the post-training recipe for AI-for-biology.

q-bio cs.LG

#37

Congressional Research Service says fewer than 3% of AI bill-summary outputs met its standards

Government & Defense 2026-06-25 FedScoop — AI 6.1 6.0/6.3/6.0

The Congressional Research Service tested AI on bill summaries to work through a backlog, but its director told the House Administration Committee that fewer than 3 percent of the AI-generated results met the agency’s standards. It is a sober, concrete data point on the gap between general-purpose model capability and the accuracy bar for high-stakes legislative work, and a counterweight to broad claims about AI readily automating government analytical functions — the hard part is the last few percent of reliability, not the first draft.

#38

PrivacyAlign grounds agent privacy decisions in human judgment

Safety, Policy & Regulation 2026-06-25 arXivHugging Face Daily Papers 6.1 6.1/6.2/5.9

Every message, post, or tool call an agent makes is a contextual judgment about what is appropriate to share, with whom, and under what conditions — making privacy a core alignment problem for agents acting on a user’s behalf. Because these judgments depend on social norms, human judgment helps define violations rather than just label them. PrivacyAlign provides 1,350 samples with 3,516 detailed annotations from 599 annotators across scenarios where current LLMs actually leak, and uses them to ground both alignment training and automated evaluation. The contribution is putting human contextual judgment at the center of agentic privacy rather than relying on the unreliable proxies prior work used.

cs.CL cs.CR

#39

Philippines deploys U.S.-made Triton autonomous naval drones in the West Philippine Sea

Government & Defense 2026-06-25 C4ISRNET 6.0 6.0/6.1/6.0

The Philippine Navy is deploying four Triton autonomous underwater and surface drones to protect subsea cables and monitor incursions by Chinese vessels and maritime militia in disputed waters. The deployment is a concrete example of uncrewed maritime autonomy moving into contested operational use for persistent surveillance, where autonomous surface and subsurface platforms extend a navy’s sensing reach without putting crews forward.

#40

Poland buys Shield AI’s V-BAT drones for Baltic naval operations

Government & Defense 2026-06-25 C4ISRNET 6.0 6.0/6.0/6.0

Poland’s Ministry of National Defence has contracted U.S. firm Shield AI to supply its V-BAT vertical-takeoff drone for the Polish Navy, expanding the country’s uncrewed maritime ISR capacity in the Baltic. The purchase adds to a string of European procurements of autonomous and AI-enabled platforms from defense-tech companies, and underscores how vendors built around autonomy software, not just airframes, are winning operational contracts.

#41

Netris raises $15M from a16z to help AI neoclouds go live faster

Infrastructure 2026-06-25 TechCrunch — AI 6.0 5.9/5.9/6.2

Netris raised a 15 million dollar Series A led by a16z for software that runs on network switches and helps neocloud operators reduce the time it takes to bring GPU clusters online. As specialized AI cloud providers proliferate, the unglamorous work of automating network provisioning becomes a real bottleneck, and Netris is betting that switch-level software is where time-to-revenue for new clusters is won or lost.

#42

Speaker verification for non-verbal vocalizations via conditional distillation and a mixture-of-experts

Audio & Speech 2026-06-25 arXivHugging Face Daily Papers 6.0 6.1/6.0/5.9

As expressive TTS and voice-conversion systems increasingly generate non-verbal vocalizations — laughs, sighs, and other sounds — verifying that those segments preserve speaker identity matters, but current speaker-verification systems generalize poorly to them, and fine-tuning on non-verbal data causes catastrophic forgetting of speech performance. This first systematic study across 10 non-verbal vocalization types pairs frozen Data2Vec self-supervised features with ECAPA-TDNN and adds a mixture-of-experts module, retaining speech verification accuracy while extending reliable identity checking to non-verbal segments. A targeted fix for an evaluation gap that expressive speech generation is rapidly creating.

eess.AS cs.SD

#43

Autodata trains an agent to act as a data scientist that builds its own training data

Agents & Tool Use 2026-06-25 arXivHugging Face Daily Papers 6.0 6.0/6.1/5.9

Autodata is a method for training AI agents to act as data scientists that construct high-quality training and evaluation data, and to meta-optimize so the agent learns to create progressively stronger data. Its concrete instantiation, Agentic Self-Instruct, is tested on computer-science research tasks, legal reasoning, and reasoning over mathematical objects, where it beats classical synthetic-data-creation methods, with the meta-optimized data scientist yielding further gains. It sits in the same trajectory as today’s on-policy-distillation work: pushing more of the data-generation loop inside a learned, self-improving agent rather than a fixed pipeline.

cs.LG cs.AI

#44

‘Human in the loop’ is not the same as AI governance, argues a federal-tech analysis

Safety, Policy & Regulation 2026-06-25 FedScoop — AI 5.9 5.9/6.1/5.8

As agencies accelerate AI adoption, “keep a human in the loop” has become the default assurance of accountability. This analysis argues the phrase is doing too much work: a person nominally in the loop who lacks the time, information, or authority to meaningfully intervene provides the appearance of oversight without its substance. The piece presses for governance defined by genuine decision authority and auditability rather than the mere presence of a human checkpoint — a distinction that matters as automated decision systems scale across government.

#45

Computerphile: is AI music classification a modern ‘Clever Hans’?

Interpretability 2026-06-25 Computerphile 5.9 5.8/5.9/6.0

In a Computerphile segment, King’s College London researcher David Kelly uses the Clever Hans story — the horse that appeared to do arithmetic but was reading its handler’s cues — as a lens on shortcut learning in AI music classifiers, asking whether systems that score well are tracking the musical property of interest or an incidental correlate. It is an accessible framing of a core interpretability and evaluation concern: high benchmark accuracy can mask a model exploiting spurious signal rather than the intended structure, and distinguishing the two requires probing what the model is actually keying on.