Wolf Digest — 2026-06-23

#1

OpenAI launches Daybreak, a security tool suite, and the Patch the Planet open-source initiative

Industry 2026-06-22 OpenAI ResearchTechCrunch — AI 8.1 8.0/8.5/7.8

OpenAI introduced Daybreak, a package of tools aimed at letting organizations find, validate, and patch software vulnerabilities at scale, anchored by two products: Codex Security, a security-oriented layer over its Codex agent, and GPT-5.5-Cyber, a model tuned for offensive and defensive security work. The framing is that the same agentic capability that can discover a vulnerability in a codebase can also be pointed at fixing it, and OpenAI is packaging that loop into a workflow organizations can run continuously rather than as one-off audits.

Alongside the tooling, OpenAI announced Patch the Planet, an initiative to harden open-source software in partnership with the security firm Trail of Bits. The name nods to the 1995 film Hackers and its catchphrase. Under the program, Trail of Bits security engineers work directly with open-source maintainers: they review machine-generated findings before those findings reach maintainers, collaborate on patches and regression tests, and build reusable workflows so a project keeps improving after the first round of fixes lands. OpenAI's stated goal is to reduce, not add to, the triage burden on maintainers who are already being asked to process more vulnerability reports more quickly with the same limited resources. TechCrunch likened the Trail of Bits engineers to code paramedics who triage and stabilize issues, all supported by OpenAI's software.

The backdrop is the structural fragility of open source: it underpins essentially all commercial software, yet much of it is maintained by small, unfunded teams, and a single bug can cascade widely, as the log4j incident demonstrated. The newer wrinkle is that automated vulnerability discovery cuts both ways, because a model that can enumerate exploitable bugs lowers the cost of attack as much as defense. That dual-use tension is the same one raised by Anthropic's Mythos security model, and it is the explicit motivation OpenAI gives for shipping defensive tooling and maintainer support together rather than the model alone.

The practical questions are about scale and durability. TechCrunch noted it is unclear how Patch the Planet sustains itself across the long tail of open-source projects, or how the human-in-the-loop review keeps pace if the tooling generates findings faster than Trail of Bits engineers can vet them, and read the launch partly as a competitive positioning move against Anthropic in the AI-security arena.

How it was discussed

OpenAI's own framing centers the tooling, Codex Security and GPT-5.5-Cyber, and the find-validate-patch loop run continuously.
TechCrunch emphasized the Trail of Bits human-review layer and questioned how the maintainer program scales over the long tail of open source.
TechCrunch and OpenAI both situate the launch against Anthropic's Mythos, where automated bug-finding lowers the cost of attack as much as defense.

agents safety_policy ai_coding cybersecurity

#2

EnterpriseClawBench: an agent benchmark from real workplace sessions where the best config hits only 0.663

Evaluations & Benchmarks 2026-06-22 Hugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 7.6 7.6/7.7/7.5

EnterpriseClawBench is a benchmark for agents that operate inside real workspaces, where they read heterogeneous files, invoke tools, and produce business artifacts. Its distinguishing feature is provenance: rather than synthetic tasks, it is constructed from a large archive of proprietary, real-world agent sessions captured inside actual enterprises. From that archive the authors distill 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics that define what a correct deliverable looks like.

Because the underlying sessions contain confidential enterprise content, the benchmark data itself is not released. The reusable contribution is instead the construction and evaluation protocol, the recipe for turning recorded agent sessions into a graded benchmark, which sidesteps the data-sharing problem while still letting other teams reproduce the methodology on their own session archives. The headline empirical result is sobering: the strongest configuration the authors tested, Codex paired with GPT-5.5, reaches only 0.663 on the suite, leaving roughly a third of real enterprise tasks unsolved by today's best agent stack.

The authors argue that enterprise agent evaluation cannot be reduced to a single accuracy number. They insist results be reported as harness-and-model combinations rather than model alone, because the scaffold materially changes outcomes, and that the report card must also include artifact delivery, visual quality of the produced documents, cost, runtime, and stability. That multi-axis view reframes the gap between agent demos and deployment as not only a capability shortfall but a reliability and economics problem, which is precisely the dimension that matters to organizations deciding whether to put agents into production knowledge work. The main caveat is the closed data: reproducibility rests on re-running the protocol against comparable private sessions rather than on a public leaderboard, so cross-paper comparisons will be harder to calibrate.

How it was discussed

Curated onto Hugging Face Daily Papers and surfaced by akhaliq, signaling strong community interest in realistic agent evals.
The arXiv evals framing stresses reporting harness-model pairs plus cost, runtime, and artifact quality, not a single accuracy figure.

agents frontier_llm enterprise

#3

MLST documentary: John Jumper dissects AlphaFold's architecture and its limits

AI for Science 2026-06-22 Machine Learning Street TalkMachine Learning Street Talk (MLST) 7.5 7.0/8.0/7.5

Machine Learning Street Talk released a documentary-cut interview with John Jumper, who shared the 2024 Nobel Prize in Chemistry for AlphaFold and recently left Google DeepMind for Anthropic. The film is less about the move than about a precise, deflationary account of what AlphaFold actually does. Jumper walks through the architecture in detail: multiple-sequence alignments as input, the Evoformer that mixes evolutionary and pairwise representations, invariant point attention operating in three-dimensional coordinate frames, and the FAPE loss that scores predicted structures. He also corrects the popular equivariance narrative, noting that ablations valued the equivariant machinery at roughly 2.5 of 30 GDT points rather than crediting it with the whole result.

He is blunt about the limits. AlphaFold predicts one experiment, the static folded structure, extraordinarily well, but it is not a model of the cell, it does not capture dynamics, and on any given drug target it is, in his words, wrong nine times out of ten. From there the conversation traces the downstream story: the AlphaFold Database of more than 200 million predicted structures, AlphaFold 3 and its handling of ligands, and the spinout of Isomorphic Labs to pursue drug discovery. Jumper also voices a quarrel with the bitter lesson, arguing that in regimes of finite data and expensive experiments, human hypotheses and inductive structure still earn their keep, rather than scale alone settling the question.

The documentary closes with Emmanuel Nji of BioStruct Africa on what changes when structural work that once took a year and roughly a hundred thousand dollars of crystallography per structure now takes months, and on training the next thousand structural biologists across Africa. For an ML audience the value is hearing a foundational AI-for-science figure separate the genuine breakthrough from the hype around it, with specific numbers attached to each claim, and locate where learned priors end and laboratory experiments remain irreplaceable.

ai_science alphafold protein-folding

#4

Tmax: an open RL recipe for terminal agents reaching 27% on Terminal-Bench 2.0 at 9B parameters

AI Coding 2026-06-22 Hugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 7.5 7.6/7.5/7.4

Tmax is presented as the strongest open reinforcement-learning recipe to date for terminal-using agents, the language models that drive a shell to accomplish tasks. The motivation is that terminal agents have become one of the most popular downstream applications of language models, yet very little academic work examines RL-based training for them, held back by difficult benchmarks, scarce data, and the absence of simple baseline recipes others can build on.

The central result is efficiency: Tmax reaches 27 percent on Terminal-Bench 2.0 with only 9 billion parameters, outperforming substantially larger models from prior work. The method's core is data generation. The authors introduce a taxonomy that combines difficulty control, personas, and verifier diversification, which lets them cheaply synthesize large numbers of terminal environments suitable for both supervised fine-tuning and reinforcement learning. They then release the resulting terminal dataset, which they report is more than 2.5 times larger than the previously available terminal-agent dataset, lowering the barrier for other groups to train competitive terminal agents without proprietary data.

The work is co-authored by Nathan Lambert, who frames it on Interconnects as part of closing the gap between open recipes and the frontier for agentic use. The value for practitioners is twofold: a reproducible small-model baseline that beats larger systems on a recognized terminal benchmark, and an open data pipeline whose taxonomy can be extended to new tool environments. The main caveat is that 27 percent on Terminal-Bench 2.0, while strong for the parameter count, still leaves terminal agents far from reliable on the harder tasks in the suite, so the contribution is best read as a strong open starting point rather than a solved problem.

How it was discussed

Nathan Lambert frames Tmax on Interconnects as an open RL recipe pulling open agents toward the closed frontier.
Picked up by Hugging Face Daily Papers and akhaliq, reflecting community appetite for open terminal-agent baselines and data.

agents rl terminal-bench

#5

Gray Swan's Kolter and Fredrikson on red-teaming after Mythos

Safety, Policy & Regulation 2026-06-22 Latent Space PodcastLatent Space (swyx & Alessio) 7.1 6.8/7.5/7.0

Latent Space hosts Zico Kolter, a CMU professor and member of OpenAI's board on its Safety and Security Committee, and Matt Fredrikson, also of CMU and chief executive of Gray Swan, who co-authored an influential paper on indirect prompt injection. The conversation argues that jailbreaks and indirect prompt injection have moved from a niche concern to a central problem now that agents take consequential actions, and that the export-control attention on Anthropic's Mythos and Fable models has put model security in the spotlight. They draw on years of red-teaming work, from HackAPrompt-style competitions to Gray Swan's arena, to argue that injection through retrieved or rendered content remains largely unsolved at the model layer.

How it was discussed

Kolter ties model security to OpenAI's board-level safety work; Fredrikson frames it through Gray Swan's red-teaming arena.
Both place the discussion against the Mythos and Fable export controls now making model security a mainstream topic.

agents prompt-injection red-teaming

#6

SpaceX signs $150M-per-month compute deal with open-source lab Reflection AI

Infrastructure 2026-06-22 TechCrunch — AI 7.0 7.2/7.0/6.8

Reflection AI, an open-source AI lab, will pay SpaceX 150 million dollars a month from July 1, 2026 through 2029 for immediate access to Nvidia's latest GB300 accelerators and supporting hardware in SpaceX's Colossus 2 data center near Memphis, Tennessee. The arrangement underscores how SpaceX's Colossus build-out is being monetized as a neocloud, selling frontier GPU capacity to outside labs, and how an open-weights lab is willing to commit roughly 5.4 billion dollars over the term to secure GB300 supply rather than wait in line for scarce capacity.

compute gb300 neocloud

#7

MIT Tech Review recaps Anthropic's dispute with the US government over the Mythos model

Safety, Policy & Regulation 2026-06-22 MIT Technology Review — AI 7.0 6.9/7.6/6.5

MIT Technology Review recaps the state of Anthropic's dispute with the federal government. By its account, Anthropic said in April it had built a model called Mythos that was strong enough at working with code to pose a global cybersecurity threat, and gave access to a small group of cybersecurity experts under restricted terms, after which the government moved to control the technology. The piece lays out the points to watch as the situation develops, including how access is governed, how export controls on frontier security models are scoped, and how other labs respond. Reported here as the policy backdrop to OpenAI's Daybreak launch and the day's red-teaming discussion.

anthropic mythos export-controls

#8

PhoneBuddy: training open models for reliable agentic phone use

Agents & Tool Use 2026-06-22 Hugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.CL (Computation & Language) 6.9 7.1/6.8/6.8

PhoneBuddy is a training recipe and open-model line for agents that operate real phones. The core difficulty is that the deployment environment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate it. The authors pair a real-app environment with PhoneWorld, a mock environment that reconstructs runnable mock apps from real GUI usage structure, build a shared supervised fine-tuning stage from both, then compare real-app RL against mixed RL across both. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, success rates improve materially over the supervised baseline.

How it was discussed

Surfaced via Hugging Face Daily Papers and akhaliq, alongside the arXiv agents feed.

agents gui rl

#9

PlanBench-XL: long-horizon planning over 1,665 tools, where GPT-5.4 reaches 51.9%

Evaluations & Benchmarks 2026-06-21 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.8 6.8/6.9/6.7

PlanBench-XL is an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence, and chain those results toward a goal under retrieval-limited tool visibility. An optional blocking mechanism injects missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Across ten leading models, massive-tool planning remains hard: GPT-5.4 achieves 51.9 percent, indicating that discovering and composing the right tools from a large ecosystem, rather than reasoning over a fixed toolset, is a live bottleneck.

How it was discussed

Highlighted on Hugging Face Daily Papers and akhaliq as a stress test for large-ecosystem tool planning.

agents tool-use planning

#10

SelfCompact: letting an agent decide when to compact its own context

Agents & Tool Use 2026-06-22 Hugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 6.8 6.9/6.8/6.7

Long agent traces of chain-of-thought and tool calls accumulate stale content that anchors later generations and eventually overruns the context window. Standard scaffolds compact at fixed token thresholds, which ignores trajectory structure and risks discarding partial results mid-derivation. SelfCompact instead lets the model decide when and how to compact, pairing a compaction tool the model invokes to summarize accumulated context with a lightweight rubric specifying when to fire (a sub-task resolved, the trajectory converging) and when to suppress (mid-derivation, or when stuck). The authors show both elements are needed: the tool alone is used unevenly across open-weight models, often at unhelpful moments, until the rubric guides its timing.

How it was discussed

Curated by Hugging Face Daily Papers and akhaliq across the arXiv agents and evals feeds.

agents context-management efficiency

#11

Groq confirms $650M raise and leans into its neocloud business

Industry 2026-06-22 TechCrunch — AI 6.8 6.9/6.5/7.0

AI chipmaker Groq confirmed a 650 million dollar raise and is hiring new executives as it leans into its neocloud business, selling inference capacity built on its LPU inference accelerators. The move follows an Nvidia 20 billion dollar arrangement that TechCrunch characterizes as a not-acqui-hire, and positions Groq to re-staff and compete on low-latency inference serving rather than purely on chip sales.

groq funding inference

#12

Can LLMs reliably self-report adversarial prefills? Largely not

Safety, Policy & Regulation 2026-06-22 arXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksarXiv — Post-training / AlignmentarXiv — Reinforcement Learning 6.7 6.5/7.2/6.4

This paper tests whether a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned models from 3B to 70B and four safety benchmarks, no model reliably recognizes its own compromised outputs, claiming genuine intent on prefilled responses at an average rate of 27.3 percent. The introspective signal stems largely from safety and refusal-related reasoning: orthogonalizing model weights against the refusal direction collapses the gap between claiming rates on prefilled versus natural outputs to near zero, though that direction is not the unique mediator. The signal is also probe-dependent, with internal-intention versus external-tampering framings eliciting qualitatively different answers.

interpretability introspection jailbreaks

#13

Google DeepMind and A24 commit $75M to build AI filmmaking tools

Generative Media 2026-06-22 TechCrunch — AI 6.7 6.8/6.3/7.0

Google DeepMind and the studio A24 are teaming up, with a reported 75 million dollar commitment, to build AI tools for filmmaking. The partnership pairs DeepMind's generative video and image models with a studio known for auteur-driven work, signaling a push to move generative media from demos into production pipelines and to shape how AI tooling is adopted by working filmmakers.

generative-video hollywood a24

#14

G2PO: graph-structured credit assignment for long-horizon agentic RL

Reinforcement Learning 2026-06-22 arXiv — Agents / Tool UsearXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksarXiv — Reinforcement Learning 6.6 6.7/6.6/6.5

Group-based RL has improved LLM agents, and recent frameworks shifted from trajectory-level to step-level updates for finer credit assignment. But long-horizon agentic RL still suffers from sparse, delayed reward, and existing step-level methods treat exploration as isolated linear trajectories, ignoring the graph structure of state transitions and producing high-variance value estimates and myopic credit assignment. Group-Graph Policy Optimization (G2PO) models the shared graph of states across a group of rollouts, aggregating value signal over converging paths to deliver lower-variance estimates and less localized credit assignment for multi-turn agents.

How it was discussed

Cross-listed across the arXiv agents, evals, and reinforcement-learning feeds.

rl agents credit-assignment

#15

ToolGraph: self-evolving multi-turn tool-calling via divergence-point preferences

Agents & Tool Use 2026-06-22 arXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksarXiv — Post-training / AlignmentarXiv — Reinforcement Learning 6.6 6.7/6.6/6.5

Multi-turn tool-using agents must coordinate long tool sequences while tracking dialogue state and policy constraints, yet most approaches separate inference-time orchestration from parameter-level learning, leaving tool selection weakly structured and preference updates vulnerable to train-deployment prompt mismatch. ToolGraph combines schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls for write prerequisites and repeated-search loops, then trains on divergence-point preferences, the steps where successful and unsuccessful trajectories first split, to sharpen tool selection for within-benchmark self-improvement.

How it was discussed

Spans the arXiv language, evals, alignment, and RL feeds.

agents tool-use post-training

#16

World Action Models: a survey clarifying a blurred field

Robotic Autonomy 2026-06-18 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.6 6.4/6.9/6.5

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Some recent WAMs repurpose large video-generation models; a parallel line uses language or vision-language backbones with no video core. This survey argues the rapid expansion has blurred the boundaries among broad world models, video-generation models, action-grounded video world models, vision-language-action policies, and WAMs, and gives the field a common account. It organizes existing work along two axes, what each method must generate (rendered futures, latent futures, or video-generation-free action reasoning) and a decomposition by predictive substrate, backbone, action coupling, and deployment regime.

How it was discussed

Brought forward by Hugging Face Daily Papers and akhaliq curation.

world-models vla survey

#17

CoorDex: dexterous humanoid loco-manipulation on the move

Robotics 2026-06-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv cs.RO (Robotics)arXiv — Reinforcement Learning 6.5 6.7/6.3/6.5

Humanoid loco-manipulation is usually simplified into a stop-and-go process, walking to an object, stopping to manipulate, then resuming, and commonly relies on low-degree-of-freedom end effectors that act like an open-close grasp. CoorDex is a learning pipeline that converts high-dimensional body and dexterous-hand control into coordinated latent residual control, enabling high-DoF dexterous loco-manipulation while the robot is moving. Starting from simulated whole-body and hand demonstrations, it trains a coordinated controller that couples body and hand priors rather than treating them as separate stages.

How it was discussed

Cross-listed across the arXiv AI, ML, robotics, and RL feeds.

robotics humanoid dexterous-manipulation

#18

AIR: adaptive interleaved reasoning with code in multimodal LLMs

Multimodal 2026-06-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)arXiv — Evals & BenchmarksarXiv — Reinforcement Learning 6.5 6.6/6.4/6.5

Following the o3-style paradigm of interleaving code execution with reasoning, most multimodal work has focused on tool use within vision-perception tasks, relying on predefined heuristics for visual manipulation and remaining unable to handle numerical computation. AIR gives multimodal LLMs adaptive interleaved reasoning that decides when to call code versus reason in language, extending interleaved code reasoning beyond visual operations to numerical and mixed problems, and trains the policy so the model learns to invoke computation only when it helps.

How it was discussed

Appears across the arXiv AI, vision, evals, and RL feeds.

multimodal tool-use reasoning

#19

Scaling linear mode connectivity and model merging to billion-parameter transformers

Post-Training 2026-06-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Post-training / AlignmentarXiv — Reinforcement Learning 6.5 6.5/6.8/6.2

Linear mode connectivity offers a foundation for understanding and merging independently trained networks, but existing methods optimize the interpolation path from a single endpoint, limiting scalability to large pretrained transformers. This work applies functionality-preserving weight transformations to align functionally equivalent solutions and introduces a dual procedure in which both models jointly learn their transformations toward a shared interpolation path. The bidirectional optimization substantially reduces interpolation barriers, enabling more reliable merging across billion-parameter architectures than one-sided alignment allows.

How it was discussed

Cross-posted to the arXiv AI, ML, alignment, and RL feeds.

model-merging mode-connectivity post-training

#20

Unlimited OCR Works: tackling the KV-cache blowup in LLM-decoder OCR

Multimodal 2026-06-22 Hugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv cs.CL (Computation & Language)arXiv — Efficiency (Quantization, MoE, Inference) 6.5 6.6/6.3/6.6

End-to-end OCR models such as DeepSeek-OCR put OCR back in the spotlight by using a language model as the decoder to leverage the prior distribution of language. The downside is that as output sequences lengthen, the accumulated KV cache drives up memory and progressively slows generation, unlike humans, whose reading efficiency does not decay with length. Unlimited OCR Works targets that decline directly, proposing mechanisms to keep memory and throughput roughly constant as documents grow so that long-document OCR stays efficient rather than degrading.

How it was discussed

Surfaced on Hugging Face Daily Papers and akhaliq, riding the DeepSeek-OCR wave of interest.

ocr kv-cache efficiency

#21

Kamera: position-invariant multimodal KV-cache reuse without training

Efficiency 2026-06-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)arXiv — Efficiency (Quantization, MoE, Inference)arXiv — Evals & Benchmarks 6.4 6.6/6.3/6.3

Multimodal agents repeatedly re-examine the same video frames, UI screenshots, and rendered artifacts as the context window slides and reasoning iterates, yet every look-back re-encodes from scratch because prefix caches only serve reuse at a fixed leading position. Kamera shows this recompute is avoidable and identifies exactly what naive KV reuse loses: the cross-chunk conditioning a chunk absorbs from its neighbors. That loss is asymmetric, and the direct readout of a cached chunk is recovered exactly and for free by a standard state-merge, enabling training-free reuse of multimodal KV state regardless of position.

How it was discussed

Cross-listed across the arXiv AI, vision, efficiency, and evals feeds.

efficiency kv-cache multimodal

#22

Interconnects argues GLM-5.2 is the step change for open agents

Frontier LLMs 2026-06-22 Interconnects (Nathan Lambert) 6.4 6.3/6.6/6.3

Nathan Lambert revisits Z.ai's GLM-5.2, the MIT-licensed mixture-of-experts model released June 13, to argue its real significance is agentic rather than just leaderboard standing. His thesis is that GLM-5.2 is the first open-weights model to deliver a step change in reliable agentic and tool-use behavior, narrowing the gap to closed frontier systems for builders who need self-hostable agents. He situates it against the current open-versus-closed dynamics and ties it to his own companion release, the Tmax open RL recipe for terminal agents. Reported here as analysis of an already-covered model release rather than new launch news.

open-weights glm agents

#23

MeshFlow: direct triangle-mesh generation with equivariant flow matching

Generative Media 2026-06-22 Hugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv cs.CV (Computer Vision)arXiv — Generative Media / Diffusion 6.4 6.5/6.2/6.5

Meshes are among the most common 3D representations but are hard to generate directly because they carry strong symmetries, including permutation invariance over faces and over vertices within a face. MeshFlow generates triangle meshes directly as triangle soups, avoiding serialization into long autoregressive sequences, and uses equivariant optimal-transport flow matching that respects those permutation symmetries. The result is a generator aligned to the structure of mesh data rather than fighting it through sequence modeling.

How it was discussed

Brought forward by Hugging Face Daily Papers and akhaliq across the vision and generative-media feeds.

3d-generation flow-matching equivariance

#24

Open problem: is AdamW effective under heavy-tailed gradient noise?

Research 2026-06-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & BenchmarksarXiv stat.ML (Statistical ML) 6.4 6.2/6.9/6.1

AdamW is the de facto optimizer for training large language models, yet its theory lives mostly in finite-variance regimes, even though empirical evidence indicates stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent results show sign-based optimizers like Lion and Muon achieve sharp heavy-tailed rates and that AdaGrad can converge under heavy tails, but no rigorous convergence theory exists for AdamW in this regime. The paper poses this as an open problem and surveys what would be required to settle whether AdamW remains effective when gradient noise lacks finite variance.

How it was discussed

Cross-listed across the arXiv AI, ML, evals, and statistics feeds.

optimization adamw theory

#25

Flowing With Purpose: latent-action-guided flow-matching policies for manipulation

Robotic Autonomy 2026-06-22 arXiv cs.RO (Robotics)arXiv — Evals & BenchmarksarXiv — Generative Media / Diffusion 6.3 6.5/6.0/6.4

Flow matching has become a standard for behavior cloning in robotic manipulation, but state-of-the-art flow-matching policies rely on a globally fixed isotropic source distribution that mismatches the fragmented, heteroscedastic structure of robotic action spaces, forcing the model to learn entangled vector fields and bottlenecking training. This work introduces latent-action guidance for the source distribution, shaping the starting noise by inferred latent actions so the learned flow is less entangled and the policy trains more efficiently and performs better on manipulation tasks.

How it was discussed

Spans the arXiv robotics, evals, and generative-media feeds.

robotics flow-matching manipulation

#26

PhySciBench: evaluating deep-research agents in the physical sciences

AI for Science 2026-06-17 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.3/6.5/6.1

Deep-research agents are LLM-based systems for autonomous multi-step scientific reasoning, with clear potential to accelerate the physical sciences, yet rigorous evaluation in this domain is lacking. PhySciBench introduces 200 expert-curated questions balanced between physics and chemistry across six task categories that reflect real research workflows, plus a multi-agent framework, giving the field a benchmark to measure how well autonomous research agents handle genuine physical-science problems rather than generic reasoning puzzles.

How it was discussed

Surfaced via Hugging Face Daily Papers and akhaliq curation.

ai-science agents benchmark

#27

Army selects Anduril to lead the NGC2 common data layer baseline

Government & Defense 2026-06-22 DefenseScoop 6.3 6.2/6.6/6.1

After months of testing between Anduril and Lockheed Martin, the US Army has selected Anduril to lead the common data layer baseline for its Next Generation Command and Control (NGC2) initiative, the effort to make the service's disparate systems interoperate. The decision boosts Anduril's role as a key software integrator on a flagship Army modernization program centered on moving data fluidly across the battlefield network.

defense anduril ngc2

#28

Shield AI completes acquisition of simulation firm Aechelon Technology

Government & Defense 2026-06-22 Shield AI 6.2 6.3/6.0/6.3

Shield AI completed its acquisition of Aechelon Technology, a maker of high-fidelity, physics-based simulation and image generation used in flight training and synthetic environments. The deal, tied to a previously announced raise at a 12.7 billion dollar valuation, pairs Aechelon's simulation stack with Shield AI's autonomy software for AI pilots and uncrewed aircraft, strengthening the training-and-simulation pipeline behind its autonomous systems.

defense autonomy simulation

#29

DIA issues RFI for an AI platform to streamline its procurement

Government & Defense 2026-06-22 DefenseScoop 6.1 6.0/6.3/6.0

The Defense Intelligence Agency, with a workforce exceeding 16,000, issued a request for information on June 17 to inform a potential AI prototype project aimed at overcoming inefficiencies in its procurement enterprise. The RFI is an early step toward a possible launch, signaling the intelligence agency's interest in applying AI to back-office acquisition workflows rather than only to mission analysis.

defense procurement rfi

#30

Army seeks autonomous ground vehicles to recover disabled equipment under fire

Government & Defense 2026-06-22 DefenseScoop 6.1 6.2/6.0/6.1

In a June 17 request for information, the US Army said it is interested in a robust, ruggedized autonomous ground vehicle to recover broken or disabled platforms from contested environments, continuing its exploration of robots for dangerous battlefield tasks. The notice frames vehicle recovery as a candidate mission for ground autonomy, where removing crews from exposed recovery operations is the explicit motivation.

defense ground-autonomy robotics

#31

NVIDIA debuts AI-for-science software at ISC: DAQIRI, ALCHEMI, and cuPhoton

Infrastructure 2026-06-22 NVIDIA AI Blog 6.1 6.2/6.1/6.0

At the ISC conference in Hamburg, NVIDIA introduced software to speed AI for science spanning chemistry, materials discovery, and the search for dark matter. The releases include the DAQIRI library, new ALCHEMI NIM microservices for materials and chemistry workflows, and the cuPhoton reference code for photonics and experimental physics, extending NVIDIA's accelerated-computing stack from model training into domain-specific scientific simulation.

hpc ai-for-science isc

#32

NVIDIA Vera CPUs to power new Los Alamos supercomputers for agentic science

Infrastructure 2026-06-22 NVIDIA AI Blog 6.0 6.1/6.0/5.9

Three new Los Alamos National Laboratory supercomputers, named Mission, Vision, and Veritas and built with HPE and NVIDIA, will use NVIDIA Vera CPUs to accelerate scientific discovery, including agentic scientific-AI workloads. The systems extend the trend of national labs standing up large CPU-plus-accelerator machines aimed at coupling simulation with autonomous AI research agents rather than batch HPC alone.

hpc los-alamos vera

#33

Gradient Flow lays out the bear case for AI data centers

Industry 2026-06-22 Gradient Flow (Ben Lorica) 6.0 5.8/6.5/5.7

Ben Lorica argues that the economics of AI data centers look increasingly weak and names them his leading candidate for what could pop the AI bubble within six to twelve months. The concern is not that AI stops improving or that demand vanishes, but that capital spending has raced far ahead of proven revenue while the financed assets, chiefly rapidly depreciating accelerators, lose value quickly. The piece is a financial-sustainability counterweight to the day's compute build-out news.

data-centers economics compute

#34

Import AI 462: superpersuasion, self-sustaining AI, and paths to ASI

Safety, Policy & Regulation 2026-06-22 Import AI (Jack Clark) 6.0 5.8/6.4/5.8

Jack Clark's Import AI 462 examines superpersuasion, the prospect of models that systematically out-argue humans, alongside discussion of self-sustaining AI systems and candidate technical paths toward artificial superintelligence. The newsletter synthesizes recent research and frames the safety and governance questions these trajectories raise for the field.

asi persuasion governance

#35

Causal Discovery in the Era of Agents: a call to reframe the LLM role

Research 2026-06-22 Hugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 5.9 5.9/6.1/5.7

Recent attempts to combine LLMs with causal discovery have them infer pairwise directions, propose graph structures, or inject model outputs as priors, but these can obscure whether a causal claim is supported by data and assumptions or merely by textual associations, prompt artifacts, and hallucinated mechanisms. The paper argues for a different role: agents should inspect data, retrieve context, and explain and clarify method assumptions, acting as transparent assistants to principled causal inference rather than as oracles that emit causal directions.

How it was discussed

Highlighted on Hugging Face Daily Papers and akhaliq across the arXiv AI and ML feeds.

causal-discovery agents methodology

#36

PP-OCRv6 lands on Hugging Face: 50-language OCR from 1.5M to 34.5M parameters

Multimodal 2026-06-22 Hugging Face Blog 5.8 6.0/5.5/5.9

PaddlePaddle released PP-OCRv6 on Hugging Face, a family of OCR models spanning 1.5 million to 34.5 million parameters and covering 50 languages. The lineup targets practical document recognition at small footprints, offering a lightweight open alternative to LLM-decoder OCR systems for multilingual text extraction on modest hardware.

ocr multilingual open-weights