← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — 2026-04-28

Coverage window: 2026-04-27 03:25 ET → 2026-04-28 03:02 ET
Press play to listen
Tuesday, April 28, 2026
15m 6s · top-4 narrated briefing
Must-read · top 3
#1 · Industry
OpenAI–Microsoft AGI clause expires; deal restructured to enable $50B AWS partnership
OpenAI restructured its agreement with Microsoft, retiring the long-standing 'AGI clause' that would have voided Microsoft's commercial IP rights upon AGI declaration. In exchange for relaxing the…
Score 8.6
#2 · Industry
Ineffable Intelligence raises $1.1B at $5.1B valuation to build AI that learns without human data
David Silver — AlphaGo / AlphaZero co-author and DeepMind veteran — has closed $1.1B at a $5.1B valuation for Ineffable Intelligence, founded just months ago. The pitch is a return to learning…
Score 8.2
#3 · Industry
Musk vs. Altman trial opens in N.D. Cal. — verdict could constrain OpenAI's for-profit conversion ahead of IPO
After years of pre-trial maneuvering, Musk's lawsuit against OpenAI is going to trial in Northern California. The court could rule on whether OpenAI is permitted to operate as a for-profit and could…
Score 8.0
Filter
6.5
Showing 72 of 72 items
#1

OpenAI–Microsoft AGI clause expires; deal restructured to enable $50B AWS partnership

Industry 2026-04-27 TechCrunchMIT Technology ReviewSimon Willison
8.6
I 8.0 Im 8.5 P 9.0

OpenAI and Microsoft restructured the central commercial clause of their relationship today, retiring the long-running 'AGI clause' that for six years had said Microsoft's commercial intellectual-property rights to OpenAI technology would be voided once OpenAI's board declared that artificial general intelligence had been achieved. That clause was the load-bearing piece of OpenAI's nominal nonprofit governance: it kept the for-profit subsidiary structurally subordinate to the original mission, and it gave OpenAI a unilateral exit from its largest commercial counterparty. Removing it changes the company's commercial topology meaningfully.

In exchange, Microsoft accepts that OpenAI may sell ChatGPT Enterprise and the OpenAI API on Amazon Web Services. That clears the path for the previously rumored fifty-billion-dollar AWS compute deal, which the companies had been unable to finalize while the Azure exclusivity was still binding. Microsoft is not walking away with nothing: the new agreement gives Microsoft a richer revenue share on OpenAI products that flow through other clouds, plus continued first-look rights on OpenAI's research artifacts.

Simon Willison documented the slow-motion disappearance of the AGI clause from the openai.com website. It first appeared in the July 2019 announcement of Microsoft's initial billion-dollar investment, where it was framed in plain text as a structural commitment to ensure 'AGI is used for the benefit of all.' Over the years it migrated to footnotes, then to a blog comment thread, and by mid-2025 it had been removed entirely from the partnership's public-facing description. Today's announcement is the legal version of that fade.

The timing is the part that matters strategically. The Musk-versus-Altman lawsuit went to trial this same week in Northern California, and the central legal claim is whether OpenAI is permitted to convert from a capped-profit structure to a fully for-profit structure ahead of its planned initial public offering. Until today, the AGI clause was a complication for that conversion: any IPO prospectus would have had to disclose that Microsoft's commercial rights could vanish on a future board vote. With the clause retired, the prospectus risk profile is materially cleaner.

For practitioners, the on-the-ground change is that ChatGPT Enterprise and OpenAI API endpoints will be procurable on AWS in addition to Azure. That is a meaningful procurement unlock for organizations whose security review and credential infrastructure are AWS-native and have been blocked from OpenAI for that reason. It also changes the competitive picture for Anthropic, which has had AWS effectively to itself in the frontier-model-on-AWS slot.

The caveats Willison and the Hacker News commentary flagged are worth keeping in mind. First, the new agreement still gives Microsoft preferential pricing and capacity allocation, which limits how aggressive an AWS rollout can be. Second, the AGI clause was always more rhetorical than enforceable — declaring AGI is a board action that no rational board would ever take while it would void a primary revenue stream. Its retirement codifies what was already the practical reality. Third, the trial outcome is still pending; if Musk's claims succeed, the for-profit conversion itself could be enjoined, regardless of how the Microsoft side has been cleaned up.

OpenAIMicrosoftAGI clauseAWSfor-profit

David Silver — co-author of AlphaGo, AlphaZero, and the 'era of experience' position paper Silver and Sutton circulated in 2024 — has closed one-point-one billion dollars at a five-point-one-billion valuation for Ineffable Intelligence, a London-based AI lab founded only a few months ago. The round is one of the largest pre-product seed-and-Series-A bets in the history of the field, and it is explicitly framed as a bet on a different paradigm: AI systems that learn primarily from environment interaction and self-play, rather than from human-generated text and human-labeled preference data.

The thesis traces directly to Silver and Sutton's argument that the post-2020 frontier-model era — defined by imitating human-written corpora and then post-training with human preferences — has saturated against the bottleneck of available human-generated data. Their alternative is an 'experience-first' regime: agents acting in rich, simulated, or real environments, generating their own data through interaction, and being trained on the outcomes of those actions rather than against fixed datasets. AlphaZero is the canonical proof of concept; the open question, which Ineffable is now well-funded to pursue, is whether the same approach can extend to language and general reasoning at frontier scale.

The funding profile suggests the round is structured for the unusually long capability runway this kind of work requires. Burn-rate math at frontier compute pricing implies the round funds something like eighteen to twenty-four months of training experiments at a hundred-thousand-H200-equivalent scale, which is the right order of magnitude for the simulation-heavy training Silver's prior research implies. Investor reporting points to a mix of European sovereign wealth, U.S. crossover funds, and at least one strategic from a hyperscaler, though the company has not confirmed names.

What the bet competes with is interesting. The dominant frontier-lab thesis — held at OpenAI, Anthropic, Google, and Meta — is that the lab with the best post-training pipeline plus the cleanest synthetic data plus the most efficient long-context inference wins, with experience-style RL from environment interaction as a useful but secondary technique. Silver's wager runs in the other direction: the experience side is primary, and the imitation side is a starting condition rather than the engine. If correct, the implications for compute allocation, eval design, and what counts as a 'model release' all shift.

The risks are well-documented in the academic literature. Self-play in language has a notorious mode-collapse problem; environments rich enough to drive general capability are hard to specify; and reward functions for open-ended behavior degrade in characteristic ways. Several DeepMind alumni and prominent RL researchers have flagged these as the real bottleneck rather than compute. Ineffable will have to ship a credible technical roadmap in the next twelve months — most likely a domain-specific demonstration analogous to AlphaProof or AlphaCode — for the round to seem prescient rather than expensive.

RLself-playpost-AlphaGoDeepMind alumnifrontier_lab

The Musk-versus-Altman trial opened in the Northern District of California this week, the culmination of more than two years of pre-trial filings, mediation attempts, and amended complaints. The case has narrowed to a tractable legal core: whether OpenAI's planned conversion from a capped-profit limited-partnership structure to a fully for-profit corporation violates the original donor-restriction terms under which Musk and other early funders contributed to OpenAI's founding nonprofit, OpenAI Inc. Musk's complaint asks for an injunction against the conversion and, in the alternative, a constructive trust over the equity that would otherwise vest in for-profit insiders.

If Musk prevails on the donor-restriction theory, the practical remedy menu includes blocking the for-profit conversion, restructuring the equity allocation to flow back into the nonprofit, or — at the more extreme end of the requested relief — removing Sam Altman from his executive role. The first remedy would force OpenAI to maintain the existing capped-profit structure indefinitely, which would in turn likely block the planned initial public offering, since public-equity markets are not configured to absorb capped-return common stock at scale. The second remedy could be reconciled with an IPO but would require new disclosure language. The third is the dramatic possibility that has driven most of the press coverage and is also the legally weakest of the three.

What makes the trial interesting beyond the personality drama is the timing relative to the rest of OpenAI's corporate restructuring. The Microsoft AGI-clause retirement and the AWS partnership announcement, both concluded earlier the same day, were almost certainly accelerated to clean up the prospectus risk profile in advance of trial. They confirm what the market had inferred from leaked S-1 drafts: an IPO is being prepared on a timeline that the trial could materially affect. Counsel for OpenAI has signaled that they intend to argue the donor-restriction theory is inapplicable because the early funders received bargained-for benefits — research access, board representation, and downstream product-rights — rather than charitable-deduction-eligible donations.

For ML practitioners and the broader research community, the substantive consequences depend less on the verdict than on the discovery record, which has produced extensive disclosures about how OpenAI's research and product roadmaps were prioritized internally, how decisions about model releases were weighed against safety review, and how the relationship with Microsoft constrained or unconstrained that decision-making. Several documents released in pre-trial briefing have already been mined by safety researchers for what they reveal about the boundary between OpenAI's safety-research and product-engineering organizations.

The expected duration is short — a four-to-six-week bench trial, no jury, with a single judge issuing a written opinion sometime in the early summer. A directed verdict in either direction is possible at the close of plaintiff's case. The IPO timeline is sensitive enough that any extended hold by the court will itself function as a partial victory for Musk, regardless of the eventual ruling on the merits.

OpenAIMuskAltmanfor-profit conversionlitigation
#4
7.8
I 7.5 Im 7.5 P 8.0

DeepSeek released a preview build of V4 late Friday, and the weekend has been spent by the community parsing what the open release actually means. The headline numbers in the model card place V4 in the contended frontier band on the standard suite — MMLU-Pro, GPQA-Diamond, SWE-Bench Verified, ARC-AGI-2 — clustered with the late-cycle GPT-5.5, Claude Opus 4.6, and Gemini 3 results, with no individual benchmark on which V4 is dominant but several on which it is competitive at materially lower serving cost. The V4-preview pricing card lists token costs roughly forty percent below the V3 rate, which on a like-for-like quality basis is the largest open-weights cost-per-token decline since the V3 launch.

The technical contributions worth flagging are concentrated in three areas. First, an extended context-handling regime — preliminary measurements suggest near-lossless retrieval out to 256K tokens on the BABILong benchmark, which is the regime that has historically broken sliding-window and ring-attention-style approaches. Second, a refreshed inference stack co-released with the LMSYS team, building on the SGLang-plus-Miles serving work covered in Sunday's digest, with batched-prefix optimizations that exploit the V4 attention pattern more aggressively than V3-era serving could. Third, the post-training recipe is described as 'reasoning-first': the SFT stage uses heavily curated reasoning trajectories from V3.5-Code and V3.5-Math, then RL fine-tuning runs on a verifier-grounded reward stack rather than human preferences for many of the reasoning capabilities. The DPO/KTO references in the V3 paper appear to have been replaced by a process-reward training objective.

Independent eval coverage from Artificial Analysis posted Sunday afternoon largely confirms the cost numbers and the long-context claims, with one caveat: the long-context retrieval gains soften noticeably on tasks that mix retrieval and multi-hop reasoning, indicating the gains are mostly on the retrieval side rather than the deep-context-reasoning side. The Chatbot Arena leaderboard slot will not update for several days yet, but informal A/B comparisons posted to the Hugging Face Hub show V4 as competitive on coding and meaningfully ahead on math-style reasoning, with the open question of whether the model is overfit to math-eval style.

The community context sharpens the read. DeepSeek released V3 in December 2024, V3.1 in mid-2025, and V3.5-Code and V3.5-Math as separate specialist releases in early 2026. V4 is positioned as the unification: one base model that subsumes the math and code specialists at frontier general quality. If that consolidation holds, the practical implication is that DeepSeek now has a single open-weights model competitive with closed frontier offerings across most workloads at substantially lower inference cost — which is a meaningful structural change to the open-vs-closed competitive picture rather than a one-off benchmark headline. Caveats: this is still a preview build, the production release with the full model card, eval suite, and license terms is expected within a week, and several of the reported numbers will need to be verified once independent eval harnesses publish.

DeepSeekV4long contextopen-weights
#5

China blocks Meta's $2B Manus acquisition after months-long antitrust probe

Industry 2026-04-27 TechCrunchCNBC (Hacker News)
7.4
I 7.0 Im 7.5 P 7.5

SAMR ordered Meta to unwind its acquisition of Manus, the Singapore-incorporated agent company that became famous in 2025 for its general-purpose autonomous web agent. The block is a significant escalation in tech-deal scrutiny against U.S. AI companies acquiring assets with operations in China and removes Meta's clearest path to a frontier-class web-agent stack.

MetaManusantitrustChinaM&A

Cognition announced that Mercedes-Benz is deploying Devin and Windsurf across its global engineering organization, starting with legacy modernization, cloud-native development, and logistics — three of the highest-leverage applied-AI workloads in automotive. The footprint reportedly includes both software engineering teams and the connected-vehicle stack. It's the largest publicly named enterprise rollout for Cognition since the Goldman Sachs case study.

DevinWindsurfMercedes-Benzenterpriselegacy modernization
#7
7.0
I 7.0 Im 7.0 P 6.5

Pixel-embedding architecture that beats ViT/SigLIP-class encoders on a suite of multimodal benchmarks for both understanding and generation. Notable for collapsing the encoder-decoder split — pixel embeddings flow directly into the language model with no separate vision tower.

Tuna-2
#8
6.8
I 6.5 Im 7.0 P 6.0

OpenAI achieved FedRAMP Moderate authorization, clearing ChatGPT Enterprise and the OpenAI API for U.S. federal agency use under the GSA's compliance baseline. The certification covers data residency, encryption-at-rest, audit logging, and continuous monitoring requirements, opening procurement paths into agencies that have been waiting on Moderate before piloting frontier models — likely closing the gap with Anthropic's existing FedRAMP coverage and Microsoft's Azure OpenAI Government tenant.

FedRAMPfederal AIcomplianceGovCloud
#9
6.8
I 6.0 Im 6.5 P 7.5

A breach disclosure on Oravys claims roughly 4TB of voice training samples plus identifying contractor metadata were exfiltrated from Mercor, the labor marketplace used by frontier labs for RLHF and voice-data collection. If accurate, the dataset is large enough to seed targeted voice-cloning attacks against thousands of named individuals; HN discussion focused on weak rotation of contractor PII and the lack of a regulatory disclosure pathway equivalent to the GLBA framework that data brokers operate under.

data breachvoice cloningRLHF labelersMercor
#10
6.7
I 6.5 Im 7.0 P 6.0

DoD officials disclosed that GenAI.mil — the department's CDAO-fronted internal generative-AI platform — now has more than 100,000 user-built agents, with the latest Google model joining Anthropic's Claude and OpenAI's models in the catalogue. The 100k figure is significant for benchmarking what 'AI uptake at the DoD' actually looks like at scale; it also reinforces the agentic-platform-not-just-chat framing that DoD pushed in late 2025.

DoDGenAI.milGoogle Geminiagent platform

Ming-Chi Kuo's note describes an OpenAI phone with MediaTek (SoC), Qualcomm (modem), and Luxshare (assembly), pitched as an agent-first device where AI agents replace traditional apps. This is the second OpenAI hardware leak in two months after the Ive earbuds; the agent-replaces-apps framing maps to the same thesis Mustafa Suleyman pushed at Microsoft Build 2024 and Bret Taylor articulated in Sierra's recent fundraising deck.

OpenAIhardwareagentic UXMing-Chi Kuo
#12
6.7
I 6.5 Im 6.5 P 6.5

Qasar Younis and Peter Ludwig (founders of Applied Intuition) discuss the company's pivot from automotive simulation to a broader 'AI for vehicles' platform spanning mining rigs, drones, trucks, and warships. The conversation covers their CARLA-style simulation stack, the latency budget for real-time control, and how the defense-tech angle is now driving a meaningful share of revenue.

Applied Intuitionphysical AIdefenseQasar Younis
#13

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Agents & Tool Use 2026-04-24 Hugging Face Daily PapersarXiv
6.7
I 6.5 Im 6.5 P 6.5

Multi-agent organization framework that treats heterogeneous agents as a corporate hierarchy: skill-tree-based dispatching, project-management-style task graphs, and human-org budget metaphors as routing primitives. Strong gains on long-horizon planning benchmarks.

Skills-to-Talent
#16

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Multimodal 2026-04-27 Hugging Face Daily PapersarXiv
6.7
I 6.5 Im 6.5 P 6.5

RL fine-tuning of text-to-video models with explicit 3D-constraint rewards (depth and pose consistency); reports SOTA on long-horizon video physics benchmarks vs. supervised + DPO baselines.

World-R1

RAND released 'The AGI Rideout Strategy for Reducing Strategic Risk and Promoting Stability in the Transition to Artificial General Intelligence.' The framework treats the years immediately preceding and during AGI emergence as a deterrence-stability problem analogous to the early nuclear era: prioritize avoiding kinetic conflict, harden U.S. compute and grid resilience, and pursue tacit confidence-building measures with the PRC rather than verifiable export-control regimes. It's the most comprehensive grand-strategy document RAND has issued on AGI to date.

AGInational securitydeterrencepolicy
#18
6.6
I 7.3 Im 6.7 P 5.4

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences.

RL
#19
6.5
I 6.5 Im 6.0 P 6.5

Cognition shipped a terminal entry point for Devin: start a session locally on the laptop, hand the same session over to Devin's cloud workers when the work outgrows the local box. The pattern mirrors Claude Code and Codex CLI but is more explicit about migration mid-task, which addresses the long-running complaint about cloud-only agent sessions for sensitive monorepos.

Devinterminallocal-firstagent UX
#20
6.5
I 6.0 Im 6.0 P 7.0

swyx's Monday digest argues that the GPT-Image-2 explosion — and the way image generation is being threaded through agent tool-calls — is best read as a step toward general capability, not an isolated modality drop. The piece pulls together the rendering quality jump, agent integration, and recent diffusion-transformer scaling work into a single thesis.

GPT-Image-2image generationtool use
#21
6.5
I 7.1 Im 6.5 P 5.2

Large Language Models (LLMs) excel academically but struggle with social intelligence tasks, such as creating good compromises. In this paper, we present methods for generating empathically neutral compromises between two opposing viewpoints. We first compared four different prompt engineering methods using Claude 3 Opus and a dataset of 2,400 contrasting views on shared places. A subset of the gen erated compromises was evaluated for acceptability in a 50-participant study.

NLP
#23

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

Safety, Policy & Regulation 2026-04-26 Hugging Face Daily PapersarXiv
6.4
I 6.0 Im 6.5 P 6.0

Survey of safety threats specific to vision-language-action models — physical execution, prompt injection through visual modality, sensor spoofing — and the evaluation harnesses now standard in robotics safety work.

VLA Safety survey
#24
6.3
I 6.8 Im 6.4 P 5.1

Visual reinforcement learning aims to empower an agent to learn policies from visual observations, yet it remains vulnerable to dynamic visual perturbations, such as unpredictable shifts in corruption types. To systematically study this, we introduce the Visual Degraded Control Suite (VDCS), a benchmark extending DeepMind Control Suite with Markov-switching degradations to simulate non-stationary real-world perturbations. Experiments on VDCS reveal severe performance degradation in existing methods.

RL
#25
6.2
I 6.0 Im 6.0 P 6.0

Microsoft's VibeVoice — released January 21, 2026 but only now picking up community attention — is a Whisper-class ASR model with speaker diarization integrated into the model rather than bolted on. Willison ran the 4-bit MLX port locally on a Mac via mlx-audio (5.71GB). MIT-licensed. The diarization-in-the-model design point is the technically interesting part — most production stacks today still chain Whisper to a separate pyannote model.

ASRdiarizationVibeVoiceMLX
#26
6.2
I 6.0 Im 6.5 P 5.5

NVIDIA's NV-Raw2Insights-US — published with Hugging Face — is a physics-informed AI imaging model for adaptive ultrasound that operates on raw RF channel data rather than reconstructed B-mode images. The pitch is improved tissue contrast and fewer reconstruction artifacts on the same probe hardware, with integration paths for handheld scanners. Pairs naturally with the Arc/Tempus medical-AI thread.

medical imagingphysics-informedNVIDIAultrasound
#27

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

Post-Training 2026-04-27 Hugging Face Daily PapersarXiv
6.2
I 6.0 Im 6.5 P 5.5

Stabilizes reasoning-RL fine-tuning by selecting advantage estimates at the step level rather than trajectory level. Reports cleaner training curves on math reasoning vs. GRPO-style baselines.

Step-Level Advantage Selection
#28

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Efficiency 2026-04-03 Hugging Face Daily PapersarXiv
6.2
I 6.0 Im 6.5 P 5.5

Adaptive depth-wise KV-cache sharing via stochastic routing across layers. Reduces inference memory at fixed quality — competing approach to recent layer-wise KV-merging work.

Stochastic KV Routing
#29
6.2
I 6.7 Im 6.3 P 5.0

Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment.

mech interpSAE
#30
6.1
I 5.5 Im 6.0 P 6.5

Forbes profile of Mistral that picked up traction this week: the case that the lab's geographic non-Americanness is its biggest commercial moat in EU and Gulf-state procurement. The piece pairs naturally with the Cohere/Aleph Alpha sovereign-AI merger covered in the prior digest.

MistralEuropesovereign AIForbes
#31
6.1
I 6.6 Im 6.2 P 5.0

Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability.

state spaceMamba
#32
6.1
I 6.6 Im 6.2 P 5.0

Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents.

RL
#33
6.0
I 5.5 Im 5.5 P 6.5

Nick Levine, David Duvenaud, and Alec Radford released talkie-1930-13b-base — a 13B model trained on 260B tokens of pre-1931 English. The exercise both probes how much modern-style language understanding survives without contemporary data and supplies a clean baseline for studying which behaviors are inherited from post-1930 corpus shifts. The data-curation methodology paper is the more interesting artifact than the weights themselves.

historical textRadfordDuvenauddata curation
#35
6.0
I 6.4 Im 6.1 P 4.9

Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states.

mech interpSAE
#36
6.0
I 6.4 Im 6.0 P 4.8

Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences.

agenttool use
#37
5.9
I 6.4 Im 6.0 P 4.8

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation.

RL

The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis.

benchmark

While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research.

benchmark
#40
5.8
I 5.0 Im 5.5 P 6.5

Thompson revises his AR-vs-VR thesis after a hands-on with Meta's Ray-Ban Display: the device meaningfully reframes what AR is for as a context-bearing surface for an always-on AI assistant. The piece reads as a partial walking-back of his earlier skepticism on heads-up display form factors.

MetaARRay-Ban DisplayAI hardware
#41
5.8
I 5.5 Im 6.0 P 5.5

Long-form post arguing — with extensive Gemini-3 input/output transcripts — that frontier models already encode foundational meta-ethical reasoning more carefully than most humans. The argument's strongest leg is the consistency under perturbation of model responses to dilemma framings, not a normative claim about which ethics are correct.

alignmentethicsGemini-3moral epistemics
#42
5.8
I 6.0 Im 6.0 P 5.0

The U.S. Navy completed the first flight test of the MQ-25 Stingray, Boeing's carrier-based tanker drone — the Navy's first operational carrier-launched unmanned aircraft. Significance is not the autonomy stack (the MQ-25 is largely teleoperated) but the precedent for unmanned-launch-and-recovery from a deployed carrier deck.

MQ-25BoeingNavycarrier UAV
#43
5.8
I 6.2 Im 5.9 P 4.7

Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors.

RL
#44
5.8
I 6.2 Im 5.9 P 4.7

Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images.

benchmark
#45
5.8
I 6.2 Im 5.9 P 4.7

The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics.

benchmark
#46

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

Agents & Tool Use 2026-04-21 Hugging Face Daily PapersarXiv
5.7
I 5.5 Im 6.0 P 5.0

Treats actor-observer asymmetry in agent self-models as an alignment problem; proposes dialectical training that exposes agents to both first-person and third-person perspective queries on their own actions.

Dialectical Alignment
#47

Efficient Agent Evaluation via Diversity-Guided User Simulation

Evaluations & Benchmarks 2026-04-23 Hugging Face Daily PapersarXiv
5.7
I 5.5 Im 5.5 P 5.5

User-simulation framework for agent evaluation that uses a diversity-guided sampling strategy to surface long-tail interaction modes; reduces eval cost vs. exhaustive scenarios.

Diversity-Guided User Simulation
#48
5.7
I 6.1 Im 5.8 P 4.7

Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} addresses this by training a single policy network conditioned on preference-weighted rewards. In this paper, we explore a novel algorithmic perspective: leveraging reward-free reinforcement learning (RFRL) for MORL.

RL
#49
5.7
I 6.1 Im 5.8 P 4.7

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference.

MoEquantization
#50
5.7
I 6.1 Im 5.8 P 4.7

Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often involve external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability.

NLP
#51
5.7
I 6.1 Im 5.8 P 4.7

Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling.

RL

The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), maintained by the Observational Health Data Sciences and Informatics (OHDSI) collaboration, enabled the harmonisation of electronic health records data of nearly one billion patients in 83 countries. Yet generating real-world evidence (RWE) from these repositories remains a manual process requiring clinical, epidemiological and technical expertise.

agenttool use
#53
5.7
I 6.1 Im 5.8 P 4.7

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings.

diffusion

Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts.

NLP
#55
5.7
I 6.0 Im 5.8 P 4.6

Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While Large Language Models (LLMs) have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning.

benchmark

Classical robot ethics is often framed around obedience, most famously through Asimov's laws. This framing is too narrow for contemporary AI systems, which are adaptive, generative, embodied, and embedded in physical, psychological, and social worlds. We argue that future human-AI relations should be understood not as master-tool obedience, but as conditional mutualism under governance: a co-evolutionary relationship in which humans and AI systems can develop, specialize, and coordinate while institutions keep the relation reciprocal, reversible, psychologically safe, and socially legitimate.

VLAembodied
#57

Kwai Summary Attention Technical Report

Agents & Tool Use 2026-04-27 arXiv
5.7
I 6.0 Im 5.8 P 4.6

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly.

agenttool use
#58
5.7
I 6.0 Im 5.8 P 4.6

Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior.

agenttool use

Dwarkesh's weekend musings cover the intelligence-vs-power split (capability gain doesn't entail power gain), what counts as 'data' once synthetic-from-RL outputs dominate corpora, and a half-formed thought on whether a Manhattan Project framing actually maps onto compute-bound rather than knowledge-bound problems.

Dwarkeshessayintelligence vs power
#63

Speech translation in Google Meet rolls out to mobile

Audio & Speech 2026-04-27 Simon Willison
5.3
I 5.0 Im 5.0 P 5.5

Google Meet's real-time speech translation — already in desktop GA — is rolling out on mobile. The voice-cloning-on-the-fly experience now reaches the form factor where most cross-language conversations actually happen. Willison's hands-on report says the 'kind-of worked' qualifier is doing meaningful work — usable for short exchanges, still rough on overlapping speech.

Google Meetspeech translationreal-time
#64
5.2
I 4.5 Im 5.5 P 5.5

Meta's first contract with Overview Energy is a small-volume off-take for night-time space-beamed solar power. Symbolic for the data-center power-availability arms race more than for the watts-per-rack math: Meta is signaling that any energy source that works at all in the 2030 timeframe is worth a foot in the door.

space solardata centersMetaOverview Energy
#65

War on the Rocks: From Slogan to Standard — defining 'affordable mass'

Government & Defense 2026-04-27 War on the Rocks
4.9
I 4.5 Im 5.5 P 4.5

Op-ed proposing operational definitions for 'affordable mass' — the term that has eaten Pentagon procurement language since 2021. Argues that without a quantitative definition (cost-per-effect, sortie generation, attrition tolerance), the concept is vulnerable to the same definition-by-vendor-pitch cycle that hollowed out 'jointness.'

affordable massCCAReplicatordoctrine
#66

Pip 26.1: lockfiles and dependency cooldowns

AI Coding 2026-04-28 Simon Willison
4.8
I 4.0 Im 4.5 P 5.5

pip 26.1 introduces native lockfiles (long overdue) and a 'dependency cooldown' window that delays adopting just-published versions. The cooldown is the supply-chain-attack-mitigation worth flagging — it pushes pip towards the same posture npm took post-2024 incidents, where freshly published versions are quarantined for hours by default.

piplockfilesPythonsupply chain
#67

AI Alignment Forum: From nothing to important actions — agents that act morally

Safety, Policy & Regulation 2026-04-27 AI Alignment Forum
4.8
I 4.5 Im 5.0 P 4.5

Phenomenology-flavored alignment post on the gap between an agent's representational space (perceiving values) and behavior (acting on them). Argues that current safety setups train the perception side but leave the action-selection side underdetermined.

agent ethicsmoral patientalignment
#68
4.8
I 4.5 Im 5.0 P 4.5

SOCOM published a sources-sought notice for the ANCHOR (Advanced Naval Capabilities and Operational Readiness) initiative, naming six modernization focus areas. ANCHOR has an explicit autonomy-stack lane and is the SOCOM-side counterpart to the Navy's USV proliferation program.

SOCOMANCHORnavalindustrial base
#69
4.8
I 4.5 Im 5.0 P 4.5

Joint Army-Navy effort aims to field a containerized 150 kW high-energy laser to counter incoming cruise missiles. AI relevance is in the fire-control and target-discrimination loop rather than the lasing physics; the program is the highest-profile DEW push since Iron Beam reached IOC.

DEWlaser weaponArmy-Navycounter-CM
#70
4.7
I 4.0 Im 4.5 P 5.5

Skye is raising on a thesis that an AI-aware iPhone home screen — a launcher layer that proxies user intent to the right app — will dominate the post-app era. Pre-launch funding plus the OpenAI phone leak the same day frame the home-screen-as-agent-surface fight.

SkyeAI launcheriPhoneconsumer
#71
4.4
I 4.0 Im 4.5 P 4.0

Anthropic named Theo Hourmouzis as its first General Manager for Australia and New Zealand, expanding regional GTM as Claude Enterprise pushes into the public-sector market in both countries. Routine GM appointment, but it's the first dedicated leader Anthropic has placed in ANZ.

Anthropicexecutive hireANZgo-to-market
#72
4.1
I 3.5 Im 4.5 P 4.0

Interior continues to grapple with a multi-decade probate-records backlog despite AI-assisted document processing. Useful counterweight to GenAI.mil triumph stories: the document-extraction problem on heterogeneous historical records remains harder than vendors imply.

DOIprobatedocument AIcase study
Run summary
Items
72
Multi-source
18
Long-form ≥7.5
4
Sources OK / attempted
38 / 50
Top category
#10 · Industry