AI Radar #28

Without instruments, we’re blind

Jun 04, 2026

Cross-section of an evaluation chamber rendered as a precision technical illustration in slate blue on off-white. A large sealed cabinet-like machine sits on a raised test platform, surrounded by telescopes, sensor arms, and instruments on tripods and gantries, all trained on it. Technicians watch gauges from consoles and an observation window at left. From an open hatch on the machine’s roof, a slender mast raises a small telescope of its own—highlighted in amber-gold—aimed directly back at the observation window, its sight line crossing the chamber.

I’ll be in Berkeley this week for LessOnline and Summer Camp. If you’re there, I’d love to chat: give me a yell on Waypoint.

To make good decisions in the coming years, we’ll need accurate information about AI capabilities and their trajectory. But as the models get smarter, it’s harder to measure how capable—and dangerous—they are.

We need more sophisticated evaluations, but each domain has different requirements. Coding benchmarks need to incorporate more complex tasks and better verification of success, as the new DeepSWE does. Finding misaligned behavior in models that know when they’re being evaluated requires that evaluators be given deep access to model internals. And accurately measuring biorisk relies on carefully combining expensive wet lab experiments with cheaper but less definitive traditional evaluation techniques.

Top pick

SecureBio: The role of evals in the biorisk evidence hierarchy

Don’t let the title fool you—SecureBio’s latest post is an outstanding overview of the full landscape of how biorisk is measured.

Watercolor infographic titled ”AI Biorisk Evidence Hierarchy,” depicting a winding path climbing a tiered mountain toward an erupting volcano crowned with a skull. Four numbered map-pin markers label the levels from base to summit: (1) Theoretical Arguments, shown as an open notebook and coffee cup on a grassy meadow; (2) Evals, on a blocky blue plateau with glowing data screens; (3) Real-World Uplift Studies, on a mid-mountain ledge with laboratory buildings; and (4) Incidents & Near-Misses, near the dangerous volcanic peak. Flanking arrows indicate cost increasing left and inferential strength increasing right as evidence climbs the hierarchy. — Let’s avoid #4, shall we?

Biorisk is harder to evaluate than cyber risk, in part because the threat model is more complicated. We care about whether AI can help the next Ted Kaczynski create a crude bioweapon, but we also care about whether it can help North Korea’s bioweapons program produce more sophisticated pathogens than they otherwise could. Different threat actors require different evaluations: an AI that provides substantial uplift to a lone actor might be useless to a nation-state, and vice versa.

Adding to the challenge, traditional evaluations can only tell us so much: the best way to measure uplift is to conduct wet lab experiments, which are immensely costly and time-consuming. It isn’t feasible to conduct a full round of wet lab experiments with every new model, so we need to deploy a thoughtful mix of both approaches.

News

Here’s the AI executive order

After some confusing back and forth, Trump has signed an executive order on AI. Zvi’s coverage is excellent—you should read at least What Does The Executive Order Do and We Have Concerns.

This is better than nothing, and it’s better than some of the alternatives, but it isn’t great. I’d much rather see a mandatory but transparent vetting process than this “voluntary” but secretive and arbitrary regime. It’s hard to imagine a system better suited to favoritism and weaponization than this one.

Opus 4.8

Opus 4.8 is here—it’s a modest improvement overall, though it’s notably better on honesty and misaligned behavior:

Bar chart titled ”Misaligned behavior” comparing four AI models on a 1–10 score axis (displayed range 1.0–2.6). Sonnet 4.6 scores highest at ~2.58 (blue), followed by Opus 4.7 at ~2.48 (green), then Opus 4.8 at ~1.83 (orange-red), with Mythos Preview lowest at ~1.77 (amber). All bars include error bars indicating uncertainty in the estimates. — Baby, I swear I’ve changed. From now on, I’m gonna lie to you 50% less.

Zvi reviews the system card, model welfare, and capabilities and reactions:

my overall perspective is that it is a good model, sir, an incremental improvement over Opus 4.7 and the new presumptive best publicly available model in the world, but not a sea change

The biggest news in the Opus 4.8 announcement isn’t about Opus:

we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Capabilities and forecasts

Comparing open and closed models

Epoch reviews the Epoch Capabilities Index (ECI) scores of leading open and closed models over time and finds that open models trail by about four months:

Step-chart of the Epoch Capabilities Index score from January 2023 to early 2026, with two staircase lines — teal for closed-weights models and pink for open-weights models — both rising steadily. Closed-weights models (Claude 3.5 Sonnet, Gemini 2.5 Pro, o1, GPT-5 series) climb from ~126 to ~160; open-weights models (Llama 2-70B, Mixtral, DeepSeek-R1, and others) trail slightly behind, rising from ~109 to ~151, with the gap narrowing by 2026. — Yes, but…

Something’s missing here. The open models benchmark well, but they lag more than four months behind the frontier on complex agentic tasks as well as intangible but important qualities. ECI is a valuable metric, but it doesn’t capture the full depth of model capabilities.

Nathan Lambert, a strong advocate for open models, is optimistic in the long run but sees significant shortcomings in current open models:

Many businesses want to switch to open models but the models today are not good enough in out-of-distribution tasks.

Project Glasswing: what Mythos showed us

Cloudflare shares a detailed analysis of using Mythos Preview for security work:

Mythos Preview is a real step forward, and it’s worth saying that plainly before getting into anything else. We’ve been running models against our code for a while now, and the jump from what was possible with previous general-purpose frontier models to what Mythos Preview does today is not just a refinement of what came before.

AI-assisted vulnerability discovery is harder than it sounds: you can’t just point a good model (even Mythos) at a codebase and tell it to find bugs.

Flow diagram titled ”Vulnerability discovery harness” showing eight numbered orange-outlined stages in sequence: Recon → Hunt → Validate → Dedupe → Trace → Feedback → Report, with a side branch to a fourth stage, Gapfill, connected by dashed arrows looping back into Hunt and Validate and forward to Feedback. — A good model isn’t enough

Some thoughts on AI detectors

Kiran Garimella’s analysis of AI detectors is a useful review of the state of the art in AI detection as well as a thoughtful exploration of what it is—and isn’t—good for:

Right now, “is this human?” is being used as a proxy for “is this any good?”. The proxy is shaky in both directions.

OpenAI’s “milestone” math breakthrough played to AI’s strengths

Kai Williams dives into the most important AI math achievement to date, OpenAI’s disproof of the Erdős unit distance conjecture.

This is a solid explanation of the conjecture itself, as well as what it tells us about AI’s current capabilities.

A mathematical diagram showing hundreds of black dots arranged in a quasicrystalline or Poisson-disk pattern filling a large dashed circle on a grid background, with a red starburst at the center — lines radiating outward to the nearest-neighbor dots — highlighting the local spacing geometry of the point distribution. — I’ll take your word for it

DeepSWE

DeepSWE is a new agentic coding benchmark that fixes some significant issues in many current evaluations. The developers claim it is free of contamination in the training data and has more complex and diverse tasks, as well as better verification of task completion. It also provides better differentiation between models, which is great but means it might saturate quickly.

The release announcement has an in-depth discussion of what makes a good modern coding benchmark.

Alignment and interpretability

Safety evaluations need white box access

AIs are alarmingly good at telling when they’re being evaluated. To keep up, Apollo Research and AVERI argue that some evaluators will need deeper access to the models they’re evaluating (summary thread):

Establish appropriate state of the art access standards. Evaluators should receive, at minimum, raw chain of thought access, fine-tuning access, and access to reduced-mitigation model variants, all relevant tools and intermediate activations.
Provide access to steerable evaluation-awareness endpoints. […]
Apply access parity between internal and external evaluators. […]

Full access parity between internal and external evaluators sounds ambitious, but that list otherwise seems entirely reasonable. Evaluation awareness presents a major challenge for loss of control evaluations: I don’t see how we can get useful results without giving evaluators deep access to the internals.

Time to take AI consciousness seriously

Samuel Hammond reviews the functional similarities between LLMs and the human brain and argues that consciousness will likely arise in AIs as they become increasingly capable:

What I find harder to imagine is an unconscious AI that is as capable as humans at doing things for which consciousness is functionally load-bearing. The same universality argument that explains functional convergence between brains and neural networks gives us good reason to expect that deep learning systems, facing similar problems and optimization pressures, will converge on something functionally analogous to whatever consciousness does for us.

Something like Attention Schema Theory seems like the most plausible explanation of consciousness: we evolved to be conscious because consciousness is an effective tool for solving important problems facing intelligent, agentic minds. My default expectation is therefore that AI is likely to spontaneously develop some form of consciousness as it approaches AGI.

Risks

Strengthening societal resilience with Rosalind Biodefense

OpenAI is launching Rosalind Biodefense, an initiative to accelerate the development of technologies that can defend against new pathogens.

The initiative goes hand in hand with GPT-Rosalind, a model tuned specifically for life sciences work. Like Mythos and GPT-Cyber, GPT-Rosalind is in limited release to trusted partners.

This is great—I’m excited for the broader public health benefits as well as getting a head start on defending against AI-accelerated threats. Biorisk capabilities lag behind cyber, but engineered pathogens are more harmful and harder to defend against than cyber threats. We need all the head start we can get.

How we contain Claude across products

A central challenge of the agentic era is that as agents become more capable of doing work, they necessarily become more capable of causing damage. Anthropic shares what they’ve learned about how to contain the dangers of modern agents:

Yet as agents become capable of doing work that once required a person or even a team, the cost of not deploying grows large enough that the risk-reward calculation tips heavily toward adoption, as long as products can be made safe. The engineering question becomes how to cap the blast radius.

Using AI

The solution might be cancelling my AI subscription

David Wilson has concerns about agentic coding:

this technology is horrific for attention. It’s a thermonuclear ADHD amplifier and I have seen the same effect in every single one of my adult friends. Folk running 3 screens simultaneously working on totally unrelated “projects” they have little hope of maintaining, and such little commitment to the outcome that the time is obviously wasted.

Simon Willison can relate:

I’m hopeful that the critical skill to develop here is discipline. That’s not great news for me: I’ve been trying to figure that one out for decades!

The problem, for many people, is real. But the solution isn’t to stop using AI: it is, as Simon intuits, to develop the ability to figure out what matters, and to do that rather than what is most accessible in the moment.

Cal Newport’s Deep Work predates LLMs, but it’s by far my favorite guide to productivity—now more than ever.

Use AI This Election

Scott Alexander wants you to use AI to improve your voting research. I mean, obviously.

The prompt at the very beginning is worth stealing, and the analysis at the end adds valuable detail. You can probably skip the bulk of the piece, which is Claude’s analysis of various California candidates.

Let the agents democratize open source

DHH isn’t a fan of recent efforts to ban AI from open source:

Projects big and small have been erecting new participation barriers on contributions aided by AI to preserve the privileges of the old programmer guilds.
This is a protectionist tale as old as time.
And the justifications are just as tired: It’s about quality! It’s about attribution! It’s about workers! Spare me. It’s about you, your insecurities, and your privileges.

We are in a (probably) brief window of time where many open source projects are drowning in AI-generated slop. In some cases, a temporary ban may be the best feasible solution.

But reading some of the actual policies, it’s clear many of them are motivated by animus toward AI rather than a desire to build great software. Some of them will evolve, and the rest are ngmi.

People and data

Demis Hassabis talks with Harry Stebbings

Demis Hassabis talks with Harry Stebbings about the state of AI halfway through 2026. It’s an engaging conversation that fits a lot of information into half an hour. If you’re looking for exciting quotes, you won’t be disappointed:

Demis sees a “very good chance” of AGI within the next 5 years
He likes to frame AGI as “ten times the Industrial Revolution, at ten times the speed”

People often list continual learning as the big breakthrough we know we need to reach AGI—Demis agrees, but also sees different memory systems, long-term planning, and consistency as likely candidates for necessary breakthroughs.

As open models exhaust what they can do with the public research literature, he expects the frontier labs that are doing cutting-edge research to pull further ahead—that seems likely, although we haven’t seen much evidence of it so far.

Google DeepMind History

Logan Kilpatrick talks with Jeff Dean, Koray Kavukcuoglu, Noam Shazeer, and Oriol Vinyals about the origins of Gemini. It’s a fun chat, with a mix of history lesson and technical information—I didn’t realize the name Gemini comes from unifying two separate AI projects.

While we’re in the history books, Andrew Trask argues that keeping GDM in London was one of Demis’ smartest choices:

Demis’s decision meant that for a 5-7 year period, every senior AI researcher in Europe who wanted to join one of the new/big AGI labs... but didn’t want to be 5000 miles away from their home/family/culture... joined DeepMind.

AI Radar

Discussion about this post

Ready for more?