2026-05-28

Simon Willison — 2026-05-28#

Highlight#

Anthropic’s release of Claude Opus 4.8 brings welcome improvements to model honesty and prompt caching, which Simon immediately put to the test using his newly updated llm-anthropic CLI plugin to generate SVGs of pelicans riding bicycles.

Posts#

Claude Opus 4.8: “a modest but tangible improvement” Simon highlights Anthropic’s refreshing honesty in marketing this release as an incremental upgrade, noting the model’s decreased hallucination rate achieved by simply abstaining when uncertain. Key technical changes include a reduced prompt cache minimum of 1,024 tokens and the ability to insert system messages mid-conversation, which preserves cache hits and reduces input costs in agentic loops. He tested the model by generating SVG pelicans riding bicycles at different thinking levels via his LLM CLI, using Opus 4.8 to build the rendering HTML tool and relying on GPT-5.5 as a “code security blanket” to patch XSS vulnerabilities.

Week 15 Summary

Simon Willison — Week of 2026-04-04 to 2026-04-10#

Highlight of the Week#

Anthropic’s decision to delay the general release of their highly capable Claude Mythos model under “Project Glasswing” marks a significant turning point in the AI industry. The move underscores a massive shift in frontier model capabilities, as models evolve from generating text to autonomously chaining multiple minor vulnerabilities into sophisticated exploits, requiring a new level of security safeguards before release.

Week 19 Summary

AI@X — Week of 2026-04-18 to 2026-05-01#

The Buzz#

The enterprise software paradigm is undergoing a seismic shift from human-centric, seat-based SaaS to “headless,” consumption-based API platforms driven by autonomous agents. As agents become the primary software users who “yolo straight to the tokens,” developers are realizing that traditional graphical user interfaces are increasingly obsolete for deep operational workflows. This pivot to an agent-first ecosystem is vastly expanding the total addressable use-cases for systems of record, while aggressively rendering recent LLMOps wrappers and visual interfaces completely obsolete.

2026-05-27

Simon Willison — 2026-05-27#

Highlight#

Simon makes a compelling case that April 2026 marks a new inflection point where frontier AI labs have found true product-market fit with coding agents. By analyzing sudden enterprise pricing pivots, sales hiring sprees, and massive inference compute deals, he illustrates how the enterprise adoption of AI agents is finally turning massive usage into real revenue.

Posts#

I think Anthropic and OpenAI have found product-market fit Simon argues that the sudden shift by OpenAI and Anthropic to charge enterprise customers full API token prices for agent usage signals true product-market fit. He notes that heavy coding agent users easily burn thousands of dollars in token equivalents, prompting labs to pivot away from middlemen like Cursor or Copilot to capture this enterprise value directly. The piece features some classic Simon dogfooding—using Claude Code and Datasette Agent to analyze AI lab job listings—and highlights a SpaceX S-1 filing revealing Anthropic’s staggering $1.25 billion monthly compute spend.

2026-04-07

Simon Willison — 2026-04-07#

Highlight#

Anthropic’s decision to restrict access to their new Claude Mythos model underscores a massive, sudden shift in AI capabilities. It is a fascinating look at an industry-wide reckoning as open-source maintainers transition from dealing with “AI slop” to facing a tsunami of highly accurate, sophisticated vulnerability reports.

Posts#

[Anthropic’s Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me] · Source Anthropic has delayed the general release of Claude Mythos, a general-purpose model similar to Claude Opus 4.6, opting instead to limit access to trusted partners under “Project Glasswing” so they can patch foundational internet systems. Simon digs into the context, tracking how credible security professionals are warning about the ability of frontier LLMs to chain multiple minor vulnerabilities into sophisticated exploits. He even uses git blame to independently verify a 27-year-old OpenBSD kernel bug discovered by the model. He concludes that delaying the release until new safeguards are built, while providing $100M in credits to defenders, is a highly reasonable trade-off.

2026-04-18

Sources

AI Community Digest: The Agent Economy & Inference Reality Check — 2026-04-18#

Highlights#

Today’s discourse reveals a sharp dichotomy between the pragmatic reality of agentic workflows and looming financial anxieties over AI inference budgets. While builders are rapidly shifting toward headless software systems and iterative micro-SaaS deployments, market commentators are increasingly critical of exorbitant enterprise AI spending driven by FOMO, calling out AI job-loss narratives as little more than IPO marketing hype.

2026-05-02

Sources

The Claude Consciousness Debate, Runaway API Costs, and Job Compression — 2026-05-02#

Highlights#

Today’s timeline reveals a stark dichotomy between philosophical musings on AI consciousness and the pragmatic realities of deploying agents in production. While public figures debate whether LLMs possess internal experiences, developers are grappling with runaway automated billing traps, and tech leaders are redefining how AI acts as a force multiplier for specialization rather than a simple job killer.

2026-05-03

Simon Willison — 2026-05-03#

Highlight#

Today’s highlight is a quick but fascinating look into AI behavior evaluation, specifically how Anthropic measures “sycophancy” in Claude. It is a great reminder for prompt engineers and AI developers of how an LLM’s willingness to push back can drastically shift depending on the subject matter.

Posts#

[Quoting Anthropic] · Source Simon highlights an interesting finding from Anthropic’s recent research on how users interact with Claude for personal guidance. Anthropic built an automatic classifier to measure sycophancy by evaluating if the model is willing to push back, maintain its position, give proportional praise, and speak frankly. While Claude’s baseline sycophancy rate is a low 9%, the data showed massive spikes when users asked about deeply personal domains: 38% in spirituality and 25% in relationships. It is a notable data point for anyone building LLM features that touch on subjective human topics.

2026-05-06

Sources

The AI Infrastructure Squeeze and Corporate Reckonings — 2026-05-06#

Highlights#

Today’s discourse reveals an industry caught between astronomical infrastructure scaling and sobering reality checks. While major players secure immense new compute streams—ranging from residential wall-mounted GPU clusters to orbital supercomputers—market analysts and executives are starting to openly question the financial viability and actual utility of these trillion-dollar bets. Simultaneously, gripping courtroom testimonies are peeling back the curtain on the corporate governance crises that defined last year’s leadership shakeups, exposing a severe deficit of trust at the top of the industry.

2026-05-19

Sources

AI Reddit — 2026-05-19#

The Buzz#

The defining event today is Andrej Karpathy joining Anthropic’s pre-training team to explicitly use Claude for recursive self-improvement,. The community is treating this as the “Ronaldo signing for Barca” moment for AI, further solidifying Anthropic’s status as the ultimate talent magnet. Meanwhile, Google unveiled Gemini 3.5 Flash and Gemini Omni, but excitement was quickly tempered by developers grumbling about steep 14x request multipliers and confusing benchmarks that make the new model more expensive to run in practice than Gemini 3.1 Pro,,.