Sources

Engineering @ Scale — 2026-05-29#

Signal of the Day#

Netflix’s approach to service topology reveals that no single data source provides a complete system dependency map at scale. By combining eBPF network flows for completeness, IPC metrics for endpoint context, and distributed tracing for actual runtime behavior, they built a real-time, multi-layer graph capable of sub-second traversal across thousands of microservices.

Deep Dives#

GitHub Slashes Agent Workflow Token Spend up to 62% with Daily Audits and MCP Pruning · GitHub Scaling agentic CI workflows quickly introduces massive and unpredictable token consumption costs. GitHub addressed this by implementing daily “auditor” and “optimizer” agents to track spend regressions across models. By systematically pruning unused Model Context Protocol (MCP) tools and substituting expensive MCP calls with standard gh CLI commands, they achieved up to a 62% reduction in token costs. An architectural artifact introduced was a token-usage.jsonl file to track an “Effective Tokens” metric, providing a standardized way to observe and control operational LLM expenses.

Presentation: Building Evals for AI Adoption: From Principles to Practice · Multiple Companies Traditional software metrics are insufficient for modern AI architectures, leading to a hidden production risk termed “evaluation debt”. Drawing on experience from Twitter, Walmart, and Netflix, Mallika Rao outlines a five-layer evaluation stack that encompasses everything from base infrastructure to the user experience. The core problem is that LLMs can fail semantically without triggering traditional operational alarms. To combat these silent semantic failures, engineering leaders must implement a diagnostic maturity model that continuously assesses model quality in production.

AI-Assisted Migration Tool Helps Teams Move from ingress-nginx to Higress in Minutes · CNCF Migrating Kubernetes networking infrastructure is typically a high-risk, time-intensive manual process. The Cloud Native Computing Foundation showcased an AI-assisted tooling approach that successfully translated 60 ingress-nginx resources to Higress. This automated migration was completed in approximately 30 minutes, drastically reducing the operational toil and human error associated with rewriting gateway configurations. The approach signals a broader pattern where AI is utilized to safely accelerate deterministic core infrastructure modernization rather than just product features.

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality · Amazon LLM inference introduces unpredictable token consumption and variable outputs, making traditional infrastructure monitoring inadequate. AWS proposes a dual-dimensional observability architecture separating “quantity” (infrastructure health, GPU compute/memory percentages) from “quality” (factual accuracy, safety, and relevance). They route enhanced operational metrics to one CloudWatch namespace (/aws/sagemaker/InferenceComponents/) and custom LLM-as-a-judge quality scores to another (/aws/sagemaker/inference-quality/). Using Amazon Managed Grafana, operators can correlate latency spikes or cost drivers directly with model degradation without requiring custom application-level instrumentation.

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map · Netflix Netflix operates thousands of microservices where a single user action triggers a massive call cascade, making blast-radius analysis critical for incident resolution. Stale architecture diagrams fail in dynamic environments, so Netflix engineered a multi-layer topology graph sourced from three distinct telemetry streams. They capture eBPF flows for network-layer ground truth, IPC metrics for rich application-layer context, and distributed traces for runtime behavior. These streams are processed via multi-region Kafka and Apache Pekko Streams to resolve network intermediaries into direct app-to-app dependencies. By executing parallel queries across physically separated graph partitions, engineers achieve sub-second traversal to instantly visualize dependencies during outages.

High-Throughput Graph Abstraction at Netflix: Part I · Netflix To support OLTP graph workloads at the scale of 10 million operations per second across 650 TB of data, Netflix built a highly optimized Graph Abstraction layer. The architecture relies on an underlying Key-Value (KV) store that physically decouples edge links from edge properties to prevent wide-row partition hotspots. An in-memory strongly-typed schema allows the system to reject non-conforming data, optimize query planning by eliminating impossible traversal paths, and deduplicate bidirectional edges. The system enforces strict eventual consistency across regions using Kafka for entropy repair, effectively trading synchronous atomic writes for single-digit millisecond read latencies and massive throughput.

MUFG aims to become AI-native with OpenAI · MUFG Enterprise AI adoption in highly regulated financial sectors requires robust primitives that guarantee data privacy. MUFG is deploying ChatGPT Enterprise to systematically reform workflows and build an AI-native organization at scale. By leveraging enterprise-grade deployments, the bank aims to safely deliver new AI-powered financial services while maintaining strict institutional compliance.

Strengthening societal resilience with Rosalind Biodefense · OpenAI Deploying frontier models for high-risk biological domains necessitates strict access control and domain specialization. OpenAI introduced Rosalind Biodefense, restricting access to a specialized GPT-Rosalind model exclusively to vetted developers and U.S. government partners. This targeted deployment model demonstrates how organizations can safely accelerate public health and pandemic preparedness research without widely releasing sensitive dual-use capabilities.

A shared playbook for trustworthy third party evaluations · OpenAI Evaluating frontier AI models requires standardized methodologies that extend beyond internal provider benchmarks. OpenAI released guidance for third-party evaluators, establishing a framework to assess model capabilities, safeguards, and overall validity. This playbook offers engineering and safety teams a structured approach to independently verify the operational limits and security boundaries of advanced AI systems in production.

Boston Children’s uses AI to unlock new diagnoses · Boston Children’s Hospital In clinical environments, processing massive amounts of unstructured medical data is a severe operational bottleneck. Boston Children’s Hospital integrated OpenAI technology to alleviate operational burdens and assist clinical workflows. This implementation directly contributed to diagnosing over 40 rare disease cases, proving that LLMs can act as powerful data synthesis tools to enhance highly specialized human decision-making.

How Braintrust turns customer requests into code with Codex · Braintrust Accelerating the development loop from product request to deployed feature is a constant engineering challenge. Braintrust engineers utilized OpenAI’s Codex, powered by GPT-5.5, to drastically speed up experimentation and code generation. This integration demonstrates how teams are embedding advanced coding agents directly into their development cycles to scale output without linearly scaling headcount.

Team-wide provider allowlist on AI Gateway · Vercel Regulated teams often struggle to enforce AI vendor compliance across distributed development environments where engineers can arbitrarily route traffic. Vercel’s AI Gateway solves this by enforcing a centralized, team-wide provider allowlist at the gateway level, overriding any request-level or agent-modified filters. If an unapproved provider is requested, the gateway strictly blocks it or falls back to an allowed alternative, ensuring that new vendors aren’t silently introduced.

Port 8080 is now available in Vercel Sandboxes · Vercel Cloud developer environments frequently require binding to common web ports for seamless application previewing. Vercel Sandboxes freed up port 8080 by migrating their internal controller to port 23456. This allows engineers to bind port 8080 as an ingress domain, significantly reducing configuration friction for containerized or legacy applications that hardcode this port for local web traffic.

Building a real-time power outage map with Next.js on Vercel · Endeavour Energy Endeavour Energy’s legacy outage map repeatedly failed during storms due to tight coupling between the frontend and the CMS, creating an inability to absorb 17x traffic spikes. They decoupled the architecture into three independent layers: a Next.js frontend on Vercel for edge caching, Supabase for real-time data, and their existing Sitecore CMS for content. Vercel Cron Jobs sync upstream data into Supabase every five minutes, eliminating the manual orchestration that previously delayed updates. This headless approach allowed them to achieve sub-second page loads during peak weather events without migrating the CMS or provisioning costly year-round failover hardware.

How Conductor moved parallel coding agents from the laptop to the cloud with Vercel Sandbox · Conductor Running multi-agent IDEs locally burns through developer hardware resources and halts when a laptop is closed. Conductor migrated its remote execution layer to Vercel Sandboxes, allowing engineers to spin up multiple parallel coding agents on isolated codebase branches. This model-agnostic cloud approach decouples agent computation from local constraints, enabling asynchronous execution while preserving the speed and feel of a local development environment.

Run Docker containers inside Vercel Sandbox · Vercel AI coding agents often require complex system dependencies to test the code they generate. Vercel Sandboxes now support running the Docker daemon natively, allowing agents to build containers, install packages, and spin up services like Redis or Postgres in an isolated cloud environment. By also introducing FUSE filesystem drivers and VPN clients, this feature bridges the gap between lightweight serverless compute and full-system virtualization for automated development agents.

Function invocations now billed per unit · Vercel Coarse, package-based billing for serverless compute often penalizes teams with bursty or highly variable traffic. Vercel has transitioned Function invocations to a strictly per-unit pricing model for Pro and Enterprise tiers. Priced at $0.0000006 per invocation, this granular model aligns infrastructure costs directly with actual consumption, providing better cost predictability for scaled applications.

Protecting against inference theft · Vercel With standard HTTP requests costing fractions of a cent but an LLM prompt costing up to $2, attackers are wrapping victim AI APIs in OpenAI-compatible adapters and reselling the stolen inference. Traditional IP rate limits fail because attackers fan out through residential proxies, and session-based auth is bypassed once, amortizing the attack cost. Vercel implemented BotID deep analysis to run an invisible, client-side ML challenge on every single request directly inside the Next.js route handler. By forcing the verification cost onto every call, the economic asymmetry is flipped, making inference theft unprofitable for the attacker.

Check out real-life AI prototypes from the Futures Lab. · Google Bridging the gap between academic research and applied engineering is crucial for expanding AI accessibility. Students at the University of Waterloo collaborated with the Futures Lab to develop real-world AI prototypes, including interactive sign language tutors. This underscores a growing industry trend of applying advanced AI models to niche, high-impact accessibility and educational use cases.

11 demos of Gemini Omni and Gemini 3.5 in action · Google Visualizing model capabilities is key for engineers determining technology adoption paths. Google demonstrated the capabilities of its newly announced Gemini Omni and Gemini 3.5 models through technical videos at I/O 2026. These demos provide practical insights into the latency, reasoning thresholds, and multi-modal handling of the latest generation of Gemini models.

Take our I/O 2026 quiz, vibe coded in Google AI Studio. · Google Rapid prototyping of web applications is increasingly driven by AI-native tools rather than traditional scaffolding. Google showcased this by using Google AI Studio to entirely “vibe code” an interactive quiz application. This highlights how the developer experience is shifting toward higher-level, prompt-driven application generation for bounded, low-complexity frontends.

Gemini Flash Gets Pricey, AI Act Delays, Agents Drive Online Traffic · Multiple Companies The landscape of AI engineering is rapidly shifting across talent, model architecture, and infrastructure. The rise of the “Forward Deployed Engineer” is accelerating to help clients tune agentic workflows, though generalist AI Engineers remain dominant. On the model side, Google’s new Mixture-of-Experts Gemini 3.5 Flash achieved top-tier agentic benchmarking but significantly raised the cost-per-token, challenging the notion that “Flash” tier models will remain strictly low-cost. Meanwhile, AI-driven internet traffic tripled in 2025, heavily driven by crawlers and agentic browsers hitting e-commerce pages, fundamentally altering traffic patterns and security postures for infrastructure teams.

Planning Generated Images In Stages · Meta / UCSD / WPI / Northwestern Diffusion models often fail at precise spatial relationships because they generate whole images simultaneously. Meta and academic researchers fine-tuned the BAGEL-7B model to break composition into discrete flow-matching stages: plan, sketch, inspect, and refine. By generating an element, utilizing a VLM to check it against the prompt, and issuing corrective instructions in a loop, they achieved an 83% accuracy score on the GenEval benchmark. This staged generation approach mimics “chain-of-thought” reasoning for visual domains, proving that iterative feedback loops yield higher fidelity than scaling dataset size alone.

Open Source Ecosystems · Anthropic / Stainless While open protocols like the Model Context Protocol (MCP) attempt to standardize agentic tool use, platforms are actively capturing the developer experience layer above them. Anthropic’s acquisition of Stainless—a dominant tool for converting business APIs into MCP-compliant SDKs—illustrates this strategy of “complement capture”. Because baseline model capabilities are converging, companies are shifting their moats toward developer tooling and connector reliability. For engineering leaders, this highlights that protocol openness does not guarantee immunity from vendor lock-in if the critical-path integration layers and SDK generators are consolidated by private platform owners.

Patterns Across Companies#

A dominant theme this period is the necessity of decoupling and specialized observability to handle the unpredictability of AI workloads. AWS, Vercel, and GitHub are all treating LLM behaviors (cost spikes, semantic degradation, and inference theft) as fundamental infrastructure problems requiring strict, per-request verification and distinct metric namespaces. Additionally, the shift toward running agentic workflows is forcing execution infrastructure changes everywhere—from Vercel supporting Docker and persistent states in Sandboxes to Anthropic acquiring API-generation layers—indicating that the immediate battleground has moved from the models themselves to the developer execution and integration environments.