Sources

Engineering @ Scale — 2026-04-02#

Signal of the Day#

The rise of AI crawler traffic is fundamentally breaking traditional LRU (Least Recently Used) cache algorithms. Cloudflare observed that AI agents sweeping sites for RAG and model training maintain a 70-100% unique access ratio, churning the long-tail of caches and severely degrading performance for human users. The fix requires moving away from LRU toward algorithms like S3FIFO and splitting CDNs into separate cache tiers based on latency tolerance.

Deep Dives#

KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure · Meta Hand-tuning performance kernels across a heterogeneous fleet (NVIDIA, AMD, custom MTIA silicon) for rapidly evolving ML architectures is no longer scalable for human engineers. Meta built KernelEvolve to treat low-level kernel optimization as a Monte Carlo tree search problem. The agent continuously generates candidates via an LLM, compiles them, profiles hardware utilization (like memory constraints), and feeds the diagnostics back into the prompt dynamically. Instead of using static prompts, Meta relies on a retrieval-augmented knowledge base that updates itself using in-context reinforcement learning to refine optimization patterns over time. The takeaway is that treating infrastructure optimization as an agentic search space is becoming a requirement to leverage diverse AI accelerators effectively.

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events · Netflix Netflix migrated its live streaming pipeline from Constant Bitrate (CBR) to Variable Bitrate (VBR) to reduce overall network traffic, but the shift broke their server capacity planning. VBR drops bitrates significantly during static scenes, which tricked Netflix’s traffic-steering logic into over-admitting client sessions to seemingly under-utilized servers. When visually complex scenes occurred, the sudden bandwidth spikes caused cascading packet drops and buffering. To mitigate this, Netflix separated current traffic metrics from capacity reservations, choosing to reserve server headroom based on each stream’s nominal bitrate rather than its current utilization. For any team moving to variable-load systems, this is a textbook lesson in reserving capacity against potential spikes rather than instantaneous load to prevent system destabilization.

Improving storage efficiency in Magic Pocket, our immutable blob store · Dropbox A new background erasure-coding service unintentionally created a long tail of severely under-filled volumes in Dropbox’s exabyte-scale immutable blob store, inflating storage overhead. Their steady-state garbage collection strategy (L1) was designed to top off already dense volumes and lacked the throughput to efficiently clear these sparse volumes. Dropbox implemented a dynamic programming-based strategy (L2) to batch and pack moderately under-filled volumes, alongside a streaming pipeline (L3) to completely rewrite the sparsest tail. The L3 strategy aggressively reclaims raw storage space, but requires heavy metadata rewrites to generate new blob identifiers. This highlights that exabyte-scale garbage collection cannot rely on a single heuristic; systems require layered compaction strategies targeting different segments of the fragmentation distribution while carefully managing the load on downstream metadata clusters.

Optimizing Vercel Sandbox snapshots · Vercel Vercel’s Firecracker microVM sandboxes relied on S3-backed disk snapshots to persist state, but p75 restore times were unacceptably slow at over 40 seconds. Vercel first optimized the network path by replacing sequential S3 downloads with parallel HTTP Range requests and multi-threaded decompression, streaming bytes directly into the decompressor to avoid intermediate disk writes. However, to achieve sub-second boot times, they ultimately implemented an LRU cache storing the uncompressed disk images directly on local NVMe storage. By accepting higher local storage costs, they achieved a 95% cache hit rate that bypasses S3 entirely. Network and CPU optimizations are essential, but skipping the network entirely via local caching remains the only reliable path to sub-second VM instantiation.

Scaling seismic foundation models on AWS: Distributed training with Amazon SageMaker HyperPod and expanding context windows · TGS Training Vision Transformers on massive 3D seismic data required massive throughput, creating GPU idle times due to storage bottlenecks. Migrating to 16 nodes of H200 GPUs, TGS evaluated DeepSpeed ZeRO-2, ZeRO-3, and FSDP2 for distributed state tracking, ultimately selecting ZeRO-2 to minimize communication overhead despite its lower memory efficiency. Surprisingly, they found that relying on high-speed distributed filesystems (like FSx for Lustre) became a bottleneck as the cluster grew due to shared volume limits. Instead, they opted to stream training data directly from S3 using multi-threaded connections, cutting storage costs by 90% while sustaining 80 GBps of cluster-wide throughput. For massive distributed training workloads, direct object storage streaming scales more predictably per node than provisioned shared file systems.

Persist session state with filesystem configuration and execute shell commands · Amazon Web Services AI coding agents often lose their workspace (installed dependencies, generated code) when ephemeral microVMs terminate, and routing deterministic commands (like npm test) back through the LLM adds unnecessary cost, latency, and non-determinism. Amazon Bedrock AgentCore solved this by introducing managed session storage to persist the workspace directory across VM stop/resume cycles. Rather than using sidecars or orchestration logic outside the runtime, they deliberately run deterministic shell commands inside the same microVM container as the agent. In modern agentic architectures, the best practice is separating the LLM’s probabilistic reasoning loop from deterministic execution, binding them together via a shared, persistent filesystem.

Patterns Across Companies#

A recurring theme this cycle is how AI and massive-scale workloads are breaking legacy infrastructure assumptions. Cloudflare discovered AI crawlers defeat LRU caches by churning long-tail content. Netflix found that optimizing bitrates with VBR broke their server capacity heuristics. Dropbox found that new erasure coding paths fragmented their immutable blob storage to the point where standard garbage collection failed. Additionally, we are seeing a shift away from complex intermediate layers in favor of direct access: Vercel eliminated intermediate disk writes to speed up VM snapshots, and TGS achieved linear scaling by streaming training data directly from S3 rather than relying on shared Lustre filesystems.