Sources

Engineering @ Scale — 2026-04-27#

Signal of the Day#

Amazon successfully bridged the semantic gap in product search by using massive LLMs offline to generate a 29-million edge commonsense knowledge graph, then instruction-tuning a smaller, highly-efficient model (COSMO-LM) for real-time production serving. It is a masterclass in treating frontier models as data-synthesizers rather than production-serving endpoints.

Deep Dives#

How Amazon Uses LLMs to Recommend Products · Amazon Traditional recommendation systems match keyword text and purchase history, failing when user queries require human “common sense” (e.g., matching “shoes for pregnant women” with “slip-resistant”). Amazon solved this by feeding query-purchase behavior into a massive OPT-175B model to generate millions of candidate explanations, which were heavily filtered through a generalization classifier. Because running a 175B parameter model in real-time is prohibitively expensive, Amazon distilled this data into a structured knowledge graph and instruction-tuned a smaller LLaMA-based model (COSMO-LM) for low-latency serving. This highlights a generalizable pattern for GenAI architecture: use massive, slow models purely for offline synthetic data generation and quality filtering, then deploy smaller, task-specific models for the production critical path.

Introducing AMS: Activation-based model scanner for open-weight LLM safety verification · Google Testing open-weight models for safety using behavioral prompt benchmarks is notoriously slow, incomplete, and easily gamed by fine-tuning. Google’s GKE team introduced AMS (Activation-based Model Scanner), an open-source tool that drops behavioral testing entirely in favor of measuring the geometric structure of a model’s internal activation space. By checking for the collapse of internal “direction vectors” that separate harmful from benign content, AMS can verify model integrity in 10–40 seconds without generating a single output token. This structural approach to LLM security allows infrastructure teams to build rapid, un-gameable safety gates directly into CI/CD pipelines and supply chain verifications.

How Popsa used Amazon Nova to inspire customers with personalised title suggestions · Popsa Popsa needed to automatically generate creative, brand-aligned titles for photo books while strictly adhering to a 36-character layout limit and requiring structured JSON outputs. They replaced their legacy heuristic graph algorithm with a Retrieval-Augmented Generation (RAG) approach, utilizing an “LLM-as-a-judge” to evaluate generated titles against hard constraints during testing. To overcome the latency problem of generating multi-option JSON, they migrated to Amazon Bedrock’s ConverseStream API, which allowed them to parse incomplete JSON streams in real-time and render the first valid suggestion in under one second. The architecture proves that teams can safely adopt generative AI for strict UI constraints by shifting complexity into real-time stream-parsing and rigorous offline LLM-based evaluation.

Deloitte optimizes EKS environment provisioning and achieves 89% faster testing environments using Amazon EKS and vCluster · Deloitte Provisioning dedicated EKS clusters for QA testing was taking Deloitte 30–45 minutes per environment and creating massive infrastructure duplication across DNS, ALBs, and monitoring agents. The platform team architected a multi-tenancy solution using vCluster on top of a shared Amazon EKS host cluster running in Auto Mode. This design allowed them to deploy shared ingress and storage controllers exactly once on the host, granting QA teams lightweight, isolated virtual clusters that provision in under 5 minutes. The tradeoff of sharing the underlying compute is a slight increase in host-cluster management complexity, but it eliminates redundant cloud resources and drastically cuts costs.

Build and deploy an automatic sync solution for Amazon Bedrock Knowledge Bases · Deloitte Keeping Amazon Bedrock Knowledge Bases continuously synced with S3 document updates is difficult due to strict API rate limits and service quotas, such as a hard limit of one ingestion job per knowledge base at a time. Deloitte built a serverless, event-driven orchestration layer using EventBridge to capture S3 changes, which are then immediately buffered into an SQS queue. An AWS Step Functions state machine consumes these messages, checks DynamoDB to track active jobs, and enforces the concurrency quotas before triggering new Bedrock ingestion jobs. This serves as an excellent blueprint for any backend system where high-throughput asynchronous events must be safely throttled into a strictly rate-limited third-party API without losing data.

Uber Migrates 75,000+ Test Classes from Junit 4 to Junit 5 Using Automated Code Transformation · Uber Uber faced the enormous operational risk of modernizing testing infrastructure across a massive monorepo by migrating over 75,000 test classes from JUnit 4 to JUnit 5. The engineering team avoided manual, error-prone rewrites by leveraging OpenRewrite for automated code transformation, wrapped in internal orchestration tooling. To ensure correctness and zero developer downtime, they enabled the JUnit Platform for dual execution with Bazel and validated all changes strictly through CI. This is a prime example of treating code-base migrations as a programmatic infrastructure problem rather than a manual engineering chore.

Presentation: Building a Future-Proof Observability Platform to Empower Engineers · Skyscanner Skyscanner needed to scale its observability practices across more than 800 microservices without being locked into a specific vendor’s ecosystem. The engineering team standardized on OpenTelemetry, effectively decoupling their application instrumentation from proprietary backends. By treating their platform as an internal product with developers as customers, they drove a cultural shift that systematically reduced technical debt and incident rates. Standardizing on open protocols like OpenTelemetry is becoming the default strategy to maintain telemetry leverage and vendor portability at massive scale.

An open-source spec for orchestration: Symphony · OpenAI Engineers face significant context switching when translating requirements from issue trackers into coding environments. OpenAI released Symphony, an open-source specification for Codex orchestration that effectively turns issue trackers into always-on agent systems. By defining a standardized orchestration layer, it tightly couples task definitions with automated code generation. This points toward a growing industry focus on standardizing the interfaces between autonomous agents and existing project management repositories.

Patterns Across Companies#

A dominant theme this period is the maturation of how top organizations deploy LLMs: companies are shifting away from treating massive models as runtime black-boxes, and instead using them as offline engines to generate structured artifacts and data. Amazon used 175B-parameter models purely to build a knowledge graph and train a smaller, faster model for production, while Google’s AMS moves safety checks out of behavioral LLM prompting and directly into the structural math of the model’s activations. Meanwhile, platform engineering continues to focus on extreme resource consolidation, utilizing tools like vCluster and automated code transformation to abstract complexity away from feature developers and reduce cloud waste.