Sources

Engineering @ Scale — 2026-06-29#

Signal of the Day#

PAR Technology demonstrated a masterclass in LLM multi-tenant security by introducing a “Split-Plane SQL” architecture. Instead of trusting a non-deterministic LLM to apply tenant filters, they programmatically generate row-level security CTEs before the model is invoked, giving the LLM zero visibility into cross-tenant data and completely neutralizing prompt-injection data leaks.

Deep Dives#

AI Tools Accelerates Coding, but Not Overall Software Delivery, GitLab Research Finds · GitLab Engineering teams are encountering an “AI Paradox” where 78% of developers code faster using AI, but end-to-end software delivery timelines remain stagnant. This bottleneck occurs because accelerated coding shifts the friction downstream to code review, automated testing, and enterprise governance. The architectural takeaway is that optimizing a single node in the SDLC without addressing constraints in the broader CI/CD pipeline yields no overall systemic gain. Organizations must now modernize their traceability and compliance frameworks to keep pace with AI-generated throughput.

Podcast: Architectural Patterns: Moving Beyond Cloud-Native to Local-First - Insights from Adam Wiggins · Ink & Switch Adam Wiggins advocates for a “local-first” architecture that merges the collaborative benefits of cloud-native design with the performance and data ownership of local applications. The core problem solved is latency and offline availability, achieved by leveraging Conflict-Free Replicated Data Types (CRDTs) to sync state without a central authoritative server. This approach pushes version control primitives beyond code into generic data domains. Furthermore, Wiggins suggests that hybrid AI models can execute core productivity tasks locally, reducing cloud dependencies and improving user privacy.

Article: Virtual panel: Security in the Machine Age: Expert Insights on AI Threat Evolution · InfoQ As AI systems become highly autonomous and integrated into critical workflows, traditional incident response models are struggling to adapt. Security teams are facing novel attack vectors, transitioning from simple prompt injections and data poisoning to sophisticated agent abuse and AI-powered social engineering. A panel of security experts emphasized that defending these systems requires shifting security controls closer to the AI execution boundaries. The discussion underscores that organizations must implement robust validation logic separate from the non-deterministic AI models themselves to protect core infrastructure.

Presentation: Million PDFs: Building a Modern Document Infrastructure with Rust and Typst · InfoQ Legacy PDF generation in regulated industries typically relies on heavy, slow engines like Puppeteer or LaTeX, causing severe operational pain. Erik Steiger resolved this by transitioning to a serverless architecture written in Rust and powered by the Typst typesetting engine. This architectural pivot dropped render latencies to below 2ms, fundamentally changing the performance profile of document pipelines. By applying Git and Docker concepts to template registries, the team also ensured ironclad compliance and enabled rapid debugging of generated assets.

Inside Target’s LLM-Based System for Semantic Matching in Marketing Forecast Pipelines · Target Target’s engineering team struggled with brittle, rule-based workflows for marketing campaign forecasting. They re-architected the pipeline by deploying an LLM-based generative AI system that utilizes vector search and embeddings to retrieve and rank similar historical campaigns. This shift to semantic matching achieved 75% top-1 and 100% top-3 coverage in campaign evaluations. The system generalizes well to other domains, utilizing automated feedback loops based on actual campaign outcomes to continuously refine retrieval accuracy.

Eliya 25 Brings a JVM-Level Diagnostic Profile to OpenJDK 25 LTS · Asymm Systems Teams running Java applications in highly regulated environments often struggle with extracting reliable production diagnostic data without severe performance hits. Asymm Systems launched Eliya 25.0.3, a distribution of OpenJDK 25 LTS that resolves this by packaging several HotSpot features into a single, opt-in Production diagnostic profile. This simplifies JVM tuning and guarantees a baseline of observability out-of-the-box for enterprise deployments.

Java News Roundup: Hardwood 1.0, Endive 1.0, Azul Payara, Quarkus, WildFly, LangChain4j, OSSI · InfoQ The Java ecosystem saw multiple updates, including the General Availability of Hardwood 1.0 and Endive 1.0. Additionally, point releases for Quarkus and LangChain4j signal ongoing enhancements in Java-based cloud-native and AI framework integrations. The newly introduced Open Source Sustainability Initiative (OSSI) aims to support long-term maintenance of vital open-source Java dependencies.

Debugging production agents with Amazon Bedrock AgentCore Observability · AWS Production AI agents frequently fail silently by returning incorrect answers or entering infinite reasoning loops without throwing traditional exceptions. AWS addresses this by introducing AgentCore Observability, which exposes execution states across CloudWatch metrics, OpenTelemetry traces, and structured logs. To mitigate infinite loops caused by missing termination conditions, engineers must enforce explicit hard stops, like limiting reasoning to 10-15 steps, and implement state-tracking loop detection outside the LLM prompt. Furthermore, tracing tool invocations helps immediately identify whether reliability failures stem from missing IAM authorizations (403 errors) or invalid LLM schemas (400 errors).

Build an agentic AI healthcare claims pipeline with Amazon Bedrock and AWS HealthLake · AWS Processing scanned CMS-1500 healthcare claims is highly error-prone, but granting full autonomy to AI agents is too risky for sensitive FHIR data generation. AWS architected a pipeline where Amazon Bedrock Data Automation extracts structured data, and an AgentCore LLM validates the records against AWS HealthLake. Crucially, the architecture utilizes AWS Lambda as a “deterministic supervisor” over the agentic workflow. Rather than allowing the LLM to orchestrate the entire flow at runtime, Lambda acts as the final arbiter, maintaining strict systemic control while leveraging the LLM merely to navigate OCR discrepancies.

Multi-tenant LLM analytics with row-level security: How we built a secure agent on AWS · PAR Technology Executing text-to-SQL tasks in multi-tenant environments carries extreme data leakage risks, as probabilistic LLMs cannot be trusted to reliably apply tenant filters. PAR Technology built a “Zero Trust” analytics architecture utilizing a three-layer defense: AWS SigV4 cryptographic signing, semantic LLM validation, and programmatic data isolation. Before the LLM generates SQL, a “Split-Plane” engine programmatically creates Common Table Expressions (CTEs) bound strictly to the authenticated user’s scope. The LLM only sees the schema of this pre-filtered, in-memory sandbox, ensuring that even deliberate prompt-injection jailbreaks cannot reference cross-tenant data because it simply isn’t present in the execution context.

Pair Nova 2 Lite with Claude for cost-optimized document processing · AWS Digitizing complex layouts like yearbook pages requires detecting faces, reading text, and accurately linking them based on varying spatial locations. Instead of routing the entire image to a single, expensive multimodal model, engineers created a two-stage pipeline. First, Amazon Nova 2 Lite extracts photos, bounding boxes, and names natively in one pass. Second, Claude Sonnet 4.6 receives only the structured JSON coordinates to perform adaptive spatial reasoning and match the names to the faces. This decoupled architecture reduces per-page costs by roughly 66% while achieving a 93% high-confidence matching rate, proving multi-model orchestration outcompetes monolithic prompting.

Implement a backup strategy for Amazon QuickSight BI assets · AWS Business Intelligence assets require robust disaster recovery, but backing up interconnected dashboards, datasets, and VPC connections is highly complex. Engineers can utilize QuickSight’s AssetsAsBundle APIs to programmatically export full dependency graphs of BI infrastructure into CloudFormation or JSON packages. The presented architecture orchestrates the backup using AWS Step Functions, persisting bundle ZIPs to S3 while separately backing up IAM group structures into date-suffixed DynamoDB tables. This isolates BI state into durable storage, allowing for point-in-time recovery without risking API throttling from large parallel exports.

GenPage: Towards End-to-End Generative Homepage Construction at Netflix · Netflix Netflix replaced its intricate, multi-stage recommender stack with GenPage, an autoregressive decoder-only transformer that generates the entire homepage layout sequentially. To meet strict serving latency requirements, Netflix bypassed standard text tokenizers, building a domain-specific vocabulary where each movie or UI row is represented by a single token. This custom tokenization simplified business logic enforcement via constrained decoding, leading to a 20% reduction in end-to-end serving latency. Interestingly, fine-tuning the model using Reinforcement Learning (RL) on page-level rewards natively produced higher homepage diversity, capturing complex cross-row interactions without explicit diversity constraints.

Dual-token authentication for Nakama game servers with Amazon Cognito on AWS · AWS Managing real-time game sessions alongside a managed identity provider usually disrupts player experiences with redirects. This architecture solves the issue by wrapping a Nakama game server in a default-closed routing layer where HTTP passes through an ALB and WebSockets route via a TCP-passthrough NLB. A custom Go hook validates the Amazon Cognito JWT cryptographically before issuing an independent Nakama session token. To survive the NLB’s hard 350-second idle connection drop, the server employs strict 10-second WebSocket ping/pong frames, keeping the TCP flow active without dropping persistent player states.

Preventing data exfiltration in machine learning environments with Amazon SageMaker AI · iBusiness Providing data scientists access to sensitive data safely traditionally requires expensive, heavily monitored VDI environments. iBusiness cut costs by 80% using a 3-layer security model built around Amazon SageMaker Studio. The outer layer mandates access strictly through Amazon WorkSpaces Secure Browser, disabling local clipboards and downloads. The inner network layer removes all internet gateways from the SageMaker VPC, relying entirely on AWS VPC Endpoints with strict IAM policies to prevent cross-account exfiltration.

Lessons learned from scaling to 1 million Lambda functions · ProGlove Operating a multi-tenant SaaS with a dedicated AWS account per customer provides excellent quota isolation, but introduces massive scale-to-zero challenges. ProGlove learned that synchronized scheduled Lambdas across thousands of accounts created a self-DDoS effect on their internal APIs, necessitating randomized jitter across all cron expressions. Furthermore, they discovered that traditional “best practices” like SQS buffers resulted in extreme costs due to empty polling; they stripped SQS out and adopted centralized Dead Letter Queues instead. Efficient serverless scaling ultimately required moving away from per-account observability tools to avoid ruinous metric-forwarding costs.

Inside the Advisory Database and what happens when vulnerability volume breaks records · GitHub GitHub’s Advisory Database experienced a 5x surge in vulnerability reporting, exposing the fragility of manual curation workflows. While well-formatted advisories take minutes to process, incomplete upstream data forces curators into complex package disambiguation and version-range reconstruction across disparate ecosystems. To scale, GitHub is deploying AI-assisted research tools for curators while reinforcing strict community guidelines—such as providing registry-accurate package names and complete CVSS vector strings. High-quality, machine-readable upstream disclosures are now mandatory to prevent widespread false positives in downstream SCA tools.

Highlights from Git 2.55 · GitHub Maintaining enormous Git monorepos traditionally required monolithic rewrites of pack indexes, causing massive I/O spikes. Git 2.55 stabilizes incremental multi-pack indexes (MIDX), allowing Git to chain index layers so new packs append effortlessly without rewriting old metadata. To prevent the MIDX chain from growing infinitely, the algorithm applies geometric repacking, aggressively rolling up new layers only when they surpass a size threshold relative to older ones. This release also introduces the highly anticipated git history fixup command, allowing developers to safely apply staged index changes directly to historical commits.

Memora: A Harmonic Memory Representation Balancing Abstraction and Specificity · Microsoft Existing AI memory systems force a tradeoff: RAG fragments context into brittle chunks, while summarization deletes specific details. Memora solves this by decoupling what is stored from how it is retrieved. Every memory entry combines a “primary abstraction” (a 6-8 word summary embedded for semantic search) with a detailed “memory value” that preserves raw specificity but is hidden from the vector search. A policy-guided retriever iteratively follows context-aware “cue anchors” across memories, allowing multi-hop reasoning while consuming up to 98% fewer tokens than full-context injection.

How AI Agents Manage Memory and Avoid Forgetfulness · ByteByteGo Because LLMs are fundamentally stateless, “agent memory” is purely a platform engineering and retrieval problem. Continuously appending history to the context window ruins latency, skyrockets costs, and triggers the “lost-in-the-middle” attention degradation effect. Modern agent architectures mimic operating system paging, utilizing a 4-tier hierarchy: Context Window, Session Memory, Long-term Store, and Cold Archive. The hardest tradeoff engineers face is deciding retrieval logic: balancing recency against semantic similarity, while mitigating the risk of “memory poisoning” from stale or malicious historical records.

GenPage: Towards End-to-End Generative Homepage Construction at Netflix · Netflix (Note: Duplicate release. Please see the detailed GenPage entry above for insights into Netflix’s generative UI architecture and custom tokenization.)

Mapping Europe’s AI Workforce Opportunity · OpenAI OpenAI published a report mapping the macro impact of AI on the European labor market. It breaks down the transition, indicating which European occupations face automation risks and which are likely to see workflow augmentation.

Sandboxes now expire based on last use · Vercel Vercel modified its Sandbox snapshot retention policy to expire based on the last access time rather than creation time. This simple state-management tweak prevents snapshots from dying mid-session, making it vastly safer to build long-running integrations on top of ephemeral environments.

Query Speed Insights from the Vercel CLI · Vercel Vercel exposed its Speed Insights directly through the CLI via the vercel metrics command. This allows automated CI/CD pipelines and coding agents to programmatically query real-user Core Web Vitals (INP, CLS, LCP) to detect frontend regressions before deploying.

Realtime voice, speech, and transcription now supported on AI Gateway · Vercel Vercel’s AI Gateway added native support for streaming voice models. Instead of chaining Speech-to-Text and Text-to-Speech models, single real-time models can take audio in and emit audio out, drastically lowering latency for voice agents.

Build realtime voice agents on AI Gateway · Vercel Implementing real-time conversational AI in browsers requires precise state and token management. To secure provider API keys, developers mint short-lived tokens on the server and pass them to the client’s useRealtime hook to manage the WebSocket connections. Notably, turn-taking relies on Server Voice Activity Detection (server-vad), enabling users to interrupt the AI natively.

xAI Grok audio models now available on Vercel AI Gateway · Vercel xAI’s audio suite—including grok-voice-think-fast, grok-tts, and grok-stt—is now integrated into Vercel’s AI Gateway via AI SDK 7. This allows developers to use Grok models seamlessly while maintaining universal routing, observability, and spend controls.

Ask an AI expert: What exactly is the full stack? · Google Google published an explainer outlining its full-stack AI strategy. It details how owning the entire stack—from custom silicon and infrastructure platforms to foundation models—remains central to their technological architecture and product development.

Open Models, Closed Environments: Palantir Brings Secure AI to US Agencies With NVIDIA Nemotron · Palantir U.S. government agencies face strict data sovereignty requirements that block cloud-based LLM deployments. Palantir integrated NVIDIA Nemotron open models into its Sovereign AI Operating System, allowing fully air-gapped, on-premise frontier AI execution. This guarantees absolute data isolation and auditability while enabling organizations to fine-tune weights on proprietary operations data.

Firefly Aerospace Operates NVIDIA Jetson in Lunar Orbit for the First Time · Firefly Aerospace Firefly Aerospace successfully operated an NVIDIA Jetson edge computing module while in lunar orbit. This marks a significant milestone in deploying high-performance, accelerated AI computing within the harsh radiation constraints of space.

Claude Meets Blackwell Ultra: Anthropic’s Models Now Run on NVIDIA GB300 in Azure · Anthropic Anthropic’s Claude models are now available on Microsoft Azure utilizing NVIDIA GB300 NVL72 systems linked via Quantum-X800 InfiniBand. This deployment leverages the NVIDIA Secure Agent Workspace to enforce infrastructure-level network access and runtime policies, enabling enterprises to safely run autonomous sub-agents across business domains.

Agent Memory · Oracle Dumping full conversation histories into LLM prompts creates noisy contexts that obscure critical facts. True agentic memory requires categorizing state into distinct structures: Semantic (durable facts), Episodic (past events), Procedural (learned workflows), and Entity (scoped to a specific user/ticket). Oracle addresses this engineering challenge with the Oracle AI Agent Memory Package (OAMP), treating memory as a multi-modal database problem. OAMP provides primitives like “context cards” to retrieve structured SQL tables, JSON, and vector embeddings, injecting only highly targeted context into the prompt.

What You Bring to AI Determines the Result · O’Reilly A common misconception is that providing dozens of examples in a prompt will force an LLM to reliably mimic a specific writing style. Educator Harper Carroll notes that in-context prompting simply applies pattern matching against frozen model weights, leaving the underlying probability distribution unaltered. To truly shift an LLM’s behavioral output, developers must perform fine-tuning, altering the model’s parameters so it fundamentally wants to write differently. A highly effective technique is training the model using AI-generated text as the input and human-authored text as the target output, teaching the network to systematically “undo the tells”.

Patterns Across Companies#

Companies across the stack (Microsoft, Netflix, AWS, Oracle) are hitting the limits of treating LLMs as stateless, omnipotent orchestrators. The architectural pattern converging across these teams is bounding the LLM’s non-determinism with strict deterministic systems: AWS uses Lambda supervisors to arbiter agents, PAR Technology pre-filters SQL planes before LLM ingestion, and Microsoft/Oracle are abstracting memory into structural vector/SQL databases instead of polluting the context window.