Hacker News — 2026-04-12#
Top Story#
Researchers completely bypassed top AI agent benchmarks—including SWE-bench, OSWorld, and WebArena—by writing simple exploits like fake curl wrappers and modified test hooks to achieve 100% scores without actually solving a single task. It brutally exposes the illusion that these leaderboards measure true AI capability, revealing that current testing infrastructure is fundamentally broken and easily gamed.
Front Page Highlights#
[Anthropic silently downgraded cache TTL from 1h -> 5m] · GitHub Data from over 119,000 API calls shows Anthropic quietly dropped Claude Code’s prompt cache TTL from an hour down to five minutes in early March. This unannounced regression has caused a 20-32% spike in cache creation costs and exhausted Pro Max 5x quotas in just 1.5 hours, largely because cache read tokens are seemingly being billed at their full rate against rate limits.