GLM-5 topped the coding benchmarks. Then I actually used it.
Zhipu AI's GLM-5 leads SWE-bench and LiveCodeBench. I tested it on an unpublished NP-hard optimization problem and 89 coding tasks. The best-case is competitive. The typical case is not.
Blog
Thoughts on AI systems, engineering culture, and building things that work.
Zhipu AI's GLM-5 leads SWE-bench and LiveCodeBench. I tested it on an unpublished NP-hard optimization problem and 89 coding tasks. The best-case is competitive. The typical case is not.
Deep dive into the architecture of Codex, Gemini CLI, Mistral Vibe, and OpenCode. Same model, 2x performance gap — the scaffolding is what matters. Updated with GLM-5 results: every agent improved 38-54%, but the ranking stayed the same.
Claude Code (Opus 4.6), Codex (GPT-5.3-Codex xhigh), Gemini CLI (Gemini-3-Pro-Preview), and Mistral (Devstral-2) tackle a fiber network optimization problem. Claude Code beat my 8-year-old C++ solution by 62 points. Updated with GLM-5 results across two agent frameworks and Terminal-Bench.
Building omniagents, a ~2000-line Python framework for multi-tenant AI coding agents, only to discover OpenHands already existed. Lessons on agent architecture, isolation, persistence, and costs.
We benchmarked our production RAG system across embedding models, chunk sizes, chunking strategies, and retrieval modes. The results contradicted common wisdom.
A live benchmark that tests AI models' ability to predict real-world events through prediction markets. Every day, we let AI models bet $1 on top events from Polymarket.
We're brilliant engineers — yet we wrestle with Word docs, Excel sheets, endless email chains. Time to reclaim our craft.