Blog

Writing

Thoughts on AI systems, engineering culture, and building things that work.

Feb 14, 2026 / 8 min read

GLM-5 with Coding Agents: Competitive Scores, Weak Reliability

Zhipu AI's GLM-5 leads SWE-bench and LiveCodeBench. I tested it on an unpublished NP-hard optimization problem and 89 coding tasks. The best-case is competitive. The typical case is not.

Feb 10, 2026 / 24 min read

I Forked 4 cli coding agents to Run the Same Model. A scaffholding deepdive.

Deep dive into the architecture of Codex, Gemini CLI, Mistral Vibe, and OpenCode. Same model, 2x performance gap — the scaffolding is what matters. Updated with GLM-5 results: every agent improved 38-54%, but the ranking stayed the same.

Feb 8, 2026 / 21 min read

I benchmarked 4 CLI coding agents on an NP-hard optimization problem I solved by hand 8 years ago.

Claude Code (Opus 4.6), Codex (GPT-5.3-Codex xhigh), Gemini CLI (Gemini-3-Pro-Preview), and Mistral (Devstral-2) tackle a fiber network optimization problem. Claude Code beat my 8-year-old C++ solution by 62 points. Updated with GLM-5 results across two agent frameworks and Terminal-Bench.

Jan 2, 2026 / 3 min read

I Accidentally Rebuilt OpenHands From Scratch -- Here's What I Learned

Building omniagents, a ~2000-line Python framework for multi-tenant AI coding agents, only to discover OpenHands already existed. Lessons on agent architecture, isolation, persistence, and costs.

Nov 5, 2025 / 10 min read

Evaluate Your Own RAG: Why Best Practices Failed Us

We benchmarked our production RAG system across embedding models, chunk sizes, chunking strategies, and retrieval modes. The results contradicted common wisdom.

Sep 24, 2025 / 11 min read

PrediBench: Testing AI Models on Prediction Markets

A live benchmark that tests AI models' ability to predict real-world events through prediction markets. Every day, we let AI models bet $1 on top events from Polymarket.

Jun 1, 2025 / 1 min read

Engineering-as-Code: Version, Automate, and Reuse Engineering Work

We're brilliant engineers — yet we wrestle with Word docs, Excel sheets, endless email chains. Time to reclaim our craft.