Research
Research in the open
We publish model by model and GPU by GPU, with the limits of each result stated.
MAY 2026 · TUTORIALS · GPU
Modern GPU Matmul Optimization
How to optimize a matmul kernel on a modern GPU, one optimization at a time: register tiling, vectorized loads, shared-memory staging, cp.async, TMA, warp specialization, split-K, and tensor cores. Each is introduced, then demonstrated by toggling it on a real kernel and watching the generated code diff and the latency change on an RTX 5090.
MAY 2026 · TUTORIALS · GPU
A Principled ML Compiler Stack in 5,000 Lines of Python — Part 3
Part 3 of the from-scratch ML compiler walkthrough: replacing the hand-coded heuristics from part 2 with an SP-MCTS search loop over Tile-IR rewrite parameters. The same six-IR pipeline, the same sixteen rules — just a tree walk on top picking the parameters that bench fastest.
MAY 2026 · TUTORIALS · GPU
A Principled ML Compiler Stack in 5,000 Lines of Python — Part 2
Part 2 of the from-scratch ML compiler walkthrough: how Loop IR is lowered to a GPU schedule. An overview of the Tile IR through three running examples (pointwise, reduction, matmul) where each picks up the rules they need from a stack of small rewrites.
APR 2026 · TUTORIALS · GPU
A Principled ML Compiler Stack in 5,000 Lines of Python — Part 1
ML compilers look like black boxes. They're not. I built one from scratch (tracing, fusion, scheduling, CUDA codegen) in 5,000 lines of Python. I walk a transformer's RMSNorm layer from PyTorch through decomposition, fusion, and tile-level scheduling, ending with the emitted CUDA kernel.
APR 2026 · TUTORIALS · GPU
Surfacing a 60% performance bug in cuBLAS
While benchmarking an FP32 SGEMM kernel on the RTX 5090, I found cuBLAS dispatching a tiny kernel for huge batched workloads — stuck at ~40% FMA utilization across the entire size range. The same library binary correctly escalates to 73% on the RTX PRO 6000 and 82% on the H200.
MAR 2026 · BENCHMARKS · LLM
Optimizing Qwen3 Coder for RTX 5090 and PRO 6000
I got Qwen3 Coder from 277 tok/s to 1,207 tok/s on a PRO 6000, and from 556 to 1,157 tok/s on an RTX 5090. Here's exactly how, with reproducible recipes.