Research

Research in the open

We publish model by model and GPU by GPU, with the limits of each result stated.

MAY 2026 · TUTORIALS · GPU

Modern GPU Matmul Optimization

How to optimize a matmul kernel on a modern GPU, one optimization at a time: register tiling, vectorized loads, shared-memory staging, cp.async, TMA, warp specialization, split-K, and tensor cores. Each is introduced, then demonstrated by toggling it on a real kernel and watching the generated code diff and the latency change on an RTX 5090.

Dmitry Trifonov

MAY 2026 · TUTORIALS · GPU

A Principled ML Compiler Stack in 5,000 Lines of Python — Part 3

Part 3 of the from-scratch ML compiler walkthrough: replacing the hand-coded heuristics from part 2 with an SP-MCTS search loop over Tile-IR rewrite parameters. The same six-IR pipeline, the same sixteen rules — just a tree walk on top picking the parameters that bench fastest.

Dmitry Trifonov

MAY 2026 · TUTORIALS · GPU

A Principled ML Compiler Stack in 5,000 Lines of Python — Part 2

Part 2 of the from-scratch ML compiler walkthrough: how Loop IR is lowered to a GPU schedule. An overview of the Tile IR through three running examples (pointwise, reduction, matmul) where each picks up the rules they need from a stack of small rewrites.

Dmitry Trifonov

APR 2026 · TUTORIALS · GPU

A Principled ML Compiler Stack in 5,000 Lines of Python — Part 1

ML compilers look like black boxes. They're not. I built one from scratch (tracing, fusion, scheduling, CUDA codegen) in 5,000 lines of Python. I walk a transformer's RMSNorm layer from PyTorch through decomposition, fusion, and tile-level scheduling, ending with the emitted CUDA kernel.

Dmitry Trifonov

APR 2026 · TUTORIALS · GPU

Surfacing a 60% performance bug in cuBLAS

While benchmarking an FP32 SGEMM kernel on the RTX 5090, I found cuBLAS dispatching a tiny kernel for huge batched workloads — stuck at ~40% FMA utilization across the entire size range. The same library binary correctly escalates to 73% on the RTX PRO 6000 and 82% on the H200.

Dmitry Trifonov

MAR 2026 · BENCHMARKS · LLM

Optimizing Qwen3 Coder for RTX 5090 and PRO 6000

I got Qwen3 Coder from 277 tok/s to 1,207 tok/s on a PRO 6000, and from 556 to 1,157 tok/s on an RTX 5090. Here's exactly how, with reproducible recipes.

Dmitry Trifonov