Open source · ML compiler

Emmy

An open-source compiler that lowers PyTorch graphs to CUDA through six inspectable intermediate representations — about 5,000 lines of Python, built from scratch.

Read the code

# install
pip install deplodock

# compile a layer to CUDA
deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))"

Installs as deplodock for now — the package is being renamed to Emmy.

The pipeline

A graph is lowered through six intermediate representations, each one printable. Scheduling comes from a search over Tile-IR rewrite rules rather than fixed heuristics.

Torch IRTensor IRLoop IRTile IRKernel IRCUDA

Benchmarks

Measured on consumer-class GPUs. Full methodology is in the blog series.

50–60% over cuBLAS

FP32 SGEMM, batched, on the RTX 5090. Non-batched, cuBLAS wins.

4.87× over eager

GELU pointwise, fused into one kernel.

0.56× eager

TinyLlama-1.1B end-to-end — matmul-dominated, still behind eager.

Built in the open, under Apache-2.0.

View on GitHub →How we built it →