About
An applied inference optimization lab
RiftStack works on the software layer between models and hardware. We build Emmy, a compiler that generates fast kernels, and a serving stack that runs them efficiently — so the same model costs less and responds faster on the GPUs teams already have.
Team
Track record
A small team building low-level systems for performance-critical software, with backgrounds at Apple, Roblox, and Ubisoft.
50–60% over cuBLAS
FP32 SGEMM in batched mode on the RTX 5090. In non-batched mode cuBLAS still wins.
1,157 tok/s
Qwen3 Coder on a single consumer-class RTX 5090.
ML compiler from scratch
Emmy — tracing, fusion, scheduling, and CUDA codegen in Python.







