RESEARCHER, EFFICIENT INFERENCE
ABOUT THE COMPANY
We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site
ABOUT THE ROLE
You'll be researching making models efficient: quantization, speculative decoding, sparse and structured attention, distillation, mixture-of-experts inference, and the training-time techniques that make those methods possible. The work spans algorithm design, careful evaluation, and pushing methods to where they actually run.
This is a senior research role with a clear engineering edge. You'll spend time at the intersection of model architecture and inference performance, designing methods that move accuracy/latency/cost trade-offs in our favor (then partnering with engineers to make those wins real in production).
WHAT YOU'LL DO
Research and develop quantization methods: post-training quantization, quantization-aware training, mixed-precision regimes, low-bit-width arithmetic
Design and evaluate speculative decoding approaches: draft models, tree attention, parallel speculation, lookahead decoding
Investigate training-time efficiency methods that compose well with inference: distillation, sparse attention, mixture-of-experts, low-rank adaptation, pruning
Run controlled experiments at production scale; characterize what works on real workloads, not just toy benchmarks
Co-design methods with the inference engineering team: push results to where they actually run, not stop at the paper
Read deeply across the efficient ML / efficient inference literature; translate the most useful ideas into our stack
Publish when the work warrants it; share findings internally
Partner with model and training researchers so efficiency choices align with model architecture and post-training decisions
WHAT WE'RE LOOKING FOR
Strong track record of ML research on efficiency methods: quantization, speculative decoding, distillation, MoE, sparse attention, or adjacent
5+ years of hands-on research experience
Deep familiarity with both training and inference performance characteristics
Fluent in PyTorch, Jax or equivalent; comfortable working at the kernel and serving-framework level when methods require it
Track record of moving efficiency research from prototype to production
Strong statistical expertise: you'd notice a flawed comparison before someone else points it out
Strong written communication
Published research at NeurIPS, ICML, ICLR, MLSys, or comparable venues
NICE TO HAVE
PhD in ML, systems, or related field
Open-source contributions to quantization, speculative-decoding, or efficient-inference libraries
Experience with hardware-aware optimization and accelerator-specific tooling
Background in numerical methods, low-precision arithmetic, or
approximate computation
THIS ROLE IS PROBABLY NOT FOR YOU IF
You want to focus on pretraining large models from scratch (that's a different role)
You prefer abstract algorithmic research without hands-on implementation
You want a fixed benchmark with stable targets (our targets shift with what our models actually need to do)
Posted by Makermaker.ai on their own careers page — you apply directly, no recruiter in between. View original / apply →