RESEARCHER, EFFICIENT INFERENCE

San Francisco Full-time Posted 1mo ago

ABOUT THE COMPANY

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site

ABOUT THE ROLE

You'll be researching making models efficient: quantization, speculative decoding, sparse and structured attention, distillation, mixture-of-experts inference, and the training-time techniques that make those methods possible. The work spans algorithm design, careful evaluation, and pushing methods to where they actually run.

This is a senior research role with a clear engineering edge. You'll spend time at the intersection of model architecture and inference performance, designing methods that move accuracy/latency/cost trade-offs in our favor (then partnering with engineers to make those wins real in production).

WHAT YOU'LL DO

Research and develop quantization methods: post-training quantization, quantization-aware training, mixed-precision regimes, low-bit-width arithmetic
Design and evaluate speculative decoding approaches: draft models, tree attention, parallel speculation, lookahead decoding
Investigate training-time efficiency methods that compose well with inference: distillation, sparse attention, mixture-of-experts, low-rank adaptation, pruning
Run controlled experiments at production scale; characterize what works on real workloads, not just toy benchmarks
Co-design methods with the inference engineering team: push results to where they actually run, not stop at the paper
Read deeply across the efficient ML / efficient inference literature; translate the most useful ideas into our stack
Publish when the work warrants it; share findings internally
Partner with model and training researchers so efficiency choices align with model architecture and post-training decisions

WHAT WE'RE LOOKING FOR

Strong track record of ML research on efficiency methods: quantization, speculative decoding, distillation, MoE, sparse attention, or adjacent
5+ years of hands-on research experience
Deep familiarity with both training and inference performance characteristics
Fluent in PyTorch, Jax or equivalent; comfortable working at the kernel and serving-framework level when methods require it
Track record of moving efficiency research from prototype to production
Strong statistical expertise: you'd notice a flawed comparison before someone else points it out
Strong written communication
Published research at NeurIPS, ICML, ICLR, MLSys, or comparable venues

NICE TO HAVE

PhD in ML, systems, or related field
Open-source contributions to quantization, speculative-decoding, or efficient-inference libraries
Experience with hardware-aware optimization and accelerator-specific tooling
Background in numerical methods, low-precision arithmetic, or
approximate computation

THIS ROLE IS PROBABLY NOT FOR YOU IF

You want to focus on pretraining large models from scratch (that's a different role)
You prefer abstract algorithmic research without hands-on implementation
You want a fixed benchmark with stable targets (our targets shift with what our models actually need to do)

Posted by Makermaker.ai on their own careers page — you apply directly, no recruiter in between. View original / apply →