EngRadardirect-apply

RESEARCHER, EFFICIENT INFERENCE

Makermaker.ai

San Francisco Full-time Posted 1mo ago

ABOUT THE COMPANY

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site

ABOUT THE ROLE

You'll be researching making models efficient: quantization, speculative decoding, sparse and structured attention, distillation, mixture-of-experts inference, and the training-time techniques that make those methods possible. The work spans algorithm design, careful evaluation, and pushing methods to where they actually run.

This is a senior research role with a clear engineering edge. You'll spend time at the intersection of model architecture and inference performance, designing methods that move accuracy/latency/cost trade-offs in our favor (then partnering with engineers to make those wins real in production).

WHAT YOU'LL DO

  • Research and develop quantization methods: post-training quantization, quantization-aware training, mixed-precision regimes, low-bit-width arithmetic

  • Design and evaluate speculative decoding approaches: draft models, tree attention, parallel speculation, lookahead decoding

  • Investigate training-time efficiency methods that compose well with inference: distillation, sparse attention, mixture-of-experts, low-rank adaptation, pruning

  • Run controlled experiments at production scale; characterize what works on real workloads, not just toy benchmarks

  • Co-design methods with the inference engineering team: push results to where they actually run, not stop at the paper

  • Read deeply across the efficient ML / efficient inference literature; translate the most useful ideas into our stack

  • Publish when the work warrants it; share findings internally

  • Partner with model and training researchers so efficiency choices align with model architecture and post-training decisions

WHAT WE'RE LOOKING FOR

  • Strong track record of ML research on efficiency methods: quantization, speculative decoding, distillation, MoE, sparse attention, or adjacent

  • 5+ years of hands-on research experience

  • Deep familiarity with both training and inference performance characteristics

  • Fluent in PyTorch, Jax or equivalent; comfortable working at the kernel and serving-framework level when methods require it

  • Track record of moving efficiency research from prototype to production

  • Strong statistical expertise: you'd notice a flawed comparison before someone else points it out

  • Strong written communication

  • Published research at NeurIPS, ICML, ICLR, MLSys, or comparable venues

NICE TO HAVE

  • PhD in ML, systems, or related field

  • Open-source contributions to quantization, speculative-decoding, or efficient-inference libraries

  • Experience with hardware-aware optimization and accelerator-specific tooling

  • Background in numerical methods, low-precision arithmetic, or

  • approximate computation

THIS ROLE IS PROBABLY NOT FOR YOU IF

  • You want to focus on pretraining large models from scratch (that's a different role)

  • You prefer abstract algorithmic research without hands-on implementation

  • You want a fixed benchmark with stable targets (our targets shift with what our models actually need to do)

Posted by Makermaker.ai on their own careers page — you apply directly, no recruiter in between. View original / apply →

More at Makermaker.ai