RESEARCHER, POST-TRAINING
ABOUT THE COMPANY
We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site
ABOUT THE ROLE
You'll lead our work on model post-training: supervised fine-tuning, preference data, reinforcement learning from human and AI feedback, reward modeling, and the evaluation suites that tell us what's actually working. You'll own a research area that meaningfully shapes our model behavior and capability.
This is a hands-on senior research role. You'll set direction, run experiments, and ship into production. You'll partner with the data, infrastructure, and engineering teams to make the post-training pipeline reliable and fast: improvements there compound into every model we ship.
WHAT YOU'LL DO
Lead post-training research: SFT, RLHF/RLAIF, RLVR, DPO and successor methods, reward modeling, preference data design
Design and curate the data that goes into post-training (from sourcing, to filtering, to quality assessment)
Build and maintain the evaluation suites that measure what matters; resist Goodharting your own benchmarks
Run rigorous experiments (controls, ablations, statistical significance) and write up internal findings clearly
Scale data pipelines and the infrastructure team to scale training
Identify and characterize failure modes (reward hacking, distribution drift, eval saturation) and design experiments to address them
Stay current on the post-training literature; bring useful methods in, ignore the noise
WHAT WE'RE LOOKING FOR
Strong track record of post-training research (SFT, RL, reward modeling) at a frontier-model lab or equivalent
5+ years of hands-on ML research experience
Comfort with large-scale data curation and preference-data pipelines
Experience designing evaluation suites for capabilities that aren't easily benchmarked
Fluent in PyTorch or equivalent; comfortable at the scale of distributed training
Strong statistical instincts: you'd notice a flawed comparison before someone else points it out
Strong written communication
NICE TO HAVE
PhD in ML, statistics, CS, or adjacent
Published research at NeurIPS, ICML, ICLR, COLM, RLC, or comparable venues
Experience with reward hacking detection, scaling reward models, or RLHF infrastructure
Synthetic data generation experience
Background in RL math (policy gradients, importance sampling, off-policy methods)
Open-source contributions to post-training infrastructure
THIS ROLE IS PROBABLY NOT FOR YOU IF
You're primarily interested in pretraining (that's a different role)- You'd rather invent novel methods in isolation than ship them into a model that real users run
You prefer benchmarks that are stable to evaluation work where the right answer isn't yet defined
Posted by Makermaker.ai on their own careers page — you apply directly, no recruiter in between. View original / apply →