Software Engineer, RL Training Infra
OpenAIGenerative AI company
San Francisco, United StatesMid
Software Engineering
About the role
Maintain and scale large RL training runs, debugging and improving training infrastructure.
- •Work to keep large-scale reinforcement learning (RL) training runs fast, reliable, and unblocked by debugging and improving training, inference, orchestration, and distributed systems.
- •Key Responsibilities Maintain and scale frontier RL training runs, addressing urgent engineering and infra problems.
- •Debug across training systems, inference, orchestration, and distributed infrastructure.
- •Improve reliability, efficiency, and tooling for RL training.
- •Support research-heavy integrations like multi-agent capabilities and memory.
- •Requirements Strong generalist engineering skills with ML infrastructure experience.
- •Experience debugging distributed systems and scaling model training.
- •High attention to detail, rapid learning, and strong communication.
- •Ability to operate under tight timelines and ownership of operational issues.
Match insights
Level:Mid
Location:San Francisco, United States