Software Engineer, RL Training Infra

OpenAIGenerative AI company

San Francisco, United StatesMid

Software Engineering

About the role

Maintain and scale large RL training runs, debugging and improving training infrastructure.

•Work to keep large-scale reinforcement learning (RL) training runs fast, reliable, and unblocked by debugging and improving training, inference, orchestration, and distributed systems.
•Key Responsibilities Maintain and scale frontier RL training runs, addressing urgent engineering and infra problems.
•Debug across training systems, inference, orchestration, and distributed infrastructure.
•Improve reliability, efficiency, and tooling for RL training.
•Support research-heavy integrations like multi-agent capabilities and memory.
•Requirements Strong generalist engineering skills with ML infrastructure experience.
•Experience debugging distributed systems and scaling model training.
•High attention to detail, rapid learning, and strong communication.
•Ability to operate under tight timelines and ownership of operational issues.

Level:Mid

Location:San Francisco, United States