Skip to content
OpenAI logo

Software Engineer, RL Training Infra

OpenAIGenerative AI company
San Francisco, United StatesMid
Software Engineering

About the role

Maintain and scale large RL training runs, debugging and improving training infrastructure.

  • Work to keep large-scale reinforcement learning (RL) training runs fast, reliable, and unblocked by debugging and improving training, inference, orchestration, and distributed systems.
  • Key Responsibilities Maintain and scale frontier RL training runs, addressing urgent engineering and infra problems.
  • Debug across training systems, inference, orchestration, and distributed infrastructure.
  • Improve reliability, efficiency, and tooling for RL training.
  • Support research-heavy integrations like multi-agent capabilities and memory.
  • Requirements Strong generalist engineering skills with ML infrastructure experience.
  • Experience debugging distributed systems and scaling model training.
  • High attention to detail, rapid learning, and strong communication.
  • Ability to operate under tight timelines and ownership of operational issues.
View original posting →

Match insights

Level:Mid
Location:San Francisco, United States