Job Title: ML Platform Engineer - GPU Infrastructure
Job Summary
Support team by designing, implementing, and maintaining the automation and ML workload enablement layer of the GPU cluster platform. This role focuses on optimizing GPU compute environments for AI/ML training and Isaac Sim simulation workloads, integrating GPU jobs into CI/CD pipelines, standardizing runtime environments, and supporting reliable storage and artifact management.
Required Experience
3 years of experience in ML Platform Engineering, DevOps, Infrastructure Engineering, or related field
Bachelor's or Master's degree in Systems Engineering, Computer Science, Computer Engineering, or related discipline
Responsibilities
Support GPU cluster platforms for AI/ML and simulation workloads
Optimize GPU compute environments for ML training and Isaac Sim execution
Integrate GPU workload execution into CI/CD pipelines
Standardize runtime environments using containers and automation tools
Manage storage, artifacts, and workload outputs
Troubleshoot and improve platform reliability, scalability, and performance
Collaborate with ML, infrastructure, and engineering teams
Required Skills
Experience with Linux, Kubernetes, Docker, and GPU infrastructure
Knowledge of CI/CD tools and automation scripting (Python/Bash)
Experience supporting AI/ML workloads and distributed systems
Familiarity with NVIDIA GPU technologies and containerized environments
Strong troubleshooting and performance optimization skills
Preferred Skills
Experience with Isaac Sim or simulation workloads
Exposure to cloud platforms (AWS, Azure, or GCP)
Knowledge of monitoring and observability tools such as Grafana or Prometheus