We are seeking a highly skilled GPU Cluster Software Engineer with strong expertise in VMware and CPU/GPU cluster technologies. This engineer will play a critical role in designing, implementing, and managing high-performance compute clusters that support advanced workloads including AI, ML, and HPC applications.
Key Responsibilities
- Design, deploy, and manage enterprise-scale CPU/GPU clusters for high-performance workloads.
- Configure, maintain, and optimize VMware virtualization platforms (vSphere, ESXi, vCenter, vSAN).
- Integrate GPU virtualization technologies (e.g., NVIDIA GRID, vGPU) into VMware environments.
- Perform performance tuning, capacity planning, and resource optimization for compute clusters.
- Implement automation and orchestration tools to streamline cluster operations and provisioning.
- Monitor, troubleshoot, and optimize cluster performance to ensure system reliability.
- Collaborate with research and engineering teams to support compute-intensive applications (AI/ML/HPC).
- Ensure system scalability, security, and efficiency across multi-user environments.
Required Skills & Qualifications
- Hands-on expertise with VMware virtualization technologies (vSphere, ESXi, vCenter, vSAN).
- Proven experience in building and managing CPU/GPU clusters in enterprise or research environments.
- Strong knowledge of GPU virtualization (NVIDIA GRID, vGPU) and integration with VMware.
- Proficiency in cluster monitoring, troubleshooting, and optimization.
- Solid understanding of networking and storage concepts in clustered environments.
- Experience supporting compute-intensive workloads such as AI, ML, or HPC.
- Familiarity with automation/orchestration tools (e.g., Ansible, Terraform, Kubernetes, or similar).
- Excellent problem-solving skills and ability to work in a fast-paced, collaborative environment.
Education
- Master's or Ph.D. in Computer Science, Computer Engineering, or related field.