[Remote] Principal Software Engineer, DGX Cloud Production Engineering
Note: The job is a remote job and is open to candidates in USA. NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. They are looking for Principal Software Engineers to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.
Responsibilities
- Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments
- Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness
- Establish patterns for Kubernetes-based GPU cluster operations across partner and on-prem environments
- Identify and eliminate operational toil through software, APIs, automation, and agent-assisted workflows
- Set technical standards for production readiness, SLOs, incident response, handoff gates, and operational acceptance
- Mentor engineers and influence platform, infrastructure, storage, networking, security, and workload teams
Skills
- 15+ years of experience building and operating large-scale distributed systems or cloud infrastructure
- Deep experience with Kubernetes, Linux, infrastructure automation, and production operations
- Strong programming experience in Go, Python, or similar
- Proven ability to lead complex cross-org technical initiatives
- Experience designing reliable systems with clear SLOs, observability, incident response, and automation
- BS/MS in Computer Science or equivalent experience
- Experience with GPU clusters, AI/ML infrastructure, Kubernetes operators, GitOps, BMaaS/VMaaS, managed Kubernetes, or multi-cloud fleet operations
- Experience building internal platforms, control planes, lifecycle automation, or production readiness frameworks
- Track record of turning operational pain into reusable software, APIs, and engineering standards
Benefits
- Equity
- Benefits
Company Overview
Company H1B Sponsorship
Apply To This Job