[Remote] Senior Systems Engineer, Storage - DGX Cloud
Note: The job is a remote job and is open to candidates in USA. NVIDIA is a leading technology company known for its innovative GPU cloud services. The Senior Systems Engineer will design, deploy, and operate solutions on Kubernetes for large-scale storage and data platforms, ensuring reliability and performance through automation and observability.
Responsibilities
- Design, deploy, and operate solutions on Kubernetes for large-scale storage and data platforms, including the manifests, Helm charts, and operators that run them
- Build tools, services, and automation that improve the lifecycle of storage and data systems – from provisioning and configuration through deployment, scaling, and day-2 operations
- Develop and operate telemetry and observability for production systems – metrics, logging, tracing, dashboards, and alerting – so that system health, availability, and latency are measurable and actionable
- Apply strong analytical troubleshooting skills to diagnose and resolve complex issues across distributed, containerized infrastructure
- Work closely with peers and partner teams to improve the lifecycle of services, from inception and design through deployment, operation, and refinement
- Scale systems sustainably through automation, infrastructure-as-code, and CI/CD, and evolve systems by pushing for changes that improve reliability and velocity
- Support services before they go live through activities such as deployment automation, capacity planning, and launch and readiness reviews
- Practice sustainable incident response and postmortems, and participate in an on-call rotation to support production systems
Skills
- BS degree (or equivalent experience) in Computer Science or related technical field involving coding
- 12+ years of practical experience
- Hands-on experience with Kubernetes – deploying, configuring, and operating workloads and solutions on Kubernetes in production
- Experience building tools and services for storage, data, or platform infrastructure, with solid software design fundamentals (algorithms, data structures, complexity analysis) on large-scale Linux-based systems
- Experience building and operating telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack
- Strong analytical troubleshooting skills with a systematic, root-cause-driven approach to identifying and resolving complex problems
- Proficiency in one or more of the following: Python, Go, or Java
- Good knowledge of infrastructure configuration management and infrastructure-as-code tools such as Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform
- Customer-first mindset with a focus on customer satisfaction and a passion for ensuring customer success
- Experience with Git, code review, pipelines, and CI/CD
- Experience using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker
- Interest in crafting, analyzing, and fixing large-scale distributed systems, with strong debugging skills and a systematic problem-solving approach
- Experience designing storage- or data-focused tooling and automating their operations at scale
- Thrive in collaborative environments and enjoy working with various teams, and are flexible in adapting to different working styles
Benefits
- You will also be eligible for equity and [benefits](https://www.nvidia.com/en-us/benefits/).
Company Overview
Company H1B Sponsorship
Apply To This Job