[Remote] DevOps Engineer - Atlanta, GA, Birmingham, AL, Louisville, KY, Richmond, VA, Charlotte, NC
Note: The job is a remote job and is open to candidates in USA. Dice is seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with expertise in Incident Management and cloud-native platforms. The role involves ensuring the reliability and performance of distributed systems, managing incident responses, and implementing automation and governance strategies.
Responsibilities
- Manage and improve platform reliability, availability, and performance across production environments
- Lead and participate in incident management, root cause analysis, remediation planning, and post-incident reviews
- Drive change control processes and ensure operational governance standards are followed
- Monitor and manage error budgets while implementing reliability improvements
- Design, build, and maintain scalable cloud infrastructure and automation frameworks
- Deploy and manage containerized applications using Kubernetes and Docker
- Develop and maintain CI/CD pipelines to support efficient software delivery
- Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management
- Establish observability strategies using monitoring, logging, and alerting platforms
- Collaborate with development, infrastructure, security, and business teams to ensure platform stability
- Troubleshoot complex production issues across cloud, networking, infrastructure, and application layers
- Continuously improve operational processes, automation, and system resilience
Skills
- 7+ years of experience in Site Reliability Engineering (SRE), DevOps, Cloud Infrastructure, or Production Operations
- Strong experience managing workloads in cloud environments: Microsoft Azure, Amazon Web Services (AWS), Google Cloud Platform (Google Cloud Platform)
- Hands-on experience with: Kubernetes, Docker, CI/CD Pipelines, Infrastructure as Code (IaC)
- Strong scripting and automation expertise using: Python, Bash, PowerShell, Go (Golang)
- Experience with observability and monitoring platforms: Datadog, Grafana, Prometheus, Splunk
- Strong understanding of: Networking concepts, Linux Administration, Windows Administration, Distributed Systems, Cloud-Native Architectures
- Experience with: Incident Response, Production Troubleshooting, Operational Governance
- Experience implementing reliability engineering best practices and SRE methodologies
- Experience supporting large-scale enterprise production environments
- Familiarity with high-availability and disaster recovery architectures
- Experience automating operational workflows and infrastructure management
- Knowledge of security best practices within cloud environments
- Experience working in Agile and DevOps-driven organizations
Company Overview
Apply To This Job