[Remote] Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Talener is a fast-growing healthcare technology organization seeking a Site Reliability Engineer (SRE) to help scale and support a high-impact cloud platform focused on improving healthcare delivery nationwide. This role is critical for strengthening platform reliability, operational efficiency, observability, and automation across production environments.
Responsibilities
- Ensure the reliability, scalability, performance, and security of cloud-based infrastructure and applications
- Monitor, troubleshoot, and resolve production platform and application issues across distributed systems
- Lead incident response efforts, root cause analysis, and blameless post-mortems
- Build and maintain operational runbooks and automated remediation workflows
- Develop and enhance observability and telemetry solutions for proactive monitoring and alerting
- Collaborate closely with engineering, DevOps, QA, security, and operations teams to improve platform health and deployment processes
- Support infrastructure automation and configuration management initiatives
- Contribute to infrastructure-as-code (IaC) practices and CI/CD operational improvements
- Promote best practices around reliability engineering, incident management, and operational excellence
- Participate in an on-call rotation supporting production systems, including occasional off-hours support for West Coast operations
Skills
- 5+ years of experience in Site Reliability Engineering, DevOps, Cloud Infrastructure, or related disciplines
- Strong experience troubleshooting and supporting production environments
- Hands-on experience with observability and monitoring platforms such as Datadog, New Relic, or similar tools
- Experience working within Azure-based cloud environments and modern containerized infrastructure
- Knowledge of Docker, Kubernetes, and cloud-native application hosting environments
- Experience with infrastructure-as-code tools such as Terraform, Terragrunt, or OpenTofu
- Strong scripting and automation experience using PowerShell, Python, JavaScript, or similar languages
- Experience with source control and CI/CD tooling (Git, Azure DevOps, etc.)
- Understanding of cloud security principles, compliance frameworks, and operational best practices
- Strong collaboration and communication skills within Agile engineering environments
- Experience improving operational visibility through telemetry, dashboards, reports, and alerting systems
- Experience evolving incident response processes and operational tooling
- Passion for mentoring others and promoting operational excellence across teams
- Strong problem-solving mindset with a focus on continuous improvement and automation
Benefits
- Opportunity to work on mission-driven technology with meaningful real-world impact
- Collaborative engineering culture focused on innovation, reliability, and continuous learning
- Flexible environment that supports work-life balance while maintaining operational excellence
Company Overview
Apply To This Job