[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Ellucian is a company that powers innovation for higher education, serving over 21 million students globally. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and cost-efficiency of their production systems, focusing on DevOps practices and incident management.
Responsibilities
- Own and improve system reliability, availability, and performance for production environments
- Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
- Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
- Perform detailed root cause analysis (RCA) and drive permanent resolutions
- Partner with engineering and DevOps teams to build scalable, resilient infrastructure
- Automate operational processes to improve efficiency and reduce risk
- Analyze and optimize infrastructure and application costs
- Define and manage SLIs/SLOs to meet reliability targets
- Continuously improve deployment, monitoring, and operational practices
Skills
- 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
- Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
- Experience with cloud platforms (AWS, Azure, or GCP)
- Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
- Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
- Experience with containers and orchestration (Docker, Kubernetes)
- Scripting or programming experience (Python, Bash, or similar)
- Proven ability to analyze and optimize cloud costs
- Own and improve system reliability, availability, and performance for production environments
- Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
- Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
- Perform detailed root cause analysis (RCA) and drive permanent resolutions
- Partner with engineering and DevOps teams to build scalable, resilient infrastructure
- Automate operational processes to improve efficiency and reduce risk
- Analyze and optimize infrastructure and application costs
- Define and manage SLIs/SLOs to meet reliability targets
- Continuously improve deployment, monitoring, and operational practices
- Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
- Familiarity with cloud security and compliance best practices
- Experience supporting high-availability, customer-facing systems
- Strong collaboration and communication skills
Benefits
- Comprehensive health coverage: medical, dental, and vision
- Flexible time off
- Thrive Flex Lifestyle Account (LSA) that allows you to contribute towards your health, financial or learning interests
- 401k w/ match & BrightPlan - to help you save for the future
- Parental Leave
- 5 charitable days to support the community that supports us
- Telemedicine
- Wellness
- Headspace Care (mental health)
- Wellbeats (virtual fitness classes)
- RethinkCare & Wellthy– caregiver support
- Diversity and inclusion programs which provide access to internal employee resource groups
- Employee referral bonuses to encourage the addition of great new people to the team
- We Foster a learning culture with:
- Education Assistance Program
- Professional development opportunities
Company Overview
Company H1B Sponsorship
Apply To This Job