[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Docusign is a company that brings agreements to life, serving over 1.5 million customers globally. They are seeking a Senior Site Reliability Engineer to lead reliability initiatives for critical services, ensuring performance and scalability while driving improvements in observability and incident response.
Responsibilities
- Design, implement, and operate highly available, scalable services in cloud environments (primarily Azure, with some multi‑cloud scenarios)
- Define and evolve SLOs/SLIs, error budgets, and capacity strategies for owned services; use them to guide engineering trade‑offs and release decisions
- Analyze patterns in incidents and outages; own long‑term reliability improvements for your domain and contribute to reliability strategy across services
- Write high quality code that is easy to maintain and test
- Ensure design and architecture is extensible across projects, and participate in technical design and code reviews
- Identify operational toil and lead automation efforts to eliminate it—deployment, runbook, and remediation workflows that make incidents rarer and faster to resolve
- Develop robust, well‑tested tooling and shared libraries that are adopted across multiple teams
- Improve CI/CD pipelines and guardrails to reduce change failure rate while increasing deployment velocity
- Design and implement logging, metrics, tracing, and alerting for complex distributed systems; ensure signals are actionable and aligned to business impact
- Build and automate tools and solutions for incident impact analysis and effective mitigation
- Participate in and often lead incident response for Sev0–Sev2 events: triage, mitigation, coordination, and clear communication
- Perform and contribute to blameless post‑incident reviews, root‑cause analysis, and follow‑through on corrective actions
- Work with Operations and Incident Command teams during and post incidents to drive excellence in Incident Management Process
- Compose and analyze dashboard to highlight areas of the business that need attention and help drive organizational KPI
- Create and respond to system generated alerts to maintain system health
- Work with Operations and Engineers to fill any gaps in alerting and telemetry
- Act as the primary SRE partner for one or more engineering teams—shaping architecture, reviewing designs, and embedding reliability best practices
- Mentor and coach other SREs and software engineers on topics such as debugging, observability, incident management, and performance optimization
- Contribute to and help standardize SRE practices, runbooks, and production readiness criteria across CPE and product teams
- Work with Product Management, collaborators and other developers to understand design requirements and provide estimates for development
- Learn and grow in all key technologies in Docusign and be a partner to Eng and Operations teams
Skills
- 8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with ownership of production systems at scale (or equivalent experience)
- Experience coding in at least one modern language (e.g., Go, Python, C#, Java), with the ability to design, implement, test, and debug production‑grade automation and services
- Practical experience operating large‑scale services in public cloud (Azure preferred; AWS/GCP acceptable with willingness to learn Azure)
- Experience with Linux, networking fundamentals, and common infrastructure components (load balancers, DNS, certificates, queues, caches, databases)
- Experience with Observability stacks (e.g., Prometheus/Grafana, OpenTelemetry/Chronicle, centralized logging)
- Experience with CI/CD systems and deployment strategies (blue/green, canary, rolling updates)
- Experience with incident management and on‑call operations for 24x7 services
- Experience in building dashboards and metrics analysis
- Strong analytical and problem-solving skills
- Experience in high‑availability, regulated, or customer‑facing SaaS environments
- Background in reliability practices such as chaos testing, capacity modeling, and performance tuning
- Exposure to release management/unified release practices and safe rollout strategies (feature flags, staged rollouts, configuration‑driven changes)
- Demonstrated leadership driving cross‑team initiatives: reliability programs, migrations, or major refactors
- Strong written and verbal communication skills; ability to explain complex technical topics to both engineers and non‑technical stakeholders
Benefits
- Bonus: Sales personnel are eligible for variable incentive pay dependent on their achievement of pre-established sales goals. Non-Sales roles are eligible for a company bonus plan, which is calculated as a percentage of eligible wages and dependent on company performance.
- Stock: This role is eligible to receive Restricted Stock Units (RSUs).
- Paid Time Off: earned time off, as well as paid company holidays based on region
- Paid Parental Leave: take up to six months off with your child after birth, adoption or foster care placement
- Full Health Benefits Plans: options for 100% employer paid and minimum employee contribution health plans from day one of employment
- Retirement Plans: select retirement and pension programs with potential for employer contributions
- Learning and Development: options for coaching, online courses and education reimbursements
- Compassionate Care Leave: paid time off following the loss of a loved one and other life-changing events
Company Overview
Company H1B Sponsorship
Apply To This Job