[Remote] Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Qlik is a Gartner Magic Quadrant Leader in data integration and analytics, serving over 40,000 global customers. They are seeking a Site Reliability Engineer to ensure the security, stability, and scalability of their Qlik and Talend Cloud services, while tackling complex challenges and driving improvements in performance and reliability.
Responsibilities
- Take on the responsibility of maintaining the reliability and availability of our cloud platforms, tackling complex problems and driving improvements to enhance performance and scalability
- Work closely with our Engineering organization, collaborating with Architecture, Platforms, and Domains teams to design and develop new infrastructure features and optimize cloud-related practices
- Design and develop effective tooling, alerts, and responses to identify and address reliability risks, utilizing your expertise in cloud technology and backend systems
- Act as a resource for fellow engineers, sharing your knowledge and expertise in cloud engineering, production service operations, incident management, and troubleshooting
- Stay updated on the latest industry trends and technologies, contributing to the adoption of best practices and driving continuous improvement within our cloud environment
- Ensure high reliability and availability of our cloud platforms, collaborating with cross-functional teams to implement new infrastructure features and optimize performance
- Define and evangelize cloud-related optimizations and best practices, driving improvements in reliability, scalability, and performance
- Analyze complex issues at the infrastructure, systems, network, and application levels, making recommendations and decisions to resolve them effectively
- Share your expertise with fellow engineers, providing guidance on cloud technologies, automation, security, and best practices
- Participate in on-call duties to maintain the availability and performance of our cloud infrastructure, providing regular updates on project status and activities
Skills
- Bachelor's or Master's degree in Computer Science or a relevant field
- Self-motivated with the ability to work autonomously and multitask effectively
- Strong analytical skills for solving complex problems and driving innovative solutions
- 10+ years of experience in software engineering and Site Reliability Engineering, focused on large-scale distributed systems, cloud infrastructure, and production operations
- 5+ years' experience with Infrastructure as Code (IaC) tools such as Terraform, Crossplane, Ansible, or similar
- 5+ years' experience working alongside a production system running on Kubernetes
- 5+ years of professional experience in cloud engineering, preferably with AWS and Azure
- 5+ years of Professional experience with operating and/or building microservices
- Proficiency in scripting and automation (e.g., Bash, Python, Go, C#) and software engineering concepts
- Proficiency with CI/CD, cloud and microservice autoscaling
- Proficiency with observability stack tooling such as Prometheus, Open Telemetry, distributed tracing, and SIEM such as Splunk
- Proficiency with Helm including but not limited to managing helm charts as well as creating custom charts from existing ones or building new
- Provide technical leadership during troubleshooting efforts and effectively communicate issues, impact, and resolution plans to senior leadership
- Proficiency with cloud security best practices across infrastructure and platform services, including identity and access management, encryption, network segmentation, secrets management, and least-privilege access controls
- Proficiency with incident management best practices and confidently drive an incident in a critical production environment
- Knowledge of infrastructure security review and compliance frameworks
- Experience working with database concepts and tooling such as MongoDB, Redis, OpenSearch and RDS
- Demonstrated ability to collaborate with development teams and provide expert guidance on implementing reliability best practices, ensuring systems are robust, scalable, and highly available
- Knowledge of event-driven architecture (Ex. Pub Sub)
- Where applicable, experience with or interest in learning other tools such as Clik House, Fire Hydrant, Solace, Gloo, Istio, and other cloud native related tools
- Ability to obtain sufficient clearance status to work on IL5 systems with Qlik support
- Due to this requirement: Must be a USA Citizen or be in process to become one by January 2027
- Excellent English communication skills, both oral and written
- Curiosity and desire to learn
- Ability to take a rotating on-call shift (24/7)
- Certifications such as CKD, CKS, AWS Certified Solutions Architect Associate/Professional, AWS Certified Advanced Networking Specialty, AWS Certified Security Specialty
- Experience supporting FedRAMP or DoD IL4 certification initiatives by implementing security controls, driving audit readiness, and operationalizing compliant cloud infrastructure
- Experience with self-hosted Temporal workflow infrastructure, including deployment, upgrades, scaling, monitoring, troubleshooting, and performance optimization across Kubernetes environments
Benefits
- Medical, dental, and vision coverage
- Life and AD&D
- Short and long-term disability coverage
- Paid time off
- Paid parental / maternity leave
- Participation in a 401(k) program that includes company match
- Many other additional voluntary benefits
- Genuine career progression pathways and mentoring programs
- Culture of innovation, technology, collaboration, and openness
- Flexible, diverse, and international work environment
- Extra “change the world” day plus another for personal development
- Participation in our Corporate Responsibility Employee Programs
Company Overview
Apply To This Job