[Remote] Principal Engineer, Compute Platform
Note: The job is a remote job and is open to candidates in USA. Pinterest is a platform that inspires creativity and innovation, and they are seeking a Principal Engineer to lead the consolidation and modernization of their compute infrastructure. This role involves designing and building a shared compute platform to support large-scale workloads, enhancing operational efficiency, and collaborating with various teams to meet unique customer needs.
Responsibilities
- Solving the challenges of replacing isolated pools of dedicated compute resources with a very large scale shared compute platform, shifting from machine-based designs to container-based designs
- Working with leads across various platforms, especially stateful and data platforms, to build the right features and migration paths that work for them
- Owning and driving up utilization on the shared compute platform by designing and implementing workload stacking, optimizing and bin packing, safe oversubscription, etc
- Work with multiple customers with unique requirements to make sure the platform will address their needs and is not only a viable but a desirable solution for running their workloads
- Leading a group of engineers around design topics, execution, trade offs, migration paths, observability, performance, and operability for the platform
- Evolving the platform towards a multi-cloud abstraction layer to enable running workloads across multiple cloud providers
- Being a role model for setting a high bar for production quality and engineering excellence in delivering a foundational technology which empowers the entire company
- Working closely with partners around capacity planning, cost visibility, fungibility of virtual machine instance types, and efficiency
- Putting special focus on the delivery of GPU resources through the platform, to enable and expedite AI workloads
- Leverage AI tools to increase the velocity and ease of migrations, and create self service solutions for the customers of the platform as needed
- Help the team apply AI to the operational aspects of running the cluster, discovering issues, and investigating and root causing issues
- Expedite feature development using AI coding tools and be a thought leader on creating the right balance between speed and safety by designing safeguards and layers of defense
Skills
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience
- 12+ years of relevant industry experience with large scale, production distributed systems
- 5+ years of experience with Kubernetes in production
- Experience working across SWE and SRE or Production Engineering teams to deliver robust production systems
- Ability to work with cross-functional partners across multiple organizations
- Passion for automation, reducing toil, and building proper tooling for getting the job done
- Experience with running distributed data systems and migrating them to Kubernetes is highly preferred
Benefits
- The position is also eligible for equity.
- Information regarding the culture at Pinterest and benefits available for this position can be found here.
- In-Office Requirement Statement: This role will need to be in the office for in-person collaboration 1-2 times/quarter and therefore can be situated anywhere in the country.
Company Overview
Apply To This Job