Site Reliability Engineering Manager, Enterprise Technology Services
Apple Inc
Singapore, Singapore
Job posting number: #7289291 (Ref:apl-200574895)
Posted: October 22, 2024
Job Description
Summary
Imagine what we could do together. At Apple, new ideas have a way of becoming excellent products, services, and customer experiences very quickly. Bring passion and dedication to your job and there’s no telling what you could accomplish! The people here at Apple don’t just build products - they craft the kind of wonder that’s revolutionized entire industries. It’s the diversity of those people and their ideas that encourages the innovation that runs through everything we do, from amazing technology to industry-leading environmental efforts.
At Enterprise Technology Services(ETS) Team, we take pride in developing groundbreaking, world-changing platforms and services. Our ETS applications play a crucial role in supporting the Apple ecosystem by offering identity management, factory and device support, infrastructure support, platform support, and collaboration tools. Whether you're logging into Apple, making a purchase, or enabling Apple devices, our applications are there every step of the way, ensuring a flawless and secure experience.
As a Manager for Site Reliability and Operations (SRE) you will be responsible for leading a team of Site Reliability Engineers to ensure the reliability, scalability, and performance of production systems. This role combines technical expertise with leadership skills to drive operational excellence and foster a culture of collaboration and continuous improvement. As part of the role you will work with the team to automate operations, optimize infrastructure, and troubleshoot issues in an exciting, fast-paced environment.
This role is designed for driven individuals who:
• Love learning new technologies and thrive in solving complex challenges.
• Comfortable in a fast-paced, changing environment and able to manage competing priorities.
• Ability to work effectively across teams and influence without authority.
• Are independent, motivated, and excited to take on ambitious projects.
• Excel at collaborating with engineering teams and can stay calm under pressure.
• Have a passion for delivering quality, reliable solutions in a dynamic, high-energy workplace.
This isn’t just another job—it’s a chance to shape the future of technology, contribute to mission-critical systems at a global scale, and grow alongside some of the brightest minds in the industry. Are you ready to make an impact?
At Enterprise Technology Services(ETS) Team, we take pride in developing groundbreaking, world-changing platforms and services. Our ETS applications play a crucial role in supporting the Apple ecosystem by offering identity management, factory and device support, infrastructure support, platform support, and collaboration tools. Whether you're logging into Apple, making a purchase, or enabling Apple devices, our applications are there every step of the way, ensuring a flawless and secure experience.
As a Manager for Site Reliability and Operations (SRE) you will be responsible for leading a team of Site Reliability Engineers to ensure the reliability, scalability, and performance of production systems. This role combines technical expertise with leadership skills to drive operational excellence and foster a culture of collaboration and continuous improvement. As part of the role you will work with the team to automate operations, optimize infrastructure, and troubleshoot issues in an exciting, fast-paced environment.
This role is designed for driven individuals who:
• Love learning new technologies and thrive in solving complex challenges.
• Comfortable in a fast-paced, changing environment and able to manage competing priorities.
• Ability to work effectively across teams and influence without authority.
• Are independent, motivated, and excited to take on ambitious projects.
• Excel at collaborating with engineering teams and can stay calm under pressure.
• Have a passion for delivering quality, reliable solutions in a dynamic, high-energy workplace.
This isn’t just another job—it’s a chance to shape the future of technology, contribute to mission-critical systems at a global scale, and grow alongside some of the brightest minds in the industry. Are you ready to make an impact?
Description
• Lead, mentor and develop a team of SREs.
• Foster a culture of reliability and excellence within the team.
• Promote continuous learning and knowledge sharing.
• Help the team to build and maintain robust and highly available System
• Automate CI/CD processes.
• Ensure the availability and performance of production systems.
• Oversee incident response, post-mortem analysis, and root cause investigations.
• Implement and maintain service level objectives (SLOs) and service level indicators (SLIs).
• Work closely with development, product, and other engineering teams to ensure reliability is prioritized in the development lifecycle.
• Communicate effectively with stakeholders regarding reliability metrics, incident reports, and team progress.
• Develop and execute a strategic roadmap for the SRE team.
• Identify areas for improvement and propose solutions that align with business goals.
• Optimize resource allocation and usage for operational efficiency.
• Identify and assess risks to production systems and work to mitigate them.
• Establish and maintain disaster recovery and business continuity plans.
• Foster a culture of reliability and excellence within the team.
• Promote continuous learning and knowledge sharing.
• Help the team to build and maintain robust and highly available System
• Automate CI/CD processes.
• Ensure the availability and performance of production systems.
• Oversee incident response, post-mortem analysis, and root cause investigations.
• Implement and maintain service level objectives (SLOs) and service level indicators (SLIs).
• Work closely with development, product, and other engineering teams to ensure reliability is prioritized in the development lifecycle.
• Communicate effectively with stakeholders regarding reliability metrics, incident reports, and team progress.
• Develop and execute a strategic roadmap for the SRE team.
• Identify areas for improvement and propose solutions that align with business goals.
• Optimize resource allocation and usage for operational efficiency.
• Identify and assess risks to production systems and work to mitigate them.
• Establish and maintain disaster recovery and business continuity plans.
Minimum Qualifications
- BS degree or higher in Computer Science or a related field.
- 5+ years in a site reliability engineering, DevOps, or related role, with at least 2 years in a lead capacity.
- Strong understanding of systems architecture, cloud infrastructure, and monitoring tools.
- Proficiency in one or more programming languages in particular Java.
- Proven experience in leading and mentoring engineering teams.
- Strong analytical skills and the ability to troubleshoot complex systems.
- Knowledge on fundamentals of network, databases, system administration, Version Control, CI/CD automations
- Strong problem-solving, communication skills
Preferred Qualifications
- Knowledgeable with container based technologies such as Docker, Kubernetes, or EKS.
- Knowledgeable with modern web services architectures and cloud platforms such as AWS, GCP.
- Exceptional analytical and troubleshooting skills in complex Unix/Linux systems environment and applications implementations.
- Ability to build tools from scratch.
- Ability to work in a collaborative environment.