
Manager, Incident Response & Management
- Singapore
- $208,000 per year
- Permanent
- Full-time
- Lead the global 24/7 team of regional managers and incident response managers with ability to be hands-on and support frontline on-call with speed, cross-functional collaboration and escalation
- Develop and own Stripe's incident response and management strategy and cross-functional roadmap, ensuring it aligns with the company's reputation for reliability.
- Spearhead and manage Stripe's AI-First strategy for automation of incident response workflows, partnering with the engineering team to implement required tooling enhancements.
- Enhance Stripe's incident response by leading and implementing improvements derived from analyzing user-facing incidents and extracting actionable insights and learnings.
- Collaborate closely with executive leadership, engineering, and operations teams to lead significant programs and reshape workflows and metrics concerning reliability and incident operations.
- Manage relevant TTx metrics, particularly those related to communication and escalation. Collaborate with engineering leadership to implement necessary improvements for each metric.
- Develop user-focused metrics and data to guide Stripe's incident response, reliability strategy, and user communications (including RCAs), ensuring impactful decision-making.
- 5+ years of management experience, including 2+ years of experience managing managers with a proven record in building, growing and transforming teams.
- Extensive experience (4+ years) leading incident response for complex, large-scale distributed services with high SLOs/SLAs, coupled with deep expertise in crisis management.
- Demonstrated ability to lead, influence other leaders and deliver complex strategic projects involving multiple stakeholders
- Strong analytical skills, and the ability to use data to drive business decisions
- Possesses proficiency in basic incident troubleshooting and a reasonable understanding of system architecture. Fluent in using SQL, Splunk, or similar query languages.
- Exceptional communication abilities, capable of adapting incident updates for diverse audiences (executives, external users, internal teams).
- Affinity for a fast paced work environment, crafting strategic and rapid fixes to high intensity problems with a keen eye for detail and a high bar for quality
- Comfort navigating ambiguity, while identifying areas for process improvement and establishing best practices
- Experience managing geographically dispersed teams
- Experience using infrastructure and application monitoring tools such as Prometheus, Sentry and others
- Experience in incident response at a high-growth technology company, preferably within the payments or e-commerce sectors.
- Proven ability to apply Agentic and Generative AI to revolutionize incident response, coupled with a strong grasp of current industry trends in the incident response domain.
- Demonstrated history of driving engineering and process enhancements to improve incident response efficiency within a rapidly expanding technology organization.