
Lead Site Reliability Engineer, Cloud Technology
- Singapore
- Permanent
- Full-time
- Engage in and improve the lifecycle of cloud services from inception, design, deployment, and operation
- Automate repeated manual tasks, develop tools and automation to improve the efficiency of the platform and infrastructure.
- Analyze defects, propose improvements and drive efficiencies in systems and processes.
- Helps to develop new cloud engineering strategies and implementations for the firm
- As part of Site Reliability, you have the responsibility of ensuring the reliability, availability, and performance of the cloud infrastructure and platform.
- Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
- Develop observability and telemetry tools.
- Author and improve the quality of technical engineering documentation
- Debug and solve issues in a production environment
- Participates in SRE on-call rotations and escalation workflows.
- Formal training or certification on software engineering or site reliability engineering and 5+ years applied experience
- Bachelor's Degree in Computer Science or equivalent
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Expertise in building solutions with AWS cloud services.
- Knowledge in Infrastructure as Code, tools such as Terraform
- Fluency in at least one programming language such as Python and Java.
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Experience with troubleshooting common networking technologies and issues
- Ability to identify and solve problems related to complex data structures and algorithms
- Drive to self-educate and evaluate new technology
- Ability to teach new programming languages to team members
- Ability to expand and collaborate across different levels and stakeholder groups
- Excellent communication skills working with stakeholders and domain experts across the company to design solutions to user problems
- Self-disciplined, self-managed, self-motivated and strong sense of ownership, urgency, and drive
- AWS certifications will be a bonus.