
Lead AI Infrastructure Engineer
- Singapore
- Permanent
- Full-time
- Design and operate GPU-based infrastructure (e.g., NVIDIA GB200, H100) across cloud and self-hosted environments.
- Architect scalable inference platforms that support real-time and batch serving with high availability, load balancing, and fault tolerance.
- Integrate inference workloads with orchestration frameworks such as Kubernetes, Slurm, and Ray, as well as observability stacks like Prometheus, Grafana, and OpenTelemetry.
- Automate infrastructure provisioning and deployment using Terraform, Helm, and CI/CD pipelines.
- Collaborate with ML engineers to co-design systems optimized for low-latency serving, continuous batching, and advanced inference optimization techniques (quantization, distillation, pruning, KV caching).
- Lead client engagements by shaping technical roadmaps that align AI infrastructure with business objectives, ensuring compliance, scalability, and performance.
- Champion DevOps and agile practices to accelerate delivery while maintaining reliability, quality, and resilience.
- Mentor and guide teams in best practices for AI infrastructure engineering, fostering a culture of technical excellence and innovation.
- Expertise in GPU-based infrastructure for AI (H100, GB200, or similar), including scaling across clusters.
- Strong knowledge of orchestration frameworks: Kubernetes, Ray, Slurm.
- Experience with inference-serving frameworks (vLLM, NVIDIA Triton, DeepSpeed).
- Proficiency in infrastructure automation (Terraform, Helm, CI/CD pipelines).
- Experience building resilient, high-throughput, low-latency systems for AI inference.Strong background in observability and monitoring: Prometheus, Grafana, OpenTelemetry.
- Familiarity with security, compliance, and governance concerns in AI infrastructure (data sovereignty, air-gapped deployments, audit logging).
- Solid understanding of DevOps, cloud-native architectures, and Infrastructure as Code.
- Exposure to multi-cloud and hybrid deployments (AWS, GCP, Azure, sovereign/private cloud).
- Experience with benchmarking and cost/performance tuning for AI systems.
- Background in MLOps or collaboration with ML teams on large-scale AI production systems.
- Proven ability to partner with senior client stakeholders (CTO, CIO, COO) and translate technical strategy into business outcomes.
- Skilled at leading multi-disciplinary teams and building trust across diverse technical and business functions.
- Strong communication skills, with the ability to explain complex AI infrastructure concepts to both technical and non-technical audiences.
- Comfortable navigating uncertainty, making pragmatic decisions, and adapting quickly to evolving technologies.
- Passionate about creating scalable, sustainable, and high-impact solutions that help transform industries with AI.