
Cloud DevOps Engineer #GeneralInternship
- Singapore
- Permanent
- Full-time
- Assist in deploying and supporting GPU clusters for AI and ML workloads.
- Support automation tasks for provisioning GPU resources in on-prem and cloud platforms.
- Learn and contribute to CI/CD pipeline setup for AI models and GPU-accelerated applications.
- Monitor basic cluster usage, health, and performance under supervision.
- Assist in automating infrastructure provisioning and monitoring.
- Support troubleshooting of system-level issues (e.g., Slurm, Kubernetes, GPU drivers, CUDA, IB networking) with guidance from senior engineers.
- Participate in system benchmarking and stay updated on advancements in GPU technologies.
- Help set up monitoring and logging tools (e.g., Zabbix, Prometheus, NVIDIA DCGM).
- Learn and apply basic security practices in a multi-tenant GPU cloud environment.
- Collaborate with senior engineers and administrators to streamline workflows.
- Provide user support under supervision for GPU-accelerated systems.
- Work closely with senior DevOps engineers to identify bottlenecks and improve processes.
- Gain hands-on learning experience in high-performance distributed computation for AI and HPC workloads.
- Currently pursuing a Bachelor’s degree in Computer Science/Engineering, Information Technology, Systems Engineering, or a related field.
- Basic knowledge of Linux system administration (Ubuntu, CentOS, Rocky Linux, etc.) through coursework or personal projects.
- Exposure to DevOps tools such as Jenkins, Kubernetes, Ansible, or Terraform.
- Understanding of core DevOps concepts (e.g., CI/CD, automation, monitoring) with willingness to learn further.
- Familiarity with scripting languages (Python, Bash) for simple tasks or assignments.
- Exposure to monitoring solutions such as Zabbix or Prometheus is a plus.
- Interest in AI frameworks such as TensorFlow or PyTorch, with coursework or project experience preferred.
- Awareness of cloud architectures (IaaS, PaaS) and GPU technologies, including NVIDIA GPUs.
- Good verbal and written communication skills in English.
- Collaborative mindset and ability to work effectively in a team environment.
- Strong interest in developing problem-solving and analytical skills for system optimization.
- Understanding of how collective communications (MPI, RDMA, and NCCL) works, as well as an understanding of GPU specific aceleration works on GPU cluster.
- Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers, Kubernetes, data center deployments
- Understanding of AI & HPC networking technologies such as InfiniBand, RoCE, DPUs.
- Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.