Cloud DevOps Engineer #GeneralInternship

Singtel

  • Singapore
  • Permanent
  • Full-time
  • 1 day ago
As an DevOps Engineer Intern for SingTel’s GPU Cloud, you will help in implementing processes and integration of operations to advance customer’s AI and HPC capabilities. You will be exposed to both physical data center implementation and software solutions in a Singtel RE:AI GPU Cloud. This position requires a forward-thinking individual who thrives in dynamic environments and is committed to driving continuous improvement in GPU for AI and HPC environments. This is an excellent opportunity for someone eager to start their career in DevOps and grow their expertise in AI and HPC cloud platforms.Responsibilities
  • Assist in deploying and supporting GPU clusters for AI and ML workloads.
  • Support automation tasks for provisioning GPU resources in on-prem and cloud platforms.
  • Learn and contribute to CI/CD pipeline setup for AI models and GPU-accelerated applications.
  • Monitor basic cluster usage, health, and performance under supervision.
  • Assist in automating infrastructure provisioning and monitoring.
  • Support troubleshooting of system-level issues (e.g., Slurm, Kubernetes, GPU drivers, CUDA, IB networking) with guidance from senior engineers.
  • Participate in system benchmarking and stay updated on advancements in GPU technologies.
  • Help set up monitoring and logging tools (e.g., Zabbix, Prometheus, NVIDIA DCGM).
  • Learn and apply basic security practices in a multi-tenant GPU cloud environment.
  • Collaborate with senior engineers and administrators to streamline workflows.
  • Provide user support under supervision for GPU-accelerated systems.
  • Work closely with senior DevOps engineers to identify bottlenecks and improve processes.
  • Gain hands-on learning experience in high-performance distributed computation for AI and HPC workloads.
Requirements
  • Currently pursuing a Bachelor’s degree in Computer Science/Engineering, Information Technology, Systems Engineering, or a related field.
  • Basic knowledge of Linux system administration (Ubuntu, CentOS, Rocky Linux, etc.) through coursework or personal projects.
  • Exposure to DevOps tools such as Jenkins, Kubernetes, Ansible, or Terraform.
  • Understanding of core DevOps concepts (e.g., CI/CD, automation, monitoring) with willingness to learn further.
  • Familiarity with scripting languages (Python, Bash) for simple tasks or assignments.
  • Exposure to monitoring solutions such as Zabbix or Prometheus is a plus.
  • Interest in AI frameworks such as TensorFlow or PyTorch, with coursework or project experience preferred.
  • Awareness of cloud architectures (IaaS, PaaS) and GPU technologies, including NVIDIA GPUs.
  • Good verbal and written communication skills in English.
  • Collaborative mindset and ability to work effectively in a team environment.
  • Strong interest in developing problem-solving and analytical skills for system optimization.
Desirable qualifications
  • Understanding of how collective communications (MPI, RDMA, and NCCL) works, as well as an understanding of GPU specific aceleration works on GPU cluster.
  • Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers, Kubernetes, data center deployments
  • Understanding of AI & HPC networking technologies such as InfiniBand, RoCE, DPUs.
  • Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.

Singtel