Research Scientist (Machine Learning Training Systems) - TikTok Applied Machine Learning

  • Singapore
  • Permanent
  • Full-time
  • 1 month ago
TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Singapore, Jakarta, Seoul and Tokyo.Why Join Us
Creation is the core of TikTok's purpose. Our platform is built to help imaginations thrive. This is doubly true of the teams that make TikTok possible.
Together, we inspire creativity and bring joy - a mission we all believe in and aim towards achieving every day.
To us, every challenge, no matter how difficult, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always.
At TikTok, we create together and grow together. That's how we drive impact - for ourselves, our company, and the communities we serve.
Join us.About the Team
The Applied Machine Learning Machine Learning (ML)Systems team provides end-to-end (E2E) machine learning experience and machine learning resources for the company. The team builds heterogeneous ML training and inference systems based on GPU and AI chips and advances the state-of-the-art of ML systems technology to accelerate models such as stable diffusion and LLM.The team is also responsible for research and development of hardware acceleration technologies for AI and cloud computing, via technologies such as distributed systems, compilers, HPC, and RDMA networking. The team is reinventing the ML infra for large scale language models. We have published papers at top tier conferences such as SIGCOMM, NSDI, EuroSys, OSDI, SOSP, MLSys, NeurIPS, etc.Responsibilities:
- Research and develop our machine learning systems, including heterogeneous computing architecture, management, scheduling, and monitoring
- Manage cross-layer optimisation of system and AI algorithms and hardware for machine learning (GPU, ASIC)
- Implement both general purpose training framework features and model specific optimisations (e.g. LLM, diffusions)
- Improve efficiency and stability for extremely large scale distributed training jobs
- Plan and lead the development of new and advanced data analytic techniques, methodologies and analytical solutions from design, prototyping, and testing.
- Identify and develop core data and AI science components for the delivery of projects, architect specialised database and computing environments, explore and visualise complex data set to provide incremental business value.
- Extract and integrate data from various sources, and create advanced models and algorithms suitable for the business use case.
- Conduct testing on data and AI models, interprets findings from testing, and evaluates model performance for scaling and deployment.
- Work in a team setting and apply proficient in statistics, scripting and programming languages required by the firm.
- Work with relevant software platforms on which the solution is deployed.Qualifications:- Bachelor or above degree in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies;
- At least 3 years or more working experiences;
- Familiar with machine learning algorithms, platforms and frameworks such as PyTorch and Jax.
- Have basic understanding of how GPU and/or ASIC works;
- Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python;Preferred Qualifications:
The following experiences will be a big plus:
- GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs);
-Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD
- AI compiler stacks such as torch.fx, XLA and MLIR;
- Large scale data processing and parallel computing;
- Experiences in designing and operating large scale systems in cloud computing or machine learning;
- Experiences in in-depth CUDA programming and performance tuning (cutlass, triton)TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.

TikTok