
NPU Communications Engineer (Embedded Systems)
- Singapore
- Permanent
- Full-time
- Design and implement foundational collective communication operators (e.g., Send, Receive, Broadcast, Gather, Reduce, All Reduce, All Gather, etc.) tightly coupled with the NPU (Neural Processing Unit) hardware architecture.
- Optimize communication primitives to exploit hardware features like specialized communication links, on-chip interconnects, and DMA engines, minimizing latency and maximizing bandwidth.
- Analyze different communication modes (blocking/non-blocking, sync/async, reliable/unreliable) in the context of chip microarchitecture to enhance throughput and reduce stalls.
- Research and integrate communication algorithms (e.g., Ring, Hierarchical Decomposition) tailored for NPU topology and workload patterns, ensuring scalability across many compute nodes.
- Ensure software-hardware co-design compatibility, verifying correctness and performance across the chip's instruction set, system software stack, and runtime environment.
- Perform deep debugging and profiling using hardware-level tools and logs to rapidly identify bottlenecks or correctness issues and drive resolution.
- Collaborate cross-functionally with chip architects, firmware engineers, and system software teams to deliver optimized communication solutions aligned with the overall AI accelerator roadmap.
- Master's degree or higher in Computer Science, Electrical Engineering, Integrated Circuit Design, or related fields.
- Proficient in C/C++ and Python programming with strong software engineering skills; experience with assembly or low-level programming for hardware optimization is highly valued.
- Deep understanding of heterogeneous hardware platforms, especially NPU architecture including compute cores, on-chip memory hierarchies, and communication fabrics.
- Solid grasp of collective communication principles and algorithms, including the implementation of efficient communication primitives on hardware accelerators.
- Experience with performance profiling and debugging at hardware-software boundaries, able to use tools like logic analyzers, hardware performance counters, and trace logs.
- Excellent problem-solving skills and ability to work in a collaborative, cross-disciplinary environment.
- Bonus skills include knowledge of GPU/TPU/DPU/NPU architectures, CUDA/ROCm programming, RDMA, communication libraries like NCCL, and distributed AI training frameworks.
- A culture that values authenticity and diversity of thoughts and backgrounds;
- An inclusive and respectable environment with open workspaces and exciting start-up spirit;
- Fast-growing company with the chance to network with industrial pioneers and enthusiasts;
- Ability to contribute directly and make an impact on the future of the digital asset industry;
- Involvement in new projects, developing processes/systems;
- Personal accountability, autonomy, fast growth, and learning opportunities;
- Attractive welfare benefits and developmental opportunities such as training and mentoring;