I am a Ph.D. student at the National University of Singapore and a member of the Alibaba Platform for AI (PAI) team, jointly advised by Prof. Jialin Li and Wei Lin. My primary research interest is developing highly efficient distributed infrastructure for machine learning.
During my undergraduate studies, I had the privilege (and fun) of spending four years with the NTU HPC club, where we won the Overall Championship at SC'17 and set the LINPACK World Record.
Outside of my academic pursuits, I enjoy cooking, jogging, and skateboarding. I have even developed a menu. My Erdős number is 5.
Download my resumé.
Doctor of Philosophy, Computer Science, 2021 - Current
National University of Singapore
Bachelor of Engineering in Computer Science, 2015 - 2019
Nanyang Technological University
Visiting Student, Fall 2016
New York University
Scaling up deep neural networks has been proven effective in improving model quality, while it also brings ever-growing training challenges including training efficiency, programmability, and resource adaptability. We present Whale, a general and efficient distributed training framework for giant models. Whale generalizes the programming interface to support various parallel strategies and their hybrids through defining two new primitives as model annotations to engage user cooperations. The Whale runtime utilizes those annotations and performs graph optimizations to transform a local deep learning DAG graph for distributed multi-GPU execution. Whale further introduces a novel hardware-aware parallel strategy, allowing the training of giant models on heterogeneous GPUs in a balanced way. Deployed in a production cluster with 512 GPUs, Whale successfully trains an industry-scale multimodal model with over ten trillions model parameters, named M6, demonstrating great scalability and efficiency.
We propose an efficient parameter sharing strategy for Transformer architecture by replacing FFN with MoE layer and sharing the trainable parameters except the normalization layers. Competitive performance across CV and NLP tasks were achieved with up to 6x reduction in the numbers of unique parameters.