MLSys

TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

We present a framework that drastically speeds up the process of deriving the tensor parallel schedule for large neural networks by 160x.

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

We present ParaGAN, a cloud-training framework for GAN, which demonstrates near optimal scaling performance over thousands of acclerators with system & training co-design.

TAP: Efficient Derivation of Tensor Parallel Plans for Large Neural Networks

We present a framework that drastically speeds up the process of deriving the tensor parallel schedule for large neural networks.