In preparation for SC’17, Chen Hailin and I optimized one of the compeition applcation Mrbayes using AVX-512 instruction set. We acheived 4x speedup in the end, getting the best performance in SC’17 on this application.
We started by profiling MrBayes to identify the computationally demanding code. Then, we read reserarch papers and code to understand the underlying logic. We chose to port to AVX because code is not well optimized for automatic vectorization done by the Intel compiler, and the cloud component support for Skylake CPU.
There were some challenges though. The syntax to use
Zmm register on
AVX-512 is different from the
AVX. Also, for a program with 10 KLOC but spead out in just a handful of files, we just sometimes got lost in it 😅
Our code is open sourced here.