TY - JOUR
T1 - Tree cutting approach for domain partitioning on forest-of-octrees-based block-structured static adaptive mesh refinement with lattice Boltzmann method
AU - Hasegawa, Yuta
AU - Aoki, Takayuki
AU - Kobayashi, Hiromichi
AU - Idomura, Yasuhiro
AU - Onodera, Naoyuki
N1 - Funding Information:
This work was supported in part by JSPS, Japan KAKENHI Grant Numbers JP26220002 , JP19K24359 and 21K17755 , High Performance Computing Infrastructure, Japan (HPCI, Project ID: hp180146 ) and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures, Japan (JHPCN, Project ID: jh200050 ). Computations were performed on the TSUBAME2.5/3.0 supercomputer at Tokyo Institute of Technology, and the DGX-2 system and the HPE SGI 8600 supercomputer at Japan Atomic Energy Agency (JAEA), and the ABCI supercomputer at the National Institute of Advanced Industrial Science and Technology (AIST). We would also like to thank Dr. Watanabe at Kyushu University for useful discussion and suggestion about space-filling curves.
Funding Information:
This work was supported in part by JSPS, Japan KAKENHI Grant Numbers JP26220002, JP19K24359 and 21K17755, High Performance Computing Infrastructure, Japan (HPCI, Project ID: hp180146) and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures, Japan (JHPCN, Project ID: jh200050). Computations were performed on the TSUBAME2.5/3.0 supercomputer at Tokyo Institute of Technology, and the DGX-2 system and the HPE SGI 8600 supercomputer at Japan Atomic Energy Agency (JAEA), and the ABCI supercomputer at the National Institute of Advanced Industrial Science and Technology (AIST). We would also like to thank Dr. Watanabe at Kyushu University for useful discussion and suggestion about space-filling curves.
Publisher Copyright:
© 2021 The Author(s)
PY - 2021/12
Y1 - 2021/12
N2 - The aerodynamics simulation code based on the lattice Boltzmann method (LBM) using forest-of-octrees-based block-structured adaptive mesh refinement (AMR) with temporary-fixed refinement was implemented, and its performance was evaluated on GPU-based supercomputers. Although the Space-Filling-Curve-based (SFC) domain partitioning algorithm for the octree-based AMR has been widely used on conventional CPU-based supercomputers, accelerated computation on GPU-based supercomputers revealed a bottleneck due to costly halo data communication. Our new tree cutting approach adopts a hybrid domain partitioning with the coarse structured block decomposition and the SFC partitioning in each block. This hybrid approach improved the locality and the topology of the partitioned sub-domains and reduced the amount of the halo communication to one-third of the original SFC approach. In the strong scaling test, the code achieved maximum ×1.82 speedup at the performance of 2207 MLUPS (mega-lattice update per second) on 128 GPUs (NVIDIA® Tesla® V100). In the weak scaling test, the code achieved 9620 MLUPS at 128 GPUs with 4.473 billion grid points, while keeping the parallel efficiency of 93.4% from 8 to 128 GPUs.
AB - The aerodynamics simulation code based on the lattice Boltzmann method (LBM) using forest-of-octrees-based block-structured adaptive mesh refinement (AMR) with temporary-fixed refinement was implemented, and its performance was evaluated on GPU-based supercomputers. Although the Space-Filling-Curve-based (SFC) domain partitioning algorithm for the octree-based AMR has been widely used on conventional CPU-based supercomputers, accelerated computation on GPU-based supercomputers revealed a bottleneck due to costly halo data communication. Our new tree cutting approach adopts a hybrid domain partitioning with the coarse structured block decomposition and the SFC partitioning in each block. This hybrid approach improved the locality and the topology of the partitioned sub-domains and reduced the amount of the halo communication to one-third of the original SFC approach. In the strong scaling test, the code achieved maximum ×1.82 speedup at the performance of 2207 MLUPS (mega-lattice update per second) on 128 GPUs (NVIDIA® Tesla® V100). In the weak scaling test, the code achieved 9620 MLUPS at 128 GPUs with 4.473 billion grid points, while keeping the parallel efficiency of 93.4% from 8 to 128 GPUs.
KW - Adaptive mesh refinement (AMR)
KW - GPU
KW - Lattice Boltzmann method
KW - Static AMR
UR - http://www.scopus.com/inward/record.url?scp=85116323232&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85116323232&partnerID=8YFLogxK
U2 - 10.1016/j.parco.2021.102851
DO - 10.1016/j.parco.2021.102851
M3 - Article
AN - SCOPUS:85116323232
SN - 0167-8191
VL - 108
JO - Parallel Computing
JF - Parallel Computing
M1 - 102851
ER -