Flattening of Data-Dependent Nested Loops for Compile-Time Optimization of GPU Programs

Vadim Bulavintsev


Modern Graphics Processing Units (GPUs) belong to the “Single Instruction Multiple Data” (SIMD) computational architecture class. Due to inefficient execution of divergent branches, SIMD devices can lose performance on nested loops with data-dependent exit conditions. A specialized compile-time Control Flow Graph (CFG) transformation routine can solve this problem. The routine reduces loop nest level by merging the inner loop with the outer loop. The transformed program remains logically equivalent to the original one, while its branching pattern becomes better suited for execution on a SIMD device. The routine is implemented as a Low-Level Virtual Machine (LLVM) Transformation Pass. Depending on the dataset and nested loops parameters, the transformation reduces the worst-case running time of a specialized GPU benchmarking application up to 24 times.

Full Text:



S. Ryoo et al., “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” Proc. 13th ACM SIGPLAN Symp. Principles and practice of parallel programming, pp. 73-82, 2008.

M. J. Flynn, “Some Computer Organizations and Their Effectiveness,” IEEE Trans. on Computers, vol. C-21, pp. 948–960, 1972.

CUDA Toolkit Documentation v10.1, Nvidia corp., Santa Clara, CA, 2019.

OpenCL User Guide v3.0, AMD corp.: Santa Clara, CA, 2015.

V. G. Bulavintsev, “An evaluation of CPU vs. GPU performance of some combinatorial algorithms for cryptoanalysis,” CMSE Bulletin of the South Ural State University, vol. 4(3), pp. 67–84, 2015.

F. E. Allen, “Control flow analysis,” ACM SIGPLAN Notices, vol. 5, no. 7, pp. 1–19, Jul. 1970.

C. Lattner, V. Adve, “LLVM: A compilation framework for lifelong program analysis and transformation,” Proc. 2004 Int. Symposium on Code Generation and Optimization, Washington DC, IEEE Computer Society, pp. 75-86, 2004.

The LLVM Compiler Infrastructure Project, 2019, http://llvm.org.

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, F. K. Za-deck, “Efficiently computing static single assignment form and the control dependence graph,” ACM Trans. Programming Languages and Systems, TOPLAS 13.4, pp. 451–490, 1991.

V. G. Bulavintsev, NestedLoopsFusion GitHub project, 2019, https://github.com/ichorid/nestedloopsfusion.

J. Wu, et al., “gpucc: an open-source GPGPU compiler”, Proc. 2016 Int. Symp. Code Generation and Optimization, New York, ACM, pp. 105–116, 2016.

Clang: a C language family frontend for LLVM, 2019, http://clang.llvm.org.

V. G. Bulavintsev, A. A. Semenov, “GPU-based implementation of DPLL algorithm with limited non-chronological backtracking”, Prikladnaya Diskretnaya Matematika, Suppl., vol. 6, pp. 111-112, 2013.

O. Zaikin, “Application of parallel SAT solving algorithms for cryptanalysis of the shrinking and self-shrinking keystream generators,” International Journal of Open Information Technologies (INJOIT), vol.6, no. 10, pp. 29-33, 2018.

J.Allen, K. Kennedy, C. Porterfiel and J. Warren, “Conversion of control dependence to data dependence,” Proc. 10th ACM SIGACT-SIGPLAN Symp. Principles programming languages, pp. 177-189, 1983.

K. Kennedy, “Relaxing SIMD control flow constraints using loop transformations,” Proc. ACM SIGPLAN 1992 conf. Programming language design and implementation, vol. 27, no. 7, pp. 188-199, 1992.

R.Von Hanxleden, “Compiler support for machine-independent parallelization of irregular problems,” Doctoral diss., Rice Univ., 1995.

A. Ghuloum and A. Fisher, “Flattening and parallelizing irregular, recurrent loop nests,” ACM SIGPLAN Notices, vol. 30, no. 8, pp. 58-67, 1995.

W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic warp formation and scheduling for efficient GPU control flow,” Proc. 40th Annual IEEE/ACM Int. Symp. Microarchitecture (MICRO), pp. 407-420, 2007.

G. Diamos et al., “SIMD re-convergence at thread frontiers,” Proc. 44th Annual IEEE/ACM Int. Symp. Microarchitecture (MICRO), pp. 477-488, 2011.

M. Rhu and M. Erez, “The dual-path execution model for efficient GPU control flow,” Proc. 19th Int. Symp. High Performance Computer Architecture (HPCA), pp. 591-602, 2013.

M. Rhu and M. Erez, “Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation,” ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 356-367, 2013.

A. Vaidya, A. Shayesteh, D. Woo, R. Saharoy and M. Azimi, “SIMD divergence optimization through intra-warp compaction,” ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 368-379, 2013.

J. Lee, S. Seo, H. Lee and H. Sim, “Flattening-based mapping of imperfect loop nests for CGRAs,” Proc. 2014 Int. Conf. Hardware/Software Codesign and System Synthesis, p. 9, 2014

B. Coutinho, D. Sampaio, F. Pereira and W. Meira Jr., “Diver-gence analysis and optimizations,” 2011 Int. Conf. Parallel Ar-chitectures and Compilation Techniques, IEEE, pp. 320-329, 2011.

Z. Cui, Y. Liang, K. Rupnow and D. Chen, “An accurate GPU performance model for effective control flow divergence opti-mization,” 26th International Parallel and Distributed Processing Symposium, IEEE, pp. 83-94, 2012.

T. Schaub, S. Moll, R. Karrenberg and S. Hack, “The impact of the SIMD width on control-flow and memory divergence,” ACM Trans. Architecture and Code Optimization (TACO), vol. 11, no. 4, p. 54, 2015.

H. Wu, G. Diamos, S. Li and S. Yalamanchili, "Characterization and transformation of unstructured control flow in gpu applica-tions,” 1st International Workshop on Characterizing Applications for Heterogeneous Exascale Systems, 2011.

J. Anantpur and R. Govindarajan, “Taming control divergence in GPUs through control flow linearization,” Int. Conf. Compiler Construction, Springer, Berlin, Heidelberg, pp. 133-153, 2014.

T. Han and T. Abdelrahman, “Reducing branch divergence in GPU programs,” Proc. 4th Workshop on General Purpose Pro-cessing on Graphics Processing Units, ACM, p. 3, 2011.

A. Carminati, R. Starke and R. de Oliveira, “Combining loop unrolling strategies and code predication to reduce the worst-case execution time of real-time software,” Applied computing and informatics, vol. 13, no. 2, pp. 184-193, 2017.

T. Han and T. Abdelrahman, “Reducing divergence in GPGPU programs with loop merging,” Proc. 6th Workshop General Pur-pose Processor Using Graphics Processing Units, ACM, pp. 12-23, ACM, 2013.

S. Pop, R. Yazdani and Q. Neill, “Improving GCC’s auto-vectorization with if-conversion and loop flattening for AMD’s Bulldozer processors,” GCC Developers’ Summit, p. 89, 2010

GNU Compiler Collection, 2019, http://gcc.gnu.org.


  • There are currently no refbacks.

Abava  Absolutech IT-EDU 2019

ISSN: 2307-8162