CUDA Image Transposed

FileKernelMemory pattern
naive_transpose.cuRow-major read, column-stride writeWrite NOT coalesced
tiled_transpose.cuShared memory tile, bank-conflict-freeCoalesced read AND write

Jargon

  • Bank conflict: Shared Memory is divded into banks. If multiple threads access the same bank, accesses are serialized, leading to performance degradation. (This is important because it gives a good understanding of gpu architecture and how to optimize memory access patterns.)
  • Stride: The step size between consecutive memory accesses. For example, in a row-major order, the stride for column access is equal to the number of columns, which can lead to non-coalesced accesses.
  • Coalesced access: When threads in a warp access memory addresses that are contiguous or follow a specific pattern, allowing the GPU to combine these accesses into a single transaction, improving performance.

Profiling

CPU Kernel Profiling via NVIDIA Nsight Systems (nsys) and NVIDIA Nsight Compute (ncu) reveals the following performance metrics for the three kernels:

KernelAvg (ms)Med (ms)Instances
copyKernel (peak ref)0.3510.345100
naiveTranspose1.1311.084101
tiledTransposeNoPad (V1)0.6180.609100
tiledTranspose +1Pad (V2)0.3920.381101

67.1MB read + 67.1MB write = 134.2MB | Kernel | ms | GB/s | vs Copy | | ————– | —– | ——– | ——————- | | Copy (peak) | 0.345 | 389 GB/s | baseline | | Naive | 1.084 | 124 GB/s | 32% (strided write) | | Tiled NoPad V1 | 0.609 | 220 GB/s | 57% (bank conflict) | | Tiled +1Pad V2 | 0.381 | 352 GB/s | 90% |

Analysis

Naive (32%)

  • Strided writes continuously evict data from the L2 cache
  • Each element results in a separate DRAM write

Tiled NoPad V1 (57%)

  • Shared memory 32-way bank conflict
  • Phase 2 (Shared→Global) Serialized

Tiled +1Pad V2 (90%)

  • bank conflict = 0
  • Bi-directional coalesced (read + write)
  • Copy bandwidth의 90%

Resource