File Kernel Memory pattern
naive_transpose.cu Row-major read, column-stride write Write NOT coalesced
tiled_transpose.cu Shared memory tile, bank-conflict-free Coalesced read AND write

Jargon

Profiling

CPU Kernel Profiling via NVIDIA Nsight Systems (nsys) and NVIDIA Nsight Compute (ncu) reveals the following performance metrics for the three kernels:

Kernel Avg (ms) Med (ms) Instances
copyKernel (peak ref) 0.351 0.345 100
naiveTranspose 1.131 1.084 101
tiledTransposeNoPad (V1) 0.618 0.609 100
tiledTranspose +1Pad (V2) 0.392 0.381 101

67.1MB read + 67.1MB write = 134.2MB | Kernel | ms | GB/s | vs Copy | | ————– | —– | ——– | ——————- | | Copy (peak) | 0.345 | 389 GB/s | baseline | | Naive | 1.084 | 124 GB/s | 32% (strided write) | | Tiled NoPad V1 | 0.609 | 220 GB/s | 57% (bank conflict) | | Tiled +1Pad V2 | 0.381 | 352 GB/s | 90% |

Analysis

Naive (32%)

Tiled NoPad V1 (57%)

Tiled +1Pad V2 (90%)

Resource