File	Kernel	Memory pattern
`naive_transpose.cu`	Row-major read, column-stride write	Write NOT coalesced
`tiled_transpose.cu`	Shared memory tile, bank-conflict-free	Coalesced read AND write

Jargon

Bank conflict: Shared Memory is divded into banks. If multiple threads access the same bank, accesses are serialized, leading to performance degradation. (This is important because it gives a good understanding of gpu architecture and how to optimize memory access patterns.)
Stride: The step size between consecutive memory accesses. For example, in a row-major order, the stride for column access is equal to the number of columns, which can lead to non-coalesced accesses.
Coalesced access: When threads in a warp access memory addresses that are contiguous or follow a specific pattern, allowing the GPU to combine these accesses into a single transaction, improving performance.

Profiling

CPU Kernel Profiling via NVIDIA Nsight Systems (nsys) and NVIDIA Nsight Compute (ncu) reveals the following performance metrics for the three kernels:

Kernel	Avg (ms)	Med (ms)	Instances
copyKernel (peak ref)	0.351	0.345	100
naiveTranspose	1.131	1.084	101
tiledTransposeNoPad (V1)	0.618	0.609	100
tiledTranspose +1Pad (V2)	0.392	0.381	101

67.1MB read + 67.1MB write = 134.2MB | Kernel | ms | GB/s | vs Copy | | ————– | —– | ——– | ——————- | | Copy (peak) | 0.345 | 389 GB/s | baseline | | Naive | 1.084 | 124 GB/s | 32% (strided write) | | Tiled NoPad V1 | 0.609 | 220 GB/s | 57% (bank conflict) | | Tiled +1Pad V2 | 0.381 | 352 GB/s | 90% |

Analysis

Naive (32%)

Strided writes continuously evict data from the L2 cache
Each element results in a separate DRAM write

Tiled NoPad V1 (57%)

Shared memory 32-way bank conflict
Phase 2 (Shared→Global) Serialized

Tiled +1Pad V2 (90%)

bank conflict = 0
Bi-directional coalesced (read + write)
Copy bandwidth의 90%

Jargon

Profiling

Analysis

Resource