Useful Links & Learning Website
This is just a graphics resource to learn DirectX
This is just a graphics resource to learn DirectX
While Vision Transformers (ViT) have shown strong performance in image classification. There are several limitations that arise when applying them to more and various computer vision tasks.
ViT splits the image into fixed-size patches (e.g., 16x16), regardless of the underlying content
As a results:
Maybe the example I can think of is, a small object like a bird might get split between patches, causing the network to miss its continuity entirely.
ViT lacks mechanisms to handle scale variablity in visual tokens. In real-world images:
ViT’s self-attention has quadratic complexity with respect to the number of patches (i.e., ( O(n^2) )).
This becomes computationally expensive for high-resolution images commonly found in practice (e.g., 1920×1080).
ViT was initially designed for image classification, but many vision applications require more:
These architectural choices allow Transformers to be applied beyond classification—including detection and segmentation

The diagram above illustrates the general structure of a Transformer-based vision backbone. The Swin Transformer begins by splitting the input image into small non-overlapping patches (gray outlines) and gradually builds up hierarchical representations by merging neighboring patches at deeper layers—similar to CNNs.
To reduce computational complexity, Swin Transformer performs self-attention within non-overlapping windows (red outlines) rather than across the entire image.
This window-based local attention allows the model to efficiently process images of various sizes, since the number of windows scales with the image size.
This hierarchical merging process enables the model to learn increasingly abstract and global features, while maintaining efficiency and scalability.

In Swin Transformer, attention is applied locally within windows. However, instead of applying the same partitioning across all layers, it introduces a clever mechanism:
Shifted Windows, which allows the model to connect neighboring windows and capture richer context. Each patch in a window shares the same key set, which not only simplifies computation but also improves memory access efficiency—crucial for hardware acceleration as explain furthermore.
Notably, all windows in Swin are non-overlapping.

In Layer 1 of the Swin Transformer block (left side of the original figure above), the image is divided into non-overlapping windows. Self-attention is applied within each window only, making it a local attention mechanism. This is referred to as W-MSA.
In Layer 2 (right side of the original figure), the windows are shifted relative to the previous layer. This allows self-attention to be computed across window boundaries, connecting adjacent regions. While traditional sliding windows can also cover neighboring areas, shifted windows maintain efficiency while introducing inter-window connections, enabling the model to capture global interactions gradually.
Traditional Multi-head Self-Attention (MSA) computes attention globally across all tokens in an image. While this enables long-range dependencies, it comes with a significant computational cost.
Before, the complexity of global self-attenion is:
MSA: Ω(MSA) = 4 * h * w * C^2 + 2 * (h * w)^2 * C
Where:
But there is a problem with this, if the input resolution increases, the quadratic term grows dramatically - making it inefficient for high-resolution images.
To address this, Window-based Multi-head Self-Attention (W-MSA) restricts attention to local windows of size M × M. Its complexity is:
W-MSA: Ω(W-MSA) = 4 * h * w * C^2 + 2 * M^2 * h * w * C
Here, M is a fixed window size (e.g., 7), making this approach linear in terms of image size. This dramatically reduces computational cost while retaining performance in local contexts.
| Method | Complexity | Scales with Image Size? |
|---|---|---|
| MSA | O((h·w)^2) | ❌ Quadratic |
| W-MSA | O(h·w) (when M is fixed) | ✅ Linear |
Also, the each swin transformer block consists of:

The figure above illustrates the overall architecture of the Swin Transformer. Here’s how the input image is processed step-by-step:
Patch Partitioning
The input image is first split into non-overlapping patches of size 4×4, resulting in patch tokens of shape 4×4×3. Each patch is then flattened and passed through a Linear Projection to form an embedding vector.
Linear Embedding
These patch vectors are embedded into a fixed-dimensional space using a learnable linear layer. This prepares them for the Transformer blocks that follow.
Transformer Blocks (Stage 1)
The embedded patches are fed into Transformer blocks, where self-attention is computed both within patches (local) and between patches (global). This stage captures fine-grained, low-level features.
Patch Merging (Stage 2)
After Stage 1, a Patch Merging Layer is applied. This merges each group of 2×2 neighboring patches, concatenating their features into a single vector (dimension becomes 4C). A linear layer then reduces this to 2C, effectively reducing spatial resolution (by a factor of 2) and increasing the channel capacity.
Hierarchical Feature Learning (Stages 3 & 4)
This patch merging process and Transformer block application are repeated multiple times, forming deeper stages. As a result, the feature maps get smaller in spatial dimensions but richer in representation—similar to how UNet or image pyramids work in traditional vision architectures.
Final MLP Head
At the end of the final stage, a Multi-Layer Perceptron (MLP) head is applied to perform the final prediction task, such as classification or detection.
To explain furthermore on second layer, let’s look at the image below.

Let’s slowly wrap this, Swin Transformer, introduces a novel mechanism—Shifted Window Multi-head Self-Attention (SW-MSA)—to efficiently model long-range dependencies without incurring the high cost of global self-attention.
Why Shift Windows? The baseline attention mechanism, Window-based MSA (W-MSA), computes self-attention within non-overlapping windows. This is efficient, but it lacks cross-window communication.
To address this, shifted windows are applied in alternating Transformer blocks:

Shifting windows introduces new windows and disrupts alignment, potentially increasing computational complexity. Swin Transformer handles this elegantly using:
Unlike traditional Vision Transformers (ViT), which use absolute positional embeddings, Swin Transformer applies:
The details are shwon below
$Attention(Q, K, V) = SoftMax(\frac{QK^T}{\sqrt{d}} + B)Vs$

In conclusion, as shown in the figure above, the Relative Position Bias is generated using $\hat{B}$. Ultimately, for the matrices $Q$, $K$, and $V$, where $M^2$ represents the number of Window Patches, and the relative positions along each axis range from $[-M+1, M-1]$, it can be seen that a smaller-sized Bias matrix $\hat{B}$ is parameterized, and the values of $B$ are derived from $\hat{B}$.
The basic idea of attention is that, at each time step during decoding, the decoder refers back to the entire input sentence encoded by the encoder. Instead of treating all parts of the input sentence equally, the model focuses more on the specific parts of the input that are most relevant to predicting the current output word.
The attention function can be thought of similarly to a Python dictionary. However, its role is to compute the similarity between a given query and all keys, then map each key to a value weighted by this similarity. The output is essentially a weighted sum of these values, where the weights reflect the relevance (similarity) to the query.

At each time step, the decoder refers back to all encoder hidden states, focusing more on the words that are most relevant. This approach helps address issues like vanishing gradients and the fixed-size output vector limitation of traditional RNNs.
The attention mechanism computes similarity scores based on dot products. Specifically, to calculate the attention scores, the decoder’s hidden state at time step 𝑡, denoted as 𝑠𝑡, is multiplied by each encoder hidden state ℎ1, ℎ2, …,ℎ𝑁 For example, in the third decoding stage where the model predicts the next word after “je” and “suis,” it re-examines all encoder inputs to determine relevant information.
Formally, assuming the encoder and decoder hidden states share the same dimensionality, the attention scores 𝑒𝑡 can be calculated as: `et = [st⊤h1, st⊤h2, … ,st⊤hN] Here, 𝑒𝑡 is the vector of attention scores representing the relevance of each encoder hidden state to the current decoder state.

After computing the attention scores, we apply the softmax function to obtain the attention distribution, also known as the attention weights. By applying softmax to the attention scores 𝑒𝑡, we get a probability distribution 𝑎𝑡, where all values sum to 1. Each value in 𝑎𝑡 represents the weight (or importance) assigned to the corresponding encoder hidden state.
Formally: 𝑎𝑡 = softmax(𝑒𝑡) These attention weights determine how much each encoder hidden state should contribute to the current decoding step. In the diagram, the red rectangles represent the magnitude of attention weights applied to the encoder hidden states.

Once the attention weights 𝑎𝑡 are computed, we calculate the weighted sum of the encoder hidden states. This weighted sum is called the attention value, and it’s also commonly referred to as the context vector.
The higher the attention weight assigned to a particular hidden state, the more strongly it contributes to the final context vector—indicating a higher relevance to the current decoding step.
Formally, the context vector context 𝑡 (or attention value) is computed as: \[\text{context}_t = \sum_{i=1}^{N} a_i^t \cdot h_i\]
This vector summarizes the relevant parts of the input sequence, tailored for the current decoding time step 𝑡.

After computing the attention value (context vector), it is concatenated with the decoder’s hidden state at time step 𝑡, denoted as 𝑠𝑡.
This concatenated vector, often written as 𝑣𝑡 = [context𝑡;𝑠𝑡], combines both the information from the encoder (via attention) and the current state of the decoder. This merged vector 𝑣𝑡 is then used as input for predicting the output 𝑦𝑡, helping the model make more accurate predictions by leveraging both current decoding context and relevant encoder features.

Before sending ( v_t ) (the concatenation of the context vector and the decoder hidden state) directly to the output layer, it is passed through an additional neural layer for transformation.
Specifically, ( v_t ) is multiplied by a weight matrix and then passed through a hyperbolic tangent (tanh) activation function. This results in a new vector ( \tilde{s}_t ), which serves as the final input to the output layer: \[\tilde{s}_t = \tanh(W_o \cdot v_t)\]
This transformation allows the model to learn a richer representation before making the final prediction for the output word ( y_t ).

| Name | Score Function | Defined by |
|---|---|---|
| dot | ( \text{score}(s_t, h_i) = s_t^\top h_i ) | Luong et al. (2015) |
| scaled dot | ( \text{score}(s_t, h_i) = \frac{s_t^\top h_i}{\sqrt{n}} ) | Vaswani et al. (2017) |
| general | ( \text{score}(s_t, h_i) = s_t^\top W_a h_i ) // ( W_a ) is a learnable weight matrix | Luong et al. (2015) |
| concat | ( \text{score}(s_t, h_i) = W_a^\top \tanh(W_b[s_t; h_i]) ) or ( W_a^\top \tanh(W_b s_t + W_c h_i) ) | Bahdanau et al. (2015) |
| location-base | ( \alpha_t = \text{softmax}(W_a s_t) ) // Only uses ( s_t ) when computing ( \alpha_t ) | Luong et al. (2015) |
The Encoder-Decoder structure consists of two main components: the encoder, which compresses the input sequence into a fixed-length vector (often called the context vector), and the decoder, which generates the output sequence based on this vector.
However, this approach has several limitations:
To address these issues, the Transformer architecture was introduced.
The Transformer replaces recurrence with attention mechanisms, specifically self-attention and cross-attention, to model dependencies in sequences. Just as CNNs use convolution to extract feature maps, attention computes similarity scores using dot products, followed by a weighted sum of values. This allows the model to focus on the most relevant parts of the input when processing each token.
Key features include:
Although the Transformer does not use RNNs, it still follows the Encoder-Decoder architecture, where it takes an input sequence and generates an output sequence. The overall structure is similar to that of traditional RNN-based models.
However, there is a key difference: In RNNs, the model processes the sequence step by step over time — each unit corresponds to a specific time step 𝑡. In Transformers, the model processes the entire sequence in parallel. Instead of time steps, it consists of 𝑁 repeated encoder and decoder blocks, each operating on the entire sequence simultaneously. This structural shift allows Transformers to overcome the sequential bottleneck of RNNs and achieve much better performance in terms of parallelization and long-range dependency modeling.

The end of the computation is as shown in the image below. The process continues from the input <sos> until <eos> is produced. Especially, since it is not divided by time steps like an RNN, positional encoding is necessary. In other words, a word’s one-hot encoding technique can be considered to be included.




I will first define the following steps. The diagram below can be considered the overall architecture.

The encoder is built using num_layers (6). Inside, there can be layers such as multi-head self-attention and position-wise FFNN.

In Sequence-to-Sequence models, the meaning of Q, K, and V was that Q represented the decoder hidden state at time step t.
However, in self-attention, Q, K, and V are the same. This means they represent the vectors of all words in the input sentence.
Note that instead of using the words themselves directly, their dimensionality is reduced. For example, if there are 4 words, each has a position vector, and these position vectors’ dimensions are reduced to form the dimensions of Q, K, and V.
After that, scaled dot-product attention is applied. Overall, this is performed as matrix operations (not just vector operations).
Among the parameters, there is a “number of heads,” which corresponds to parallel processing heads. Each head performs the following steps with Q and K: matrix multiplication → scaling → optional masking → softmax → matrix multiplication with V. Each head extracts different values, and by processing them in parallel, the values from each head are obtained.
Here, the mask refers to the padding mask. Padding tokens have no actual meaning. Therefore, in the transformer, if a key corresponds to a padding token, similarity is not computed—in other words, it is excluded (which might be compared to a skip connection, but they are not the same). In this case, the padding mask has small values (close to zero) in the attention score.
The point-wise FFNN can be considered a sublayer that both the encoder and decoder have. Therefore, it can be viewed as performing the computations shown in the diagram below.

Another notable aspect is the use of skip connections and layer normalization (similar to the residual blocks in ResNet).
That is, the encoded information is passed to the decoder. This is done by feeding the values into the multi-head attention. We will take a closer look at this part on the decoder side. If you want, I can help expand or clarify further!

Like the encoder, the decoder also receives a sequence of words as input. However, to prevent the model from looking at future words beyond the current time step, an optional mask is applied during training. This masking prevents the decoder from attending to subsequent words and is called the look-ahead mask. Similar to the encoder, self-attention is performed.
In summary, the decoder can only attend to itself and the previous words, but not to future words.

The second case is when the values received from the encoder and the decoder are used together. In this layer, it is not self-attention. This means the definitions of Q, K, and V are different.
Q corresponds to the decoder matrix, while K and V correspond to the encoder matrix. Mapping this to the example above, it looks like the following.

After that, the output probabilities are generated, and during inference, the token with the highest probability is produced.

You might have seen this in previous posts. Essentially, a Transformer is a self-attention model that operates in the following way:
Normalization → Multi-head Attention → Normalization → MLP.
Let’s take a look at the overall architecture diagram below.

Since it is a Vision Transformer, image data and the corresponding label data are used for training.

수식은 아래와 같다. \[x \in \mathbb{R}^{C \times H \times W}\]
In the context of Vision Transformers, an Image Patch refers to the process of dividing an input image into smaller, fixed-size segments called patches. These patches are typically of size $ p \times p $ pixels and are used as input to the Transformer model, effectively treating the image as a sequence of patches, similar to how words are treated in natural language processing.
For example:
Given a 224 x 224 pixel image, if the patch size is set to 16 x 16 pixels, the image is divided into a grid of 14 x 14 patches (since $ 224 \div 16 = 14 $). Each patch is then flattened and processed as part of a sequence, which is fed into the Transformer architecture for vision tasks.
This approach allows the Transformer to leverage its sequence-processing capabilities for vision tasks, treating the grid of patches as a sequence of “tokens” akin to words in text processing.

CNNs and Vision Transformers: Image Patch Flattening CNNs have traditionally excelled in image processing through convolution operations. One of their drawbacks is the explosive increase in parameters, leading to issues like overfitting and gradient vanishing, which are addressed using regularization or dropout. In contrast, Transformers appear to rely on a specific exploitation method. In traditional Transformers, word embedding sequences are used as input. For vision tasks, this is adapted by inputting image patch sequences, enabling tasks like image label prediction. This represents a distinctly different architectural pattern. Flattening of Image Patches With patched images, a flattening process is applied to convert each patch into a vector of size $p^2 \times c$, where:
Flattened Vector: Each patch $z_p$ is represented as a vector in $\mathbb{R}^{N \times (p^2 \cdot c)}$, where:
Mathematical Formulation \(z_p \in \mathbb{R}^{N \times (p^2 \cdot c)} \quad \text{where} \quad N = \frac{HW}{p^2}\)
This reflects the process of converting image patches into a sequence of flattened vectors suitable for input into a Transformer model.
After dividing the image into patches, embeddings are created from each patch. The split image patches undergo a linear transformation to initiate the encoding process. The resulting vectors are referred to as Patch Embedding Vectors, and they are characterized by a fixed length $ d $.
Each patch, after flattening, concatenates all pixel channels. It then undergoes a linear projection to the desired input dimension. This might seem confusing, but the key idea is:
After flattening, each patch becomes a single column vector (e.g., $256 \times 1$). The linear projection aims to reduce or transform the dimensions, typically for compatibility with the model’s input requirements. Example: If a $256 \times 1$ flattened patch is input into a linear layer with 512 nodes, a linear computation (multiplication by weights plus addition of bias) transforms it into a $512 \times 1$ output vector.

By projecting two features into a single feature through linear projection, the process allows the use of only one feature. Ultimately, this maps pixel values into a format the model can understand. This involves transforming high-dimensional pixel data into a vector (or point), enabling a more concise representation that captures essential information.
The latent vector represents the encoded representation of the image patches after processing. It is defined as:
A sequence of $N$ patch embeddings, where each embedding is scaled by a factor $\alpha$. The resulting vector is denoted as $[\alpha^1 E_1, \alpha^2 E_2, \dots, \alpha^N E_N]$.
Mathematical Formulation \([\alpha^1 E_1, \alpha^2 E_2, \dots, \alpha^N E_N] \in \mathbb{R}^{N \times D}\) where:
$E \in \mathbb{R}^{(p^2 \cdot c) \times D}$: The embedding matrix, with $p^2 \cdot c$ representing the flattened patch size (where $p$ is the patch dimension and $c$ is the number of channels) and $D$ is the desired embedding dimension.
Through this process, all patches can be embedded into vectors. The result is an $ N \times D $ array, where:
This $N \times D$ array represents the embedded patch sequence ready for input into the Transformer model.

To effectively train the model, a vector called the CLS Token (Classification Token) is added to the patch embeddings. This vector: Acts as a learnable parameter within the neural network.
After this step, the total number of embeddings becomes $n + 1$ (where $n$ is the number of patch embeddings plus the CLS token). Combined with the embedding size $D$, this results in a $(n+1) \times D$ array, which serves as the representation vector for further processing.
The initial embedding $ z_0 $ is formed by appending the CLS Token to the patch embeddings. It is defined as:
Mathematical Formulation \(z_0 = [\alpha_{cls}, \alpha^1_p E_1, \alpha^2_p E_2, \dots, \alpha^N_p E_N] \in \mathbb{R}^{(N+1) \times D}\)
This $z_0$ serves as the input representation vector for the Transformer model.

In the original Transformer, positional information for words was added to create 2D data. For images, where “which position?” becomes relevant, a process is added to encode positional information for each patch, reflecting its spatial location. This positional encoding is directly added to each embedding vector, including the CLS Token.
Process

The Equations are following:
$z_0 = [\alpha_{cls}, \alpha^1p E_1, \alpha^2_p E_2, \dots, \alpha^N_p E_N] + E{pos} \in \mathbb{R}^{(N+1) \times D}$
where:

Similar to the previous discussion, the positional encoding method used here involves sin and cos functions, as seen in prior posts. This approach encodes the positional information of each patch, allowing the model to understand the spatial arrangement of the image patches.
The positional encoding $ PE(pos, 2i) $ and $ PE(pos, 2i+1) $ are defined using sine and cosine functions to encode the position of each patch. The formulas are: Mathematical Formulation \(PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\) \(PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)
After the previous steps, you now have an array of size $(n + 1) \times d$, where $n$ is the number of patches and $ d $ is the embedding dimension. The next step is to apply self-attention, which allows the model to weigh the importance of different patches (including the CLS token) in relation to each other.
Create QKV (Query, Key Value)
The embedding vector generated from the previous steps is linearly transformed into multiple large vectors, which are subdivided into three components: Q (Query), K (Key), and V (Value). These vectors are derived from the $(n + 1) \times d$ array, where $n + 1$ represents the number of patches plus the CLS token, and each component retains the same $n + 1$ length.

The equations are following: \[q = z \cdot w_q,\quad k = z \cdot w_k,\quad v = z \cdot w_v \quad \text{where} \quad w_q, w_k, w_v \in \mathbb{R}^{D \times D_h}\]
Now, we need to compute the attention score. As shown in the image below, the similarity is calculated by taking the dot product between the query and the key. Then, softmax is applied so that the sum of each row in matrix ( A ) becomes 1.

The equations are following: \[SA(z) = A \cdot v \in \mathbb{R}^{N \times D_h}, \quad \text{where} \quad A = \text{softmax} \left( \frac{q \cdot k^T}{\sqrt{D_h}} \right) \in \mathbb{R}^{N \times N}\]
After that, a weighted sum is computed over ( v ) using the attention weights.
Here’s the process: to obtain the aggregated contextual information for each element, we perform the computation using the first row of the attention matrix. At this point, we use the weights applied to ( V ) to generate the aggregated vector for the first image embedding. This operation is then applied to all patches.

The above process is ultimately repeated multiple times. \[MSA(z) = [SA_1(z); SA_2(z); \cdots; SA_k(z)] U_{\text{msa}} \quad \text{where} \quad U_{\text{msa}} \in \mathbb{R}^{(k \cdot D_h) \times D}\]
After stacking multiple attention heads, the result is mapped to a vector of dimension ( D ), which matches the patch embedding size.


Simply adds the input from the previous layer to the current layer.

The output from the previous steps is passed into a feed-forward network.

Repeat the above process L times. As shown in the second figure below, after L repetitions, the final classification is performed by passing the first vector y of the Encoder’s final output through an MLP Head consisting of a single hidden layer (dimension D × C).


The output $z_l$ at layer $l$ is computed by applying a Multi-Layer Perceptron (MLP) and Layer Normalization (LN) to the input $z_l’$, combined with the residual connection $z_l$. The process is defined as:
Mathematical Formulation \(z_l = MLP(LN(z_l')) + z_l \quad \text{for} \quad l = 1, 2, \dots, L\) where: \(LN(z_l') = \gamma \frac{z_l' - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}} + \beta\)
This iterative process refines the embeddings across $L$ layers.
The final step involves examining the output of the CLS Token (Classification Token). This vector serves as the last step in the Vision Transformer. In the final stage, it is passed through a Fully Connected Layer to compute Classification Probabilities, enabling the prediction of the image class.


The term Inductive Bias frequently appears in this context. For CNNs, the use of Convolution Operations, which are specific to images, introduces an inductive bias. This bias includes:
In contrast, Vision Transformers (ViTs) rely on MLP Layers to implicitly address Locality and Translation Equivariance. ViTs learn these properties through the Input Image Patches and refine them via Positional Embeddings (e.g., through fine-tuning).
사실상 Game Engine 을 만들고 싶은건 아니다. Game Engine 을 사용한 어떠한 Product 를 만들고 싶었고, 그 Platform 이 Unreal Engine 이 됬든, Unity 가 됬든 사용하면 된다. 하지만 이미 상용화? 된 엔진들의 확실한 장점은 있지만, 그렇다고 해도, 너무 많은 방대한 정보를 이해하기에는 쉽지 않다. 예를 들어서, DirectX11 에서는 확실히 Low Level API 라고 하지만, 거의 High Level API 이다. 특히나 드라이버(인력사무소)가 거의 많은 작업들을 처리 해주었다. 그에 반대 되서, DirectX12 는 대부분의 작업을 따로 처리해줘야한다 (RootSignature, PipelineState, etc) 그리고 병렬 지원에 대해서도 충분히 이야기할수 있다. CommandList 를 병렬 처리가 가능하다고 한다. (이부분은 실제로 해보진 않았다.)
특히나, Commit 을 하기전에 Stream 방식인지, CommandList 에 일할것과, 일의 양을 명시해서 CommandQueue 에다가 넣어준다. 그리고 OMSetRenderTargets, IASetVertexBuffers 등으로 DX11 에서는 알아서 자동 상태 전이가 되지만, DX12 에서는 D3D12_RESOURCE_BARRIER 를 통해서 상태를 명시적으로 지정해주어야 할 필요가 있다. 이것 말고 등등 오늘은 DX12 의 어려움 또는 DX11 와 비교를 말을 할려는 목적은 아니다. 오늘은 나의 개발 로그를 공유하려고한다.
예전부터 내가 직접 만들어보고 싶고, 표현해보고 싶었던게 있었고, 그걸 표현하기위해서, Game Engine 관련되서 Youtube 를 찾아보게 되다가 우연치 않게, Cherno 라는 Youtuber 를 보았다. 이 분은 EA 에서 일을 하다가 이제는 직접적으로 Game Engine Hazel 을 만들고 있다. 꼭 그리고 다른 Contents 도 상대적으로 퀄리티가 있다. 그리고 Walnut 에 보면 아주 좋은 Vulkan 과 Imgui 를 묶어놓은 Template Engine 이 있다. 꼭 추천한다. 그리고 개발하면서 다른 Resource 도 올려놓겠다.
일단 나는 Multiplatform 을 Target 으로 Desktop Application 으로 정했다. 즉 Rendering 부분을 DX12 Backend 와 Vulkan Backend 로 나누어서, 추상화 단계를 거쳤다.
지금의 Project 의 구조를 설명하겠다. (전체적으로 HAL=Hardware Abstraction Layer 를 구상중)이며, 게임 엔진 내부에서 Platform 에 구애 받지 않게 설계 기준을 잡았다. 물론 추상화 계층은 삼각형 그리기 기준으로 일단 추상화를 작업을 진행하였다. 전체적으로 Resource 는 한번 추상화 작업을하고, IRenderBackend 로 부터 DirectX12 으로 할건지, Vulkan 으로 할건지 정의 하였다. 아직 작업할 일은 많지만, 한번에 하지 않으려고 진행중이다.
LunaEngine
| EntryPoint.cpp
| EntryPoint.h
| Layer.h
| LunaPCH.cpp
| LunaPCH.h
|
+---Application
| Application.cpp
| Application.h
| ApplicationSpecification.h
|
+---Graphics
| IBuffer.h
| IPipeline.h
| IShader.cpp
| IShader.h
| Texture.h
|
+---ImGui
| ImGuiBuild.cpp
| Roboto-Regular.embed
|
+---Input
| Input.cpp
| Input.h
| KeyCodes.h
|
+---Renderer
| | IRenderBackend.h
| | IRenderCommand.h
| | IRenderContext.cpp
| | IRenderContext.h
| | RenderQueue.cpp
| | RenderQueue.h
| |
| +---DX12
| | +---private
| | | BindPipelineCommand.cpp
| | | DrawCommands.cpp
| | | DX12Backend.cpp
| | | DX12Buffer.cpp
| | | DX12Pipeline.cpp
| | | DX12Shader.cpp
| | |
| | \---public
| | BindPipelineCommand.h
| | DrawCommands.h
| | DX12Backend.h
| | DX12Buffer.h
| | DX12Pipeline.h
| | DX12Shader.h
| |
| \---Vulkan
| VulkanBackend.cpp
| VulkanBackend.h
{
D3D12_DESCRIPTOR_HEAP_DESC desc = {};
desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV;
desc.NumDescriptors = APP_SRV_HEAP_SIZE;
desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_SHADER_VISIBLE;
if (g_pd3dDevice->CreateDescriptorHeap(&desc, IID_PPV_ARGS(&g_pd3dSrvDescHeap)) != S_OK)
return false;
g_pd3dSrvDescHeapAlloc.Create(g_pd3dDevice, g_pd3dSrvDescHeap);
}
현재 Pass 같은 경우, Unreal / Unity 와 같이, 마지막에 UI 를 그리는 방식으로 해서, Application Layer 에서 이러한 방식으로 호출 하고 있다.
void Application::Run()
{
_running = true;
while (ShouldContiueRunning())
{
glfwPollEvents();
float time = GetTime();
_frameTime = time - _lastFrameTime;
_lastFrameTime = time;
IRenderContext::BeginFrame();
IRenderContext::StartImGuiFrame();
if (ImGui::BeginMainMenuBar())
{
if (_menubarCallBack)
_menubarCallBack();
ImGui::EndMainMenuBar();
}
ImGui::DockSpaceOverViewport(ImGui::GetMainViewport(),ImGuiDockNodeFlags_PassthruCentralNode);
for (auto &layer : _layerStack)
layer->OnUpdate(_frameTime);
for (auto &layer : _layerStack)
layer->OnUIRender();
IRenderContext::DrawFrame();
IRenderContext::RenderImGui();
IRenderContext::EndFrame();
}
Shutdown();
}
삼각형을 그리기만하는데, 삼각형이 그려지질 않는다. 그래서 이걸 RenderDoc 으로 체크를 해보겠다.

삼각형은 완벽하게 그려지고 있다. 하지만 그 다음 Pass 에 보면 없어진다.

이거에 대해서 찾아보다가, RenderTarget 에 둘다 그릴려구 해서 그렇고, 마지막에 Update 하는 부분이 ImGUI 에서 Docking 또는 Viewports Enable 을 했을시에 문제가 있다고 한다. 이럴떄, 기본적으로, ImGUI 에서는 GPU Rendering 상으로 기능을 독릭접인 자원을 활용하기 위해서 따로 만든다고 말을 하였다. 기본적으로 Viewports 를 Enable 했을시에, 새로운 Viewport 의 배경색은 Gray 색깔이라고 한다. 그래서 마지막에 Rendering 을 했을시에, Gray 로 덮어버린다. 즉 ImGUI 는 Texture 기반의 UI 요소를 그린다. (즉, ImGUI 의 내부 관리가 아닌 OS 창으로 Rendering 이 된다.)
ImGuiIO &io = ImGui::GetIO();
io.ConfigFlags |= ImGuiConfigFlags_DockingEnable;
io.ConfigFlags |= ImGuiConfigFlags_ViewportsEnable;
이걸 해결하기위해선, 두가지 방법이 있다.
ImGuiStyle& style = ImGui::GetStyle();
style.Colors[ImGuiCol_WindowBg].w = 0.0f; // Fully transparent window background
style.Colors[ImGuiCol_DockingEmptyBg].w = 0.0f; // Transparent dockspace background
style.Colors[ImGuiCol_ChildBg].w = 0.0f; // Transparent child window background
ImGui::PushStyleColor(ImGuiCol_DockingEmptyBg, ImVec4(0.0f, 0.0f, 0.0f, 0.0f));
ImGui::DockSpaceOverViewport(ImGui::GetMainViewport(), ImGuiDockNodeFlags_PassthruCentralNode);
ImGui::PopStyleColor();
둘다 방법론은 같다. 그래서 아래와 같이 결과가 나왔다. 결국에는 해야하는 일은 하나의 RenderTarget 에 병합? (Aggregation) 이 맞는것 같다. 그리고 중요한건 RenderTarget 이 정확하게 ImGUI 와 삼각형 그리는게 맞는지 확인이 필요하다. 아래는 Aggregation 한 결과 이다.

모든 Dependency 는 PCH 에서 참조하고 있다. 참고로 vcpkg 는 premake 에서 아직 disable 하는걸 찾지 못했다. src/vendor 안에 모든 dependency 가 있다.
Series data refers to data where the state at one point in time is dependent on the states before (or after) it. Examples include input data for sentiment analysis, music analysis, or even large language models (LLMs). These tasks all rely on sequences of information—also known as time series data.
RNN, or Recurrent Neural Network, refers to a type of neural network where data is processed sequentially, one step at a time. This sequential nature allows RNNs to handle inputs such as words in a sentence, musical notes, or time-series sensor data.
The core building block of an RNN is the cell—a unit inside the hidden layer that performs activation and maintains memory. These are often referred to as memory cells, because they try to “remember” previous information in the sequence.
At each time step, the memory cell receives two inputs:
This structure enables recursive reuse of hidden states across time, making the network capable of learning temporal dependencies.
One of the key components of RNNs is the hidden state. Each memory cell carries a hidden state, and at every time step, this hidden state is updated using both the current input and the previous hidden state. This allows the network to “remember” past context as it processes the sequence step-by-step.
When the RNN is unrolled in time, it forms a chain-like structure, where each cell is connected to the next, passing along the hidden state as shown in the diagram.

In RNNs, each hidden state at time t depends on the hidden state from the previous time step (t-1). This recursive structure allows the network to “recall” past information over time.
However, this also leads to a major issue:
This technique helps stabilize training, especially in deep RNNs or long sequence tasks.

To address the issue of losing information from earlier time steps, a specialized architecture was introduced: LSTM (Long Short-Term Memory).
LSTM is designed to capture both short-term and long-term dependencies in sequential data, allowing the model to selectively retain or forget information over time.

Although the above equations may look quite complex, they essentially involve multiple gates, each performing a specific role. Mathematically, the representation of these gates can be expressed as shown in the image below.

If we break down these equations further and represent them in a simplified diagram, the image below might help make the concept easier to understand.

As mentioned earlier, each gate has a specific role. Let’s examine them one by one from left to right:

The Forget Gate is responsible for deleting unnecessary information. At the current time step 𝑡, the input 𝑥𝑡 and the previous hidden state ℎ𝑡 −1 are passed through a sigmoid function, producing values between 0 and 1. Values closer to 0 indicate that much of the information is discarded, while values closer to 1 mean the information is retained. This gating mechanism controls how the cell state is updated accordingly.

The Input Gate is used to process the new information to be added to the cell state. As shown on the right, the current input 𝑥𝑡 is multiplied by the weight matrix 𝑊𝑥𝑖, and the previous hidden state ℎ𝑡−1 is multiplied by 𝑊ℎ𝑖. Their sum is then passed through a sigmoid function. At the same time, the current input 𝑥𝑡 multiplied by 𝑊𝑥𝑔 and the previous hidden state ℎ𝑡−1 multiplied by 𝑊ℎ𝑔 are summed and passed through a hyperbolic tangent (tanh) function. The result of this operation is denoted as 𝑔𝑡
In other words, the combination of the sigmoid output (ranging from 0 to 1) and the tanh output (ranging from -1 to 1) determines how much new information is selected to update the cell state.

In a standard RNN, only the hidden state is passed along to the next time step. However, in an LSTM, both the hidden state and the cell state are passed forward. The Forget Gate selectively removes some information from the cell state, while the element-wise product of the input gate activation 𝑖𝑡 and the candidate values 𝑔𝑡 determines how much new information is added.
These two components—the retained memory and the newly selected information—are then combined (summed) to update the current cell state 𝐶𝑡 This updated cell state is passed on to the next time step 𝑡+1. If the forget gate output 𝑓𝑡 is zero, the previous cell state 𝐶𝑡−1 is effectively reset to zero, meaning the cell only retains the newly selected information.

Finally, the Output Gate computes the output at the current time step 𝑡. It takes the current input 𝑥𝑡 and the hidden state, passes them through a sigmoid function to produce the output gate activation 𝑜𝑡. Meanwhile, the current cell state 𝐶𝑡 is passed through a hyperbolic tangent (tanh) function, producing values between −1 and 1. The element-wise product of these two values filters the cell state output, resulting in the new hidden state, which is then passed on to the next time step.
As shown in the diagram, the LSTM architecture divides these operations into multiple gates. In contrast, the GRU (Gated Recurrent Unit) simplifies this by combining some of these functions into just two gates: the Update Gate and the Reset Gate. This reduction results in a simpler structure while still effectively updating the hidden state over time, making the GRU a streamlined variant of the LSTM.

During grad school, I studied Deep Learning, thinking it was a promising field worth diving into. I was genuinely excited about neural networks and how they were changing the landscape of tech.
But after graduation, I found myself doing something completely different—systems development. It felt disconnected from AI at first, and honestly, a bit frustrating. Still, I took it as a good opportunity and kept going. Surprisingly, that path led me back to AI—just from a different angle. I ended up working on systems that supported Computer Vision features. I wasn’t building models, but I was helping them run efficiently in real environments.
At first, I tried training image data using a simple Multi-Layer Perceptron (MLP). To do that, I had to flatten the 2D image into a 1D vector. While this made it technically possible to train, it came at a cost—the model lost all the local and topological information in the image. It couldn’t understand where features were located, only what values existed. This made learning abstract concepts in images inefficient and slow.
To solve this, I turned to Convolutional Neural Networks (CNNs), which preserve spatial information using the concept of a receptive field—like how a lifeguard watches over a specific area of a pool, each convolutional filter focuses on a local region of the image.
In CNNs, we use small filters (or kernels) that slide over the image. Each kernel has weights (e.g., a 3×3 filter) and performs convolution operations followed by a bias addition and an activation function (like ReLU or Sigmoid). This produces a feature map that captures localized patterns in the image.
For example, filters like the Sobel operator are hand-designed to detect edges, but in CNNs, these filters are learned automatically during training. As a result, CNNs can effectively capture local features and build up abstract representations layer by layer.
By using convolutional layers instead of fully connected layers, the model not only gains efficiency but also becomes much better at recognizing patterns in images.

One of the earliest and most influential CNN-based architectures was LeNet, developed by Yann LeCun. It was originally designed for handwritten digit recognition (e.g., MNIST) and laid the groundwork for modern convolutional networks.
The input to LeNet is a 32×32×1 grayscale image. The first convolutional layer applies 6 filters of size n×n (typically 5×5), resulting in a feature map of size 28×28×6. This means each of the 6 filters scans the input image and extracts different local features.
After convolution, a subsampling (or downsampling) layer is applied—usually a type of average pooling—which reduces the spatial resolution. This pattern of Convolution → Subsampling repeats, gradually extracting higher-level features while reducing dimensionality.
Finally, the network flattens the feature maps and passes them through one or more fully connected layers, similar to an MLP, to perform classification.

While traditional face detection algorithms like Haar Cascades could recognize faces fairly well—especially with properly preprocessed input—AlexNet took things to a whole new level. Designed to handle 224×224 RGB images, AlexNet leveraged the power of GPUs for parallel computation, which allowed it to scale deeper and wider than previous models.
One interesting feature of AlexNet was its split architecture: the network was divided into two parallel streams to take advantage of multi-GPU setups.
AlexNet also introduced several key innovations that became standard in deep learning:
AlexNet’s success in the 2012 ImageNet competition marked a turning point for deep learning, showing that with enough data, compute, and smart design choices, neural networks could outperform traditional hand-engineered features by a large margin.

VGGNet built on the success of AlexNet, using similar input dimensions (e.g., 224×224×3), but introduced a key design shift: replacing larger filters (like 5×5) with multiple 3×3 convolutions stacked in sequence.
This approach brought several advantages:
As a result, VGGNet significantly improved performance while maintaining a clean, uniform architecture. Because of its regular structure and strong performance, VGGNet became a popular backbone network for tasks like semantic segmentation and object detection.
However, deeper networks introduced a new problem: during backpropagation, gradients could vanish as they moved backward through many layers, especially toward the input. This vanishing gradient problem made training very deep models difficult, eventually motivating the development of architectures like ResNet, which addressed this with residual connections.


If VGGNet made networks deeper by stacking layers vertically, GoogLeNet (a.k.a. Inception v1) took a different approach—it went deeper in both width and depth (it goes deeper both vertically and horizontally).
GoogLeNet introduced the concept of the Inception Module, which allowed the network to process spatial information at multiple scales simultaneously. As the name suggests, this architecture digs deeper and deeper into the network structure.
The unique part of GoogLeNet is the Inception Module. Take a look at the diagram below to understand it better.

As shown in the diagram above, one way to increase the depth of the network is by extracting feature maps using different kernels, then applying average pooling or max pooling, and finally concatenating the results.
However, GoogLeNet went further and proposed a more efficient structure by combining multiple operations—like pooling and convolutions with different kernel sizes—in parallel. One of the key innovations was the use of 1×1 convolutions, either before or after other operations, forming what’s known as a bottleneck structure.
Using 1×1 convolutions significantly reduced the number of parameters and computation. For example, performing the same operation with 1×1 filters required only around 67,584 parameters (12,288 + 55,296)—a much smaller number compared to what would be needed without them.
Another interesting feature of GoogLeNet is the use of auxiliary classifiers. Instead of having a single softmax classifier at the end, it includes two additional softmax branches in the middle of the network. These auxiliary classifiers help mitigate the vanishing gradient problem by providing additional gradient signals during training.
Lastly, GoogLeNet replaces traditional fully connected layers with Global Average Pooling (GAP) near the end of the network. While the exact mechanism may seem abstract at first, the core idea is that GAP reduces each feature map to a single number by averaging spatial values, effectively summarizing global information without introducing additional parameters—unlike fully connected layers.

Residual Learning: Tackling the Vanishing Gradient Problem. As mentioned in the previous post, one of the biggest issues with deep neural networks is the vanishing gradient problem. As networks get deeper, gradients calculated during backpropagation tend to shrink. The more layers you have, the more the gradients approach zero, which means the weight updates—especially in early layers—become negligible. In other words, the network struggles to learn because the influence of the output on earlier layers diminishes.
To address this, the concept of the Residual Block was introduced. It uses a mechanism called a skip connection, which forms the basis of residual learning. This allows the gradient to flow directly through the network, helping to mitigate the vanishing gradient issue even in very deep architectures.
Traditionally, the goal was to learn a function H(x) that maps the input x to the desired output y—in other words, to minimize H(x) – y. However, residual learning takes a different approach: instead of learning H(x) directly, the network learns the residual function, which is H(x) – x. The idea is that if the desired mapping is similar to the input, it’s easier to learn the difference between the input and output than the output itself.
By reformulating the learning objective this way, the model becomes easier to optimize and performs better in very deep configurations.

Why F(x) + x Helps: Stabilizing Gradients with Residual Blocks: The diagram above shows what we’ve been building toward: by using Residual Blocks, we compute F(x) + x, where F(x) is the output of a few convolutional layers and x is the original input. The key idea here is that when you differentiate this structure during backpropagation, the gradient always retains a value of 1 through the skip connection, ensuring that at least some portion of the signal survives as it flows backward.
Of course, this doesn’t completely eliminate the vanishing gradient problem, but according to the original paper, the issue was significantly mitigated by using Batch Normalization. Whether BatchNorm fully solves the vanishing gradient issue or just partially helps is still up for debate. One could argue it’s a major breakthrough—or just a minor contributor. Either way, it plays an important role in training very deep networks. BatchNorm’s role is to normalize the output of each layer, helping stabilize the gradient flow and speed up convergence.
As a result of stacking multiple residual blocks—50 to 152 layers deep—ResNet was able to achieve a depth 8× greater than VGGNet, while still being trainable.
This is how deeply layered networks with Residual Learning end up looking, as illustrated in the diagram below.

According to the paper, as the network depth increases, there is a noticeable trend in performance—but this trend is not necessarily due to overfitting.
Performance Analysis
| Cause | Explanation | Resolution |
|---|---|---|
| Vanishing Gradient | Weakened gradients in upper layers during backpropagation | Skip Connection |
| Weight Attenuation | Imbalanced parameter updates in deeper layers | Residual Learning Architecture |
| Optimization Issues | Non-convex functions increase local minima | Bottleneck Architecture |
The following diagram shows how these challenges have been addressed in the improved architecture.



As shown above, the layers are densely connected, meaning each layer is connected to every other layer in a feed-forward fashion. This dense connectivity is the key point, and can be seen as an extension of residual learning.
Key characteristics include:
| Component | Role | Mathematical Expression |
|---|---|---|
| Dense Block | Preserve feature map connections | xₗ = Hₗ([x₀, x₁, …, xₗ₋₁]) |
| Transition Layer | Reduce dimensions and prevent redundancy | T(x) = Conv₁×₁(BN(ReLU(x))) |
| Bottleneck Layer | Improve computational efficiency | Hₗ = Conv₃×₃(Conv₁×₁(x)) |
From the equations above, we can clearly see the difference from ResNet. In ResNet, the residual connection is defined as: 𝑥𝑙 = 𝐻𝑙(𝑥𝑙 −1) + 𝑥𝑙 −1
This means each layer receives input only from the previous layer, and the outputs are summed. In contrast, DenseNet connects all preceding feature maps to the current layer as input, which increases the diversity of learned representations.
One drawback of ResNet is that if the two feature maps being summed come from different distributions, the addition operation may become less effective or even harmful.
In short:
sumconcatTraditional models typically scale along a single dimension—either depth, width, or resolution. What sets this approach apart is the idea of scaling all three dimensions in a balanced way. This is the core of what’s called Compound Scaling.

So the key point of this architecture lies in how to find the optimal balance between depth, width, and resolution.
Let’s briefly look at what each dimension represents:
This coefficient is found using a greedy search, which led to the discovery of the following scaling constants: α=1.2 (depth) β=1.1 (width) 𝛾 =1.15 (resolution).
These constants are then used to guide the compound scaling process in EfficientNet | Component | Technique | Mathematical Expression | |——————-|———————————-|———————————————————————-| | MBConv | Inverted residual block | F̂(x) = T_proj(T_expand(x)) ⊙ SE(T_dw(x)) | | SE Block | Channel-wise attention modulation| w_c = σ(W₂ δ(W₁ · GAP(x))) | | Swish Activation | Smooth activation function | swish(x) = x · σ(βx)
Just from the name alone, it’s clear where this model is meant to be used—on mobile devices. It’s a deep learning model designed specifically for mobile and resource-constrained environments.At its core, the key challenge was: “How can we reduce the amount of computation?” and that’s exactly what this architecture set out to solve.

In MobileNet, the goal is to balance latency and accuracy. Ultimately, the model achieves successful lightweight optimization, making it suitable for mobile and embedded devices.
To understand how this is done, it’s important to grasp the concept of Depthwise Separable Convolution.
Unlike standard convolutions (often referred to as pointwise convolutions when using 1×1 kernels), depthwise separable convolutions learn a separate filter for each input channel. In traditional convolutions, each filter operates across all input channels, making it difficult to isolate spatial features. Depthwise convolution, on the other hand, performs a convolution independently per channel, similar to grouping filters—a technique that dramatically reduces computation while retaining performance.


| Technique | Description |
|---|---|
| Channel Reduction | Reduce the number of channels using a width multiplier: channels × α (e.g., α = 1.0, 0.75, 0.5) |
| Compression | Reduce model size and parameters by setting a smaller α (e.g., α = 0.5) |
| Evenly Spaced Downsampling | Use stride = 2 in early layers (e.g., 224×224 → 112×112) |
| Shuffle Operation | Shuffle channels to promote cross-group information flow (e.g., ShuffleNet) |
| Knowledge Distillation & Compression | Model compression techniques like pruning, quantization, and distillation |
Previous Post 글에서 봤듯이, Addition 을 Block 하나를 여러개의 Threads 들을 사용해서, Vector Addition 을 할수있다. 만약에 그럼 여러개의 Block 을 쪼개서 총 8 개의 사이즈를 가지고 있는 Array 를 더하려면 어떻게 하냐? 라고 물어볼수 있다 생각보다 간단하다. Block 두개를 사용해서, Thread 4 개씩 할당할수 있다. 아래의 코드 Segments 를 봐보자. 아래처럼 Block 2 개, thread 개수 4 개 이런 방식으로 구현을 하면 된다. 우리가 궁금한거는 결국 이 덧셈이 어떻게 되는지가 궁금한 포인트이다. 그러기 때문에 아래처럼 속도가 느려지더라도, printf 를 통해서, 볼수 있다. int i = blockDim.x * blockIdx.x + threadIdx.x 라고 써져있다. blockIdx.x 하고 threadIdx 는 대충 이해가 갈것이다. 하지만 BlockDim 은 뭔가라고 한번쯤 고민이 필요하다.
__global__ void addKernel(const int* a, const int* b, int* c, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size)
c[i] = a[i] + b[i];
printf("%u %u %u %u %u %d\n", blockDim.x, blockDim.y, blockDim.z, blockIdx.x, threadIdx.x, i);
}
addKernel <<<2, 4 >>> (dev_a, dev_b, dev_c, size);
일단 위의 코드를 Printf 한 결과값을 한번 봐보자. 아래의 결과값을 보자면, Thread Block 은 4 x 1 x 1 이다. 그말은 Thread 의 개수 하나의 Block 당 4 개의 Thread 를 의미한다. 그리고, Block Index 는 총 2 개의 Block 을 사용하니 0, 1 로 나오며, 이제 ThreadIdx 는 그대로 나온다. 하지만 여기서 중요한건 바로 1 부터 돌아갔다는 소리이다. Multithreading 을 하다 보면 순서에 상관없이 돌기 때문에 그 환경 때문에 먼저 실행되는건 일이 끝난 순서대로 되서 순서와 상관없이 Operation 만 끝내면 된다는 방식에서 온거이다. 그리고 i 의 계산의 결과 값들을 봐도 우리가 예상하지 못한 결과를 볼수 있다.

결국에는 그럼 우리가 어떻게 Debugging 하지? 라는 질문이 있다.. 어찌저찌 됬든간에, CUDA 안에 있는 Thread Block Index, Thread Index, Block Dimension 같은 경우를 봐야하지 않을까? 라는게 포인트이다. Nsight 를 쓰면 굉장히 잘나와있다. 아래의 그림을 보면, 굉장히 Visualization 이 생각보다 잘되어있다. 안에 있는 내부 구성요소는 거의 위와 구현한 부분을 매칭하면서 보면 좋을것 같다.

아래의 코드 같은 경우 대용량의 Data 처리를 위한 예제 코드라고 볼수 있다. 64 같은 경우는 내 컴퓨터 스펙중에 Maximum Threads Per Dimension 안에 있는 내용을 인용했다. 물론 이때는 Printf 를 쓰면 과부하가 걸릴수도 있으니 결과값만 확인하자. 물론 아래의 그림같이 Printf 를 한경우도 볼수 있다.
const int size = 1024 * 1024 * 64
const int threadsPerBlock = 1024;
int numBlocks = (size + threadsPerBlock - 1) / threadsPerBlock; //int(ceil(float(size) / threadsPerBlock));
addKernel << < blocks, threadsPerBlock >> > (dev_a, dev_b, dev_c, size);

마침 Closure => Functor (Lambda Expression), Capture 이 부분이 생각보다 까다롭지 않을까? 싶었는데, 완전 C++ Syntax 가 똑같다. 단지 어떤 타입에 따라서 복사를 하는가에 따라서 다르다. (ex: C++ 에서 는 & (reference) 로 보낼지, 그냥 복사 Capture로 부를지를 [] or [&] 이런식으로 사용할수 있다, 하지만 swift 에서는 struct 는 value type 이므로 => 복사, class 처럼 reference type 이면, 주솟값을 넣어주는 형태)
이 부분을 이해하기위해서는 함수의 끝남! 을 잘 알아야한다. 예를 들어서 어떤 함수에서, closure() 가 바로 실행된다고 한다고 하자. 우리가 기대하고 있는거는 함수가 실행되고, closure 가 실행되는 방식으로 한다고 하자. 그러면 잘 작동이 될거다. 하지만 만약 비동기 처리가 이루어진다면 어떻게 되는걸까? 함수안에 내부 closure 는 실행이 될때까지 기다리고 있지만, 이미 그함수는 끝난 상태가 되버릴수있다. 그러면, 함수 내부에서 비동기 처리가 되는거가 아닌, 어떤곳에서 (ex: main) 에서 비동기 처리가 될거다. 그러면 빠져나가지도 못하는 상황이 되버린다. 이걸 방지할수 있는방법이 바로 @escaping 이라고 보면 된다. 기본적으로 closure 는 non-escaping 이다.
아래의 코드를 봐보자, 3 초 이후에 DispatchQueue 에서 비동기를 실시를 할건데, 이때 바로 들어가야하는게 @escaping keyword 이다. 즉 어떤 함수의 흐름에서, closure 가 stuc 한 부분을 풀어줘야 closure 의 끝맺음이 확실하다는것이다. (물론 함수 주기는 끝나게 된다.) 참고로 @Sendable Keyword 를 안쓴다면 Error 를 표출할것이다. 그 이유중에 하나는 일단 compleition (closure type) 은 referance type 이며, 이것을 비동기처리내에서 사용하려면, Thread Safe 라는걸 보장을 해야하는데, compiler 에서는 모른다. 그래서 명시적으로 Type 을 지정해줘야한다.
func performAfterDelay(seconds: Double, completion: @escaping @Sendable () -> ()) {
DispatchQueue.main.asyncAfter(deadline: .now() + seconds) {
completion()
}
}
print("before")
performAfterDelay(seconds: 3, closure: {
print("Hello")
})
print("after")
그리고 closure 측면에서는, 파라미터로 받은 클로저는 변수나 상수에 대입할수 없다. (이건 알아야 하는 지식중에 하나다, 가끔씩 compiler error 가 안날수도 있다.) 중첩 함수 내부에서, 클로저를 사용할 경우, 중첩함수를 return 할수 없다. 즉 함수의 어떤 흐름이 있다고 하면, 종료되기 전에 무조건 실행 되어야 한다는것이다. 근데 또 특이 한점 하나가 있다. 아래의 코드를 잠깐 봐보자.
func performAfterDelay(seconds: Double, completion: (@Sendable () -> Void)?) {
DispatchQueue.main.asyncAfter(deadline: .now() + seconds) {
completion?()
}
}
자 거시적으로 봤을때는 closure 에 return 값을 기대할수도 있고, 없을수도 있다. 그걸 Optional(?) type 을 사용했다. Optional 을 사용한다고 하면, optional type argument 이기때문에 이미 escaping 처리가 될것이다. 라고 말한다. 즉 closure type 이 더이상 아니다. 그래서 자동적으로 escaping 한다고 보면 될것 같다.
이게 해깔릴수 있으니 정리를 해보자 한다. 일단 바로 코드 부터 보는게 좋을것 같다. 일단 Value Type 인 Struct 를 사용해서 구현을 해본다고 하자. 일단 Struct 가 Type 이 Value Type 이니까? 당연히 Capture 을 하면, 당연히 복사가 이뤄지기때문에 값이 안바뀐다고 생각은 할수 있다. 하지만, closure capture 자체가 기본이 변수의 메모리값을 참조 하기 때문에, person.age 의 주솟값을 reference 로 받기때문에 closure capture 가 reference 형태로 되는것이다. 그 아래의 것은 copy 다. 이건 capture 할 list 를 넘겨주는데, 이건 closure 가 생성하는 시점의 값을 하나 강하게 들고 있다고 볼수 있다. (즉 capture 한 값을 가지고 있다는것) 그러기때문에 capture list 를 사용할때는, 값으로 들어가기때문에 변경되지 않는다. 참고로 weak 를 사용하게 된다면, compiler 에서는 class 또는 class-bound protocol-types 라고 말할것이다. 즉 reference 타입일 경우에만 사용 가능 하다.
struct Person {
var name: String
var age: Int
}
func captureRefTest() {
var person = Person(name: "John", age: 30)
var closure = {
print(person.age)
}
closure()
person.age = 40
closure()
}
captureRefTest()
func captureCopyTest() {
var person = Person(name: "Nick", age: 20)
var closure = { [person] in
print(person.age)
}
closure()
person.age = 40
closure()
}
captureCopyTest()
그렇다면 class 는 어떨가? 이건 애초에 가정이 reference type 이다. 그러기 때문에 애초에 값참조를 하지 않는다. 그러기때문에 Capture list 를 사용하더라도 reference 처럼 작동을 한다.
class Animal {
var name: String
var age: Int
init(name: String, age: Int) {
self.name = name
self.age = age
}
}
func captureTest() {
var animal = Animal(name: "Dog", age: 10)
var closure = { [weak animal] in
print(animal!.age)
}
closure()
animal.age = 20
closure()
}
captureTest()
일반적으로 c 에서는 두가지의 Vector(Array) 를 더한다고 가정을 했을때, 아래의 방식대로 더한다.
int main(void) {
// host side
const int SIZE = 6;
const int a[SIZE] = {1, 2, 3, 4, 5, 6 };
const int b[SIZE] = {10, 20, 30, 40, 50, 60 };
int c[SIZE] = {0};
for (register int i = 0; i < SIZE; ++i) {
c[i] = a[i] + b[i];
}
return 0;
}
```add.cu
위의 For-Loop 안에 있는 Body 가 있다, 이때를 `Kernel Function` 이라고도 한다. (with proper value). 실제 예시로는 아래와 같다. 왜 굳이 idx 를 넘기느냐는 병렬 처리를 위해서 `Kernel Function` 을 Define 하는것과 같다. 하지만 여기도 아직은 CPU 에서 처리를 하는거다. (CallStack 에는 CPU[0] executes add_kernel(0 ...)) 이런식으로 수행이 SIZE - 1 만큼 될거다. 즉 이건 sequential execution 이라고 생각한다.
```c
void add_kernel(int idx, const int* a, const int* b, int*c) {
int i = idx;
c[i] = a[i] + b[i];
}
for (register int i = 0; i < SIZE; ++i) {
add_kernel(i, a, b, c);
}
만약 multi-core CPU’s 또는 Parallel Execution 을 한다고 가정을 하면 어떨까? 즉 코어가 2개라면, 짝수개씩 병렬로 처리가 가능하다.
at time 0: CPU = core#0 = executes add_kernel(0, ...)
at time 0: CPU = core#1 = executes add_kernel(1, ...)
at time 1: CPU = core#0 = executes add_kernel(2, ...)
at time 1: CPU = core#1 = executes add_kernel(3, ...)
...
at time (n-1)/2: CPU = core#1 = executes add_kernel(SIZE - 1, ...)
그렇다면 GPU 는 어떻게 될까? GPU 는 엄청 많은 Core 들을 가지고 있기 때문에, 엄청난 Parallelism 을 가지고 갈수 있다. 아래와 같이 Time 0: 에 ForLoop 을 처리를 병렬 처리로 할수 있다는거다.
at time 0: CPU = core#0 = executes add_kernel(0, ...)
at time 0: CPU = core#1 = executes add_kernel(1, ...)
at time 0: CPU = core#2 = executes add_kernel(2, ...)
at time 0: CPU = core#3 = executes add_kernel(3, ...)
...
at time 0: CPU = core(#n-1) = executes add_kernel(SIZE - 1, ...)
위의 내용을 정리 하자면 아래와 같다. 즉 시간 순서별로 처리를 하는쪽은 CPU, 코어별로 처리를 하는건 GPU 라고 볼수 있다.
| CPU Kernels | GPU Kernels |
|---|---|
| with a single CPU Core, For loop | a set of GPU Cores |
| sequential execution | parallel execution |
| for-loop | kernel lanuch |
| CPU[0] for time 0 | GPU[0] for core #0 |
| CPU[1] for time 1 | GPU[1] for core #1 |
| CPU[n-1] for time n-1 | GPU[n-1] for core #n-1 |
CUDA vector addition 같은 경우 여러가지 Step 이 있다고 한다.
기본적으로 C/C++ CPU 에서는 Function 을 부를때, Function Call 이라고 한다, 이의 Syntax 는 아래와같다.
void func_name(int param, ...);
for (int i = 0; i < SIZE; i++) {
func_name(param, ...)
}
하지만 GPU 에서는 많이 다르다. c++ 에서 사용했을때와 다른 방식으로 Kernel(function) 을 사용한다. 이 Syntax 같 경우 Kernel launch Syntax 라고 한다. 의미적으로는 1 세트에 SIZE 만큼의 코어를 사용하겠다가 되는것이다. 또 다른 의미는 바로 1 이라는 인자 값은 Thread Block 몇개를 사용할건지와, 그 Thread Block 에 Thread 를 몇개 사용할지가 정의가된다. Thread Block 안에있는 Thread 는 코드 아래의 그림을 참조 하면 좋을것 같다.
__global void kernel_name(int param, ...);
kernel_name <<<1, SIZE>>>(param, ...)

실제로 예제 파일은 아래와같다. addKernel 이 실제로는 GPU 안에서의 FunctionCall 형태가 될거고, Index 를 넘기지 않기 때문에, 내부안에서 내 함수 Call 의 Index 를 찾을수 있다.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
printf("%d\n", i)
c[i] = a[i] + b[i];
}
int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
c[0], c[1], c[2], c[3], c[4]);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
// ...
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus;
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
//...
// cudaDeviceSynchronize waits for the kernel to finish, and returns
cudaError_t cudaStatus = cudaDeviceSynchronize();
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c)
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}
아래와 같이, cudaDeviceSynchronize() 는 kernel 이 끝날때까지 기다렸다가 Error_t 를 Return 을 하게 된다. 성공을 하면, cudaSuccess 를 받는다. 그리고 마지막으로는 CPU 쪽으로 복사를 해준는 구문 cudaMemcpy(...) 가 존재하고, Error 를 내뱉는곳으로 가게된다면, CudaFree 를 해준다.
물론, Host 쪽에서 계속 쭉 Status 를 사용해서, 기다리지만 Kernel 안에서, Kernel launch 중에도 에러가 발생할수 있다. 그 부분은 아래와 같이 받을수 있다. 원래는 cudaError_t err = cudaPeekAtLastError() 그리고 cudaError_t err = cudaGetLastError() 가 있다 둘의 하는 역활은 동일하다! 하지만 내부안에서 있는 Error Flag 를 Reset 을 해주는게 cudaGetLastError() 이며, cudaPeekAtLastError() 는 Reset 을 하지 않는다. 그말은 Reset 을 last error only 가 아니라 모든 Error 에 대해서 저장을 한다고 생각을 하면된다. 그리고 아래처럼 Macro 를 설정을 해주어도 좋다.
// Check for any errors launching the kernel
cudaError_t cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
cudaError_t err = cudaPeekAtLastError();
// CAUTION: we check CUDA error even in release mode
// #if defined(NDEBUG)
// #define CUDA_CHECK_ERROR() 0
// #else
#define CUDA_CHECK_ERROR() do { \
cudaError_t e = cudaGetLastError(); \
if (cudaSuccess != e) { \
printf("cuda failure \"%s\" at %s:%d\n", \
cudaGetErrorString(e), \
__FILE__, __LINE__); \
exit(1); \
} \
} while (0)
// #endif
근데 여기서 궁금증이 있을수 있다. 예를 들어서, c++ 에서는 Return 의 반환값을 지정할수 있었지만, Kernel 은 그렇지 못하다. 무조건 void 로 return 하게끔해야한다. 이건 병렬처리를 하기 때문에, 100 만개의 병렬처리를 한다면 100 만개의 return 값을 가지게 되는데 이건 error code 에 더 가깝다. 그러면 계산이 끝났다라는걸 명시적으로 어떻게 확인하느냐가 포인트일 일것 같다. 바로 Memory 를 던져줬을떄, 그 배열을 update 해서 GPU 에서 CPU 로 데이터가 Memcopy 가 됬을때만 확인이 가능하다.
예제 파일로 Vector 안에 모든 Element 에 +1 씩 붙이는 프로그램을 실행한다고 하면 아래와 같이 정의할수 있다.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void add_kernel(float *b, const float *a)
{
int i = threadIdx.x;
b[i] = a[i] + 1.0f;
}
int main()
{
const int arrSize = 8;
const float a[arrSize] = { 0., 1., 2., 3., 4., 5., 6., 7. };
float b[arrSize] = { 0., 0., 0., 0., 0., 0., 0., 0., };
printf("a = {%f,%f,%f,%f,%f,%f,%f,%f\n", a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]);
float* dev_a = nullptr;
float* dev_b = nullptr;
cudaError_t cudaStatus;
cudaMalloc((void**)&dev_a, arrSize * sizeof(float));
cudaMalloc((void**)&dev_b, arrSize * sizeof(float));
cudaMemcpy(dev_a, a, arrSize * sizeof(float), cudaMemcpyHostToDevice);
add_kernel <<<1, arrSize >>>(dev_b, dev_a);
cudaStatus = cudaPeekAtLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
}
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize r eturned error code %d after launching addKernel!\n", cudaStatus);
}
// Result
cudaStatus = cudaMemcpy(b, dev_b, arrSize * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
}
printf("b = {%f,%f,%f,%f,%f,%f,%f,%f\n", b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]);
cudaFree(dev_a);
cudaFree(dev_b);
return 0;
}
그리고 참고적으로 꿀팁중에 하나는 const char* cudaGetErrorName( cudaError_t err) 이 함수가 있다.cudaError_t 를 넣어서 확인이 가능하며, Return 이 Enum Type 의 String 을 char arr 배열로 받을수 있으니 굉장히 좋은 debugging 꿀팁일수 있겠다. 또 다른건 const char* cudaGetErrorString(cudaError_t err) err code 에 대한 explanation string 값으로 return 을 하게끔 되어있다. 둘다 cout << <<endl; 사용 가능하다.
여러가지의 Cuda Process 가 돌릴때, 내가 사용하고 있는 프로세스에서 여러가지의 Thread 가 갈라져서, 이들 thread 가 Cuda system 을 동시에 사용한다고 한다라면, CUDA Error 를 어떻게 처리하는지에 대한 고찰이 생길수도 있다. 그래서 각 Cpu Thread 가 Cuda 의 커널을 독자적으로 사용한다고 가정을 하면 Cuda eror 는 Cpu thread 기준으로 err 의 상태 관리를 하는게 좋다.