Attention & Transformer

Attention Mechanism

The basic idea of attention is that, at each time step during decoding, the decoder refers back to the entire input sentence encoded by the encoder. Instead of treating all parts of the input sentence equally, the model focuses more on the specific parts of the input that are most relevant to predicting the current output word.

Attention Function (HashMap Analogy)

The attention function can be thought of similarly to a Python dictionary. However, its role is to compute the similarity between a given query and all keys, then map each key to a value weighted by this similarity. The output is essentially a weighted sum of these values, where the weights reflect the relevance (similarity) to the query.

alt text

  • Q: Query โ€” The hidden state of the decoder cell at time step ๐‘ก
  • K: Keys โ€” The hidden states of all encoder cells at every time step
  • V: Values โ€” The hidden states of all encoder cells at every time step

Attention Features

At each time step, the decoder refers back to all encoder hidden states, focusing more on the words that are most relevant. This approach helps address issues like vanishing gradients and the fixed-size output vector limitation of traditional RNNs.

The attention mechanism computes similarity scores based on dot products. Specifically, to calculate the attention scores, the decoderโ€™s hidden state at time step ๐‘ก, denoted as ๐‘ ๐‘ก, is multiplied by each encoder hidden state โ„Ž1, โ„Ž2, โ€ฆ,โ„Ž๐‘ For example, in the third decoding stage where the model predicts the next word after โ€œjeโ€ and โ€œsuis,โ€ it re-examines all encoder inputs to determine relevant information.

Formally, assuming the encoder and decoder hidden states share the same dimensionality, the attention scores ๐‘’๐‘ก can be calculated as: `et = [stโŠคh1, stโŠคh2, โ€ฆ ,stโŠคhN] Here, ๐‘’๐‘ก is the vector of attention scores representing the relevance of each encoder hidden state to the current decoder state.

alt text

After computing the attention scores, we apply the softmax function to obtain the attention distribution, also known as the attention weights. By applying softmax to the attention scores ๐‘’๐‘ก, we get a probability distribution ๐‘Ž๐‘ก, where all values sum to 1. Each value in ๐‘Ž๐‘ก represents the weight (or importance) assigned to the corresponding encoder hidden state.

Formally: ๐‘Ž๐‘ก = softmax(๐‘’๐‘ก) These attention weights determine how much each encoder hidden state should contribute to the current decoding step. In the diagram, the red rectangles represent the magnitude of attention weights applied to the encoder hidden states.

alt text

Once the attention weights ๐‘Ž๐‘ก are computed, we calculate the weighted sum of the encoder hidden states. This weighted sum is called the attention value, and itโ€™s also commonly referred to as the context vector.

The higher the attention weight assigned to a particular hidden state, the more strongly it contributes to the final context vectorโ€”indicating a higher relevance to the current decoding step.

Formally, the context vector context ๐‘ก (or attention value) is computed as: \[\text{context}_t = \sum_{i=1}^{N} a_i^t \cdot h_i\]

This vector summarizes the relevant parts of the input sequence, tailored for the current decoding time step ๐‘ก.

alt text

After computing the attention value (context vector), it is concatenated with the decoderโ€™s hidden state at time step ๐‘ก, denoted as ๐‘ ๐‘ก.

This concatenated vector, often written as ๐‘ฃ๐‘ก = [context๐‘ก;๐‘ ๐‘ก], combines both the information from the encoder (via attention) and the current state of the decoder. This merged vector ๐‘ฃ๐‘ก is then used as input for predicting the output ๐‘ฆ๐‘ก, helping the model make more accurate predictions by leveraging both current decoding context and relevant encoder features.

alt text

Computing ( \tilde{s}_t ), the Input to the Output Layer

Before sending ( v_t ) (the concatenation of the context vector and the decoder hidden state) directly to the output layer, it is passed through an additional neural layer for transformation.

Specifically, ( v_t ) is multiplied by a weight matrix and then passed through a hyperbolic tangent (tanh) activation function. This results in a new vector ( \tilde{s}_t ), which serves as the final input to the output layer: \[\tilde{s}_t = \tanh(W_o \cdot v_t)\]

This transformation allows the model to learn a richer representation before making the final prediction for the output word ( y_t ).

alt text

Different Attention & Score

NameScore FunctionDefined by
dot( \text{score}(s_t, h_i) = s_t^\top h_i )Luong et al. (2015)
scaled dot( \text{score}(s_t, h_i) = \frac{s_t^\top h_i}{\sqrt{n}} )Vaswani et al. (2017)
general( \text{score}(s_t, h_i) = s_t^\top W_a h_i )
// ( W_a ) is a learnable weight matrix
Luong et al. (2015)
concat( \text{score}(s_t, h_i) = W_a^\top \tanh(W_b[s_t; h_i]) )
or
( W_a^\top \tanh(W_b s_t + W_c h_i) )
Bahdanau et al. (2015)
location-base( \alpha_t = \text{softmax}(W_a s_t) )
// Only uses ( s_t ) when computing ( \alpha_t )
Luong et al. (2015)

RNN + Attention (seq2seq) Limitation

The Encoder-Decoder structure consists of two main components: the encoder, which compresses the input sequence into a fixed-length vector (often called the context vector), and the decoder, which generates the output sequence based on this vector.

However, this approach has several limitations:

  • Information loss during compression into a single vector
  • Lack of parallelism, since RNN-based encoders/decoders process sequences sequentially
  • High computational complexity

To address these issues, the Transformer architecture was introduced.

Transformer

The Transformer replaces recurrence with attention mechanisms, specifically self-attention and cross-attention, to model dependencies in sequences. Just as CNNs use convolution to extract feature maps, attention computes similarity scores using dot products, followed by a weighted sum of values. This allows the model to focus on the most relevant parts of the input when processing each token.

Key features include:

  • The ability to differentiate between individual tokens using attention weights
  • The use of self-attention within both the input (encoder) and output (decoder) sequences
  • Support for parallel computation, greatly improving training efficiency

Although the Transformer does not use RNNs, it still follows the Encoder-Decoder architecture, where it takes an input sequence and generates an output sequence. The overall structure is similar to that of traditional RNN-based models.

However, there is a key difference: In RNNs, the model processes the sequence step by step over time โ€” each unit corresponds to a specific time step ๐‘ก. In Transformers, the model processes the entire sequence in parallel. Instead of time steps, it consists of ๐‘ repeated encoder and decoder blocks, each operating on the entire sequence simultaneously. This structural shift allows Transformers to overcome the sequential bottleneck of RNNs and achieve much better performance in terms of parallelization and long-range dependency modeling.

alt text

The end of the computation is as shown in the image below. The process continues from the input <sos> until <eos> is produced. Especially, since it is not divided by time steps like an RNN, positional encoding is necessary. In other words, a wordโ€™s one-hot encoding technique can be considered to be included.

alt text

alt text

alt text

alt text

Transformer Components

I will first define the following steps. The diagram below can be considered the overall architecture.

alt text

Encoder (self.attention)

The encoder is built using num_layers (6). Inside, there can be layers such as multi-head self-attention and position-wise FFNN.

alt text

In Sequence-to-Sequence models, the meaning of Q, K, and V was that Q represented the decoder hidden state at time step t.

However, in self-attention, Q, K, and V are the same. This means they represent the vectors of all words in the input sentence.

Note that instead of using the words themselves directly, their dimensionality is reduced. For example, if there are 4 words, each has a position vector, and these position vectorsโ€™ dimensions are reduced to form the dimensions of Q, K, and V.

After that, scaled dot-product attention is applied. Overall, this is performed as matrix operations (not just vector operations).


Multi-Head Attention

Among the parameters, there is a โ€œnumber of heads,โ€ which corresponds to parallel processing heads. Each head performs the following steps with Q and K: matrix multiplication โ†’ scaling โ†’ optional masking โ†’ softmax โ†’ matrix multiplication with V. Each head extracts different values, and by processing them in parallel, the values from each head are obtained.


Padding Mask

Here, the mask refers to the padding mask. Padding tokens have no actual meaning. Therefore, in the transformer, if a key corresponds to a padding token, similarity is not computedโ€”in other words, it is excluded (which might be compared to a skip connection, but they are not the same). In this case, the padding mask has small values (close to zero) in the attention score.

Position-wise FFNN

The point-wise FFNN can be considered a sublayer that both the encoder and decoder have. Therefore, it can be viewed as performing the computations shown in the diagram below.

alt text

Another notable aspect is the use of skip connections and layer normalization (similar to the residual blocks in ResNet).


Encoder to Decoder

That is, the encoded information is passed to the decoder. This is done by feeding the values into the multi-head attention. We will take a closer look at this part on the decoder side. If you want, I can help expand or clarify further!

alt text

Decoder (Self-Attention / Look-Ahead Mask)

Like the encoder, the decoder also receives a sequence of words as input. However, to prevent the model from looking at future words beyond the current time step, an optional mask is applied during training. This masking prevents the decoder from attending to subsequent words and is called the look-ahead mask. Similar to the encoder, self-attention is performed.

In summary, the decoder can only attend to itself and the previous words, but not to future words.

alt text

The second case is when the values received from the encoder and the decoder are used together. In this layer, it is not self-attention. This means the definitions of Q, K, and V are different.

Q corresponds to the decoder matrix, while K and V correspond to the encoder matrix. Mapping this to the example above, it looks like the following.

alt text

After that, the output probabilities are generated, and during inference, the token with the highest probability is produced.

Dot Product of Query and Key Values

alt text

Resource

Vision Transformer

You might have seen this in previous posts. Essentially, a Transformer is a self-attention model that operates in the following way:
Normalization โ†’ Multi-head Attention โ†’ Normalization โ†’ MLP.


Letโ€™s take a look at the overall architecture diagram below.

alt text

Input Data

Since it is a Vision Transformer, image data and the corresponding label data are used for training.

alt text

์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. \[x \in \mathbb{R}^{C \times H \times W}\]

Image Patch in Vision Transformers

In the context of Vision Transformers, an Image Patch refers to the process of dividing an input image into smaller, fixed-size segments called patches. These patches are typically of size $ p \times p $ pixels and are used as input to the Transformer model, effectively treating the image as a sequence of patches, similar to how words are treated in natural language processing.

For example:

  • Given a 224 x 224 pixel image, if the patch size is set to 16 x 16 pixels, the image is divided into a grid of 14 x 14 patches (since $ 224 \div 16 = 14 $). Each patch is then flattened and processed as part of a sequence, which is fed into the Transformer architecture for vision tasks.

  • This approach allows the Transformer to leverage its sequence-processing capabilities for vision tasks, treating the grid of patches as a sequence of โ€œtokensโ€ akin to words in text processing.

alt text

  • Below is a direct translation of your input into English, formatted in Markdown, keeping the technical details and structure intact as requested.

CNNs and Vision Transformers: Image Patch Flattening CNNs have traditionally excelled in image processing through convolution operations. One of their drawbacks is the explosive increase in parameters, leading to issues like overfitting and gradient vanishing, which are addressed using regularization or dropout. In contrast, Transformers appear to rely on a specific exploitation method. In traditional Transformers, word embedding sequences are used as input. For vision tasks, this is adapted by inputting image patch sequences, enabling tasks like image label prediction. This represents a distinctly different architectural pattern. Flattening of Image Patches With patched images, a flattening process is applied to convert each patch into a vector of size $p^2 \times c$, where:

  • Flatten Operation: The image is divided into patches, and each patch is flattened into a vector. Patch and Input Size: For an image of size $H \times W$ pixels divided into $N$ patches (where $N = \frac{HW}{p^2}$), each patch is processed.
  • Flattened Vector: Each patch $z_p$ is represented as a vector in $\mathbb{R}^{N \times (p^2 \cdot c)}$, where:

  • $p$: Size of one side of the patch.
  • $c$: Number of color channels (e.g., 3 for RGB).
  • $N = \frac{HW}{p^2}$: Total number of patches, derived from the image dimensions $H$ (height) and $W$ (width) divided by the patch area $ p^2 $.

Mathematical Formulation \(z_p \in \mathbb{R}^{N \times (p^2 \cdot c)} \quad \text{where} \quad N = \frac{HW}{p^2}\)

This reflects the process of converting image patches into a sequence of flattened vectors suitable for input into a Transformer model.

Patch Embeddings

After dividing the image into patches, embeddings are created from each patch. The split image patches undergo a linear transformation to initiate the encoding process. The resulting vectors are referred to as Patch Embedding Vectors, and they are characterized by a fixed length $ d $.

  • Each patch of size $ n \times n $ is reshaped into a single column vector of size $n^2 \times 1$.
    • Example: A patch of $16 \times 16$ pixels is flattened into a $256 \times 1$ vector.

Linear Projection

Each patch, after flattening, concatenates all pixel channels. It then undergoes a linear projection to the desired input dimension. This might seem confusing, but the key idea is:

After flattening, each patch becomes a single column vector (e.g., $256 \times 1$). The linear projection aims to reduce or transform the dimensions, typically for compatibility with the modelโ€™s input requirements. Example: If a $256 \times 1$ flattened patch is input into a linear layer with 512 nodes, a linear computation (multiplication by weights plus addition of bias) transforms it into a $512 \times 1$ output vector.

alt text

Projection of Two Features into One

By projecting two features into a single feature through linear projection, the process allows the use of only one feature. Ultimately, this maps pixel values into a format the model can understand. This involves transforming high-dimensional pixel data into a vector (or point), enabling a more concise representation that captures essential information.

Latent Vector Definition

The latent vector represents the encoded representation of the image patches after processing. It is defined as:

A sequence of $N$ patch embeddings, where each embedding is scaled by a factor $\alpha$. The resulting vector is denoted as $[\alpha^1 E_1, \alpha^2 E_2, \dots, \alpha^N E_N]$.

Mathematical Formulation \([\alpha^1 E_1, \alpha^2 E_2, \dots, \alpha^N E_N] \in \mathbb{R}^{N \times D}\) where:

$E \in \mathbb{R}^{(p^2 \cdot c) \times D}$: The embedding matrix, with $p^2 \cdot c$ representing the flattened patch size (where $p$ is the patch dimension and $c$ is the number of channels) and $D$ is the desired embedding dimension.

Through this process, all patches can be embedded into vectors. The result is an $ N \times D $ array, where:

  • $N$: The number of patches.
  • $D$: The embedding size for each patch.

This $N \times D$ array represents the embedded patch sequence ready for input into the Transformer model.

alt text

Appending a Class (Classification) Token

To effectively train the model, a vector called the CLS Token (Classification Token) is added to the patch embeddings. This vector: Acts as a learnable parameter within the neural network.

  • Is randomly initialized.
  • Is a single token added uniformly to all data instances.

After this step, the total number of embeddings becomes $n + 1$ (where $n$ is the number of patch embeddings plus the CLS token). Combined with the embedding size $D$, this results in a $(n+1) \times D$ array, which serves as the representation vector for further processing.

Initial Embedding with CLS Token

The initial embedding $ z_0 $ is formed by appending the CLS Token to the patch embeddings. It is defined as:

Mathematical Formulation \(z_0 = [\alpha_{cls}, \alpha^1_p E_1, \alpha^2_p E_2, \dots, \alpha^N_p E_N] \in \mathbb{R}^{(N+1) \times D}\)

  • $\alpha_{cls} $: The CLS Token, a learnable vector.
  • $\alpha^1_p E_1, \alpha^2_p E_2, \dots, \alpha^N_p E_N$: The scaled embeddings of $N$ patches.
  • $(N+1) \times D$: The resulting dimension, where $N$ is the number of patches and $D$ is the embedding size.

This $z_0$ serves as the input representation vector for the Transformer model.

alt text

Adding Positional Embedding Vectors

In the original Transformer, positional information for words was added to create 2D data. For images, where โ€œwhich position?โ€ becomes relevant, a process is added to encode positional information for each patch, reflecting its spatial location. This positional encoding is directly added to each embedding vector, including the CLS Token.

Process

  • Patch Embeddings: Represent the content of each patch.
  • Positional Embeddings: Encode the spatial order or position of each patch.
  • Combination: The positional embeddings are added to the patch embeddings and the CLS token to form the final input.

alt text

The Equations are following:

$z_0 = [\alpha_{cls}, \alpha^1p E_1, \alpha^2_p E_2, \dots, \alpha^N_p E_N] + E{pos} \in \mathbb{R}^{(N+1) \times D}$

where:

  • $E_{pos} \in \mathbb{R}^{(N+1) \times D}$: The positional embedding matrix, with the same shape as the patch and CLS token sequence. This $z_0$ represents the final input to the Transformer, incorporating both the content (via patch embeddings and CLS token) and positional information.

alt text

Similar to the previous discussion, the positional encoding method used here involves sin and cos functions, as seen in prior posts. This approach encodes the positional information of each patch, allowing the model to understand the spatial arrangement of the image patches.

The positional encoding $ PE(pos, 2i) $ and $ PE(pos, 2i+1) $ are defined using sine and cosine functions to encode the position of each patch. The formulas are: Mathematical Formulation \(PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\) \(PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)

  • $pos$: The position of the patch.
  • $i$: The dimension index.
  • $d_{model}$: The dimensionality of the model.

Encoder

Encoder Input

After the previous steps, you now have an array of size $(n + 1) \times d$, where $n$ is the number of patches and $ d $ is the embedding dimension. The next step is to apply self-attention, which allows the model to weigh the importance of different patches (including the CLS token) in relation to each other.

Multi-Head Self Attention

Create QKV (Query, Key Value)

The embedding vector generated from the previous steps is linearly transformed into multiple large vectors, which are subdivided into three components: Q (Query), K (Key), and V (Value). These vectors are derived from the $(n + 1) \times d$ array, where $n + 1$ represents the number of patches plus the CLS token, and each component retains the same $n + 1$ length.

alt text

The equations are following: \[q = z \cdot w_q,\quad k = z \cdot w_k,\quad v = z \cdot w_v \quad \text{where} \quad w_q, w_k, w_v \in \mathbb{R}^{D \times D_h}\]

  • usually, $D_h$ set to $D/k$, (k is the number of attention heads โ€” enabling parallel processing.)

Self.Attention

Now, we need to compute the attention score. As shown in the image below, the similarity is calculated by taking the dot product between the query and the key. Then, softmax is applied so that the sum of each row in matrix ( A ) becomes 1.

alt text

The equations are following: \[SA(z) = A \cdot v \in \mathbb{R}^{N \times D_h}, \quad \text{where} \quad A = \text{softmax} \left( \frac{q \cdot k^T}{\sqrt{D_h}} \right) \in \mathbb{R}^{N \times N}\]

After that, a weighted sum is computed over ( v ) using the attention weights.

Hereโ€™s the process: to obtain the aggregated contextual information for each element, we perform the computation using the first row of the attention matrix. At this point, we use the weights applied to ( V ) to generate the aggregated vector for the first image embedding. This operation is then applied to all patches.

alt text

Multi-Head Attention

The above process is ultimately repeated multiple times. \[MSA(z) = [SA_1(z); SA_2(z); \cdots; SA_k(z)] U_{\text{msa}} \quad \text{where} \quad U_{\text{msa}} \in \mathbb{R}^{(k \cdot D_h) \times D}\]

MLP (Additional Details)

  • Pretraining: Uses one hidden layer.
  • Finetuning: Uses a single linear layer.
  • In total, two linear layers are used, along with the GELU (Gaussian Error Linear Unit) activation function.

Last Attention Layer Setup

After stacking multiple attention heads, the result is mapped to a vector of dimension ( D ), which matches the patch embedding size.

alt text

Attention Layer Result

alt text

Residual Connection

Simply adds the input from the previous layer to the current layer.

alt text

Feed Forward Network

The output from the previous steps is passed into a feed-forward network.

alt text

Repeating Process

Repeat the above process L times. As shown in the second figure below, after L repetitions, the final classification is performed by passing the first vector y of the Encoderโ€™s final output through an MLP Head consisting of a single hidden layer (dimension D ร— C).

alt text

alt text

Transformer Layer Processing

The output $z_l$ at layer $l$ is computed by applying a Multi-Layer Perceptron (MLP) and Layer Normalization (LN) to the input $z_lโ€™$, combined with the residual connection $z_l$. The process is defined as:

Mathematical Formulation \(z_l = MLP(LN(z_l')) + z_l \quad \text{for} \quad l = 1, 2, \dots, L\) where: \(LN(z_l') = \gamma \frac{z_l' - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}} + \beta\)

  • $LN$: Layer Normalization.
  • $MLP$: Multi-Layer Perceptron.
  • $\gamma, \beta$: Learnable scaling and shift parameters.
  • $\mu_i$: Mean of the input.
  • $\sigma_i^2$: Variance of the input.
  • $\epsilon$: Small constant for numerical stability.

This iterative process refines the embeddings across $L$ layers.

Identifying Classification Token Output and Predicting Classification Probabilities

The final step involves examining the output of the CLS Token (Classification Token). This vector serves as the last step in the Vision Transformer. In the final stage, it is passed through a Fully Connected Layer to compute Classification Probabilities, enabling the prediction of the image class.

alt text

alt text

Inductive Bias

  • The term Inductive Bias frequently appears in this context. For CNNs, the use of Convolution Operations, which are specific to images, introduces an inductive bias. This bias includes:

  • Locality: Focusing on local regions of the image.
  • Two-Dimensional Neighborhood Structure: Capturing spatial relationships in a 2D grid.
  • Translation Equivariance: Ensuring the model responds consistently to translated inputs.

In contrast, Vision Transformers (ViTs) rely on MLP Layers to implicitly address Locality and Translation Equivariance. ViTs learn these properties through the Input Image Patches and refine them via Positional Embeddings (e.g., through fine-tuning).

Resource

Transformers for Image Recognition at Scale

Luna Game Engine Dev History

์‚ฌ์‹ค์ƒ Game Engine ์„ ๋งŒ๋“ค๊ณ  ์‹ถ์€๊ฑด ์•„๋‹ˆ๋‹ค. Game Engine ์„ ์‚ฌ์šฉํ•œ ์–ด๋– ํ•œ Product ๋ฅผ ๋งŒ๋“ค๊ณ  ์‹ถ์—ˆ๊ณ , ๊ทธ Platform ์ด Unreal Engine ์ด ๋ฌ๋“ , Unity ๊ฐ€ ๋ฌ๋“  ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋ฏธ ์ƒ์šฉํ™”? ๋œ ์—”์ง„๋“ค์˜ ํ™•์‹คํ•œ ์žฅ์ ์€ ์žˆ์ง€๋งŒ, ๊ทธ๋ ‡๋‹ค๊ณ  ํ•ด๋„, ๋„ˆ๋ฌด ๋งŽ์€ ๋ฐฉ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ดํ•ดํ•˜๊ธฐ์—๋Š” ์‰ฝ์ง€ ์•Š๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ, DirectX11 ์—์„œ๋Š” ํ™•์‹คํžˆ Low Level API ๋ผ๊ณ  ํ•˜์ง€๋งŒ, ๊ฑฐ์˜ High Level API ์ด๋‹ค. ํŠนํžˆ๋‚˜ ๋“œ๋ผ์ด๋ฒ„(์ธ๋ ฅ์‚ฌ๋ฌด์†Œ)๊ฐ€ ๊ฑฐ์˜ ๋งŽ์€ ์ž‘์—…๋“ค์„ ์ฒ˜๋ฆฌ ํ•ด์ฃผ์—ˆ๋‹ค. ๊ทธ์— ๋ฐ˜๋Œ€ ๋˜์„œ, DirectX12 ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์„ ๋”ฐ๋กœ ์ฒ˜๋ฆฌํ•ด์ค˜์•ผํ•œ๋‹ค (RootSignature, PipelineState, etc) ๊ทธ๋ฆฌ๊ณ  ๋ณ‘๋ ฌ ์ง€์›์— ๋Œ€ํ•ด์„œ๋„ ์ถฉ๋ถ„ํžˆ ์ด์•ผ๊ธฐํ• ์ˆ˜ ์žˆ๋‹ค. CommandList ๋ฅผ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ํ•œ๋‹ค. (์ด๋ถ€๋ถ„์€ ์‹ค์ œ๋กœ ํ•ด๋ณด์ง„ ์•Š์•˜๋‹ค.)

ํŠนํžˆ๋‚˜, Commit ์„ ํ•˜๊ธฐ์ „์— Stream ๋ฐฉ์‹์ธ์ง€, CommandList ์— ์ผํ• ๊ฒƒ๊ณผ, ์ผ์˜ ์–‘์„ ๋ช…์‹œํ•ด์„œ CommandQueue ์—๋‹ค๊ฐ€ ๋„ฃ์–ด์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  OMSetRenderTargets, IASetVertexBuffers ๋“ฑ์œผ๋กœ DX11 ์—์„œ๋Š” ์•Œ์•„์„œ ์ž๋™ ์ƒํƒœ ์ „์ด๊ฐ€ ๋˜์ง€๋งŒ, DX12 ์—์„œ๋Š” D3D12_RESOURCE_BARRIER ๋ฅผ ํ†ตํ•ด์„œ ์ƒํƒœ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ง€์ •ํ•ด์ฃผ์–ด์•ผ ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด๊ฒƒ ๋ง๊ณ  ๋“ฑ๋“ฑ ์˜ค๋Š˜์€ DX12 ์˜ ์–ด๋ ค์›€ ๋˜๋Š” DX11 ์™€ ๋น„๊ต๋ฅผ ๋ง์„ ํ• ๋ ค๋Š” ๋ชฉ์ ์€ ์•„๋‹ˆ๋‹ค. ์˜ค๋Š˜์€ ๋‚˜์˜ ๊ฐœ๋ฐœ ๋กœ๊ทธ๋ฅผ ๊ณต์œ ํ•˜๋ ค๊ณ ํ•œ๋‹ค.

Motivation

์˜ˆ์ „๋ถ€ํ„ฐ ๋‚ด๊ฐ€ ์ง์ ‘ ๋งŒ๋“ค์–ด๋ณด๊ณ  ์‹ถ๊ณ , ํ‘œํ˜„ํ•ด๋ณด๊ณ  ์‹ถ์—ˆ๋˜๊ฒŒ ์žˆ์—ˆ๊ณ , ๊ทธ๊ฑธ ํ‘œํ˜„ํ•˜๊ธฐ์œ„ํ•ด์„œ, Game Engine ๊ด€๋ จ๋˜์„œ Youtube ๋ฅผ ์ฐพ์•„๋ณด๊ฒŒ ๋˜๋‹ค๊ฐ€ ์šฐ์—ฐ์น˜ ์•Š๊ฒŒ, Cherno ๋ผ๋Š” Youtuber ๋ฅผ ๋ณด์•˜๋‹ค. ์ด ๋ถ„์€ EA ์—์„œ ์ผ์„ ํ•˜๋‹ค๊ฐ€ ์ด์ œ๋Š” ์ง์ ‘์ ์œผ๋กœ Game Engine Hazel ์„ ๋งŒ๋“ค๊ณ  ์žˆ๋‹ค. ๊ผญ ๊ทธ๋ฆฌ๊ณ  ๋‹ค๋ฅธ Contents ๋„ ์ƒ๋Œ€์ ์œผ๋กœ ํ€„๋ฆฌํ‹ฐ๊ฐ€ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Walnut ์— ๋ณด๋ฉด ์•„์ฃผ ์ข‹์€ Vulkan ๊ณผ Imgui ๋ฅผ ๋ฌถ์–ด๋†“์€ Template Engine ์ด ์žˆ๋‹ค. ๊ผญ ์ถ”์ฒœํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐœ๋ฐœํ•˜๋ฉด์„œ ๋‹ค๋ฅธ Resource ๋„ ์˜ฌ๋ ค๋†“๊ฒ ๋‹ค.

Abstraction Layer

์ผ๋‹จ ๋‚˜๋Š” Multiplatform ์„ Target ์œผ๋กœ Desktop Application ์œผ๋กœ ์ •ํ–ˆ๋‹ค. ์ฆ‰ Rendering ๋ถ€๋ถ„์„ DX12 Backend ์™€ Vulkan Backend ๋กœ ๋‚˜๋ˆ„์–ด์„œ, ์ถ”์ƒํ™” ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณค๋‹ค.

์ง€๊ธˆ์˜ Project ์˜ ๊ตฌ์กฐ๋ฅผ ์„ค๋ช…ํ•˜๊ฒ ๋‹ค. (์ „์ฒด์ ์œผ๋กœ HAL=Hardware Abstraction Layer ๋ฅผ ๊ตฌ์ƒ์ค‘)์ด๋ฉฐ, ๊ฒŒ์ž„ ์—”์ง„ ๋‚ด๋ถ€์—์„œ Platform ์— ๊ตฌ์•  ๋ฐ›์ง€ ์•Š๊ฒŒ ์„ค๊ณ„ ๊ธฐ์ค€์„ ์žก์•˜๋‹ค. ๋ฌผ๋ก  ์ถ”์ƒํ™” ๊ณ„์ธต์€ ์‚ผ๊ฐํ˜• ๊ทธ๋ฆฌ๊ธฐ ๊ธฐ์ค€์œผ๋กœ ์ผ๋‹จ ์ถ”์ƒํ™”๋ฅผ ์ž‘์—…์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ „์ฒด์ ์œผ๋กœ Resource ๋Š” ํ•œ๋ฒˆ ์ถ”์ƒํ™” ์ž‘์—…์„ํ•˜๊ณ , IRenderBackend ๋กœ ๋ถ€ํ„ฐ DirectX12 ์œผ๋กœ ํ• ๊ฑด์ง€, Vulkan ์œผ๋กœ ํ• ๊ฑด์ง€ ์ •์˜ ํ•˜์˜€๋‹ค. ์•„์ง ์ž‘์—…ํ•  ์ผ์€ ๋งŽ์ง€๋งŒ, ํ•œ๋ฒˆ์— ํ•˜์ง€ ์•Š์œผ๋ ค๊ณ  ์ง„ํ–‰์ค‘์ด๋‹ค.

LunaEngine
    |   EntryPoint.cpp
    |   EntryPoint.h
    |   Layer.h
    |   LunaPCH.cpp
    |   LunaPCH.h
    |
    +---Application
    |       Application.cpp
    |       Application.h
    |       ApplicationSpecification.h
    |
    +---Graphics
    |       IBuffer.h
    |       IPipeline.h
    |       IShader.cpp
    |       IShader.h
    |       Texture.h
    |
    +---ImGui
    |       ImGuiBuild.cpp
    |       Roboto-Regular.embed
    |
    +---Input
    |       Input.cpp
    |       Input.h
    |       KeyCodes.h
    |
    +---Renderer
    |   |   IRenderBackend.h
    |   |   IRenderCommand.h
    |   |   IRenderContext.cpp
    |   |   IRenderContext.h
    |   |   RenderQueue.cpp
    |   |   RenderQueue.h
    |   |
    |   +---DX12
    |   |   +---private
    |   |   |       BindPipelineCommand.cpp
    |   |   |       DrawCommands.cpp
    |   |   |       DX12Backend.cpp
    |   |   |       DX12Buffer.cpp
    |   |   |       DX12Pipeline.cpp
    |   |   |       DX12Shader.cpp
    |   |   |
    |   |   \---public
    |   |           BindPipelineCommand.h
    |   |           DrawCommands.h
    |   |           DX12Backend.h
    |   |           DX12Buffer.h
    |   |           DX12Pipeline.h
    |   |           DX12Shader.h
    |   |
    |   \---Vulkan
    |           VulkanBackend.cpp
    |           VulkanBackend.h

Issue

  1. ์ฒซ๋ฒˆ์งธ๋กœ, ImGUI ๋ฅผ ๊ฐ™์ด ์‚ฌ์šฉํ•˜๋ ค๋ฉด, ImGUI ์šฉ descriptor heap ์„ ๋”ฐ๋กœ ๋งŒ๋“ค์–ด์ค˜์•ผํ•œ๋‹ค. Example-DX12 Hook ์—ฌ๊ธฐ์— ๋ณด๋ฉด, ์•„๋ž˜ ์ฒ˜๋Ÿผ descriptor ๋ฅผ ๋งŒ๋“œ๋Š”๊ฑธ ํ™•์ธ ํ• ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚ด DX12Backend ์ชฝ์—์„œ๋„ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค.
{
    D3D12_DESCRIPTOR_HEAP_DESC desc = {};
    desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV;
    desc.NumDescriptors = APP_SRV_HEAP_SIZE;
    desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_SHADER_VISIBLE;
    if (g_pd3dDevice->CreateDescriptorHeap(&desc, IID_PPV_ARGS(&g_pd3dSrvDescHeap)) != S_OK)
        return false;
    g_pd3dSrvDescHeapAlloc.Create(g_pd3dDevice, g_pd3dSrvDescHeap);
}

ํ˜„์žฌ Pass ๊ฐ™์€ ๊ฒฝ์šฐ, Unreal / Unity ์™€ ๊ฐ™์ด, ๋งˆ์ง€๋ง‰์— UI ๋ฅผ ๊ทธ๋ฆฌ๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•ด์„œ, Application Layer ์—์„œ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ํ˜ธ์ถœ ํ•˜๊ณ  ์žˆ๋‹ค.

void Application::Run()
{
    _running = true;
    while (ShouldContiueRunning())
    {
        glfwPollEvents();
        float time = GetTime();
        _frameTime = time - _lastFrameTime;
        _lastFrameTime = time;

        IRenderContext::BeginFrame();
        IRenderContext::StartImGuiFrame();

        if (ImGui::BeginMainMenuBar())
        {
            if (_menubarCallBack)
                _menubarCallBack();
            ImGui::EndMainMenuBar();
        }
        
        ImGui::DockSpaceOverViewport(ImGui::GetMainViewport(),ImGuiDockNodeFlags_PassthruCentralNode);
        
        for (auto &layer : _layerStack)
            layer->OnUpdate(_frameTime);
        for (auto &layer : _layerStack)
            layer->OnUIRender();
        IRenderContext::DrawFrame();
        IRenderContext::RenderImGui();
        IRenderContext::EndFrame();
    }

    Shutdown();
}

์‚ผ๊ฐํ˜•์„ ๊ทธ๋ฆฌ๊ธฐ๋งŒํ•˜๋Š”๋ฐ, ์‚ผ๊ฐํ˜•์ด ๊ทธ๋ ค์ง€์งˆ ์•Š๋Š”๋‹ค. ๊ทธ๋ž˜์„œ ์ด๊ฑธ RenderDoc ์œผ๋กœ ์ฒดํฌ๋ฅผ ํ•ด๋ณด๊ฒ ๋‹ค.

alt text

์‚ผ๊ฐํ˜•์€ ์™„๋ฒฝํ•˜๊ฒŒ ๊ทธ๋ ค์ง€๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ทธ ๋‹ค์Œ Pass ์— ๋ณด๋ฉด ์—†์–ด์ง„๋‹ค.

alt text

์ด๊ฑฐ์— ๋Œ€ํ•ด์„œ ์ฐพ์•„๋ณด๋‹ค๊ฐ€, RenderTarget ์— ๋‘˜๋‹ค ๊ทธ๋ฆด๋ ค๊ตฌ ํ•ด์„œ ๊ทธ๋ ‡๊ณ , ๋งˆ์ง€๋ง‰์— Update ํ•˜๋Š” ๋ถ€๋ถ„์ด ImGUI ์—์„œ Docking ๋˜๋Š” Viewports Enable ์„ ํ–ˆ์„์‹œ์— ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Ÿด๋–„, ๊ธฐ๋ณธ์ ์œผ๋กœ, ImGUI ์—์„œ๋Š” GPU Rendering ์ƒ์œผ๋กœ ๊ธฐ๋Šฅ์„ ๋…๋ฆญ์ ‘์ธ ์ž์›์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋”ฐ๋กœ ๋งŒ๋“ ๋‹ค๊ณ  ๋ง์„ ํ•˜์˜€๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ Viewports ๋ฅผ Enable ํ–ˆ์„์‹œ์—, ์ƒˆ๋กœ์šด Viewport ์˜ ๋ฐฐ๊ฒฝ์ƒ‰์€ Gray ์ƒ‰๊น”์ด๋ผ๊ณ  ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๋งˆ์ง€๋ง‰์— Rendering ์„ ํ–ˆ์„์‹œ์—, Gray ๋กœ ๋ฎ์–ด๋ฒ„๋ฆฐ๋‹ค. ์ฆ‰ ImGUI ๋Š” Texture ๊ธฐ๋ฐ˜์˜ UI ์š”์†Œ๋ฅผ ๊ทธ๋ฆฐ๋‹ค. (์ฆ‰, ImGUI ์˜ ๋‚ด๋ถ€ ๊ด€๋ฆฌ๊ฐ€ ์•„๋‹Œ OS ์ฐฝ์œผ๋กœ Rendering ์ด ๋œ๋‹ค.)

ImGuiIO &io = ImGui::GetIO();
io.ConfigFlags |= ImGuiConfigFlags_DockingEnable;
io.ConfigFlags |= ImGuiConfigFlags_ViewportsEnable;

์ด๊ฑธ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•ด์„ , ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

  1. ImGUI transparent Style
    ImGuiStyle& style = ImGui::GetStyle();
    style.Colors[ImGuiCol_WindowBg].w = 0.0f; // Fully transparent window background
    style.Colors[ImGuiCol_DockingEmptyBg].w = 0.0f; // Transparent dockspace background
    style.Colors[ImGuiCol_ChildBg].w = 0.0f; // Transparent child window background
    
  2. ImGUI Docking Clear
    ImGui::PushStyleColor(ImGuiCol_DockingEmptyBg, ImVec4(0.0f, 0.0f, 0.0f, 0.0f));
    ImGui::DockSpaceOverViewport(ImGui::GetMainViewport(), ImGuiDockNodeFlags_PassthruCentralNode);
    ImGui::PopStyleColor();
    

๋‘˜๋‹ค ๋ฐฉ๋ฒ•๋ก ์€ ๊ฐ™๋‹ค. ๊ทธ๋ž˜์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค. ๊ฒฐ๊ตญ์—๋Š” ํ•ด์•ผํ•˜๋Š” ์ผ์€ ํ•˜๋‚˜์˜ RenderTarget ์— ๋ณ‘ํ•ฉ? (Aggregation) ์ด ๋งž๋Š”๊ฒƒ ๊ฐ™๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ค‘์š”ํ•œ๊ฑด RenderTarget ์ด ์ •ํ™•ํ•˜๊ฒŒ ImGUI ์™€ ์‚ผ๊ฐํ˜• ๊ทธ๋ฆฌ๋Š”๊ฒŒ ๋งž๋Š”์ง€ ํ™•์ธ์ด ํ•„์š”ํ•˜๋‹ค. ์•„๋ž˜๋Š” Aggregation ํ•œ ๊ฒฐ๊ณผ ์ด๋‹ค.

alt text

Dependencies

๋ชจ๋“  Dependency ๋Š” PCH ์—์„œ ์ฐธ์กฐํ•˜๊ณ  ์žˆ๋‹ค. ์ฐธ๊ณ ๋กœ vcpkg ๋Š” premake ์—์„œ ์•„์ง disable ํ•˜๋Š”๊ฑธ ์ฐพ์ง€ ๋ชปํ–ˆ๋‹ค. src/vendor ์•ˆ์— ๋ชจ๋“  dependency ๊ฐ€ ์žˆ๋‹ค.

  • d3d12ma -> Direct3D Memory Allocation Library
  • dxheaders -> DirectX related headers
  • dxc -> HLSL compiler
  • volk -> Vulkan Loader
  • vulkan -> you need to download sdk from vulkan webpage

Math:

  • DirectXMath: optimized directX math
  • glm: OpenGL Math Library for Vulkan

UI / Window Manage

  • imgui(immediate mode gui)
  • glfw (window / input manager)
  • stb_image: image loading library.

Resource

RNN History

Series Data and Recurrent Neural Networks (RNNs)

Series data refers to data where the state at one point in time is dependent on the states before (or after) it. Examples include input data for sentiment analysis, music analysis, or even large language models (LLMs). These tasks all rely on sequences of informationโ€”also known as time series data.

What is RNN

RNN, or Recurrent Neural Network, refers to a type of neural network where data is processed sequentially, one step at a time. This sequential nature allows RNNs to handle inputs such as words in a sentence, musical notes, or time-series sensor data.

The core building block of an RNN is the cellโ€”a unit inside the hidden layer that performs activation and maintains memory. These are often referred to as memory cells, because they try to โ€œrememberโ€ previous information in the sequence.

At each time step, the memory cell receives two inputs:

  • The current input (e.g., the current word or note)
  • The hidden state from the previous time step

This structure enables recursive reuse of hidden states across time, making the network capable of learning temporal dependencies.

One of the key components of RNNs is the hidden state. Each memory cell carries a hidden state, and at every time step, this hidden state is updated using both the current input and the previous hidden state. This allows the network to โ€œrememberโ€ past context as it processes the sequence step-by-step.

When the RNN is unrolled in time, it forms a chain-like structure, where each cell is connected to the next, passing along the hidden state as shown in the diagram.

alt text

In RNNs, each hidden state at time t depends on the hidden state from the previous time step (t-1). This recursive structure allows the network to โ€œrecallโ€ past information over time.

However, this also leads to a major issue:

  • As information is passed along through many time steps, older data tends to be lost. The longer the sequence, the harder it becomes for the model to retain information from earlier time steps.
  • In addition, as the depth of the unrolled RNN increases, the model suffers from the vanishing gradient problem during backpropagationโ€”where gradients become too small to update earlier layers effectively.
  • This makes it extremely difficult to learn long-term dependencies. One Common Solution to this problem is gradient clipping As shown in the diagram, a constant C is defined as a threshold. If the gradient exceeds this value, it is scaled back to prevent it from exploding or vanishing entirely.

This technique helps stabilize training, especially in deep RNNs or long sequence tasks.

alt text

LSTM & GRU

To address the issue of losing information from earlier time steps, a specialized architecture was introduced: LSTM (Long Short-Term Memory).

LSTM is designed to capture both short-term and long-term dependencies in sequential data, allowing the model to selectively retain or forget information over time.

alt text

Although the above equations may look quite complex, they essentially involve multiple gates, each performing a specific role. Mathematically, the representation of these gates can be expressed as shown in the image below.

alt text

If we break down these equations further and represent them in a simplified diagram, the image below might help make the concept easier to understand.

alt text

As mentioned earlier, each gate has a specific role. Letโ€™s examine them one by one from left to right:

1. Forget Gate (Decides Whether to Erase the Cell State or Not)

alt text

The Forget Gate is responsible for deleting unnecessary information. At the current time step ๐‘ก, the input ๐‘ฅ๐‘ก and the previous hidden state โ„Ž๐‘ก โˆ’1 are passed through a sigmoid function, producing values between 0 and 1. Values closer to 0 indicate that much of the information is discarded, while values closer to 1 mean the information is retained. This gating mechanism controls how the cell state is updated accordingly.

2. Input Gate

alt text

The Input Gate is used to process the new information to be added to the cell state. As shown on the right, the current input ๐‘ฅ๐‘ก is multiplied by the weight matrix ๐‘Š๐‘ฅ๐‘–, and the previous hidden state โ„Ž๐‘กโˆ’1 is multiplied by ๐‘Šโ„Ž๐‘–. Their sum is then passed through a sigmoid function. At the same time, the current input ๐‘ฅ๐‘ก multiplied by ๐‘Š๐‘ฅ๐‘” and the previous hidden state โ„Ž๐‘กโˆ’1 multiplied by ๐‘Šโ„Ž๐‘” are summed and passed through a hyperbolic tangent (tanh) function. The result of this operation is denoted as ๐‘”๐‘ก

In other words, the combination of the sigmoid output (ranging from 0 to 1) and the tanh output (ranging from -1 to 1) determines how much new information is selected to update the cell state.

3. Cell Gate ***

alt text

In a standard RNN, only the hidden state is passed along to the next time step. However, in an LSTM, both the hidden state and the cell state are passed forward. The Forget Gate selectively removes some information from the cell state, while the element-wise product of the input gate activation ๐‘–๐‘ก and the candidate values ๐‘”๐‘ก determines how much new information is added.

These two componentsโ€”the retained memory and the newly selected informationโ€”are then combined (summed) to update the current cell state ๐ถ๐‘ก This updated cell state is passed on to the next time step ๐‘ก+1. If the forget gate output ๐‘“๐‘ก is zero, the previous cell state ๐ถ๐‘กโˆ’1 is effectively reset to zero, meaning the cell only retains the newly selected information.

4. Output Gate & Hidden State

alt text

Finally, the Output Gate computes the output at the current time step ๐‘ก. It takes the current input ๐‘ฅ๐‘ก and the hidden state, passes them through a sigmoid function to produce the output gate activation ๐‘œ๐‘ก. Meanwhile, the current cell state ๐ถ๐‘ก is passed through a hyperbolic tangent (tanh) function, producing values between โˆ’1 and 1. The element-wise product of these two values filters the cell state output, resulting in the new hidden state, which is then passed on to the next time step.

As shown in the diagram, the LSTM architecture divides these operations into multiple gates. In contrast, the GRU (Gated Recurrent Unit) simplifies this by combining some of these functions into just two gates: the Update Gate and the Reset Gate. This reduction results in a simpler structure while still effectively updating the hidden state over time, making the GRU a streamlined variant of the LSTM.

alt text

Resource

CNN History

From Deep Learning Aspirations to Systems Development: My Unexpected Path into AI

During grad school, I studied Deep Learning, thinking it was a promising field worth diving into. I was genuinely excited about neural networks and how they were changing the landscape of tech.

But after graduation, I found myself doing something completely differentโ€”systems development. It felt disconnected from AI at first, and honestly, a bit frustrating. Still, I took it as a good opportunity and kept going. Surprisingly, that path led me back to AIโ€”just from a different angle. I ended up working on systems that supported Computer Vision features. I wasnโ€™t building models, but I was helping them run efficiently in real environments.

Why CNN?

At first, I tried training image data using a simple Multi-Layer Perceptron (MLP). To do that, I had to flatten the 2D image into a 1D vector. While this made it technically possible to train, it came at a costโ€”the model lost all the local and topological information in the image. It couldnโ€™t understand where features were located, only what values existed. This made learning abstract concepts in images inefficient and slow.

To solve this, I turned to Convolutional Neural Networks (CNNs), which preserve spatial information using the concept of a receptive fieldโ€”like how a lifeguard watches over a specific area of a pool, each convolutional filter focuses on a local region of the image.

Preliminary

In CNNs, we use small filters (or kernels) that slide over the image. Each kernel has weights (e.g., a 3ร—3 filter) and performs convolution operations followed by a bias addition and an activation function (like ReLU or Sigmoid). This produces a feature map that captures localized patterns in the image.

For example, filters like the Sobel operator are hand-designed to detect edges, but in CNNs, these filters are learned automatically during training. As a result, CNNs can effectively capture local features and build up abstract representations layer by layer.

By using convolutional layers instead of fully connected layers, the model not only gains efficiency but also becomes much better at recognizing patterns in images.

1. LeNet - Gradient Based Learning Applied to Document Recognition

alt text

One of the earliest and most influential CNN-based architectures was LeNet, developed by Yann LeCun. It was originally designed for handwritten digit recognition (e.g., MNIST) and laid the groundwork for modern convolutional networks.

The input to LeNet is a 32ร—32ร—1 grayscale image. The first convolutional layer applies 6 filters of size nร—n (typically 5ร—5), resulting in a feature map of size 28ร—28ร—6. This means each of the 6 filters scans the input image and extracts different local features.

After convolution, a subsampling (or downsampling) layer is appliedโ€”usually a type of average poolingโ€”which reduces the spatial resolution. This pattern of Convolution โ†’ Subsampling repeats, gradually extracting higher-level features while reducing dimensionality.

Finally, the network flattens the feature maps and passes them through one or more fully connected layers, similar to an MLP, to perform classification.

2. AlexNet - ImageNet Classification with Deep Convolutional Neural Network

alt text

While traditional face detection algorithms like Haar Cascades could recognize faces fairly wellโ€”especially with properly preprocessed inputโ€”AlexNet took things to a whole new level. Designed to handle 224ร—224 RGB images, AlexNet leveraged the power of GPUs for parallel computation, which allowed it to scale deeper and wider than previous models.

One interesting feature of AlexNet was its split architecture: the network was divided into two parallel streams to take advantage of multi-GPU setups.

AlexNet also introduced several key innovations that became standard in deep learning:

  • ReLU Activation: Instead of using sigmoid or tanh, AlexNet used the ReLU function f(x) = max(0, x) for faster convergence and to mitigate the vanishing gradient problem.
  • Dropout: To combat overfitting, AlexNet randomly dropped units during training, forcing the network to learn redundant representations.
  • Overlapping Pooling: Unlike previous networks that used non-overlapping pooling (e.g., pooling window size = stride), AlexNet used 3ร—3 pooling windows with stride 2, allowing windows to overlap. This reduced the output size and helped capture more spatial detail, improving translational invariance.
  • Local Response Normalization (LRN): Since ReLU can produce very large activations, LRN was introduced to normalize the responses across adjacent neurons at the same spatial location. This helped prevent a few highly activated neurons from dominating.
  • Softmax: At the output, a softmax layer was used to convert logits into probabilities, amplifying confident predictions and suppressing weaker ones.

AlexNetโ€™s success in the 2012 ImageNet competition marked a turning point for deep learning, showing that with enough data, compute, and smart design choices, neural networks could outperform traditional hand-engineered features by a large margin.

3. VGGNet - Very Deep Convolutional Networks for Large-Scale Image Recognition

alt text

VGGNet built on the success of AlexNet, using similar input dimensions (e.g., 224ร—224ร—3), but introduced a key design shift: replacing larger filters (like 5ร—5) with multiple 3ร—3 convolutions stacked in sequence.

This approach brought several advantages:

  • Deeper architectures (16 or 19 layers) could be created without excessively increasing the number of parameters.
  • Multiple 3ร—3 filters in a row have the same receptive field as a larger filter (e.g., two 3ร—3 filters โ‰ˆ one 5ร—5), but with fewer parameters and lower computational cost.

As a result, VGGNet significantly improved performance while maintaining a clean, uniform architecture. Because of its regular structure and strong performance, VGGNet became a popular backbone network for tasks like semantic segmentation and object detection.

However, deeper networks introduced a new problem: during backpropagation, gradients could vanish as they moved backward through many layers, especially toward the input. This vanishing gradient problem made training very deep models difficult, eventually motivating the development of architectures like ResNet, which addressed this with residual connections.

4. GooLeNet - Going Deeper with Convolution

alt text

alt text

If VGGNet made networks deeper by stacking layers vertically, GoogLeNet (a.k.a. Inception v1) took a different approachโ€”it went deeper in both width and depth (it goes deeper both vertically and horizontally).

GoogLeNet introduced the concept of the Inception Module, which allowed the network to process spatial information at multiple scales simultaneously. As the name suggests, this architecture digs deeper and deeper into the network structure.

The unique part of GoogLeNet is the Inception Module. Take a look at the diagram below to understand it better.

alt text

As shown in the diagram above, one way to increase the depth of the network is by extracting feature maps using different kernels, then applying average pooling or max pooling, and finally concatenating the results.

However, GoogLeNet went further and proposed a more efficient structure by combining multiple operationsโ€”like pooling and convolutions with different kernel sizesโ€”in parallel. One of the key innovations was the use of 1ร—1 convolutions, either before or after other operations, forming whatโ€™s known as a bottleneck structure.

Using 1ร—1 convolutions significantly reduced the number of parameters and computation. For example, performing the same operation with 1ร—1 filters required only around 67,584 parameters (12,288 + 55,296)โ€”a much smaller number compared to what would be needed without them.

Another interesting feature of GoogLeNet is the use of auxiliary classifiers. Instead of having a single softmax classifier at the end, it includes two additional softmax branches in the middle of the network. These auxiliary classifiers help mitigate the vanishing gradient problem by providing additional gradient signals during training.

Lastly, GoogLeNet replaces traditional fully connected layers with Global Average Pooling (GAP) near the end of the network. While the exact mechanism may seem abstract at first, the core idea is that GAP reduces each feature map to a single number by averaging spatial values, effectively summarizing global information without introducing additional parametersโ€”unlike fully connected layers.

alt text

5. ResNet

Residual Learning: Tackling the Vanishing Gradient Problem. As mentioned in the previous post, one of the biggest issues with deep neural networks is the vanishing gradient problem. As networks get deeper, gradients calculated during backpropagation tend to shrink. The more layers you have, the more the gradients approach zero, which means the weight updatesโ€”especially in early layersโ€”become negligible. In other words, the network struggles to learn because the influence of the output on earlier layers diminishes.

To address this, the concept of the Residual Block was introduced. It uses a mechanism called a skip connection, which forms the basis of residual learning. This allows the gradient to flow directly through the network, helping to mitigate the vanishing gradient issue even in very deep architectures.

Traditionally, the goal was to learn a function H(x) that maps the input x to the desired output yโ€”in other words, to minimize H(x) โ€“ y. However, residual learning takes a different approach: instead of learning H(x) directly, the network learns the residual function, which is H(x) โ€“ x. The idea is that if the desired mapping is similar to the input, itโ€™s easier to learn the difference between the input and output than the output itself.

By reformulating the learning objective this way, the model becomes easier to optimize and performs better in very deep configurations.

alt text

Why F(x) + x Helps: Stabilizing Gradients with Residual Blocks: The diagram above shows what weโ€™ve been building toward: by using Residual Blocks, we compute F(x) + x, where F(x) is the output of a few convolutional layers and x is the original input. The key idea here is that when you differentiate this structure during backpropagation, the gradient always retains a value of 1 through the skip connection, ensuring that at least some portion of the signal survives as it flows backward.

Of course, this doesnโ€™t completely eliminate the vanishing gradient problem, but according to the original paper, the issue was significantly mitigated by using Batch Normalization. Whether BatchNorm fully solves the vanishing gradient issue or just partially helps is still up for debate. One could argue itโ€™s a major breakthroughโ€”or just a minor contributor. Either way, it plays an important role in training very deep networks. BatchNormโ€™s role is to normalize the output of each layer, helping stabilize the gradient flow and speed up convergence.

As a result of stacking multiple residual blocksโ€”50 to 152 layers deepโ€”ResNet was able to achieve a depth 8ร— greater than VGGNet, while still being trainable.

This is how deeply layered networks with Residual Learning end up looking, as illustrated in the diagram below.

alt text

According to the paper, as the network depth increases, there is a noticeable trend in performanceโ€”but this trend is not necessarily due to overfitting.

Performance Analysis

CauseExplanationResolution
Vanishing GradientWeakened gradients in upper layers during backpropagationSkip Connection
Weight AttenuationImbalanced parameter updates in deeper layersResidual Learning Architecture
Optimization IssuesNon-convex functions increase local minimaBottleneck Architecture

The following diagram shows how these challenges have been addressed in the improved architecture.

alt text

  1. Reduction stage: Reduce the number of parameters
  2. Processing stage: Extract local features
  3. Input/Output alignment: Match channel dimensions for ๐ป(๐‘ฅ) โˆ’ ๐‘ฅ

alt text

6. DenseNet

alt text

As shown above, the layers are densely connected, meaning each layer is connected to every other layer in a feed-forward fashion. This dense connectivity is the key point, and can be seen as an extension of residual learning.

Key characteristics include:

  1. Feature reuse: Each layer receives the outputs of all preceding layers as its input.
  2. Parameter efficiency: The number of feature maps is controlled by a growth rate (k), which limits how much the output grows per layer.
  3. Implicit multi-scale learning: Low-level and high-level features are naturally fused through dense connections, enabling the network to learn across multiple scales automatically.
ComponentRoleMathematical Expression
Dense BlockPreserve feature map connectionsxโ‚— = Hโ‚—([xโ‚€, xโ‚, โ€ฆ, xโ‚—โ‚‹โ‚])
Transition LayerReduce dimensions and prevent redundancyT(x) = Convโ‚ร—โ‚(BN(ReLU(x)))
Bottleneck LayerImprove computational efficiencyHโ‚— = Convโ‚ƒร—โ‚ƒ(Convโ‚ร—โ‚(x))

From the equations above, we can clearly see the difference from ResNet. In ResNet, the residual connection is defined as: ๐‘ฅ๐‘™ = ๐ป๐‘™(๐‘ฅ๐‘™ โˆ’1) + ๐‘ฅ๐‘™ โˆ’1

This means each layer receives input only from the previous layer, and the outputs are summed. In contrast, DenseNet connects all preceding feature maps to the current layer as input, which increases the diversity of learned representations.

One drawback of ResNet is that if the two feature maps being summed come from different distributions, the addition operation may become less effective or even harmful.

In short:

  • ResNet uses sum
  • DenseNet uses concat

7. EfficientNet

Traditional models typically scale along a single dimensionโ€”either depth, width, or resolution. What sets this approach apart is the idea of scaling all three dimensions in a balanced way. This is the core of whatโ€™s called Compound Scaling.

alt text

So the key point of this architecture lies in how to find the optimal balance between depth, width, and resolution.

Letโ€™s briefly look at what each dimension represents:

  • Depth: Increasing the number of layers โ†’ allows the model to capture more complex patterns
  • Width: Increasing the number of channels โ†’ improves the modelโ€™s ability to learn fine-grained features
  • Resolution: Increasing the input image size โ†’ enables the use of higher-resolution spatial information
  • Based on these definitions, the architecture introduces a compound coefficient, denoted as ฯ• (phi), to uniformly scale all three dimensions.

This coefficient is found using a greedy search, which led to the discovery of the following scaling constants: ฮฑ=1.2 (depth) ฮฒ=1.1 (width) ๐›พ =1.15 (resolution).

These constants are then used to guide the compound scaling process in EfficientNet | Component | Technique | Mathematical Expression | |โ€”โ€”โ€”โ€”โ€”โ€”-|โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”-|โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”-| | MBConv | Inverted residual block | Fฬ‚(x) = T_proj(T_expand(x)) โŠ™ SE(T_dw(x)) | | SE Block | Channel-wise attention modulation| w_c = ฯƒ(Wโ‚‚ ฮด(Wโ‚ ยท GAP(x))) | | Swish Activation | Smooth activation function | swish(x) = x ยท ฯƒ(ฮฒx)

6. MobileNet

Just from the name alone, itโ€™s clear where this model is meant to be usedโ€”on mobile devices. Itโ€™s a deep learning model designed specifically for mobile and resource-constrained environments.At its core, the key challenge was: โ€œHow can we reduce the amount of computation?โ€ and thatโ€™s exactly what this architecture set out to solve.

alt text

In MobileNet, the goal is to balance latency and accuracy. Ultimately, the model achieves successful lightweight optimization, making it suitable for mobile and embedded devices.

To understand how this is done, itโ€™s important to grasp the concept of Depthwise Separable Convolution.

Unlike standard convolutions (often referred to as pointwise convolutions when using 1ร—1 kernels), depthwise separable convolutions learn a separate filter for each input channel. In traditional convolutions, each filter operates across all input channels, making it difficult to isolate spatial features. Depthwise convolution, on the other hand, performs a convolution independently per channel, similar to grouping filtersโ€”a technique that dramatically reduces computation while retaining performance.

alt text

alt text

TechniqueDescription
Channel ReductionReduce the number of channels using a width multiplier: channels ร— ฮฑ (e.g., ฮฑ = 1.0, 0.75, 0.5)
CompressionReduce model size and parameters by setting a smaller ฮฑ (e.g., ฮฑ = 0.5)
Evenly Spaced DownsamplingUse stride = 2 in early layers (e.g., 224ร—224 โ†’ 112ร—112)
Shuffle OperationShuffle channels to promote cross-group information flow (e.g., ShuffleNet)
Knowledge Distillation & CompressionModel compression techniques like pruning, quantization, and distillation

Resource

CUDA Kernel - Example

Previous Post ๊ธ€์—์„œ ๋ดค๋“ฏ์ด, Addition ์„ Block ํ•˜๋‚˜๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ์˜ Threads ๋“ค์„ ์‚ฌ์šฉํ•ด์„œ, Vector Addition ์„ ํ• ์ˆ˜์žˆ๋‹ค. ๋งŒ์•ฝ์— ๊ทธ๋Ÿผ ์—ฌ๋Ÿฌ๊ฐœ์˜ Block ์„ ์ชผ๊ฐœ์„œ ์ด 8 ๊ฐœ์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” Array ๋ฅผ ๋”ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•˜๋ƒ? ๋ผ๊ณ  ๋ฌผ์–ด๋ณผ์ˆ˜ ์žˆ๋‹ค ์ƒ๊ฐ๋ณด๋‹ค ๊ฐ„๋‹จํ•˜๋‹ค. Block ๋‘๊ฐœ๋ฅผ ์‚ฌ์šฉํ•ด์„œ, Thread 4 ๊ฐœ์”ฉ ํ• ๋‹นํ• ์ˆ˜ ์žˆ๋‹ค. ์•„๋ž˜์˜ ์ฝ”๋“œ Segments ๋ฅผ ๋ด๋ณด์ž. ์•„๋ž˜์ฒ˜๋Ÿผ Block 2 ๊ฐœ, thread ๊ฐœ์ˆ˜ 4 ๊ฐœ ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„์„ ํ•˜๋ฉด ๋œ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๊ถ๊ธˆํ•œ๊ฑฐ๋Š” ๊ฒฐ๊ตญ ์ด ๋ง์…ˆ์ด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€๊ฐ€ ๊ถ๊ธˆํ•œ ํฌ์ธํŠธ์ด๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ๋•Œ๋ฌธ์— ์•„๋ž˜์ฒ˜๋Ÿผ ์†๋„๊ฐ€ ๋А๋ ค์ง€๋”๋ผ๋„, printf ๋ฅผ ํ†ตํ•ด์„œ, ๋ณผ์ˆ˜ ์žˆ๋‹ค. int i = blockDim.x * blockIdx.x + threadIdx.x ๋ผ๊ณ  ์จ์ ธ์žˆ๋‹ค. blockIdx.x ํ•˜๊ณ  threadIdx ๋Š” ๋Œ€์ถฉ ์ดํ•ด๊ฐ€ ๊ฐˆ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ BlockDim ์€ ๋ญ”๊ฐ€๋ผ๊ณ  ํ•œ๋ฒˆ์ฏค ๊ณ ๋ฏผ์ด ํ•„์š”ํ•˜๋‹ค.

__global__ void addKernel(const int* a, const int* b, int* c, int size)
{
	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < size)
		c[i] = a[i] + b[i];

	printf("%u %u %u %u %u %d\n", blockDim.x, blockDim.y, blockDim.z, blockIdx.x, threadIdx.x, i);
}

addKernel <<<2, 4 >>> (dev_a, dev_b, dev_c, size);

์ผ๋‹จ ์œ„์˜ ์ฝ”๋“œ๋ฅผ Printf ํ•œ ๊ฒฐ๊ณผ๊ฐ’์„ ํ•œ๋ฒˆ ๋ด๋ณด์ž. ์•„๋ž˜์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ณด์ž๋ฉด, Thread Block ์€ 4 x 1 x 1 ์ด๋‹ค. ๊ทธ๋ง์€ Thread ์˜ ๊ฐœ์ˆ˜ ํ•˜๋‚˜์˜ Block ๋‹น 4 ๊ฐœ์˜ Thread ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ , Block Index ๋Š” ์ด 2 ๊ฐœ์˜ Block ์„ ์‚ฌ์šฉํ•˜๋‹ˆ 0, 1 ๋กœ ๋‚˜์˜ค๋ฉฐ, ์ด์ œ ThreadIdx ๋Š” ๊ทธ๋Œ€๋กœ ๋‚˜์˜จ๋‹ค. ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ๊ฑด ๋ฐ”๋กœ 1 ๋ถ€ํ„ฐ ๋Œ์•„๊ฐ”๋‹ค๋Š” ์†Œ๋ฆฌ์ด๋‹ค. Multithreading ์„ ํ•˜๋‹ค ๋ณด๋ฉด ์ˆœ์„œ์— ์ƒ๊ด€์—†์ด ๋Œ๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ํ™˜๊ฒฝ ๋•Œ๋ฌธ์— ๋จผ์ € ์‹คํ–‰๋˜๋Š”๊ฑด ์ผ์ด ๋๋‚œ ์ˆœ์„œ๋Œ€๋กœ ๋˜์„œ ์ˆœ์„œ์™€ ์ƒ๊ด€์—†์ด Operation ๋งŒ ๋๋‚ด๋ฉด ๋œ๋‹ค๋Š” ๋ฐฉ์‹์—์„œ ์˜จ๊ฑฐ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  i ์˜ ๊ณ„์‚ฐ์˜ ๊ฒฐ๊ณผ ๊ฐ’๋“ค์„ ๋ด๋„ ์šฐ๋ฆฌ๊ฐ€ ์˜ˆ์ƒํ•˜์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ์ˆ˜ ์žˆ๋‹ค.

Alt text

๊ฒฐ๊ตญ์—๋Š” ๊ทธ๋Ÿผ ์šฐ๋ฆฌ๊ฐ€ ์–ด๋–ป๊ฒŒ Debugging ํ•˜์ง€? ๋ผ๋Š” ์งˆ๋ฌธ์ด ์žˆ๋‹ค.. ์–ด์ฐŒ์ €์ฐŒ ๋ฌ๋“ ๊ฐ„์—, CUDA ์•ˆ์— ์žˆ๋Š” Thread Block Index, Thread Index, Block Dimension ๊ฐ™์€ ๊ฒฝ์šฐ๋ฅผ ๋ด์•ผํ•˜์ง€ ์•Š์„๊นŒ? ๋ผ๋Š”๊ฒŒ ํฌ์ธํŠธ์ด๋‹ค. Nsight ๋ฅผ ์“ฐ๋ฉด ๊ต‰์žฅํžˆ ์ž˜๋‚˜์™€์žˆ๋‹ค. ์•„๋ž˜์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด, ๊ต‰์žฅํžˆ Visualization ์ด ์ƒ๊ฐ๋ณด๋‹ค ์ž˜๋˜์–ด์žˆ๋‹ค. ์•ˆ์— ์žˆ๋Š” ๋‚ด๋ถ€ ๊ตฌ์„ฑ์š”์†Œ๋Š” ๊ฑฐ์˜ ์œ„์™€ ๊ตฌํ˜„ํ•œ ๋ถ€๋ถ„์„ ๋งค์นญํ•˜๋ฉด์„œ ๋ณด๋ฉด ์ข‹์„๊ฒƒ ๊ฐ™๋‹ค.

Alt text

์•„๋ž˜์˜ ์ฝ”๋“œ ๊ฐ™์€ ๊ฒฝ์šฐ ๋Œ€์šฉ๋Ÿ‰์˜ Data ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์˜ˆ์ œ ์ฝ”๋“œ๋ผ๊ณ  ๋ณผ์ˆ˜ ์žˆ๋‹ค. 64 ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๋‚ด ์ปดํ“จํ„ฐ ์ŠคํŽ™์ค‘์— Maximum Threads Per Dimension ์•ˆ์— ์žˆ๋Š” ๋‚ด์šฉ์„ ์ธ์šฉํ–ˆ๋‹ค. ๋ฌผ๋ก  ์ด๋•Œ๋Š” Printf ๋ฅผ ์“ฐ๋ฉด ๊ณผ๋ถ€ํ•˜๊ฐ€ ๊ฑธ๋ฆด์ˆ˜๋„ ์žˆ์œผ๋‹ˆ ๊ฒฐ๊ณผ๊ฐ’๋งŒ ํ™•์ธํ•˜์ž. ๋ฌผ๋ก  ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ฐ™์ด Printf ๋ฅผ ํ•œ๊ฒฝ์šฐ๋„ ๋ณผ์ˆ˜ ์žˆ๋‹ค.

const int size = 1024 * 1024 * 64
const int threadsPerBlock = 1024; 
int numBlocks = (size + threadsPerBlock - 1) / threadsPerBlock; //int(ceil(float(size) / threadsPerBlock));
addKernel << < blocks, threadsPerBlock >> > (dev_a, dev_b, dev_c, size);

Alt text

Resource

Closure Advanced

๋งˆ์นจ Closure => Functor (Lambda Expression), Capture ์ด ๋ถ€๋ถ„์ด ์ƒ๊ฐ๋ณด๋‹ค ๊นŒ๋‹ค๋กญ์ง€ ์•Š์„๊นŒ? ์‹ถ์—ˆ๋Š”๋ฐ, ์™„์ „ C++ Syntax ๊ฐ€ ๋˜‘๊ฐ™๋‹ค. ๋‹จ์ง€ ์–ด๋–ค ํƒ€์ž…์— ๋”ฐ๋ผ์„œ ๋ณต์‚ฌ๋ฅผ ํ•˜๋Š”๊ฐ€์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅด๋‹ค. (ex: C++ ์—์„œ ๋Š” & (reference) ๋กœ ๋ณด๋‚ผ์ง€, ๊ทธ๋ƒฅ ๋ณต์‚ฌ Capture๋กœ ๋ถ€๋ฅผ์ง€๋ฅผ [] or [&] ์ด๋Ÿฐ์‹์œผ๋กœ ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ๋‹ค, ํ•˜์ง€๋งŒ swift ์—์„œ๋Š” struct ๋Š” value type ์ด๋ฏ€๋กœ => ๋ณต์‚ฌ, class ์ฒ˜๋Ÿผ reference type ์ด๋ฉด, ์ฃผ์†Ÿ๊ฐ’์„ ๋„ฃ์–ด์ฃผ๋Š” ํ˜•ํƒœ)

Escaping Closure

์ด ๋ถ€๋ถ„์„ ์ดํ•ดํ•˜๊ธฐ์œ„ํ•ด์„œ๋Š” ํ•จ์ˆ˜์˜ ๋๋‚จ! ์„ ์ž˜ ์•Œ์•„์•ผํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ ์–ด๋–ค ํ•จ์ˆ˜์—์„œ, closure() ๊ฐ€ ๋ฐ”๋กœ ์‹คํ–‰๋œ๋‹ค๊ณ  ํ•œ๋‹ค๊ณ  ํ•˜์ž. ์šฐ๋ฆฌ๊ฐ€ ๊ธฐ๋Œ€ํ•˜๊ณ  ์žˆ๋Š”๊ฑฐ๋Š” ํ•จ์ˆ˜๊ฐ€ ์‹คํ–‰๋˜๊ณ , closure ๊ฐ€ ์‹คํ–‰๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•œ๋‹ค๊ณ  ํ•˜์ž. ๊ทธ๋Ÿฌ๋ฉด ์ž˜ ์ž‘๋™์ด ๋ ๊ฑฐ๋‹ค. ํ•˜์ง€๋งŒ ๋งŒ์•ฝ ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ๊ฐ€ ์ด๋ฃจ์–ด์ง„๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”๊ฑธ๊นŒ? ํ•จ์ˆ˜์•ˆ์— ๋‚ด๋ถ€ closure ๋Š” ์‹คํ–‰์ด ๋ ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ์ง€๋งŒ, ์ด๋ฏธ ๊ทธํ•จ์ˆ˜๋Š” ๋๋‚œ ์ƒํƒœ๊ฐ€ ๋˜๋ฒ„๋ฆด์ˆ˜์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด, ํ•จ์ˆ˜ ๋‚ด๋ถ€์—์„œ ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ๊ฐ€ ๋˜๋Š”๊ฑฐ๊ฐ€ ์•„๋‹Œ, ์–ด๋–ค๊ณณ์—์„œ (ex: main) ์—์„œ ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ๊ฐ€ ๋ ๊ฑฐ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ๋น ์ ธ๋‚˜๊ฐ€์ง€๋„ ๋ชปํ•˜๋Š” ์ƒํ™ฉ์ด ๋˜๋ฒ„๋ฆฐ๋‹ค. ์ด๊ฑธ ๋ฐฉ์ง€ํ• ์ˆ˜ ์žˆ๋Š”๋ฐฉ๋ฒ•์ด ๋ฐ”๋กœ @escaping ์ด๋ผ๊ณ  ๋ณด๋ฉด ๋œ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ closure ๋Š” non-escaping ์ด๋‹ค.

์•„๋ž˜์˜ ์ฝ”๋“œ๋ฅผ ๋ด๋ณด์ž, 3 ์ดˆ ์ดํ›„์— DispatchQueue ์—์„œ ๋น„๋™๊ธฐ๋ฅผ ์‹ค์‹œ๋ฅผ ํ• ๊ฑด๋ฐ, ์ด๋•Œ ๋ฐ”๋กœ ๋“ค์–ด๊ฐ€์•ผํ•˜๋Š”๊ฒŒ @escaping keyword ์ด๋‹ค. ์ฆ‰ ์–ด๋–ค ํ•จ์ˆ˜์˜ ํ๋ฆ„์—์„œ, closure ๊ฐ€ stuc ํ•œ ๋ถ€๋ถ„์„ ํ’€์–ด์ค˜์•ผ closure ์˜ ๋๋งบ์Œ์ด ํ™•์‹คํ•˜๋‹ค๋Š”๊ฒƒ์ด๋‹ค. (๋ฌผ๋ก  ํ•จ์ˆ˜ ์ฃผ๊ธฐ๋Š” ๋๋‚˜๊ฒŒ ๋œ๋‹ค.) ์ฐธ๊ณ ๋กœ @Sendable Keyword ๋ฅผ ์•ˆ์“ด๋‹ค๋ฉด Error ๋ฅผ ํ‘œ์ถœํ• ๊ฒƒ์ด๋‹ค. ๊ทธ ์ด์œ ์ค‘์— ํ•˜๋‚˜๋Š” ์ผ๋‹จ compleition (closure type) ์€ referance type ์ด๋ฉฐ, ์ด๊ฒƒ์„ ๋น„๋™๊ธฐ์ฒ˜๋ฆฌ๋‚ด์—์„œ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, Thread Safe ๋ผ๋Š”๊ฑธ ๋ณด์žฅ์„ ํ•ด์•ผํ•˜๋Š”๋ฐ, compiler ์—์„œ๋Š” ๋ชจ๋ฅธ๋‹ค. ๊ทธ๋ž˜์„œ ๋ช…์‹œ์ ์œผ๋กœ Type ์„ ์ง€์ •ํ•ด์ค˜์•ผํ•œ๋‹ค.

func performAfterDelay(seconds: Double, completion: @escaping @Sendable () -> ()) {
    DispatchQueue.main.asyncAfter(deadline: .now() + seconds) {
        completion()
    }
}
print("before")
performAfterDelay(seconds: 3, closure: {
    print("Hello")
})
print("after")

๊ทธ๋ฆฌ๊ณ  closure ์ธก๋ฉด์—์„œ๋Š”, ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฐ›์€ ํด๋กœ์ €๋Š” ๋ณ€์ˆ˜๋‚˜ ์ƒ์ˆ˜์— ๋Œ€์ž…ํ• ์ˆ˜ ์—†๋‹ค. (์ด๊ฑด ์•Œ์•„์•ผ ํ•˜๋Š” ์ง€์‹์ค‘์— ํ•˜๋‚˜๋‹ค, ๊ฐ€๋”์”ฉ compiler error ๊ฐ€ ์•ˆ๋‚ ์ˆ˜๋„ ์žˆ๋‹ค.) ์ค‘์ฒฉ ํ•จ์ˆ˜ ๋‚ด๋ถ€์—์„œ, ํด๋กœ์ €๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, ์ค‘์ฒฉํ•จ์ˆ˜๋ฅผ return ํ• ์ˆ˜ ์—†๋‹ค. ์ฆ‰ ํ•จ์ˆ˜์˜ ์–ด๋–ค ํ๋ฆ„์ด ์žˆ๋‹ค๊ณ  ํ•˜๋ฉด, ์ข…๋ฃŒ๋˜๊ธฐ ์ „์— ๋ฌด์กฐ๊ฑด ์‹คํ–‰ ๋˜์–ด์•ผ ํ•œ๋‹ค๋Š”๊ฒƒ์ด๋‹ค. ๊ทผ๋ฐ ๋˜ ํŠน์ด ํ•œ์  ํ•˜๋‚˜๊ฐ€ ์žˆ๋‹ค. ์•„๋ž˜์˜ ์ฝ”๋“œ๋ฅผ ์ž ๊น ๋ด๋ณด์ž.

func performAfterDelay(seconds: Double, completion: (@Sendable () -> Void)?) {
    DispatchQueue.main.asyncAfter(deadline: .now() + seconds) {
        completion?()
    }
}

์ž ๊ฑฐ์‹œ์ ์œผ๋กœ ๋ดค์„๋•Œ๋Š” closure ์— return ๊ฐ’์„ ๊ธฐ๋Œ€ํ• ์ˆ˜๋„ ์žˆ๊ณ , ์—†์„์ˆ˜๋„ ์žˆ๋‹ค. ๊ทธ๊ฑธ Optional(?) type ์„ ์‚ฌ์šฉํ–ˆ๋‹ค. Optional ์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•˜๋ฉด, optional type argument ์ด๊ธฐ๋•Œ๋ฌธ์— ์ด๋ฏธ escaping ์ฒ˜๋ฆฌ๊ฐ€ ๋ ๊ฒƒ์ด๋‹ค. ๋ผ๊ณ  ๋งํ•œ๋‹ค. ์ฆ‰ closure type ์ด ๋”์ด์ƒ ์•„๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ž๋™์ ์œผ๋กœ escaping ํ•œ๋‹ค๊ณ  ๋ณด๋ฉด ๋ ๊ฒƒ ๊ฐ™๋‹ค.

Capture Ref & Caputer Value

์ด๊ฒŒ ํ•ด๊น”๋ฆด์ˆ˜ ์žˆ์œผ๋‹ˆ ์ •๋ฆฌ๋ฅผ ํ•ด๋ณด์ž ํ•œ๋‹ค. ์ผ๋‹จ ๋ฐ”๋กœ ์ฝ”๋“œ ๋ถ€ํ„ฐ ๋ณด๋Š”๊ฒŒ ์ข‹์„๊ฒƒ ๊ฐ™๋‹ค. ์ผ๋‹จ Value Type ์ธ Struct ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ตฌํ˜„์„ ํ•ด๋ณธ๋‹ค๊ณ  ํ•˜์ž. ์ผ๋‹จ Struct ๊ฐ€ Type ์ด Value Type ์ด๋‹ˆ๊นŒ? ๋‹น์—ฐํžˆ Capture ์„ ํ•˜๋ฉด, ๋‹น์—ฐํžˆ ๋ณต์‚ฌ๊ฐ€ ์ด๋ค„์ง€๊ธฐ๋•Œ๋ฌธ์— ๊ฐ’์ด ์•ˆ๋ฐ”๋€๋‹ค๊ณ  ์ƒ๊ฐ์€ ํ• ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, closure capture ์ž์ฒด๊ฐ€ ๊ธฐ๋ณธ์ด ๋ณ€์ˆ˜์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ’์„ ์ฐธ์กฐ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, person.age ์˜ ์ฃผ์†Ÿ๊ฐ’์„ reference ๋กœ ๋ฐ›๊ธฐ๋•Œ๋ฌธ์— closure capture ๊ฐ€ reference ํ˜•ํƒœ๋กœ ๋˜๋Š”๊ฒƒ์ด๋‹ค. ๊ทธ ์•„๋ž˜์˜ ๊ฒƒ์€ copy ๋‹ค. ์ด๊ฑด capture ํ•  list ๋ฅผ ๋„˜๊ฒจ์ฃผ๋Š”๋ฐ, ์ด๊ฑด closure ๊ฐ€ ์ƒ์„ฑํ•˜๋Š” ์‹œ์ ์˜ ๊ฐ’์„ ํ•˜๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ๋“ค๊ณ  ์žˆ๋‹ค๊ณ  ๋ณผ์ˆ˜ ์žˆ๋‹ค. (์ฆ‰ capture ํ•œ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š”๊ฒƒ) ๊ทธ๋Ÿฌ๊ธฐ๋•Œ๋ฌธ์— capture list ๋ฅผ ์‚ฌ์šฉํ• ๋•Œ๋Š”, ๊ฐ’์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ๋•Œ๋ฌธ์— ๋ณ€๊ฒฝ๋˜์ง€ ์•Š๋Š”๋‹ค. ์ฐธ๊ณ ๋กœ weak ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด, compiler ์—์„œ๋Š” class ๋˜๋Š” class-bound protocol-types ๋ผ๊ณ  ๋งํ• ๊ฒƒ์ด๋‹ค. ์ฆ‰ reference ํƒ€์ž…์ผ ๊ฒฝ์šฐ์—๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ ํ•˜๋‹ค.

struct Person {
    var name: String
    var age: Int
}

func captureRefTest() {
    var person = Person(name: "John", age: 30)
    var closure = {
        print(person.age)
    }
    
    closure()
    person.age = 40
    closure()
}

captureRefTest()

func captureCopyTest() {
    var person = Person(name: "Nick", age: 20)
    var closure = { [person] in
        print(person.age)
    }
    
    closure()
    person.age = 40
    closure()
}

captureCopyTest()

๊ทธ๋ ‡๋‹ค๋ฉด class ๋Š” ์–ด๋–จ๊ฐ€? ์ด๊ฑด ์• ์ดˆ์— ๊ฐ€์ •์ด reference type ์ด๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ๋•Œ๋ฌธ์— ์• ์ดˆ์— ๊ฐ’์ฐธ์กฐ๋ฅผ ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ๋•Œ๋ฌธ์— Capture list ๋ฅผ ์‚ฌ์šฉํ•˜๋”๋ผ๋„ reference ์ฒ˜๋Ÿผ ์ž‘๋™์„ ํ•œ๋‹ค.

class Animal {
    var name: String
    var age: Int
    
    init(name: String, age: Int) {
        self.name = name
        self.age = age
    }
}

func captureTest() {
    var animal = Animal(name: "Dog", age: 10)
    var closure = { [weak animal] in
        print(animal!.age)
    }
    
    closure()
    animal.age = 20
    closure()
}
captureTest()

CUDA Kernel

Kernel Function (Vector Addition)

์ผ๋ฐ˜์ ์œผ๋กœ c ์—์„œ๋Š” ๋‘๊ฐ€์ง€์˜ Vector(Array) ๋ฅผ ๋”ํ•œ๋‹ค๊ณ  ๊ฐ€์ •์„ ํ–ˆ์„๋•Œ, ์•„๋ž˜์˜ ๋ฐฉ์‹๋Œ€๋กœ ๋”ํ•œ๋‹ค.

int main(void) {
    // host side
    const int SIZE = 6;
    const int a[SIZE] = {1, 2, 3, 4, 5, 6 };
    const int b[SIZE] = {10, 20, 30, 40, 50, 60 };
    int c[SIZE] = {0};

    for (register int i = 0; i < SIZE; ++i) {
        c[i] = a[i] + b[i];
    }
    return 0;
}
```add.cu

์œ„์˜ For-Loop ์•ˆ์— ์žˆ๋Š” Body ๊ฐ€ ์žˆ๋‹ค, ์ด๋•Œ๋ฅผ `Kernel Function` ์ด๋ผ๊ณ ๋„ ํ•œ๋‹ค. (with proper value). ์‹ค์ œ ์˜ˆ์‹œ๋กœ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์™œ ๊ตณ์ด idx ๋ฅผ ๋„˜๊ธฐ๋А๋ƒ๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด์„œ `Kernel Function` ์„ Define ํ•˜๋Š”๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ๋„ ์•„์ง์€ CPU ์—์„œ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š”๊ฑฐ๋‹ค. (CallStack ์—๋Š” CPU[0] executes add_kernel(0 ...)) ์ด๋Ÿฐ์‹์œผ๋กœ ์ˆ˜ํ–‰์ด SIZE - 1 ๋งŒํผ ๋ ๊ฑฐ๋‹ค. ์ฆ‰ ์ด๊ฑด sequential execution ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

```c
void add_kernel(int idx, const int* a, const int* b, int*c) {
    int i = idx;
    c[i] = a[i] + b[i];
}

for (register int i = 0; i < SIZE; ++i) {
    add_kernel(i, a, b, c);
}

๋งŒ์•ฝ multi-core CPUโ€™s ๋˜๋Š” Parallel Execution ์„ ํ•œ๋‹ค๊ณ  ๊ฐ€์ •์„ ํ•˜๋ฉด ์–ด๋–จ๊นŒ? ์ฆ‰ ์ฝ”์–ด๊ฐ€ 2๊ฐœ๋ผ๋ฉด, ์ง์ˆ˜๊ฐœ์”ฉ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

at time 0: CPU = core#0 = executes add_kernel(0, ...) 
at time 0: CPU = core#1 = executes add_kernel(1, ...)
at time 1: CPU = core#0 = executes add_kernel(2, ...) 
at time 1: CPU = core#1 = executes add_kernel(3, ...)
...
at time (n-1)/2: CPU = core#1 = executes add_kernel(SIZE - 1, ...)

๊ทธ๋ ‡๋‹ค๋ฉด GPU ๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? GPU ๋Š” ์—„์ฒญ ๋งŽ์€ Core ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์—„์ฒญ๋‚œ Parallelism ์„ ๊ฐ€์ง€๊ณ  ๊ฐˆ์ˆ˜ ์žˆ๋‹ค. ์•„๋ž˜์™€ ๊ฐ™์ด Time 0: ์— ForLoop ์„ ์ฒ˜๋ฆฌ๋ฅผ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋กœ ํ• ์ˆ˜ ์žˆ๋‹ค๋Š”๊ฑฐ๋‹ค.

at time 0: CPU = core#0 = executes add_kernel(0, ...) 
at time 0: CPU = core#1 = executes add_kernel(1, ...)
at time 0: CPU = core#2 = executes add_kernel(2, ...) 
at time 0: CPU = core#3 = executes add_kernel(3, ...)
...
at time 0: CPU = core(#n-1) = executes add_kernel(SIZE - 1, ...)

์œ„์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌ ํ•˜์ž๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์ฆ‰ ์‹œ๊ฐ„ ์ˆœ์„œ๋ณ„๋กœ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š”์ชฝ์€ CPU, ์ฝ”์–ด๋ณ„๋กœ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š”๊ฑด GPU ๋ผ๊ณ  ๋ณผ์ˆ˜ ์žˆ๋‹ค.

CPU KernelsGPU Kernels
with a single CPU Core, For loopa set of GPU Cores
sequential executionparallel execution
for-loopkernel lanuch
CPU[0] for time 0GPU[0] for core #0
CPU[1] for time 1GPU[1] for core #1
CPU[n-1] for time n-1GPU[n-1] for core #n-1

CUDA vector addition ๊ฐ™์€ ๊ฒฝ์šฐ ์—ฌ๋Ÿฌ๊ฐ€์ง€ Step ์ด ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

  1. host-side
    1. make A, B with source data
    2. prepare C for the result
  2. data copy host -> device
    1. cudaMemcpy from host to device
  3. addition in CUDA
    1. kernel launch for CUDA device
    2. result will be stored in device (VRAM)
  4. data copy device -> host
    1. cudaMemcpy from device to host
  5. host-side
    1. cout

Function Call vs Kernel Launch

๊ธฐ๋ณธ์ ์œผ๋กœ C/C++ CPU ์—์„œ๋Š” Function ์„ ๋ถ€๋ฅผ๋•Œ, Function Call ์ด๋ผ๊ณ  ํ•œ๋‹ค, ์ด์˜ Syntax ๋Š” ์•„๋ž˜์™€๊ฐ™๋‹ค.

void func_name(int param, ...);
for (int i = 0; i < SIZE; i++) {
    func_name(param, ...)
}

ํ•˜์ง€๋งŒ GPU ์—์„œ๋Š” ๋งŽ์ด ๋‹ค๋ฅด๋‹ค. c++ ์—์„œ ์‚ฌ์šฉํ–ˆ์„๋•Œ์™€ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ Kernel(function) ์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด Syntax ๊ฐ™ ๊ฒฝ์šฐ Kernel launch Syntax ๋ผ๊ณ  ํ•œ๋‹ค. ์˜๋ฏธ์ ์œผ๋กœ๋Š” 1 ์„ธํŠธ์— SIZE ๋งŒํผ์˜ ์ฝ”์–ด๋ฅผ ์‚ฌ์šฉํ•˜๊ฒ ๋‹ค๊ฐ€ ๋˜๋Š”๊ฒƒ์ด๋‹ค. ๋˜ ๋‹ค๋ฅธ ์˜๋ฏธ๋Š” ๋ฐ”๋กœ 1 ์ด๋ผ๋Š” ์ธ์ž ๊ฐ’์€ Thread Block ๋ช‡๊ฐœ๋ฅผ ์‚ฌ์šฉํ• ๊ฑด์ง€์™€, ๊ทธ Thread Block ์— Thread ๋ฅผ ๋ช‡๊ฐœ ์‚ฌ์šฉํ• ์ง€๊ฐ€ ์ •์˜๊ฐ€๋œ๋‹ค. Thread Block ์•ˆ์—์žˆ๋Š” Thread ๋Š” ์ฝ”๋“œ ์•„๋ž˜์˜ ๊ทธ๋ฆผ์„ ์ฐธ์กฐ ํ•˜๋ฉด ์ข‹์„๊ฒƒ ๊ฐ™๋‹ค.

__global void kernel_name(int param, ...);
kernel_name <<<1, SIZE>>>(param, ...)

Thread Block Organization

์‹ค์ œ๋กœ ์˜ˆ์ œ ํŒŒ์ผ์€ ์•„๋ž˜์™€๊ฐ™๋‹ค. addKernel ์ด ์‹ค์ œ๋กœ๋Š” GPU ์•ˆ์—์„œ์˜ FunctionCall ํ˜•ํƒœ๊ฐ€ ๋ ๊ฑฐ๊ณ , Index ๋ฅผ ๋„˜๊ธฐ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๋‚ด๋ถ€์•ˆ์—์„œ ๋‚ด ํ•จ์ˆ˜ Call ์˜ Index ๋ฅผ ์ฐพ์„์ˆ˜ ์žˆ๋‹ค.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    printf("%d\n", i)
    c[i] = a[i] + b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    // ...
     int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;
    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
    //...
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    cudaError_t cudaStatus = cudaDeviceSynchronize();

    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

Error:
    cudaFree(dev_c)
    cudaFree(dev_a);
    cudaFree(dev_b);
    return cudaStatus;
    
}

์•„๋ž˜์™€ ๊ฐ™์ด, cudaDeviceSynchronize() ๋Š” kernel ์ด ๋๋‚ ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ธ๋‹ค๊ฐ€ Error_t ๋ฅผ Return ์„ ํ•˜๊ฒŒ ๋œ๋‹ค. ์„ฑ๊ณต์„ ํ•˜๋ฉด, cudaSuccess ๋ฅผ ๋ฐ›๋Š”๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰์œผ๋กœ๋Š” CPU ์ชฝ์œผ๋กœ ๋ณต์‚ฌ๋ฅผ ํ•ด์ค€๋Š” ๊ตฌ๋ฌธ cudaMemcpy(...) ๊ฐ€ ์กด์žฌํ•˜๊ณ , Error ๋ฅผ ๋‚ด๋ฑ‰๋Š”๊ณณ์œผ๋กœ ๊ฐ€๊ฒŒ๋œ๋‹ค๋ฉด, CudaFree ๋ฅผ ํ•ด์ค€๋‹ค.

๋ฌผ๋ก , Host ์ชฝ์—์„œ ๊ณ„์† ์ญ‰ Status ๋ฅผ ์‚ฌ์šฉํ•ด์„œ, ๊ธฐ๋‹ค๋ฆฌ์ง€๋งŒ Kernel ์•ˆ์—์„œ, Kernel launch ์ค‘์—๋„ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ• ์ˆ˜ ์žˆ๋‹ค. ๊ทธ ๋ถ€๋ถ„์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ฐ›์„์ˆ˜ ์žˆ๋‹ค. ์›๋ž˜๋Š” cudaError_t err = cudaPeekAtLastError() ๊ทธ๋ฆฌ๊ณ  cudaError_t err = cudaGetLastError() ๊ฐ€ ์žˆ๋‹ค ๋‘˜์˜ ํ•˜๋Š” ์—ญํ™œ์€ ๋™์ผํ•˜๋‹ค! ํ•˜์ง€๋งŒ ๋‚ด๋ถ€์•ˆ์—์„œ ์žˆ๋Š” Error Flag ๋ฅผ Reset ์„ ํ•ด์ฃผ๋Š”๊ฒŒ cudaGetLastError() ์ด๋ฉฐ, cudaPeekAtLastError() ๋Š” Reset ์„ ํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋ง์€ Reset ์„ last error only ๊ฐ€ ์•„๋‹ˆ๋ผ ๋ชจ๋“  Error ์— ๋Œ€ํ•ด์„œ ์ €์žฅ์„ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐ์„ ํ•˜๋ฉด๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์•„๋ž˜์ฒ˜๋Ÿผ Macro ๋ฅผ ์„ค์ •์„ ํ•ด์ฃผ์–ด๋„ ์ข‹๋‹ค.

// Check for any errors launching the kernel
cudaError_t cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

cudaError_t err = cudaPeekAtLastError();

// CAUTION: we check CUDA error even in release mode
// #if defined(NDEBUG)
// #define CUDA_CHECK_ERROR()  0
// #else
#define CUDA_CHECK_ERROR()  do { \
    cudaError_t e = cudaGetLastError(); \
    if (cudaSuccess != e) { \
        printf("cuda failure \"%s\" at %s:%d\n", \
        cudaGetErrorString(e), \
        __FILE__, __LINE__); \
        exit(1); \
    } \
} while (0)
// #endif

๊ทผ๋ฐ ์—ฌ๊ธฐ์„œ ๊ถ๊ธˆ์ฆ์ด ์žˆ์„์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ, c++ ์—์„œ๋Š” Return ์˜ ๋ฐ˜ํ™˜๊ฐ’์„ ์ง€์ •ํ• ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, Kernel ์€ ๊ทธ๋ ‡์ง€ ๋ชปํ•˜๋‹ค. ๋ฌด์กฐ๊ฑด void ๋กœ return ํ•˜๊ฒŒ๋”ํ•ด์•ผํ•œ๋‹ค. ์ด๊ฑด ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, 100 ๋งŒ๊ฐœ์˜ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ํ•œ๋‹ค๋ฉด 100 ๋งŒ๊ฐœ์˜ return ๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋˜๋Š”๋ฐ ์ด๊ฑด error code ์— ๋” ๊ฐ€๊น๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ๊ณ„์‚ฐ์ด ๋๋‚ฌ๋‹ค๋ผ๋Š”๊ฑธ ๋ช…์‹œ์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ™•์ธํ•˜๋А๋ƒ๊ฐ€ ํฌ์ธํŠธ์ผ ์ผ๊ฒƒ ๊ฐ™๋‹ค. ๋ฐ”๋กœ Memory ๋ฅผ ๋˜์ ธ์คฌ์„๋–„, ๊ทธ ๋ฐฐ์—ด์„ update ํ•ด์„œ GPU ์—์„œ CPU ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ Memcopy ๊ฐ€ ๋ฌ์„๋•Œ๋งŒ ํ™•์ธ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

์˜ˆ์ œ ํŒŒ์ผ๋กœ Vector ์•ˆ์— ๋ชจ๋“  Element ์— +1 ์”ฉ ๋ถ™์ด๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•œ๋‹ค๊ณ  ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ• ์ˆ˜ ์žˆ๋‹ค.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void add_kernel(float *b, const float *a)
{
    int i = threadIdx.x;
    b[i] = a[i] + 1.0f;
}

int main()
{
    const int arrSize = 8;
    const float a[arrSize] = { 0., 1., 2., 3., 4., 5., 6., 7. };
    float b[arrSize] = { 0., 0., 0., 0., 0., 0., 0., 0., };

    printf("a = {%f,%f,%f,%f,%f,%f,%f,%f\n", a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]);

    float* dev_a = nullptr;
    float* dev_b = nullptr;
    cudaError_t cudaStatus;
    cudaMalloc((void**)&dev_a, arrSize * sizeof(float));
    cudaMalloc((void**)&dev_b, arrSize * sizeof(float));
    cudaMemcpy(dev_a, a, arrSize * sizeof(float), cudaMemcpyHostToDevice);
    add_kernel <<<1, arrSize >>>(dev_b, dev_a);
    
    cudaStatus = cudaPeekAtLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    }

    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize r eturned error code %d after launching addKernel!\n", cudaStatus);
    }

    // Result
    cudaStatus = cudaMemcpy(b, dev_b, arrSize * sizeof(float), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
    }

    printf("b = {%f,%f,%f,%f,%f,%f,%f,%f\n", b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7]);

    cudaFree(dev_a);
    cudaFree(dev_b);
    return 0;
}

๊ทธ๋ฆฌ๊ณ  ์ฐธ๊ณ ์ ์œผ๋กœ ๊ฟ€ํŒ์ค‘์— ํ•˜๋‚˜๋Š” const char* cudaGetErrorName( cudaError_t err) ์ด ํ•จ์ˆ˜๊ฐ€ ์žˆ๋‹ค.cudaError_t ๋ฅผ ๋„ฃ์–ด์„œ ํ™•์ธ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, Return ์ด Enum Type ์˜ String ์„ char arr ๋ฐฐ์—ด๋กœ ๋ฐ›์„์ˆ˜ ์žˆ์œผ๋‹ˆ ๊ต‰์žฅํžˆ ์ข‹์€ debugging ๊ฟ€ํŒ์ผ์ˆ˜ ์žˆ๊ฒ ๋‹ค. ๋˜ ๋‹ค๋ฅธ๊ฑด const char* cudaGetErrorString(cudaError_t err) err code ์— ๋Œ€ํ•œ explanation string ๊ฐ’์œผ๋กœ return ์„ ํ•˜๊ฒŒ๋” ๋˜์–ด์žˆ๋‹ค. ๋‘˜๋‹ค cout << <<endl; ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

cudaGetLastError() -> Thread ๋‹จ์œ„ ์ฒ˜๋ฆฌ

์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ Cuda Process ๊ฐ€ ๋Œ๋ฆด๋•Œ, ๋‚ด๊ฐ€ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š” ํ”„๋กœ์„ธ์Šค์—์„œ ์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ Thread ๊ฐ€ ๊ฐˆ๋ผ์ ธ์„œ, ์ด๋“ค thread ๊ฐ€ Cuda system ์„ ๋™์‹œ์— ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค๋ผ๋ฉด, CUDA Error ๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ณ ์ฐฐ์ด ์ƒ๊ธธ์ˆ˜๋„ ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ฐ Cpu Thread ๊ฐ€ Cuda ์˜ ์ปค๋„์„ ๋…์ž์ ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ๊ฐ€์ •์„ ํ•˜๋ฉด Cuda eror ๋Š” Cpu thread ๊ธฐ์ค€์œผ๋กœ err ์˜ ์ƒํƒœ ๊ด€๋ฆฌ๋ฅผ ํ•˜๋Š”๊ฒŒ ์ข‹๋‹ค.

Resource

Programmers: Target Number

Description

Given an array of non-negative integers numbers, and a target number target, write a function solution that returns the number of ways to add or subtract these numbers without changing their order to reach the target number.

For example, using the numbers ``, you can make the target number 3 in the following five ways:

-1+1+1+1+1 = 3 +1-1+1+1+1 = 3 +1+1-1+1+1 = 3 +1+1+1-1+1 = 3 +1+1+1+1-1 = 3

Thinking Process

Intially, when I try to solve this problem, I was thinking that this is typcial dp problem in such that if you select one number in the array, then you can choose either - number or positive number. Then, I was thinking you donโ€™t really have to approach this problem with dp, just simple bfs or dfs can be possible

  • If we want to solve this by the recursive way, we need a constraint, constraint would be the vectorโ€™s size. Letโ€™s say that weโ€™ve started the +1 by adding the first number in the vector, then index has been already incremented by 1. So, by having the index, we can track what index weโ€™ve used to get the target value.

Implementation

DFS

#include <string>
#include <vector>

using namespace std;
void searchTargetNumber(vector<int>& vec, int tar, int index, int sum, int& answer) {
    if (index == vec.size()) {
        if (sum == tar) {
            answer++;
        }
        return;
    }
    
    searchTargetNumber(vec, tar, index+1, sum + vec[index], answer);
    searchTargetNumber(vec, tar, index+1, sum - vec[index], answer);
}

int solution(vector<int> numbers, int target) {
    int answer = 0;
    searchTargetNumber(numbers, target, 0, 0, answer);
    return answer;
}

BFS This actually worked on several test cases, except if there are a lot of nums in vector, it exceeds time limits because this is 2^n

class Solution {
public:
    int findTargetSumWays(vector<int>& nums, int target) {
        int answer = 0;

        deque<pair<int, int>> dq;
        dq.push_back({nums[0], 0});
        dq.push_back({-nums[0], 0});
        
        while(!dq.empty()) {
            int value = dq.front().first;
            int index = dq.front().second;
            index += 1;
            dq.pop_front();

            if (index < nums.size()) {
                dq.push_back({value + nums[index], index});
                dq.push_back({value - nums[index], index});
            } 
            else {
                if (value == target)
                    answer++;
            }
        }
        return answer;
    }
};

Dynamic Programming

Top-down Approach: This it pretty special one, one thing we need to notice is the way to save the totalsum with current sum. This is efficient way to store all possible sum. For example, if the total sum is 5, then -5 + 5 = 0 which is index at 0, -4 + 5 = 1(index), -3 + 5 = 2(index), and so on. This ensures the O(n * totalSum);

class Solution {
public:
    int memoization(vector<int>& nums, int currentIndex, int currentSum, int target, vector<vector<int>>& memo) {
        if (currentIndex == nums.size()) {
            if (currentSum == target) {
                return 1;
            } else {
                return 0;
            }
        } else {
            // done
            if (memo[currentIndex][currentSum + totalSum] != numeric_limits<int>::min()) {
                return memo[currentIndex][currentSum + totalSum];
            }
            int add = memoization(nums, currentIndex + 1, currentSum + nums[currentIndex], target, memo);
            int subtract = memoization(nums, currentIndex + 1, currentSum - nums[currentIndex], target, memo);
            memo[currentIndex][currentSum + totalSum] = add + subtract;
            return memo[currentIndex][currentSum + totalSum];
        }
    }

    int findTargetSumWays(vector<int>& nums, int target) {
        totalSum = accumulate(nums.begin(), nums.end(), 0);
        vector<vector<int>> memo(nums.size(), vector<int>(2 * totalSum + 1, numeric_limits<int>::min()));
        return memoization(nums, 0, 0, target, memo);
    }

public:
    int totalSum;
};

Better Solution

WOW, People are very ge, see if I understand correctly. The point here is to treat this problem as subset sum. I know the main goal is to find all possible solution to get to the target, but i think itโ€™s good to break things up.

For example, if we have the list [1 -2, 3, 4], we set this as two sets one for +, the other for -. Then we can separate this s1 = [1, 3, 4] and s2 = [2]. Then, we can conclude that the totalSum = s1 + s2, but to find the target would be target = s1 - s2 (because we need to think all possible occurence of sum to be target). Then, we can write the equation like 2s1 = totalSum + target, then s1 = totalSum + target / 2. We call this as diff if this diff is not an integer, then we donโ€™t have to compute.

Then, we can implement this idea. But this code doesnโ€™t consider the sign changes, itโ€™s either select one or not, which treat this as subset sum. (you should check any dp problem if you are curious because filling dp table is very similar to LCS or matrix multiplication)

int cache(int j, int sum, vector<int>& nums) {
    if (j == 0) return sum == 0?1:0;
    // done
    if (dp[j][sum] != -1) return dp[j][sum];
    int x = nums[j-1];
    int ans = cache(j-1, sum, nums);
    if (sum>=x) ans += cache(j-1, sum-x, nums);
    return dp[j][sum] = ans;
}

int findTargetSumWays(vector<int>& nums, int target) {
    const int n = nums.size();
    int sum=accumulate(nums.begin(), nums.end(), 0);
    int diff=sum-target;    // Check if it's possible to achieve the target
    if (diff<0|| diff%2!=0) return 0; 
    diff/=2;
    vector<vector<int>> dp(n + 1, vector<int>(diff + 1, -1))
    return cache(n, diff, nums);
}

LeetCode 207: Course Schedule 2 [Medium]

class Solution {
public:
    vector<int> findOrder(int numCourses, vector<vector<int>>& prerequisites) {
        int n = prerequisites.size(); // same as numCourses
        vector<int> inDegree(numCourses, 0);
        vector<int> result;

        unordered_map<int, vector<int>> adj;
        for (int i = 0; i < n; i++) {
            adj[prerequisites[i][0]].push_back(prerequisites[i][1]);
        }

        // fill inDegree
        for (auto it : adj) {
            for (int node : adj[it.first])
                inDegree[node]++;
        }

        queue<int> q;
        for (int i = 0; i < numCourses; i++){
            if (inDegree[i] == 0)
                q.push(i);
        }


        while(!q.empty()) {
            int node = q.front();
            q.pop();

            result.push_back(node);
            for (int e : adj[node]) {
                inDegree[e]--;
                if (inDegree[e] == 0)
                    q.push(e);
            }
        }

        reverse(result.begin(), result.end());
        if (result.size() == numCourses){
            return result;
        }
        return {};
    }
};

Pagination