Commit 23435280 authored by David Peter's avatar David Peter
Browse files

CNN

parent c95b9bfc
No preview for this file type
......@@ -265,7 +265,7 @@
\input{\pwd/batch_norm}
\newsection{Convolutional Neural Networks}{dnn:cnn}
\glspl{cnn} \cite{Lecun1995} are a specialized kind of \gls{dnn} that are tremendously successful in many practical applications. \glspl{cnn} are most prominently used in image and video recognition tasks, although there are many more use cases. Similar to \glspl{mlp}, \glspl{cnn} obtain their outstanding performance by stacking multiple layers. In this section, we will cover some of the basic \gls{cnn} layers such as the convolutional layer and pooling layer as well as more intricate layers such as the depthwise separable convolution layer and the \gls{mbc} convolution layer.
\glspl{cnn} \cite{Lecun1995} are a specialized kind of \gls{dnn} that are tremendously successful in many practical applications. \glspl{cnn} are most prominently used in image and video processing tasks, although there are many more use cases. Similar to \glspl{mlp}, \glspl{cnn} obtain their outstanding performance by stacking multiple layers. In this section, we will cover some of the basic \gls{cnn} layers such as the convolutional layer and pooling layer as well as more intricate layers such as the depthwise separable convolution layer and the \gls{mbc} convolution layer.
\newsubsection{Convolutional Layer}{dnn:convolutional_layer}
\input{\pwd/convolution}
......
......@@ -6,7 +6,7 @@ Convolutional layers are the main building blocks of a \gls{cnn}. \glspl{cnn} ar
\end{equation}
where $K$ is a two-dimensional kernel with $k_1 \times k_2$ learnable parameters. The principle of the convolution operation is depicted in \Fref{fig:2d_convolution}. A kernel $K$ slides over the image $I$ and at every step, a single value in the output $S$ is obtained by computing the dot product between the kernel $K$ and the overlapping region of the image. The kernel is then moved by stride of $s$ pixels. After the kernel is moved, the process is repeated until the whole image is processed.
Sometimes, zero-padding is applied to the input image to control the shape of the output image. Zero-padding is a special type of padding where zeros are added around the outside of the input image. In \Fref{fig:2d_convolution}, no padding is applied to input image. Therefore, the output has a smaller spatial size as the input. To preserve the spatial dimension, a zero-padding with a margin of $\floor{k/2}$ needs to be applied to the input, assuming that the kernel has dimensions $k \times k$.
Sometimes, zero-padding is applied to the input image to control the shape of the output image. Zero-padding is a special type of padding where zeros are added around the edge of the input image. In \Fref{fig:2d_convolution}, no padding is applied to the input image. Therefore, the output has a smaller spatial size as the input. To preserve the spatial dimension, a zero-padding with a margin of $\floor{k/2}$ needs to be applied to the input, assuming that the kernel has dimensions $k \times k$.
\begin{figure}
\centering
\begin{tikzpicture}
......
% **************************************************************************************************
% **************************************************************************************************
\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models in general are a family of highly efficient models for mobile applications. Utilizing \gls{mbc} as the main building block, MobileNetV2 attains a Top-1 accuracy of $72.0\%$ on the ImageNet \cite{Imagenet} dataset using $3.4$M parameters and a total of $300$M \glspl{madds}. In comparison, MobileNetV1 only attains a Top-1 accuracy of $70.6\%$ using $4.2$M parameters and a total of $575$M \glspl{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.
\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models in general are a family of highly efficient models for mobile applications. Utilizing \gls{mbc} as the main building block, MobileNetV2 attains a Top-1 accuracy of \SI{72}{\percent} on the ImageNet \cite{Imagenet} dataset using \SI{3.4}{\mega\nothing} parameters and a total of \SI{300}{\mega\nothing} \glspl{madds}. In comparison, MobileNetV1 only attains a Top-1 accuracy of \SI{70.6}{\percent} using \SI{4.2}{\mega\nothing} parameters and a total of \SI{575}{\mega\nothing} \glspl{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.
\glspl{mbc} consist of three separate convolutions, a 1$\times$1 convolution followed by a depthwise separable $k \times k$ convolution followed again by a 1$\times$1 convolution. The first 1$\times$1 convolution consists of a convolutional layer followed by a batch normalization and the \gls{relu} activation function. The purpose of the first 1$\times$1 convolution is to expand the number of channels by the expansion rate factor $e \geq 1$ to transform the input feature map into a higher dimensional feature space. Then a depthwise separable $k \times k$ convolution with stride $s$ is applied to the high dimensional feature map followed by a batch normalization and the \gls{relu} activation. Then, a $1 \times 1$ convolution is applied to the high dimensional feature map in order to reduce the number of channels by a factor of $e' \leq 1$. If the input of the \gls{mbc} has the same shape as the output (e.g. when the stride is $s=1$ and $e=\frac{1}{e'}$), a residual connection is introduced. A residual connection adds the input of the \gls{mbc} to the output of the \gls{mbc}.
\Fref{tab:mbc_convs} shows the three convolutions forming a \gls{mbc}. The table also shows the input and output dimension of every convolution. A input with dimensions $c \times h \times w$ is transformed to an output with dimension $c' \times \frac{h}{s} \times \frac{w}{s}$ using a stride $s$ and an expansion rate $e$.
\Fref{tab:mbc_convs} shows the three convolutions forming a \gls{mbc}. The table also shows the input and output dimension of every convolution. An input with dimensions $C \times H \times W$ is transformed to an output with dimension $C' \times \frac{H}{s} \times \frac{W}{s}$ using a stride $s$ and an expansion rate $e$.
\begin{table}
\centering
......@@ -12,9 +12,9 @@
\toprule
Input & Convolutions & Output\\
\midrule
$c \times h \times w$ & $1 \times 1$ Conv, BN, \gls{relu} & $ec \times h \times w$ \\
$ec \times h \times w$ & $k \times k$ Depthwise-Conv (stride $s$), BN, \gls{relu} & $ec \times \frac{h}{s} \times \frac{w}{s}$ \\
$ec \times \frac{h}{s} \times \frac{w}{s}$ & $1 \times 1$ Conv & $c' \times \frac{h}{s} \times \frac{w}{s}$ \\
$C \times H \times W$ & $1 \times 1$ Conv, BN, \gls{relu} & $e \cdot C \times H \times W$ \\
$e \cdot C \times H \times W$ & $k \times k$ Depthwise-Conv (stride $s$), BN, \gls{relu} & $e \cdot C \times \frac{H}{s} \times \frac{W}{s}$ \\
$e \cdot C \times \frac{H}{s} \times \frac{W}{s}$ & $1 \times 1$ Conv & $C' \times \frac{H}{s} \times \frac{W}{s}$ \\
\bottomrule
\end{tabular}
\caption{The three convolutions forming a \gls{mbc} with a stride $s$ and an expansion rate $e$. A residual connection is introduced if the input of the \gls{mbc} has the same shape as the output.}
......
......@@ -153,4 +153,4 @@ There are several reasons to use the pooling layer in \glspl{cnn}. The first rea
\label{fig:max_avg_pooling}
\end{figure}
Deep learning frameworks, e.g. PyTorch \cite{Torch2019}, also support adaptive pooling. In adaptive pooling, the filter size and stride are adapted automatically to produce a certain spatial output size. Adaptive pooling supports several summary statistics, most commonly the average and the maximum. A special case of adaptive pooling is global pooling where the spatial output size is reduced to one. Given a input feature map of shape $C_1 \times H_1 \times W_1$, global pooling produces a output feature map of size $C_2 \times 1 \times 1$. Global pooling is most commonly used after the convolutional part of a neural network together with a flattening stage that reshapes the input feature map to one-dimensional output. Flattening takes the $C_2 \times 1 \times 1$ input and produces a vector of size $C_2$ that is then used as an input for the subsequent \gls{mlp} classifier.
Deep learning frameworks, e.g. PyTorch \cite{Torch2019}, also support adaptive pooling. In adaptive pooling, the filter size and stride are adapted automatically to produce a certain spatial output size. Adaptive pooling supports several summary statistics, most commonly the average and the maximum. A special case of adaptive pooling is global pooling where the spatial output size is reduced to one. Given a input feature map of shape $C_1 \times H_1 \times W_1$, global pooling produces a output feature map of size $C_2 \times 1 \times 1$. Global pooling is most commonly used after the convolutional part of a neural network together with a flattening stage that reshapes a feature map to an one-dimensional output. Flattening takes the $C_2 \times 1 \times 1$ input and produces a vector of size $C_2$ that is then used as an input for the subsequent \gls{mlp} classifier.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment