\glspl{cnn}\cite{Lecun1995} are a specialized kind of \gls{dnn} that are tremendously successful in many practical applications. \glspl{cnn} are most prominently used in image and video recognition tasks, although there are many more use cases. Similar to \glspl{mlp}, \glspl{cnn} obtain their outstanding performance by stacking multiple layers. In this section, we will cover some of the basic \gls{cnn} layers such as the convolutional layer and pooling layer as well as more intricate layers such as the depthwise separable convolution layer and the \gls{mbc} convolution layer.

\glspl{cnn}\cite{Lecun1995} are a specialized kind of \gls{dnn} that are tremendously successful in many practical applications. \glspl{cnn} are most prominently used in image and video processing tasks, although there are many more use cases. Similar to \glspl{mlp}, \glspl{cnn} obtain their outstanding performance by stacking multiple layers. In this section, we will cover some of the basic \gls{cnn} layers such as the convolutional layer and pooling layer as well as more intricate layers such as the depthwise separable convolution layer and the \gls{mbc} convolution layer.

@@ -6,7 +6,7 @@ Convolutional layers are the main building blocks of a \gls{cnn}. \glspl{cnn} ar

\end{equation}

where $K$ is a two-dimensional kernel with $k_1\times k_2$ learnable parameters. The principle of the convolution operation is depicted in \Fref{fig:2d_convolution}. A kernel $K$ slides over the image $I$ and at every step, a single value in the output $S$ is obtained by computing the dot product between the kernel $K$ and the overlapping region of the image. The kernel is then moved by stride of $s$ pixels. After the kernel is moved, the process is repeated until the whole image is processed.

Sometimes, zero-padding is applied to the input image to control the shape of the output image. Zero-padding is a special type of padding where zeros are added around the outside of the input image. In \Fref{fig:2d_convolution}, no padding is applied to input image. Therefore, the output has a smaller spatial size as the input. To preserve the spatial dimension, a zero-padding with a margin of $\floor{k/2}$ needs to be applied to the input, assuming that the kernel has dimensions $k \times k$.

Sometimes, zero-padding is applied to the input image to control the shape of the output image. Zero-padding is a special type of padding where zeros are added around the edge of the input image. In \Fref{fig:2d_convolution}, no padding is applied to the input image. Therefore, the output has a smaller spatial size as the input. To preserve the spatial dimension, a zero-padding with a margin of $\floor{k/2}$ needs to be applied to the input, assuming that the kernel has dimensions $k \times k$.

\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models in general are a family of highly efficient models for mobile applications. Utilizing \gls{mbc} as the main building block, MobileNetV2 attains a Top-1 accuracy of $72.0\%$ on the ImageNet \cite{Imagenet} dataset using $3.4$M parameters and a total of $300$M\glspl{madds}. In comparison, MobileNetV1 only attains a Top-1 accuracy of $70.6\%$ using $4.2$M parameters and a total of $575$M\glspl{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.

\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models in general are a family of highly efficient models for mobile applications. Utilizing \gls{mbc} as the main building block, MobileNetV2 attains a Top-1 accuracy of \SI{72}{\percent} on the ImageNet \cite{Imagenet} dataset using \SI{3.4}{\mega\nothing} parameters and a total of \SI{300}{\mega\nothing}\glspl{madds}. In comparison, MobileNetV1 only attains a Top-1 accuracy of \SI{70.6}{\percent} using \SI{4.2}{\mega\nothing} parameters and a total of \SI{575}{\mega\nothing}\glspl{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.

\glspl{mbc} consist of three separate convolutions, a 1$\times$1 convolution followed by a depthwise separable $k \times k$ convolution followed again by a 1$\times$1 convolution. The first 1$\times$1 convolution consists of a convolutional layer followed by a batch normalization and the \gls{relu} activation function. The purpose of the first 1$\times$1 convolution is to expand the number of channels by the expansion rate factor $e \geq1$ to transform the input feature map into a higher dimensional feature space. Then a depthwise separable $k \times k$ convolution with stride $s$ is applied to the high dimensional feature map followed by a batch normalization and the \gls{relu} activation. Then, a $1\times1$ convolution is applied to the high dimensional feature map in order to reduce the number of channels by a factor of $e' \leq1$. If the input of the \gls{mbc} has the same shape as the output (e.g. when the stride is $s=1$ and $e=\frac{1}{e'}$), a residual connection is introduced. A residual connection adds the input of the \gls{mbc} to the output of the \gls{mbc}.

\Fref{tab:mbc_convs} shows the three convolutions forming a \gls{mbc}. The table also shows the input and output dimension of every convolution. A input with dimensions $c\timesh\timesw$ is transformed to an output with dimension $c' \times\frac{h}{s}\times\frac{w}{s}$ using a stride $s$ and an expansion rate $e$.

\Fref{tab:mbc_convs} shows the three convolutions forming a \gls{mbc}. The table also shows the input and output dimension of every convolution. An input with dimensions $C\timesH\timesW$ is transformed to an output with dimension $C' \times\frac{H}{s}\times\frac{W}{s}$ using a stride $s$ and an expansion rate $e$.

\caption{The three convolutions forming a \gls{mbc} with a stride $s$ and an expansion rate $e$. A residual connection is introduced if the input of the \gls{mbc} has the same shape as the output.}

@@ -153,4 +153,4 @@ There are several reasons to use the pooling layer in \glspl{cnn}. The first rea

\label{fig:max_avg_pooling}

\end{figure}

Deep learning frameworks, e.g. PyTorch \cite{Torch2019}, also support adaptive pooling. In adaptive pooling, the filter size and stride are adapted automatically to produce a certain spatial output size. Adaptive pooling supports several summary statistics, most commonly the average and the maximum. A special case of adaptive pooling is global pooling where the spatial output size is reduced to one. Given a input feature map of shape $C_1\times H_1\times W_1$, global pooling produces a output feature map of size $C_2\times1\times1$. Global pooling is most commonly used after the convolutional part of a neural network together with a flattening stage that reshapes the input feature map to one-dimensional output. Flattening takes the $C_2\times1\times1$ input and produces a vector of size $C_2$ that is then used as an input for the subsequent \gls{mlp} classifier.

Deep learning frameworks, e.g. PyTorch \cite{Torch2019}, also support adaptive pooling. In adaptive pooling, the filter size and stride are adapted automatically to produce a certain spatial output size. Adaptive pooling supports several summary statistics, most commonly the average and the maximum. A special case of adaptive pooling is global pooling where the spatial output size is reduced to one. Given a input feature map of shape $C_1\times H_1\times W_1$, global pooling produces a output feature map of size $C_2\times1\times1$. Global pooling is most commonly used after the convolutional part of a neural network together with a flattening stage that reshapes a feature map to an one-dimensional output. Flattening takes the $C_2\times1\times1$ input and produces a vector of size $C_2$ that is then used as an input for the subsequent \gls{mlp} classifier.