There are many aspects to consider when designing resource efficient \glspl{dnn} for \gls{kws}. In this chapter, we will introduce five methods that address two very important aspects in resource efficient \glspl{dnn}, the amount of memory needed to store the parameters and the number of \glspl{madds} per forward pass.

The amount of memory needed to store the parameters depends on two factors, the number of parameters and the associated bit width of the parameters. Decreasing the number of parameters is possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. \Fref{sec:kws:conv} deals with resource efficient convolutional layers whereas \Fref{sec:kws:nas} deals with the resource efficient \gls{nas} approach. Quantizing weights and activations to decrease the amount of memory needed is discussed in \Fref{sec:kws:quantization}.

The amount of memory needed to store the parameters depends on two factors, the number of parameters and the associated bit width of the parameters. Decreasing the number of parameters is possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Section~\ref{sec:kws:conv} deals with resource efficient convolutional layers whereas Section~\ref{sec:kws:nas} deals with the resource efficient \gls{nas} approach. Quantizing weights and activations to decrease the amount of memory needed is discussed in Section~\ref{sec:kws:quantization}.

The number of \glspl{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws} applications. In most cases, the number \glspl{madds} is tied to the latency of a model. By decreasing the number of \glspl{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \glspl{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \glspl{madds} of a model is to replace the costly \gls{mfcc} feature extractor by a convolutional layer. This method is explained in \Fref{sec:kws:endtoend}. Finally, \Fref{sec:kws:multi_exit} introduces multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \glspl{madds} based on either the available computational resources or based on the complexity of the classification task.

The number of \glspl{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws} applications. In most cases, the number \glspl{madds} is tied to the latency of a model. By decreasing the number of \glspl{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \glspl{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \glspl{madds} of a model is to replace the costly \gls{mfcc} feature extractor by a convolutional layer. This method is explained in Section~\ref{sec:kws:endtoend}. Finally, Section~\ref{sec:kws:multi_exit} introduces multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \glspl{madds} based on either the available computational resources or based on the complexity of the classification task.

This chapter evaluates the performance of resource efficient design methods for \gls{dnn} based \gls{kws} models. \Fref{sec:results:nas} evaluates the performance for models obtained with \gls{nas} by establishing a tradeoff between the model accuracy and the model size. Then, \Fref{sec:results:quantization} evaluates the performance and the model size of quantized models where either weights, activations or both weights and activations are quantized. This section also compares the performance between fixed bit-width quantization and learned bit-width quantization (i.e. mixed-precision quantization). In \Fref{sec:results:endtoend}, the performance of end-to-end models is evaluated and compared to ordinary models using \glspl{mfcc} from \Fref{sec:results:nas}. Finally, \Fref{sec:results:multiexit} compares the impact of different exit layer types on the performance of multi-exit models. This section also evaluates the associated computational cost to compute every exit.

This chapter evaluates the performance of resource efficient design methods for \gls{dnn} based \gls{kws} models. Section~\ref{sec:results:nas} evaluates the performance for models obtained with \gls{nas} by establishing a tradeoff between the model accuracy and the model size. Then, Section~\ref{sec:results:quantization} evaluates the performance and the model size of quantized models where either weights, activations or both weights and activations are quantized. This section also compares the performance between fixed bit-width quantization and learned bit-width quantization (i.e. mixed-precision quantization). In Section~\ref{sec:results:endtoend}, the performance of end-to-end models is evaluated and compared to ordinary models using \glspl{mfcc} from Section~\ref{sec:results:nas}. Finally, Section~\ref{sec:results:multiexit} compares the impact of different exit layer types on the performance of multi-exit models. This section also evaluates the associated computational cost to compute every exit.

@@ -4,7 +4,7 @@ As mentioned in the introduction of this chapter, there exist two versions of th

This thesis follows the regular approach to only use \num{10} of the provided classes directly, with the rest grouped into a single \enquote{unknown} class and an added \enquote{silence} class for a total of \num{12} classes. The following \num{10} classes are used directly: \enquote{yes}, \enquote{no}, \enquote{up}, \enquote{down}, \enquote{left}, \enquote{right}, \enquote{on}, \enquote{off}, \enquote{stop}, \enquote{go}. To ensure that there is roughly the same number of samples in every class, only a subset of the samples from the \enquote{unknown} class is selected. The \enquote{silence} class is built artificially and includes wave files where every file contains a random section from a noise sample in the \gls{gsc} dataset.

The whole dataset is split into three parts, the training, the validation and the test dataset containing \num{80}\%, \num{10}\% and \num{10}\% of the samples respectively. \Fref{tab:utterances_per_class} shows the number of utterances per class in the train, test and validation dataset after applying the split.

The whole dataset is split into three parts, the training, the validation and the test dataset containing \num{80}\%, \num{10}\% and \num{10}\% of the samples respectively. Table~\ref{tab:utterances_per_class} shows the number of utterances per class in the train, test and validation dataset after applying the split.

Every machine learning algorithm needs a dataset to train and evaluate on. Since the goal of this thesis is to build \glspl{dnn} for \gls{kws}, the choice fell on the \gls{gsc} dataset \cite{Warden2018}. According to the author, the \gls{gsc} is a public dataset of spoken words designed to help train and evaluate \gls{kws} systems. The \gls{gsc} was only introduced recently in 2018. However, the \gls{gsc} is now widely used as reference dataset and allows reproducible results for \gls{kws} systems. By the time of writing this thesis, there have been two versions of the \gls{gsc}. The first version contains approximately 65k samples over 30 classes while the second version contains approximately 110k samples over 35 classes. In this thesis, we will use the first version of the \gls{gsc} exclusively.

\Fref{sec:dataset:classes} lists the classes used for classification as well as the train-test-validation split used in this thesis. Then \Fref{sec:dataset:augmentation} explains the data augmentation that is employed during training. Finally, \Fref{sec:dataset:feature_extraction} deals with feature extraction, in particular the extraction of \glspl{mfcc}.

\ No newline at end of file

Section~\ref{sec:dataset:classes} lists the classes used for classification as well as the train-test-validation split used in this thesis. Then Section~\ref{sec:dataset:augmentation} explains the data augmentation that is employed during training. Finally, Section~\ref{sec:dataset:feature_extraction} deals with feature extraction, in particular the extraction of \glspl{mfcc}.

Most \gls{kws} systems consist of a feature extractor followed by a classifier to produce a prediction of the corresponding class. To extract features from a raw audio signal, the signal of length $L$ is split into overlapping frames of length $l$ with a stride $s$ giving a total of $T =\frac{L-l}{s}+1$ frames. For every frame, a number of $F$ features are extracted to obtain a total a $T \times F$ features. Most \gls{kws} systems use \glspl{mfcc} as features. \Fref{sec:dataset:feature_extraction:mfcc} explains the extraction of \glspl{mfcc} in detail.

Most \gls{kws} systems consist of a feature extractor followed by a classifier to produce a prediction of the corresponding class. To extract features from a raw audio signal, the signal of length $L$ is split into overlapping frames of length $l$ with a stride $s$ giving a total of $T =\frac{L-l}{s}+1$ frames. For every frame, a number of $F$ features are extracted to obtain a total a $T \times F$ features. Most \gls{kws} systems use \glspl{mfcc} as features. Section~\ref{sec:dataset:feature_extraction:mfcc} explains the extraction of \glspl{mfcc} in detail.

\newsubsection{Mel Frequency Cepstral Coefficients}{dataset:feature_extraction:mfcc}

The \gls{mfcc} feature extraction consists of six steps: (i) pre-emphasis, (ii) framing and windowing, (iii) computing the spectrum using the \gls{dft}, (iv) computing the Mel spectrum, (v) computing the \gls{dct} to obtain static \gls{mfcc} features, and finally (vi) computing the dynamic \gls{mfcc} features. The detailed description of all steps is explained in detail below \cite{10.5555/3086729}\cite{10.5555/560905}.

...

...

@@ -27,7 +27,7 @@ The Mel spectrum is computed by applying a set of triangular bandpass filters, c

where $f$ denotes the physical frequency in Hz and $f_{\mathrm{Mel}}$ denotes the perceived frequency in Mels. The Mel scale is used to design the Mel filter bank where the spacing between the filters is selected according to the Mel scale. \Fref{fig:mel_filterbank} shows the triangular filter bank with 20 filters.

where $f$ denotes the physical frequency in Hz and $f_{\mathrm{Mel}}$ denotes the perceived frequency in Mels. The Mel scale is used to design the Mel filter bank where the spacing between the filters is selected according to the Mel scale. Figure~\ref{fig:mel_filterbank} shows the triangular filter bank with 20 filters.

@@ -8,7 +8,7 @@ The sigmoid activation function was commonly used but is nowadays mostly replace

\sigma(z) = \frac{1}{1+\exp{z}}.

\end{equation}

The shape of the sigmoid resembles the letter \enquote{S} and is shown in \Fref{fig:sigmoid}. One problem with using the sigmoid function as an activation function is that it saturates when the magnitude of the input $z$ becomes too large. Saturation causes the derivative to vanish which in turn causes problems when training \glspl{mlp}. Because of the saturation, it is discouraged to use the sigmoid as an activation function in the hidden layers of a \gls{mlp}.

The shape of the sigmoid resembles the letter \enquote{S} and is shown in Figure~\ref{fig:sigmoid}. One problem with using the sigmoid function as an activation function is that it saturates when the magnitude of the input $z$ becomes too large. Saturation causes the derivative to vanish which in turn causes problems when training \glspl{mlp}. Because of the saturation, it is discouraged to use the sigmoid as an activation function in the hidden layers of a \gls{mlp}.

\begin{figure}

\centering

...

...

@@ -47,7 +47,7 @@ However, the sigmoid activation can be used as an activation function in the out

g(z) = \max\{0, z\}.

\end{equation}

The shape of the \gls{relu} is depicted in \Fref{fig:relu}. A positive characteristic of the \gls{relu} is that the derivative is easy to compute. For positive $z$, the derivative is simply one. For negative $z$, the derivative is zero. Furthermore, the derivative of the \gls{relu} is large and not vanishing when the \gls{relu} is active. This has been shown to cause fewer problems during training of \glspl{mlp} compared to when using the sigmoid activation function.

The shape of the \gls{relu} is depicted in Figure~\ref{fig:relu}. A positive characteristic of the \gls{relu} is that the derivative is easy to compute. For positive $z$, the derivative is simply one. For negative $z$, the derivative is zero. Furthermore, the derivative of the \gls{relu} is large and not vanishing when the \gls{relu} is active. This has been shown to cause fewer problems during training of \glspl{mlp} compared to when using the sigmoid activation function.

\begin{figure}

\centering

...

...

@@ -93,7 +93,7 @@ The softmax activation function is commonly used in the output layer of \glspl{m

Since the output of the softmax represents a probability distribution, it holds that $0 < \mathrm{softmax}(\vm{z})_i < 1$ as well as $\sum_i \mathrm{softmax}(\vm{z})_i =1$.

Applying the identity activation $g(z)=z$ is the same as applying no activation function. The identity activation function can be used in the output layer of a regression model. The shape of the identity function is depicted in \Fref{fig:identity}.

Applying the identity activation $g(z)=z$ is the same as applying no activation function. The identity activation function can be used in the output layer of a regression model. The shape of the identity function is depicted in Figure~\ref{fig:identity}.

where $K$ is a two-dimensional kernel with $k_1\times k_2$ learnable parameters. The principle of the convolution operation is depicted in \Fref{fig:2d_convolution}. A kernel $K$ slides over the image $I$ and at every step, a single value in the output $S$ is obtained by computing the dot product between the kernel $K$ and the overlapping region of the image. The kernel is then moved by stride of $s$ pixels. After the kernel is moved, the process is repeated until the whole image is processed.

where $K$ is a two-dimensional kernel with $k_1\times k_2$ learnable parameters. The principle of the convolution operation is depicted in Figure~\ref{fig:2d_convolution}. A kernel $K$ slides over the image $I$ and at every step, a single value in the output $S$ is obtained by computing the dot product between the kernel $K$ and the overlapping region of the image. The kernel is then moved by stride of $s$ pixels. After the kernel is moved, the process is repeated until the whole image is processed.

Sometimes, zero-padding is applied to the input image to control the shape of the output image. Zero-padding is a special type of padding where zeros are added around the edge of the input image. In \Fref{fig:2d_convolution}, no padding is applied to the input image. Therefore, the output has a smaller spatial size as the input. To preserve the spatial dimension, a zero-padding with a margin of $\floor{k/2}$ needs to be applied to the input, assuming that the kernel has dimensions $k \times k$.

Sometimes, zero-padding is applied to the input image to control the shape of the output image. Zero-padding is a special type of padding where zeros are added around the edge of the input image. In Figure~\ref{fig:2d_convolution}, no padding is applied to the input image. Therefore, the output has a smaller spatial size as the input. To preserve the spatial dimension, a zero-padding with a margin of $\floor{k/2}$ needs to be applied to the input, assuming that the kernel has dimensions $k \times k$.

\begin{figure}

\centering

\begin{tikzpicture}

...

...

@@ -82,7 +82,7 @@ Sometimes, zero-padding is applied to the input image to control the shape of th

\label{fig:2d_convolution}

\end{figure}

\glspl{cnn} are very well suited for image recognition tasks. Image recognition deals with RGB images with dimensions $C \times H \times W$ where $C$ is the number of channels, $H$ is the height and $W$ the width of the image. In the case of RGB images, the kernel needs an additional dimension $C$ resulting in the kernel dimensions $C \times k_1\times k_2$. Again, to compute a single value in the output, the dot product between the kernel $\vm{K}$ and the overlapping part of the image $\vm{I}$ is computed according to \Fref{fig:3d_convolution}. However, in practice there is not only a single kernel but a number of $N_C$ convolution kernels. A convolution is performed between the input and every convolution kernel resulting in $N_C$ different outputs. The output of a convolution is sometimes called a feature map.

\glspl{cnn} are very well suited for image recognition tasks. Image recognition deals with RGB images with dimensions $C \times H \times W$ where $C$ is the number of channels, $H$ is the height and $W$ the width of the image. In the case of RGB images, the kernel needs an additional dimension $C$ resulting in the kernel dimensions $C \times k_1\times k_2$. Again, to compute a single value in the output, the dot product between the kernel $\vm{K}$ and the overlapping part of the image $\vm{I}$ is computed according to Figure~\ref{fig:3d_convolution}. However, in practice there is not only a single kernel but a number of $N_C$ convolution kernels. A convolution is performed between the input and every convolution kernel resulting in $N_C$ different outputs. The output of a convolution is sometimes called a feature map.

The depthwise separable convolution \cite{Howard2017} factorizes the standard convolution into a depthwise convolution and a pointwise $1\times1$ convolution. The depthwise convolution is performed first by applying a single filter to each input channel. Then, the point wise convolution applies a $1\times1$ convolution to combine the outputs of the depthwise convolution. Factorizing the standard convolution into a depthwise and pointwise convolution is often utilized in resource efficient \gls{dnn} applications. With depthwise separable convolutions, the number of computing operations and the model size is greatly reduced. A visualization of the depthwise separable convolution is depicted in \Fref{fig:depthwise_sep_conv}.

The depthwise separable convolution \cite{Howard2017} factorizes the standard convolution into a depthwise convolution and a pointwise $1\times1$ convolution. The depthwise convolution is performed first by applying a single filter to each input channel. Then, the point wise convolution applies a $1\times1$ convolution to combine the outputs of the depthwise convolution. Factorizing the standard convolution into a depthwise and pointwise convolution is often utilized in resource efficient \gls{dnn} applications. With depthwise separable convolutions, the number of computing operations and the model size is greatly reduced. A visualization of the depthwise separable convolution is depicted in Figure~\ref{fig:depthwise_sep_conv}.

where $l(f(\vm{x}_i; \vm{\theta}), \vm{y}_i)$ is the per-sample loss (i.e. \gls{mse}, cross entropy, etc.). After every batch was used once to perform a gradient step, the model has undergone one epoch of training. In practice, several epochs of training are necessary to achieve a reasonable performance.

\newsubsubsection{Momentum and Nesterov Momentum}{dnn:gradient_descent:momentum}

The method of momentum \cite{POLYAK19641} is commonly used to accelerate learning with gradient descent. Instead of performing parameter updates on the negative gradient, a velocity term $\vm{v}$ is used instead to update the parameters. The velocity $\vm{v}$ accumulates an exponential moving average of past gradients. By computing the moving average, the velocity $\vm{v}$ averages out oscillations of the gradient while preserving the direction towards the minima. A nice representation of the gradient descent with momentum is depicted in \Fref{fig:gd_momentum}.

The method of momentum \cite{POLYAK19641} is commonly used to accelerate learning with gradient descent. Instead of performing parameter updates on the negative gradient, a velocity term $\vm{v}$ is used instead to update the parameters. The velocity $\vm{v}$ accumulates an exponential moving average of past gradients. By computing the moving average, the velocity $\vm{v}$ averages out oscillations of the gradient while preserving the direction towards the minima. A nice representation of the gradient descent with momentum is depicted in Figure~\ref{fig:gd_momentum}.

In \Fref{sec:dnn:over_underfitting} we discussed the concept of over- and underfitting in machine learning. Also, the notion of the training and test error was introduced but not further explained. The error (or sometimes called loss) is usually just a scalar value. It is an indicator of how well a machine learning model performs on a given dataset. The smaller the error, the better the machine learning model performs. The loss function can be different depending on the task. In this subsection, the most commonly used loss functions for classification and regression are introduced.

In Section~\ref{sec:dnn:over_underfitting} we discussed the concept of over- and underfitting in machine learning. Also, the notion of the training and test error was introduced but not further explained. The error (or sometimes called loss) is usually just a scalar value. It is an indicator of how well a machine learning model performs on a given dataset. The smaller the error, the better the machine learning model performs. The loss function can be different depending on the task. In this subsection, the most commonly used loss functions for classification and regression are introduced.

The \gls{mse} is the average of squared differences between the predicted and the real targets across all $N$ samples in the dataset. The \gls{mse} is most commonly used for regression tasks and depends on the model parameters $\vm{\theta}$. The \gls{mse} for a single sample is defined as

@@ -6,7 +6,7 @@ Stacking layers makes the \gls{mlp} more expressive and therefore increases its

The first layer of the \gls{mlp} is called the input layer that takes a vector $\vm{x}$ as the input. Intermediate layers are called hidden layers because the output of a hidden layer is usually not observed directly. The last layer of the \gls{mlp} is called the output layer. The output of the output layer is called the predicted target. The predicted target in classification tasks with $C$ classes is usually just a single categorical value $\hat{y}\in\{1, \dots, C\}$. In practice, the predicted target is a probability distribution described by the vector $\vm{\hat{y}}\in\REAL^C$. The target is also just a single categorical value $y \in\{1, \dots, C\}$. In practice, the target is described as a one-hot encoded vector $\vm{y}\in\REAL^C$ where a single element of the vector is set to one and all the other elements are set to zero.

\Fref{fig:mlp_structure} visualizes the structure of the \gls{mlp}. On the left (green), the vector $\vm{x}$ serves as the input to the \gls{mlp}. After the input layer, there are $L$ hidden layers (blue) followed by an output layer (red) that computes a probability distribution vector from which the predicted target is extracted.

Figure~\ref{fig:mlp_structure} visualizes the structure of the \gls{mlp}. On the left (green), the vector $\vm{x}$ serves as the input to the \gls{mlp}. After the input layer, there are $L$ hidden layers (blue) followed by an output layer (red) that computes a probability distribution vector from which the predicted target is extracted.

\glspl{mbc} consist of three separate convolutions, a 1$\times$1 convolution followed by a depthwise separable $k \times k$ convolution followed again by a 1$\times$1 convolution. The first 1$\times$1 convolution consists of a convolutional layer followed by a batch normalization and the \gls{relu} activation function. The purpose of the first 1$\times$1 convolution is to expand the number of channels by the expansion rate factor $e \geq1$ to transform the input feature map into a higher dimensional feature space. Then a depthwise separable $k \times k$ convolution with stride $s$ is applied to the high dimensional feature map followed by a batch normalization and the \gls{relu} activation. Then, a $1\times1$ convolution is applied to the high dimensional feature map in order to reduce the number of channels by a factor of $e' \leq1$. If the input of the \gls{mbc} has the same shape as the output (e.g. when the stride is $s=1$ and $e=\frac{1}{e'}$), a residual connection is introduced. A residual connection adds the input of the \gls{mbc} to the output of the \gls{mbc}.

\Fref{tab:mbc_convs} shows the three convolutions forming a \gls{mbc}. The table also shows the input and output dimension of every convolution. An input with dimensions $C \times H \times W$ is transformed to an output with dimension $C' \times\frac{H}{s}\times\frac{W}{s}$ using a stride $s$ and an expansion rate $e$.

Table~\ref{tab:mbc_convs} shows the three convolutions forming a \gls{mbc}. The table also shows the input and output dimension of every convolution. An input with dimensions $C \times H \times W$ is transformed to an output with dimension $C' \times\frac{H}{s}\times\frac{W}{s}$ using a stride $s$ and an expansion rate $e$.

@@ -4,7 +4,7 @@ When we build machine learning models such as \glspl{dnn}, we want the models to

A model may not be able to generalize well on the test dataset. If this happens, there are two reasons: underfitting or overfitting. Underfitting occurs when the model capacity is too low and the model struggles to reduce the error on the training dataset. Overfitting occurs when the model capacity is too large and the model \enquote{memorizes} the training data. Overfitting becomes evident during training when the gap between the training and test error increases. When the model capacity is right, the model is able to generalize well.

As an example to illustrate underfitting and overfitting, three models with different capacities are depicted in \Fref{fig:over_underfit}. Every model was trained to fit a function to the noisy samples of a quadratic function. In the figure on the left, the capacity of the model is too low since only a linear function was used to fit the data and therefore the shape of the quadratic function is matched poorly. In the figure on the right, the model capacity is too high since a higher order polynomial was used to fit the data and therefore the model overfits to the noisy samples. The model on the right has the lowest training error but at the same time will perform poorly on unseen data. In the figure in the middle, the model capacity is right and the model neither overfits nor underfits to the data. The capacity of this model is appropriate and therefore, this model will perform best on unseen data.

As an example to illustrate underfitting and overfitting, three models with different capacities are depicted in Figure~\ref{fig:over_underfit}. Every model was trained to fit a function to the noisy samples of a quadratic function. In the figure on the left, the capacity of the model is too low since only a linear function was used to fit the data and therefore the shape of the quadratic function is matched poorly. In the figure on the right, the model capacity is too high since a higher order polynomial was used to fit the data and therefore the model overfits to the noisy samples. The model on the right has the lowest training error but at the same time will perform poorly on unseen data. In the figure in the middle, the model capacity is right and the model neither overfits nor underfits to the data. The capacity of this model is appropriate and therefore, this model will perform best on unseen data.

\begin{figure}

\centering

...

...

@@ -67,7 +67,7 @@ As an example to illustrate underfitting and overfitting, three models with diff

\label{fig:over_underfit}

\end{figure}

\glspl{dnn} usually have a large number of trainable parameters and are able to compute very complex mappings from inputs to targets. Therefore, \glspl{dnn} have a large capacity and are often prone to overfit to the training data. When training \glspl{dnn}, the training and test error often follow the stereotypical shape depicted in \Fref{fig:ml_basics:over_underfitting:generalization_gap}. At the start of training, both the training and test error are high, the model underfits. Initially during training, both the training and test error decrease. At some point, the test error may increase while the training error still decreases. The increasing gap between the training and the test error is an indicator that the model capacity gets too large and the model overfits to the training data. Somewhere between the underfitting and the overfitting zone, there is a point at which the test error is the lowest. At this point, the capacity of the model is optimal.

\glspl{dnn} usually have a large number of trainable parameters and are able to compute very complex mappings from inputs to targets. Therefore, \glspl{dnn} have a large capacity and are often prone to overfit to the training data. When training \glspl{dnn}, the training and test error often follow the stereotypical shape depicted in Figure~\ref{fig:ml_basics:over_underfitting:generalization_gap}. At the start of training, both the training and test error are high, the model underfits. Initially during training, both the training and test error decrease. At some point, the test error may increase while the training error still decreases. The increasing gap between the training and the test error is an indicator that the model capacity gets too large and the model overfits to the training data. Somewhere between the underfitting and the overfitting zone, there is a point at which the test error is the lowest. At this point, the capacity of the model is optimal.

\begin{figure}

\centering

...

...

@@ -82,5 +82,5 @@ As an example to illustrate underfitting and overfitting, three models with diff

An intuitive approach to find the point of optimal capacity is to stop training when the test error increases. This approach has a name and is called early stopping. With early stopping, the dataset is split into three parts, the training dataset, the validation dataset and the test dataset. During training, the error on the validation set is monitored. If the error on the validation does not decrease for some time or even increases, the training is stopped and the error on test set is reported. Early stopping is a regularization method. In general, regularization is any modification we make to a learning algorithm that is intended to reduce its test error but not necessarily its training error \cite{Goodfellow2016}. Other methods for regularization, including weight decay and dropout, are explained in detail in \Fref{sec:dnn:regularization}.

An intuitive approach to find the point of optimal capacity is to stop training when the test error increases. This approach has a name and is called early stopping. With early stopping, the dataset is split into three parts, the training dataset, the validation dataset and the test dataset. During training, the error on the validation set is monitored. If the error on the validation does not decrease for some time or even increases, the training is stopped and the error on test set is reported. Early stopping is a regularization method. In general, regularization is any modification we make to a learning algorithm that is intended to reduce its test error but not necessarily its training error \cite{Goodfellow2016}. Other methods for regularization, including weight decay and dropout, are explained in detail in Section~\ref{sec:dnn:regularization}.

The perceptron is the basic building block of the \gls{mlp}. The perceptron takes an input vector $\vm{x}$ and computes the dot product between the input vector and a learnable weight vector $\vm{w}$. After computing the dot product, a scalar bias $b$ is added. Then an activation function, in the case of the perceptron a threshold function, is applied to restrict the output to either zero or one. In this basic form, the perceptron acts as binary classifier.

In modern neural network implementations, the threshold function is usually replaced with a different activation function. An in depth explanation of different activation functions follows in \Fref{sec:act_functions}. Mathematically, the perceptron is described by two equations

In modern neural network implementations, the threshold function is usually replaced with a different activation function. An in depth explanation of different activation functions follows in Section~\ref{sec:act_functions}. Mathematically, the perceptron is described by two equations

\begin{align}

&z = \sum_i w_i x_i + b \\

&a = g(z)

\end{align}

where $g$ is the activation function.

\Fref{fig:perceptron_model} shows a schematic representation of the perceptron. Again, the perceptron simply computes a dot product between the input and its weights followed by an activation function. An additional fixed input of value one is included and multiplied by $b$ to account for the bias term. In some cases, the bias can be dropped completely.

Figure~\ref{fig:perceptron_model} shows a schematic representation of the perceptron. Again, the perceptron simply computes a dot product between the input and its weights followed by an activation function. An additional fixed input of value one is included and multiplied by $b$ to account for the bias term. In some cases, the bias can be dropped completely.

@@ -4,7 +4,7 @@ The pooling layer is another type of layer commonly used in \glspl{cnn}. The poo

There are several reasons to use the pooling layer in \glspl{cnn}. The first reason is to reduce the spacial size of the feature map. Reducing the spatial size of the feature map reduces the computational cost in the later stages of the \gls{cnn}. Furthermore, pooling helps the model to be invariant to small translational changes of the input.

\Fref{fig:max_avg_pooling} shows an example of max-pooling (left) and average pooling (right) applied to a 2 dimensional image. In this figure, the filter size and the stride were both selected as $2\times2$.

Figure~\ref{fig:max_avg_pooling} shows an example of max-pooling (left) and average pooling (right) applied to a 2 dimensional image. In this figure, the filter size and the stride were both selected as $2\times2$.

@@ -25,4 +25,4 @@ With \glspl{dnn}, it is possible to obtain resource efficient \gls{kws} models w

\newsection{Outline}{intro:outline}

%Outline: Thesis outline. Kapitel mit Sätzen beschreiben.

The outline of this thesis is as follows: Chapter 2 provides the theoretical background to \glspl{dnn}. This chapter introduces various learning approaches, capacity, over- and underfitting, \glspl{mlp}, \glspl{cnn} and the training of \glspl{dnn}. Chapter 3 provides the theoretical background for resource efficient \gls{kws}. This chapter presents resource efficient convolutional layers, \gls{nas}, weight and activation quantization, end-to-end models and multi-exit models. The \gls{gsc} dataset, data augmentation and feature extraction is explained in detail in Chapter 4. The experimental results of this thesis are presented and discussed in Chapter 5. Finally, Chapter 6 provides the conclusion to this thesis.

The outline of this thesis is as follows: Chapter~\ref{chp:dnn} provides the theoretical background to \glspl{dnn}. This chapter introduces various learning approaches, capacity, over- and underfitting, \glspl{mlp}, \glspl{cnn} and the training of \glspl{dnn}. Chapter~\ref{chp:kws} provides the theoretical background for resource efficient \gls{kws}. This chapter presents resource efficient convolutional layers, \gls{nas}, weight and activation quantization, end-to-end models and multi-exit models. The \gls{gsc} dataset, data augmentation and feature extraction is explained in detail in Chapter~\ref{chp:dataset}. The experimental results of this thesis are presented and discussed in Chapter~\ref{chp:results}. Finally, Chapter~\ref{chp:conclusion} provides the conclusion to this thesis.

Convolutional layers are the key building blocks of modern \glspl{dnn}. Therefore, many research efforts have been dedicated to reducing the number of \glspl{madds} and the number of parameters of convolutional layers. The basic principle of how the standard convolution, the depthwise separable convolution and the \gls{mbc} work has already been explained in \Fref{sec:dnn:cnn}. In this section, we will compare the standard convolution, the depthwise separable convolution and the \gls{mbc} in terms of \glspl{madds} and number of parameters.

Convolutional layers are the key building blocks of modern \glspl{dnn}. Therefore, many research efforts have been dedicated to reducing the number of \glspl{madds} and the number of parameters of convolutional layers. The basic principle of how the standard convolution, the depthwise separable convolution and the \gls{mbc} work has already been explained in Section~\ref{sec:dnn:cnn}. In this section, we will compare the standard convolution, the depthwise separable convolution and the \gls{mbc} in terms of \glspl{madds} and number of parameters.

The standard convolution takes a $C_{\mathrm{in}}\times H \times W$ input and applies a convolutional kernel $C_{\mathrm{out}}\times C_{\mathrm{in}}\times k \times k$ to produce a $C_{\mathrm{out}}\times H \times W$ output \cite{Sandler2018}. The number of \glspl{madds} to compute the standard convolution is

\begin{equation}

...

...

@@ -18,7 +18,7 @@ which is the sum of the depthwise and the pointwise convolution. Depthwise separ

\begin{equation}

C_{\mathrm{in}}\cdot H \cdot W \cdot e \cdot (C_{\mathrm{in}} + k^2 + C_{\mathrm{out}}).

\end{equation}

Compared to Equation~\ref{dwise_sep_ops}, the \gls{mbc} has an extra term. However, with \glspl{mbc} we can utilize much smaller input and output dimensions without sacrificing performance and therefore reduce the number of \glspl{madds}. \Fref{tab:conv_ops_params} compares the number of \glspl{madds} and number of parameters for the standard convolution, the depthwise separable convolution and the \gls{mbc}.

Compared to Equation~\ref{dwise_sep_ops}, the \gls{mbc} has an extra term. However, with \glspl{mbc} we can utilize much smaller input and output dimensions without sacrificing performance and therefore reduce the number of \glspl{madds}. Table~\ref{tab:conv_ops_params} compares the number of \glspl{madds} and number of parameters for the standard convolution, the depthwise separable convolution and the \gls{mbc}.

@@ -8,7 +8,7 @@ In many classification tasks there are certain samples that are very easy to cla

In multi-exit models there are usually two methods for prediction: the anytime prediction and the budget prediction \cite{PhuongL19}. In anytime prediction, a crude initial prediction is produced and then gradually improved if the computational budget allows it. If the computational budget is depleted, the multi-exit model outputs the prediction of the last exit or an ensemble of all evaluated exits. In budget prediction, the computational budget is constant and therefore, only a single exit is selected that fits the computational budget. Since in budget prediction mode only one exit is selected, only one prediction is reported.

Multi-exit models are usually trained by attaching a loss function to every exit and minimizing the sum of exit-wise losses. Recently, knowledge distillation has been used effectively to train multi-exit models as well. In \cite{PhuongL19}, self-distillation is performed as shown in \Fref{fig:multi_exit_self_dist}. In self-distillation, the output of the final layer produces the soft targets that are then used to compute the distillation losses of the early exits. However, the soft targets can also be generated from a different teacher network. The multi-exit model is then trained on the weighted average of the the distillation and the classification loss.

Multi-exit models are usually trained by attaching a loss function to every exit and minimizing the sum of exit-wise losses. Recently, knowledge distillation has been used effectively to train multi-exit models as well. In \cite{PhuongL19}, self-distillation is performed as shown in Figure~\ref{fig:multi_exit_self_dist}. In self-distillation, the output of the final layer produces the soft targets that are then used to compute the distillation losses of the early exits. However, the soft targets can also be generated from a different teacher network. The multi-exit model is then trained on the weighted average of the the distillation and the classification loss.

@@ -5,10 +5,10 @@ gradient based methods \cite{Liu2019} or even evolutionary methods \cite{Liu2018

Conventional \gls{nas} methods require thousands of architectures to be trained to find the final architecture. Not only is this very time consuming but in some cases even infeasible, especially for large-scale tasks such as ImageNet or when computational resources are limited. Therefore, several proposals have been made to reduce the computational overhead. So called proxy tasks for reducing the search cost are training for fewer epochs, training on a smaller dataset or learning a smaller model that is then scaled up. As an example, \cite{Zoph2018} searches for the best convolutional layer on the CIFAR10 dataset which is then stacked and applied to the much larger ImageNet dataset. Many other NAS methods implement this technique with great success \cite{Liu2018a,Real2019,Cai2018,Liu2019,Tan2019,Luo2018}. Another approach, called EfficientNet \cite{Tan2019a}, employs the NAS approach from \cite{Tan2019} to find a small baseline model which is subsequently scaled up. By scaling the three dimensions depth, width and resolution of the small baseline model, they obtain state-of-the-art performance.

Although model scaling and stacking achieves good performances, the convolutional layers optimized on the proxy task may not be optimal for the target task. Therefore, several approaches have been proposed to get rid of proxy tasks. Instead, architectures are directly trained on the target task and optimized for the hardware at hand. In \cite{Cai2019} ProxylessNAS is introduced where an overparameterized network with multiple parallel candidate operations per layer is used as the base model. Every candidate operation is gated by a binary gate which either prunes or keeps the operation. During architecture search, the binary gates are then trained such that only one operation per layer remains and the targeted memory and latency requirements are met. \Fref{sec:kws:nas:proxyless} will explore ProxylessNAS in more detail.

Although model scaling and stacking achieves good performances, the convolutional layers optimized on the proxy task may not be optimal for the target task. Therefore, several approaches have been proposed to get rid of proxy tasks. Instead, architectures are directly trained on the target task and optimized for the hardware at hand. In \cite{Cai2019} ProxylessNAS is introduced where an overparameterized network with multiple parallel candidate operations per layer is used as the base model. Every candidate operation is gated by a binary gate which either prunes or keeps the operation. During architecture search, the binary gates are then trained such that only one operation per layer remains and the targeted memory and latency requirements are met. Section~\ref{sec:kws:nas:proxyless} will explore ProxylessNAS in more detail.

\newsubsection{ProxylessNAS}{kws:nas:proxyless}

ProxylessNAS is a multi-objective \gls{nas} approach from \cite{Cai2019} that we will use to find resource efficient \gls{kws} models. In ProxylessNAS, hardware-aware optimization is performed to optimize the model accuracy and latency on different hardware platforms. However, in this thesis, we optimize for accuracy and the number of \glspl{madds} and not for latency. By optimizing for the number of \glspl{madds}, the model latency and the number of parameters is implicitly optimized as well. The trade-off between the accuracy and the number of \glspl{madds} is established by regularizing the architecture loss as explained in \Fref{sec:results:nas:experimental}.

ProxylessNAS is a multi-objective \gls{nas} approach from \cite{Cai2019} that we will use to find resource efficient \gls{kws} models. In ProxylessNAS, hardware-aware optimization is performed to optimize the model accuracy and latency on different hardware platforms. However, in this thesis, we optimize for accuracy and the number of \glspl{madds} and not for latency. By optimizing for the number of \glspl{madds}, the model latency and the number of parameters is implicitly optimized as well. The trade-off between the accuracy and the number of \glspl{madds} is established by regularizing the architecture loss as explained in Section~\ref{sec:results:nas:experimental}.

ProxylessNAS constructs an overparameterized model with multiple parallel candidate operations per layer as the base model. A single operation is denoted by $o_i$. Every operation is assigned a real-valued architecture parameter $\alpha_i$. The $N$ architecture parameters are transformed to probability values $p_i$ by applying the softmax function.

Every candidate operation is gated by a binary gate $g_i$ which either prunes or keeps the operation. Only one gate per layer is active at a time. Gates are sampled randomly according to

...

...

@@ -25,7 +25,7 @@ Based on the binary gates and the layer input $x$, the output of the mixed opera

During architecture search, the training of the architecture parameters $\alpha_i$ and the weight parameters of the operations $o_i$ are performed as a two-step procedure. When training weight parameters, the architecture parameters are frozen and binary gates are sampled according to (\ref{eq:proxyless_gates}). Weight parameters are then updated via standard gradient descent on the training set. When training architecture parameters, the weight parameters are frozen and the architecture parameters are updated on the validation set. The two-step procedure for learning the weight and architecture parameters is visualized in \Fref{fig:proxyless_nas_learning}.

During architecture search, the training of the architecture parameters $\alpha_i$ and the weight parameters of the operations $o_i$ are performed as a two-step procedure. When training weight parameters, the architecture parameters are frozen and binary gates are sampled according to (\ref{eq:proxyless_gates}). Weight parameters are then updated via standard gradient descent on the training set. When training architecture parameters, the weight parameters are frozen and the architecture parameters are updated on the validation set. The two-step procedure for learning the weight and architecture parameters is visualized in Figure~\ref{fig:proxyless_nas_learning}.

\begin{figure}

\centering

...

...

@@ -34,7 +34,7 @@ During architecture search, the training of the architecture parameters $\alpha_

\label{fig:proxyless_nas_learning}

\end{figure}

Updating the architecture parameters via backpropagation requires computing $\partial\mathcal{L}/\partial\alpha_i$ which is not defined due to the sampling in (\ref{eq:proxyless_gates}). To solve this issue, the gradient of $\partial\mathcal{L}/\partial p_i$ is estimated as $\partial\mathcal{L}/\partial g_i$. This estimation is an application of the \gls{ste} explained in \Fref{sec:kws:quantization}. With this estimation, backpropagation can be used to compute $\partial\mathcal{L}/\partial\alpha_i$ since the gates $g_i$ are part of the computation graph and thus $\partial\mathcal{L}/\partial g_i$ can be calculated using backpropagation.

Updating the architecture parameters via backpropagation requires computing $\partial\mathcal{L}/\partial\alpha_i$ which is not defined due to the sampling in (\ref{eq:proxyless_gates}). To solve this issue, the gradient of $\partial\mathcal{L}/\partial p_i$ is estimated as $\partial\mathcal{L}/\partial g_i$. This estimation is an application of the \gls{ste} explained in Section~\ref{sec:kws:quantization}. With this estimation, backpropagation can be used to compute $\partial\mathcal{L}/\partial\alpha_i$ since the gates $g_i$ are part of the computation graph and thus $\partial\mathcal{L}/\partial g_i$ can be calculated using backpropagation.

%\newsubsection{Single-Path NAS}{}

%Single-Path NAS was proposed in \cite{Stamoulis2019} as a hardware-efficient method for \gls{nas}. Compared to ProxylessNAS, Single-path NAS uses one single-path overparameterized network to encode all architectural decisions with shared convolutional kernel parameters. Parameter sharing drastically decreases the number of trainable parameters and therefore also reduces the search cost in comparison to ProxylessNAS.

@@ -26,7 +26,7 @@ In the statistics-based approach, the maximum absolute weight value is used to c

\end{equation}

In the parameter-based approach, the scaling factor $\alpha$ is defined as a learnable parameter that is optimized using gradient descent.

\Fref{fig:ste_quant} illustrates weight quantization using the \gls{ste} on a typical convolutional layer (without batch normalization). During the forward pass, the quantizer $Q$ performs the quantization of the auxiliary weight tensor $\mathbf{W}^l$ to obtain the quantized weight tensor $\mathbf{W}^l_q$. The quantized weight tensor is then used in the convolution. During the backward pass, the STE replaces the derivative of the quantizer $Q$ by the derivative of the identity function. In this way, the gradient can propagate back to $\mathbf{W}^l$.

Figure~\ref{fig:ste_quant} illustrates weight quantization using the \gls{ste} on a typical convolutional layer (without batch normalization). During the forward pass, the quantizer $Q$ performs the quantization of the auxiliary weight tensor $\mathbf{W}^l$ to obtain the quantized weight tensor $\mathbf{W}^l_q$. The quantized weight tensor is then used in the convolution. During the backward pass, the STE replaces the derivative of the quantizer $Q$ by the derivative of the identity function. In this way, the gradient can propagate back to $\mathbf{W}^l$.