\glspl{cnn}\cite{Lecun1995} are a specialized kind of \gls{dnn} that are tremendously successful in many practical applications. \glspl{cnn} are most prominently used in image and video processing tasks, although there are many more use cases. Similar to \glspl{mlp}, \glspl{cnn} obtain their outstanding performance by stacking multiple layers. In this section, we will cover some of the basic \gls{cnn} layers such as the convolutional layer and pooling layer as well as more intricate layers such as the depthwise separable convolution layer and the \gls{mbc}convolution layer.

\glspl{cnn}\cite{Lecun1995} are tremendously successful in many practical applications. \glspl{cnn} are most prominently used in image and video processing tasks, although there are many more use cases. Similar to \glspl{mlp}, \glspl{cnn} obtain their outstanding performance by stacking multiple layers. In this section, we will cover some of the basic \gls{cnn} layers such as the convolutional layer and the pooling layer as well as more intricate layers such as the depthwise separable convolution layer and the \gls{mbc} layer.

The amount of memory needed to store the parameters depends on two factors, the number of parameters and the associated bit width of the parameters. Decreasing the number of parameters is possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Section~\ref{sec:kws:conv} deals with resource efficient convolutional layers whereas Section~\ref{sec:kws:nas} deals with the resource efficient \gls{nas} approach. Quantizing weights and activations to decrease the amount of memory needed is discussed in Section~\ref{sec:kws:quantization}.

The number of \glspl{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws}applications. In most cases, the number \glspl{madds} is tied to the latency of a model. By decreasing the number of \glspl{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \glspl{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \glspl{madds} of a model is to replace the costly \gls{mfcc} feature extractor by a convolutional layer. This method is explained in Section~\ref{sec:kws:endtoend}. Finally, Section~\ref{sec:kws:multi_exit} introduces multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \glspl{madds} based on either the available computational resources or based on the complexity of the classification task.

The number of \glspl{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws}systems. In most cases, the number \glspl{madds} is tied to the latency of a model. By decreasing the number of \glspl{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \glspl{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \glspl{madds} of a model is to replace the costly \gls{mfcc} feature extractor by a convolutional layer. This method is explained in Section~\ref{sec:kws:endtoend}. Finally, Section~\ref{sec:kws:multi_exit} introduces multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \glspl{madds} based on either the available computational resources or based on the complexity of the classification task.

@@ -8,4 +8,4 @@ With weight and activation quantization, we were able to further reduce the memo

With end-to-end \acrshort{kws} models, we were able to skip the extraction of hand-crafted speech features and instead perform classification on the raw audio waveforms. By removing the need for hand-crafted speech features we managed to find models with fewer parameters. However, we also observed a small negative performance impact of end-to-end models compared to ordinary models using \acrshortpl{mfcc} as speech features.

In our last experiment, we explored multi-exit models for \acrshort{kws} where we compared different exit topologies. Furthermore, we compared distillation based training to ordinary training. Multi-exit models increase the flexibility in a \acrshort{kws} system substantially, allowing us to interrupt the forward pass early if necessary. However, this increase in flexibility comes at the cost of an increased number of model parameters. We observed that the exit topology has a substantial impact on the performance of a multi-exit model. We also observed that distillation based training is beneficial for training multi-exit models.

In our last experiment, we explored multi-exit models for \acrshort{kws} where we compared different exit topologies. Furthermore, we compared distillation based training to ordinary training. Multi-exit models increase the flexibility in a \acrshort{kws} system substantially, allowing us to interrupt the forward pass earlier if necessary. However, this increase in flexibility comes at the cost of an increased number of model parameters. We observed that the exit topology has a substantial impact on the performance of a multi-exit model. We also observed that distillation based training is beneficial for training multi-exit models.

As mentioned in the introduction of this chapter, there exist two versions of the \gls{gsc} dataset. In this thesis we will use the first version for the training and the evaluation of our \gls{kws}algorithms. The first version of the \gls{gsc} contains \num{64727} utterances from \num{1881} speakers. Each utterance is stored as a one second wave file sampled at \SI{16}{\kilo\hertz} with \num{16} bit.

As mentioned in the introduction of this chapter, there exist two versions of the \gls{gsc} dataset. In this thesis we will use the first version for the training and the evaluation of our \gls{kws}models. The first version of the \gls{gsc} contains \num{64727} utterances from \num{1881} speakers. Each utterance is stored as a one second wave file sampled at \SI{16}{\kilo\hertz} with \num{16} bit.

This thesis follows the regular approach to only use \num{10} of the provided classes directly, with the rest grouped into a single \enquote{unknown} class and an added \enquote{silence} class for a total of \num{12} classes. The following \num{10} classes are used directly: \enquote{yes}, \enquote{no}, \enquote{up}, \enquote{down}, \enquote{left}, \enquote{right}, \enquote{on}, \enquote{off}, \enquote{stop}, \enquote{go}. To ensure that there is roughly the same number of samples in every class, only a subset of the samples from the \enquote{unknown} class is selected. The \enquote{silence} class is built artificially and includes wave files where every file contains a random section from a noise sample in the \gls{gsc} dataset.

The sigmoid activation function was commonly used but is nowadays mostly replaced by other activation functions. Nonetheless, there are still some cases where the sigmoid activation function is used. The sigmoid function is defined by

\begin{equation}

\sigma(z) = \frac{1}{1+\exp{z}}.

\sigma(z) = \frac{1}{1+\exp{-z}}.

\end{equation}

The shape of the sigmoid resembles the letter \enquote{S} and is shown in Figure~\ref{fig:sigmoid}. One problem with using the sigmoid function as an activation function is that it saturates when the magnitude of the input $z$ becomes too large. Saturation causes the derivative to vanish which in turn causes problems when training \glspl{mlp}. Because of the saturation, it is discouraged to use the sigmoid as an activation function in the hidden layers of a \gls{mlp}.

The shape of the sigmoid resembles the letter \enquote{S} and is shown in Figure~\ref{fig:sigmoid}. One problem with the sigmoid function as an activation function is that it saturates when the magnitude of the input $z$ becomes too large. Saturation causes the derivative to vanish which in turn causes problems when training \glspl{mlp}. Because of the saturation, it is discouraged to use the sigmoid as an activation function in the hidden layers of a \gls{mlp}.

\begin{figure}

\centering

...

...

@@ -42,12 +42,12 @@ The shape of the sigmoid resembles the letter \enquote{S} and is shown in Figure

However, the sigmoid activation can be used as an activation function in the output layer of a binary classifier. A binary classifier has a single output neuron in the output layer. If the sigmoid activation function is used, the output of the binary classifier gives the probability $p$ that the input belongs to class $y=1$. The probability that the input belongs to class $y=2$ is then simply $1-p$.

\newsubsubsection{Rectified Linear Units}{dnn:activation_functions:relu}

\glspl{relu}\cite{Nair2010} are often used as a replacement of the sigmoid activation function in the hidden layers of a \gls{mlp}. The \gls{relu} is defined by

\glspl{relu}\cite{Nair2010} are often used as a replacement for the sigmoid activation function in the hidden layers of a \gls{mlp}. The \gls{relu} is defined by

\begin{equation}

g(z) = \max\{0, z\}.

\end{equation}

The shape of the \gls{relu} is depicted in Figure~\ref{fig:relu}. A positive characteristic of the \gls{relu} is that the derivative is easy to compute. For positive $z$, the derivative is simply one. For negative $z$, the derivative is zero. Furthermore, the derivative of the \gls{relu} is large and not vanishing when the \gls{relu} is active. This has been shown to cause fewer problems during training of \glspl{mlp} compared to when using the sigmoid activation function.

The shape of the \gls{relu} is depicted in Figure~\ref{fig:relu}. A positive aspect of the \gls{relu} is that the derivative is easy to compute. For positive $z$, the derivative is simply one. For negative $z$, the derivative is zero. Furthermore, the derivative of the \gls{relu} is not vanishing when the \gls{relu} is active. This has been shown to cause fewer problems during training of \glspl{mlp} compared to when using the sigmoid activation function.

\begin{figure}

\centering

...

...

@@ -90,7 +90,7 @@ The softmax activation function is commonly used in the output layer of \glspl{m

Since the output of the softmax represents a probability distribution, it holds that $0<\mathrm{softmax}(\vm{z})_i <1$ as well as $\sum_i \mathrm{softmax}(\vm{z})_i =1$.

Since the output of the softmax represents a probability distribution, it holds that $0\leq\mathrm{softmax}(\vm{z})_i \leq1$ as well as $\sum_i \mathrm{softmax}(\vm{z})_i =1$.

Applying the identity activation $g(z)=z$ is the same as applying no activation function. The identity activation function can be used in the output layer of a regression model. The shape of the identity function is depicted in Figure~\ref{fig:identity}.

Introduced in 2015, batch normalization \cite{Ioffe2015} has been widely adapted in the field of deep learning to speed up the learning of \glspl{dnn}. Batch normalization ensures that the activations inside a \gls{dnn} are normalized and therefore every hidden layer gets a standardized input. Given a vector of activations $\vm{z}=(z_1, \dots, z_N)^T$ of a hidden layer, a single activation is normalized by

Introduced in 2015, batch normalization \cite{Ioffe2015} has been widely adopted in the field of deep learning to speed up the learning of \glspl{dnn}. Batch normalization ensures that the activations inside a \gls{dnn} are normalized and therefore every hidden layer gets a standardized input. Given a vector of activations $\vm{z}=(z_1, \dots, z_N)^T$ of a hidden layer, a single activation is normalized by

@@ -78,11 +78,11 @@ Sometimes, zero-padding is applied to the input image to control the shape of th

\node[anchor=south east, inner sep=0.01em, blue] at (mtr-3-5.south east) (xx) {\scalebox{.5}{$\times0$}};

\node[anchor=south east, inner sep=0.01em, blue] at (mtr-3-6.south east) (xx) {\scalebox{.5}{$\times1$}};

\end{tikzpicture}

\caption{Convolution operation applied to a two dimensional input image $\vm{I}$. The parameters of the kernel $\vm{K}$ are learnable. Drawing by\cite{Velickovic2016}}

\caption{Convolution operation applied to a two dimensional input image $I$. The parameters of the kernel $K$ are learnable. Figure from\cite{Velickovic2016}.}

\label{fig:2d_convolution}

\end{figure}

\glspl{cnn} are very well suited for image recognition tasks. Image recognition deals with RGB images with dimensions $C \times H \times W$ where $C$ is the number of channels, $H$ is the height and $W$ the width of the image. In the case of RGB images, the kernel needs an additional dimension $C$ resulting in the kernel dimensions $C \times k_1\times k_2$. Again, to compute a single value in the output, the dot product between the kernel $\vm{K}$ and the overlapping part of the image $\vm{I}$ is computed according to Figure~\ref{fig:3d_convolution}. However, in practice there is not only a single kernel but a number of $N_C$ convolution kernels. A convolution is performed between the input and every convolution kernel resulting in $N_C$ different outputs. The output of a convolution is sometimes called a feature map.

\glspl{cnn} are very well suited for image recognition tasks. Image recognition deals with RGB images with dimensions $C \times H \times W$ where $C$ is the number of channels, $H$ is the height and $W$ the width of the image. In the case of RGB images, the kernel needs an additional dimension $C$ resulting in the kernel dimensions $C \times k_1\times k_2$. Again, to compute a single value in the output, the dot product between the kernel $K$ and the overlapping part of the image $I$ is computed according to Figure~\ref{fig:3d_convolution}. However, in practice there is not only a single kernel but a number of $N_C$ convolution kernels. A convolution is performed between the input and every convolution kernel resulting in $N_C$ different outputs. The output of a convolution is sometimes called a feature map.

\begin{figure}

\centering

\begin{tikzpicture}

...

...

@@ -116,6 +116,6 @@ Sometimes, zero-padding is applied to the input image to control the shape of th

\caption{Convolution operation applied to a three dimensional input image $\vm{I}$ having three channels. The parameters of the kernel $\vm{K}$ are learnable.}

\caption{Convolution operation applied to a three dimensional input image $I$ having three channels. The parameters of the kernel $K$ are learnable.}

The depthwise separable convolution \cite{Howard2017} factorizes the standard convolution into a depthwise convolution and a pointwise $1\times1$ convolution. The depthwise convolution is performed first by applying a single filter to each input channel. Then, the point wise convolution applies a $1\times1$ convolution to combine the outputs of the depthwise convolution. Factorizing the standard convolution into a depthwise and pointwise convolution is often utilized in resource efficient \gls{dnn} applications. With depthwise separable convolutions, the number of computing operations and the model size is greatly reduced. A visualization of the depthwise separable convolution is depicted in Figure~\ref{fig:depthwise_sep_conv}.

The depthwise separable convolution \cite{Howard2017} factorizes the standard convolution into a depthwise convolution and a pointwise $1\times1$ convolution. The depthwise convolution is performed first by applying a single filter to each input channel. Then, the point wise convolution applies a $1\times1$ convolution to combine the outputs of the depthwise convolution. Factorizing the standard convolution into a depthwise and a pointwise convolution is often utilized in resource efficient \glspl{dnn}. With depthwise separable convolutions, the number of computing operations and the model size is greatly reduced. A visualization of the depthwise separable convolution is depicted in Figure~\ref{fig:depthwise_sep_conv}.

Knowledge distillation was introduced in \cite{Hinton2015} and has since been applied to many areas of deep learning. Knowledge distillation is an application of the student-teacher-paradigm, where a small student network is trained to mimic the predictions of a large teacher network. The transfer of knowledge between the teacher and the student is called distillation. Student models trained with knowledge distillation usually achieve a better performance than models trained without knowledge distillation.

Usually, models are trained on one hot-encoded targets. However, one-hot-encoded targets do not carry much information despite the information that the corresponding input belongs to a certain class. Furthermore, one-hot-encoded targets do not capture the relationship between different classes. For instance, in image recognition on MNIST, an image of the number 1 might look similar to the image of the number 7 and vice versa. This information is not captured in one-hot-encoded targets. It is however captured in the output probabilities (also called soft targets) of the teacher network. So instead of training models on one-hot-encoded vectors, in knowledge distillation, models are trained on soft targets. By matching the targets of the student with the soft-targets of the teacher, the student network is able to generalize in the same way as the teacher. Therefore, if the large teacher network generalized well, also the student network will generalize well if trained on the soft targets.

Usually, models are trained on one hot-encoded targets. However, one-hot-encoded targets do not carry much information despite the information that the corresponding input belongs to a certain class. Furthermore, one-hot-encoded targets do not capture the relationship between different classes. For instance, in image recognition on MNIST, an image of the number 1 might look similar to the image of the number 7 and vice versa. This information is not captured in one-hot-encoded targets. It is however captured in the output probabilities (also called soft targets) of the teacher network. So instead of training models on one-hot-encoded targets, in knowledge distillation, models are trained on soft targets. By matching the targets of the student with the softtargets of the teacher, the student network is able to generalize in the same way as the teacher. Therefore, if the large teacher network generalized well, also the student network will generalize well if trained on the soft targets.

Training on soft targets does not necessarily require a labeled dataset since the soft targets can be produced by the teacher by just forwarding the input. Therefore, knowledge distillation can be used on unsupervised datasets as well.

\glspl{dnn} compute probabilities $q_i$ for every class $c \in\{1, \dots, C\}$ by applying a softmax activation to the output of the last layer $z_i$. Knowledge distillation introduces the temperature scaled softmax activation

\glspl{dnn} compute probabilities $q_i$ for every class $c \in\{1, \dots, C\}$ by applying a softmax activation to the output of the last layer $\vm{z}$. Knowledge distillation introduces the temperature scaled softmax activation

@@ -29,7 +29,7 @@ where $\vm{z}_t$ and $\vm{z}_s$ are the outputs of the teacher and student respe

\end{equation}

Again, the overall classification loss $\mathcal{L}_{\mathrm{cls}}$ is obtain by averaging over all per-example classification losses.

In knowledge distillation, the network is then trained on $\mathcal{L}_{KD}$ consisting of the weighted average of the distillation and the classification loss

In knowledge distillation, the student network is then trained on $\mathcal{L}_{KD}$ consisting of the weighted average of the distillation and the classification loss

@@ -6,7 +6,7 @@ Tuning the parameters of a \gls{dnn} model with gradient descent is an iterative

\end{equation}

where $\epsilon > 0$ is called the step size or the learning rate. The learning rate is a hyperparameter that is usually chosen by hand. The learning rate should neither be too high nor too low in order for the gradient descent algorithm to converge. If the learning rate is too small, training takes very long since the parameters are changed only marginally at each iteration. On the other, if the learning rate is too large, the algorithm might diverge, causing the value of the loss function to increase.

In some applications, the learning rate is decreased over time. If that is the case, the value of the learning rate changes based on the current iteration step $k$ is denoted by $\epsilon_k$.

In some applications, the learning rate is decreased over time. If that is the case, the value of the learning rate changes based on the current iteration step $k$and is denoted by $\epsilon_k$.

The gradient $\nabla_{\vm{\theta}}\mathcal{L}(\vm{\theta})$ is the vector of partial derivatives of the loss with respect to all the weights and biases of a \gls{dnn}. To compute all partial derivatives, backpropagation \cite{Rumelhart1986} is employed. Modern deep learning frameworks, e.g. PyTorch, provide built-in methods which automatically compute the partial derivatives of the loss with respect to the weights and biases by means of backpropagation.

...

...

@@ -24,7 +24,7 @@ with

where $l(f(\vm{x}_i; \vm{\theta}), \vm{y}_i)$ is the per-sample loss (i.e. \gls{mse}, cross entropy, etc.). After every batch was used once to perform a gradient step, the model has undergone one epoch of training. In practice, several epochs of training are necessary to achieve a reasonable performance.

\newsubsubsection{Momentum and Nesterov Momentum}{dnn:gradient_descent:momentum}

The method of momentum \cite{POLYAK19641} is commonly used to accelerate learning with gradient descent. Instead of performing parameter updates on the negative gradient, a velocity term $\vm{v}$ is used instead to update the parameters. The velocity $\vm{v}$ accumulates an exponential moving average of past gradients. By computing the moving average, the velocity $\vm{v}$ averages out oscillations of the gradient while preserving the direction towards the minima. A nice representation of the gradient descent with momentum is depicted in Figure~\ref{fig:gd_momentum}.

The method of momentum \cite{POLYAK19641} is commonly used to accelerate learning with gradient descent. Instead of performing parameter updates on the negative gradient, a velocity term $\vm{v}$ is used instead to update the parameters. The velocity $\vm{v}$ accumulates an exponential moving average of past gradients. By computing the moving average, the velocity $\vm{v}$ averages out oscillations of the gradient while preserving the direction towards the minima. A nice representation of gradient descent with momentum is depicted in Figure~\ref{fig:gd_momentum}.

@@ -39,7 +39,7 @@ A hyperparameter $\alpha \in [0, 1)$ controls how fast previous gradients decay.

\end{align}

The hyperparameter $\alpha$ is usually a fixed value with common values being $0.5$, $0.9$ or $0.99$.

Another improvement of gradient descent was introduced by\cite{Sutskever2013} called Nesterov's accelerated gradient method. This method was inspired by \cite{Nesterov1983AMF} and changes where the gradient is evaluated at every iteration step. The gradient descent update rule with Nesterov momentum is given by

Another improvement of gradient descent was introduced in\cite{Sutskever2013} called Nesterov's accelerated gradient method. This method was inspired by \cite{Nesterov1983AMF} and changes where the gradient is evaluated at every iteration step. The gradient descent update rule with Nesterov momentum is given by

@@ -36,6 +36,6 @@ Figure~\ref{fig:mlp_structure} visualizes the structure of the \gls{mlp}. On the

\path (L2-3) -- node{$\hdots$} (L3-3);

\path (L2-6) -- node{$\hdots$} (L3-6);

\end{neuralnetwork}

\caption{General structure of an\glspl{mlp}. The \glspl{mlp} takes as input the vector $\vm{x}$ and produces an output vector $\vm{\hat{y}}$. There exist a number of $L$ hidden layers. Outputs of the hidden layers can can not be observed directly.}

\caption{General structure of a \gls{mlp}. The \gls{mlp} takes the vector $\vm{x}$as input and produces an output vector $\vm{\hat{y}}$. There exist a number of $L$ hidden layers. Outputs of the hidden layers can can not be observed directly.}

\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models in general are a family of highly efficient models for mobile applications. Utilizing \gls{mbc} as the main building block, MobileNetV2 attains a Top-1 accuracy of \SI{72}{\percent} on the ImageNet \cite{Imagenet} dataset using \SI{3.4}{\mega\nothing} parameters and a total of \SI{300}{\mega\nothing}\glspl{madds}. In comparison, MobileNetV1 only attains a Top-1 accuracy of \SI{70.6}{\percent} using \SI{4.2}{\mega\nothing} parameters and a total of \SI{575}{\mega\nothing}\glspl{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.

\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models are a family of highly efficient models for mobile applications. Utilizing \glspl{mbc} as the main building block, MobileNetV2 attains an accuracy of \SI{72}{\percent} on the ImageNet \cite{Imagenet} dataset using \SI{3.4}{\mega\nothing} parameters and a total of \SI{300}{\mega\nothing}\glspl{madds}. In comparison, MobileNetV1 only attains an accuracy of \SI{70.6}{\percent} using \SI{4.2}{\mega\nothing} parameters and a total of \SI{575}{\mega\nothing}\glspl{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.

\glspl{mbc} consist of three separate convolutions, a 1$\times$1 convolution followed by a depthwise separable $k \times k$ convolution followed again by a 1$\times$1 convolution. The first 1$\times$1 convolution consists of a convolutional layer followed by a batch normalization and the \gls{relu} activation function. The purpose of the first 1$\times$1 convolution is to expand the number of channels by the expansion rate factor $e \geq1$ to transform the input feature map into a higher dimensional feature space. Then a depthwise separable $k \times k$ convolution with stride $s$ is applied to the high dimensional feature map followed by a batch normalization and the \gls{relu} activation. Then, a $1\times1$ convolution is applied to the high dimensional feature map in order to reduce the number of channels by a factor of $e' \leq1$. If the input of the \gls{mbc} has the same shape as the output (e.g. when the stride is $s=1$ and $e=\frac{1}{e'}$), a residual connection is introduced. A residual connection adds the input of the \gls{mbc} to the output of the \gls{mbc}.

@@ -4,7 +4,7 @@ When we build machine learning models such as \glspl{dnn}, we want the models to

A model may not be able to generalize well on the test dataset. If this happens, there are two reasons: underfitting or overfitting. Underfitting occurs when the model capacity is too low and the model struggles to reduce the error on the training dataset. Overfitting occurs when the model capacity is too large and the model \enquote{memorizes} the training data. Overfitting becomes evident during training when the gap between the training and test error increases. When the model capacity is right, the model is able to generalize well.

As an example to illustrate underfitting and overfitting, three models with different capacities are depicted in Figure~\ref{fig:over_underfit}. Every model was trained to fit a function to the noisy samples of a quadratic function. In the figure on the left, the capacity of the model is too low since only a linear function was used to fit the data and therefore the shape of the quadratic function is matched poorly. In the figure on the right, the model capacity is too high since a higher order polynomial was used to fit the data and therefore the model overfits to the noisy samples. The model on the right has the lowest training error but at the same time will perform poorly on unseen data. In the figure in the middle, the model capacity is right and the model neither overfits nor underfits to the data. The capacity of this model is appropriate and therefore, this model will perform best on unseen data.

As an example to illustrate underfitting and overfitting, three models with different capacities are depicted in Figure~\ref{fig:over_underfit}. Every model was trained to fit a function to the noisy samples of a quadratic function. In the figure on the left, the capacity of the model is too low since only a linear function was used to fit the data and therefore the shape of the quadratic function is matched poorly. In the figure on the right, the model capacity is too large since a higher order polynomial was used to fit the data and therefore the model overfits to the noisy samples. The model on the right has the lowest training error but at the same time will perform poorly on unseen data. In the figure in the middle, the model capacity is right and the model neither overfits nor underfits to the data. The capacity of this model is appropriate and therefore, this model will perform best on unseen data.

\begin{figure}

\centering

...

...

@@ -82,5 +82,5 @@ As an example to illustrate underfitting and overfitting, three models with diff

An intuitive approach to find the point of optimal capacity is to stop training when the test error increases. This approach has a name and is called early stopping. With early stopping, the dataset is split into three parts, the training dataset, the validation dataset and the test dataset. During training, the error on the validation set is monitored. If the error on the validation does not decrease for some time or even increases, the training is stopped and the error on test set is reported. Early stopping is a regularization method. In general, regularization is any modification we make to a learning algorithm that is intended to reduce its test error but not necessarily its training error \cite{Goodfellow2016}. Other methods for regularization, including weight decay and dropout, are explained in detail in Section~\ref{sec:dnn:regularization}.

An intuitive approach to find the point of optimal capacity is to stop training when the test error increases. This approach has a name and is called early stopping. With early stopping, the dataset is split into three parts, the training dataset, the validation dataset and the test dataset. During training, the error on the validation set is monitored. If the error on the validation set does not decrease for some time or even increases, the training is stopped and the error on test set is reported. Early stopping is a regularization method. In general, regularization is any modification we make to a learning algorithm that is intended to reduce its test error but not necessarily its training error \cite{Goodfellow2016}. Other methods for regularization, including weight decay and dropout, are explained in detail in Section~\ref{sec:dnn:regularization}.

The pooling layer is another type of layer commonly used in \glspl{cnn}. The pooling layer combines nearby pixel values of a feature map using a summary statistic to reduce the spatial size of the feature map. The number of channels of the feature map is preserved. The most common summary statistics are the average and the maximum.

There are several reasons to use the pooling layer in \glspl{cnn}. The first reason is to reduce the spacial size of the feature map. Reducing the spatial size of the feature map reduces the computational cost in the later stages of the \gls{cnn}. Furthermore, pooling helps the model to be invariant to small translational changes of the input.

There are several reasons to use the pooling layer in \glspl{cnn}. The first reason is to reduce the spatial size of the feature map. Reducing the spatial size of the feature map reduces the computational cost in the later stages of the \gls{cnn}. Furthermore, pooling helps the model to be invariant to small translational changes of the input.

Figure~\ref{fig:max_avg_pooling} shows an example of max-pooling (left) and average pooling (right) applied to a 2 dimensional image. In this figure, the filter size and the stride were both selected as $2\times2$.

\begin{figure}

...

...

@@ -149,8 +149,8 @@ Figure~\ref{fig:max_avg_pooling} shows an example of max-pooling (left) and aver

\node at (9, 2.5) {7};

\end{tikzpicture}

}

\caption{Max pooling (left) and average pooling (right) applied to a 2 dimensional image with a filter size and a stride of $2\times2$. Drawing by\cite{Thoma2021}.}

\caption{Max pooling (left) and average pooling (right) applied to a 2 dimensional image with a filter size and a stride of $2\times2$. Figure from\cite{Thoma2021}.}

\label{fig:max_avg_pooling}

\end{figure}

Deep learning frameworks, e.g. PyTorch \cite{Torch2019}, also support adaptive pooling. In adaptive pooling, the filter size and stride are adapted automatically to produce a certain spatial output size. Adaptive pooling supports several summary statistics, most commonly the average and the maximum. A special case of adaptive pooling is global pooling where the spatial output size is reduced to one. Given a input feature map of shape $C_1\times H_1\times W_1$, global pooling produces a output feature map of size $C_2\times1\times1$. Global pooling is most commonly used after the convolutional part of a neural network together with a flattening stage that reshapes a feature map to an one-dimensional output. Flattening takes the $C_2\times1\times1$ input and produces a vector of size $C_2$ that is then used as an input for the subsequent \gls{mlp} classifier.

Deep learning frameworks, e.g. PyTorch \cite{Torch2019}, also support adaptive pooling. In adaptive pooling, the filter size and stride are adapted automatically to produce a certain spatial output size. Adaptive pooling supports several summary statistics, most commonly the average and the maximum. A special case of adaptive pooling is global pooling where the spatial output size is reduced to one. Given an input feature map of shape $C_1\times H_1\times W_1$, global pooling produces an output feature map of size $C_2\times1\times1$. Global pooling is most commonly used after the convolutional part of a neural network together with a flattening stage that reshapes a feature map to an one-dimensional output. Flattening takes the $C_2\times1\times1$ input and produces a vector of size $C_2$ that is then used as an input for the subsequent \gls{mlp} classifier.

A simple regularization technique in deep learning is early stopping. Early stopping relies on the observation that, given a model with large enough capacity, the test error has a "U"-shape where the best model is the model that produces the lowest error in the test error curve. Early stopping simply selects the model with the lowest validation error.

One way to implement early stopping is to store the model parameters every time the validation error decreases. If the validation error does not decrease any more, training can be stopped. After training is finished, the model with the lowest validation error is selected.

Weight decay or $L^2$ penality is a type of parameter norm penalty. Parameter norm penalties are regularization approaches that limit the capacity of a model by preventing the model weights from becoming too large. Weights that are too large are often a sign that the capacity is too high and the model overfits to the training data. Parameter norm penalties are applied by adding a parameter norm penalty term $\Omega(\vm{\theta})$ to the loss function $\mathcal{L}(\vm{\theta})$. The new regularized loss function is denoted by $\mathcal{L}_r(\vm{\theta})$ and is described by

@@ -7,7 +7,7 @@ The standard convolution takes a $C_{\mathrm{in}} \times H \times W$ input and a

C_{\mathrm{in}}\cdot H \cdot W \cdot k^2 \cdot C_{\mathrm{out}}.

\end{equation}

Depthwise separable convolutions are a drop-in replacement for the standard convolution layers. However, the cost to compute the depthwise separable convolution, i.e. the sum of the depthwise and the pointwise convolution, is lower than the computational cost of the standard convolution. The number of \glspl{madds} to compute the depthwise separable convolution is

Depthwise separable convolutions are a drop-in replacement for the standard convolutional layers. However, the cost to compute the depthwise separable convolution, i.e. the sum of the depthwise and the pointwise convolution, is lower than the computational cost of the standard convolution. The number of \glspl{madds} to compute the depthwise separable convolution is

\begin{equation}

\label{dwise_sep_ops}

C_{\mathrm{in}}\cdot H \cdot W \cdot (k^2 + C_{\mathrm{out}})

Most \gls{dnn} based \gls{kws} models use hand-crafted feature extractors to extract speech features from an audio signal that are then further processed by a \gls{dnn} to perform classification. However, this approach may not be optimal since the hand-crafted feature extractors may not capture all import speech characteristics needed to obtain the best classification performance. End-to-end \gls{kws} models replace the hand-crafted feature extractors and instead perform classification on the raw audio signal.

Most \gls{dnn} based \gls{kws} models use hand-crafted feature extractors to extract speech features from an audio signal that are then further processed by a \gls{dnn} to perform classification. However, this approach may not be optimal since hand-crafted feature extractors may not capture all import speech characteristics needed to obtain the best classification performance. End-to-end \gls{kws} models replace the hand-crafted feature extractors and instead perform classification on the raw audio signal.

One example of an end-to-end model applied to speaker recognition is SincNet \cite{Ravanelli2018}. SincNet uses parametrized \textit{sinc} functions as kernels in the first layer to perform a convolution with the raw audio signal $x$. This first SincNet layer consisting of parametrized \textit{sinc} functions is sometimes called SincConv. For a single SincConv kernel, the output $y$ given the kernel $g[n, \theta]$ parametrized by $\theta$ is simply

where $\vm{\hat{y}}$ is the target and $\mathrm{softmax}(\vm{z}^m_s; T=1)$ is the softmax with temperature $T=1$ applied to the output of the $m$-th exit of the multi-exit network. The per-example distillation loss is defined as

where $\vm{y}$ is the target and $\mathrm{softmax}(\vm{z}^m_s; T=1)$ is the softmax with temperature $T=1$ applied to the output of the $m$-th exit of the multi-exit network. The per-example distillation loss is defined as