@@ -38,7 +38,7 @@ We pretrain the architecture blocks for 40 epochs at a learning rate of 0.05 bef

An architecture is trained until convergence after selection by the \gls{nas} procedure to optimize the performance. We use the same hyperparameters as in the architecture search process.

\newsubsection{Results and Discussions}{results:endtoend:discussion}

\Fref{fig:sincconv_vs_conv1d} shows the test accuracy vs the number of \glspl{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. Models depicted were obtained by \gls{nas}. The number of parameters corresponds to the circle area. We can clearly see that models with SincConvs outperform models with 1D convolutions. Models with 1D convolutions fail to generalize well independent of the number of channels selected in the first layer. Models with SincConvs on the other hand generalize well. Furthermore, the test accuracy of SincConv models increases with the number of channels in the first layer as this was the case in \Fref{sec:results:nas}, where the test accuracy of the models increased by increasing the number of \glspl{mfcc}. We follow from this results that it is beneficial to introduce some prior knowledge into the first layer of an end-to-end \gls{kws} model.

\Fref{fig:sincconv_vs_conv1d} shows the test accuracy vs the number of \glspl{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. All depicted models were obtained by \gls{nas}. The number of parameters corresponds to the circle area. We can clearly see that models with SincConvs outperform models with 1D convolutions. Models with 1D convolutions fail to generalize well independent of the number of channels selected in the first layer. Models with SincConvs on the other hand generalize well. Furthermore, the test accuracy of SincConv models increases with the number of channels in the first layer as this was the case in \Fref{sec:results:nas}, where the test accuracy of the models increased by increasing the number of \glspl{mfcc}. We conclude from this results that it is beneficial to introduce some prior knowledge into the first layer of an end-to-end \gls{kws} model.

@@ -46,7 +46,7 @@ An architecture is trained until convergence after selection by the \gls{nas} pr

\label{fig:sincconv_vs_conv1d}

\end{figure}

After assessing the performance differences between models with 1D convolutions and SincConvs, we will now look at the performance of SincConv models compared to models using \glspl{mfcc} from \Fref{sec:results:nas}. \Fref{fig:sincconv_vs_mfcc} shows the test accuracy vs number \glspl{madds} of end-to-end \gls{kws} models compared to \gls{kws} models using \glspl{mfcc}. Only Pareto-optimal models are shown. The number of parameters corresponds to the circle area. We can observe that SincConv models contribute to the Pareto frontier. However, we also have to note that SincConv models do not reach the same performance as \gls{mfcc} models when the number of \glspl{madds} is matched. We can again observe the tradeoff between model size and accuracy, where larger models have a higher test accuracy than smaller models.

After assessing the performance differences between models with 1D convolutions and SincConvs, we will now look at the performance of SincConv models compared to models using \glspl{mfcc} from \Fref{sec:results:nas}. \Fref{fig:sincconv_vs_mfcc} shows the test accuracy vs number \glspl{madds} of end-to-end \gls{kws} models compared to \gls{kws} models using \glspl{mfcc}. Only Pareto-optimal models are shown. The number of parameters corresponds to the circle area. We can observe that SincConv models contribute to the Pareto frontier. However, we also have to note that SincConv models do not reach the same performance as \gls{mfcc} models when the number of \glspl{madds} is similar. We can again observe the tradeoff between model size and accuracy, where larger models have a higher test accuracy than smaller models.

\begin{figure}

\centering

...

...

@@ -70,7 +70,7 @@ After assessing the performance differences between models with 1D convolutions

\centering

\begin{tabular}{ccccc}

\toprule

\gls{mfcc}&$\beta$&\glspl{madds}& Parameters & Test Accuracy\\

SincConv Kernels&$\beta$&\glspl{madds}& Parameters & Test Accuracy\\

\caption{Topology of the model used in this section for performing multi-exit classification.}

\caption{Topology of the model used for performing multi-exit classification.}

\label{fig:multiex_model}

\end{figure}

...

...

@@ -32,10 +32,10 @@ There will be 5 exits in total, one exit after every of the first four \glspl{mb

\label{tab:exit_topology}

\end{table}

Models are trained with a learning rate of 0.05 and the stochastic gradient descent optimizer. During the course of training, the learning rate is decayed according to a cosine schedule. We use a batch size of 100 for training. For knowledge distillation, soft-labels are generated from a teacher model. The teacher model was obtained with \gls{nas} and achieves and accuracy of \SI{96.55}{\percent}. We weight the classification and distillation loss according to \Fref{eq:dist_weight} with $\alpha=0.3$.

Models are trained with a learning rate of 0.05 and the stochastic gradient descent optimizer. During the course of training, the learning rate is decayed according to a cosine schedule. We use a batch size of 100 for training. For knowledge distillation, soft-labels are generated from a teacher model. The teacher model was obtained with \gls{nas} and achieves an accuracy of \SI{96.55}{\percent}. We weight the classification and distillation loss according to \Fref{eq:dist_weight} with $\alpha=0.3$.

\newsubsection{Results and Discussions}{results:multiexit:discussion}

\Fref{fig:multi_exit_compare_exits} shows the test accuracies at certain exits of a multi-exit model using different exit topologies. The model used in this figure was pretrained and then fixed before the exits were attached and trained. As expected, different exit topologies result in different test accuracies. We observe that exits with ordinary convolutions perform better than exits with depthwise separable convolutions. Furthermore, using more than one linear layer looks to be beneficial. However, the difference in accuracy becomes negligible between two and three linear layers. Since depthwise separable convolutions are more resource efficient than ordinary convolutions, we chose the exit with depthwise separable convolutions and two linear layers as the exit for subsequent experiments.

\Fref{fig:multi_exit_compare_exits} shows the test accuracies at certain exits of a multi-exit model using different exit topologies. The model used in this figure was pretrained and then fixed before the exits were attached and trained. As expected, different exit topologies result in different test accuracies. We observe that exits with ordinary convolutions perform better than exits with depthwise separable convolutions. Furthermore, using more than one linear layer seems to be beneficial. However, the difference in accuracy becomes negligible between two and three linear layers. Since depthwise separable convolutions are more resource efficient than ordinary convolutions, we chose the exit with depthwise separable convolutions and two linear layers as the exit for subsequent experiments.

\begin{figure}

\centering

...

...

@@ -62,7 +62,7 @@ Next, we look at knowledge distillation to improve the accuracy of multi-exit mo

Finally, we look at the resource consumption of the multi-exit model. Of course, adding exit layers to an existing \gls{dnn} increases the memory requirements. In our case, the exits alone need \SI{123.92}{\kilo\nothing} parameters while the model itself without exits uses only \SI{75.71}{\kilo\nothing} parameters. This results in an increase in memory consumption of more than 2.5 times. This may be detrimental for some applications. However, with an increase in memory consumption comes more flexibility during the forward pass at runtime. \Fref{tab:exit_params_ops} shows the number of parameters and \glspl{madds} for every exit as well as the cumulative number of \glspl{madds} to compute a prediction at certain exits. In the ideal case, more than two times the number of \glspl{madds} can be save by using the prediction at exit 1 instead of passing the input through the whole model up to the final exit.

Finally, we look at the resource consumption of the multi-exit model. Of course, adding exit layers to an existing \gls{dnn} increases the memory requirements. In our case, the exits alone need \SI{123.92}{\kilo\nothing} parameters while the model itself without exits uses only \SI{75.71}{\kilo\nothing} parameters. This results in an increase in memory consumption of more than 2.5 times. This may be detrimental for some applications. However, with an increase in memory consumption comes more flexibility during the forward pass at runtime. \Fref{tab:exit_params_ops} shows the number of parameters and \glspl{madds} for every exit as well as the cumulative number of \glspl{madds} to compute a prediction at certain exits. In the ideal case, more than two times the number of \glspl{madds} can be saved by using the prediction at exit 1 instead of passing the input through the whole model up to the final exit.

@@ -7,11 +7,11 @@ Exploring the accuracy-resource tradeoff is especially import for \gls{kws} appl

In this section, we explore the accuracy-resource tradeoff by utilizing \gls{nas}. We will compare the models found with \gls{nas} in terms of accuracy, number of parameters and number of \glspl{madds} needed per forward-pass. We will focus on \gls{kws} models that are supplied with \glspl{mfcc}. Using the raw audio signal instead of \glspl{mfcc} has many potential benefits. \Fref{sec:results:endtoend} will explore end-to-end \gls{kws} models with \gls{nas} where the models are supplied with the raw audio signal instead of \glspl{mfcc}.

Before extracting \glspl{mfcc}, we first augment the raw audio signal according to \Fref{sec:dataset:augmentation}. Then, we filter the audio signal with a lowpass filter with a cutoff frequency of $f_L=\SI{4}{\kilo\hertz}$ and a highpass filter with a cutoff frequency of $f_H=\SI{20}{\hertz}$ to remove any frequency bands that do not provide any useful information. For windowing, a Hann window with a window length of \SI{40}{\milli\second} and a stride of \SI{20}{\milli\second} is used. Then we extract either 10, 20, 30 or 40 \glspl{mfcc} from the audio signal. The number of Mel filters is selected to be 40 regardless of the number of \glspl{mfcc} extracted. We do not extract any dynamic \glspl{mfcc}.

Before extracting \glspl{mfcc}, we first augment the raw audio signal according to \Fref{sec:dataset:augmentation}. Then, we filter the audio signal with a lowpass filter with a cutoff frequency of $f_L=\SI{4}{\kilo\hertz}$ and a highpass filter with a cutoff frequency of $f_H=\SI{20}{\hertz}$ to remove any frequency bands that do not provide any useful information. For windowing, a Hann window with a window length of \SI{40}{\milli\second} and a stride of \SI{20}{\milli\second} is used. Then we extract either 10, 20, 30 or 40 \glspl{mfcc} from the audio signal. The number of Mel filters is selected to be 40 regardless of the number of \glspl{mfcc} extracted. We do not extract any dynamic \glspl{mfcc} such as delta and delta-delta coefficients.

The overparameterized model used in this section is shown in Table~\ref{tab:model_structure}. The model consists of three stages, (i) an input stage, (ii) an intermediate stage, and (iii) an output stage. Stages (i) and (iii) are fixed to a 5$\times$11 convolution and a 1$\times$1 convolution respectively. The number of \glspl{madds} of the whole model is lower bounded by the number of \glspl{madds} of stage (i) and stage (iii), which however is negligible. We apply batch normalization followed by the \gls{relu} non-linearity as an activation function after the convolutions of stages (i) and (iii). We use \glspl{mbc} as our main building blocks in stage (ii). Only convolutions from stage (ii) are optimized.

The overparameterized model used in this section is shown in Table~\ref{tab:model_structure}. The model consists of three stages, (i) an input stage, (ii) an intermediate stage, and (iii) an output stage. Stages (i) and (iii) are fixed to a 5$\times$11 convolution and a 1$\times$1 convolution respectively. The number of \glspl{madds} of the whole model is lower bounded by the number of \glspl{madds} of stage (i) and stage (iii), which however is negligible. We apply batch normalization followed by the \gls{relu} non-linearity as an activation function after the convolutions of stages (i) and (iii). We use \glspl{mbc} as our main building blocks in stage (ii). Only convolutions from stage (ii) are optimized during \gls{nas}.

During NAS, we allow \glspl{mbc} with expansion rates $e \in\{1,2,3,4,5,6\}$ and kernel sizes $k \in\{3,5,7\}$ for selection. We also include the zero operation. Therefore, our overparameterized model has $\#e\cdot\#k +1=19$ binary gates per layer. For blocks where the input feature map size is equal to the output feature map size we include skip connections. If a zero operation is selected as a block by \gls{nas}, the skip connection allows this particular layer to be skipped resulting in an identity layer.

During NAS, we allow \glspl{mbc} with expansion rates $e \in\{1,2,3,4,5,6\}$ and kernel sizes $k \in\{3,5,7\}$ for selection. We also include the zero operation which essentially means that the layer is skipped. Therefore, our overparameterized model has $\#e\cdot\#k +1=19$ binary gates per layer. For blocks where the input feature map size is equal to the output feature map size we include skip connections. If a zero operation is selected as a block by \gls{nas}, the skip connection allows this particular layer to be skipped resulting in an identity layer.

Our overparameterized model has the same number of channels across all layers except for the last 1$\times$1 convolution where the number of feature maps is doubled with respect to the number of feature maps in the previous layers. We use models with 72 channels in stage (i) and (ii) and 144 channels in stage (iii).

...

...

@@ -38,7 +38,7 @@ The accuracy-resource tradeoff of the model is established by regularizing the a

where $\mathcal{L}_{\mathrm{CE}}$ is the cross entropy loss, $\mathrm{ops}_{\mathrm{exp}}$ the expected number of \glspl{madds}, $\mathrm{ops}_{\mathrm{target}}$ the target number of \glspl{madds and $\beta$ the regularization parameter. The expected number of \glspl{madds $\mathrm{ops}_{\mathrm{exp}}$ is the overall number of \glspl{madds} expected to be used in the convolutions and in the fully connected layer based on the probabilities $p_i$ of the final model. For stages (i) and (iii) we simply sum the number of \glspl{madds} used in the convolutions and in the fully connected layer. For stage (ii) layers, the expected number of \glspl{madds} is defined as the weighted sum of \glspl{madds} used in $o_i$ weighted by the probabilities $p_i$.

where $\mathcal{L}_{\mathrm{CE}}$ is the cross entropy loss, $\mathrm{ops}_{\mathrm{exp}}$are the expected number of \glspl{madds}, $\mathrm{ops}_{\mathrm{target}}$are the target number of \glspl{madds and $\beta$is the regularization parameter. The expected number of \glspl{madds $\mathrm{ops}_{\mathrm{exp}}$ is the overall number of \glspl{madds} expected to be used in the convolutions and in the fully connected layer based on the probabilities $p_i$ of the final model. For stages (i) and (iii) we simply sum the number of \glspl{madds} used in the convolutions and in the fully connected layer. For stage (ii) layers, the expected number of \glspl{madds} is defined as the weighted sum of \glspl{madds} used in a single operation $o_i$ weighted by the probabilities $p_i$.

Establishing the accuracy-resource tradeoff using Equation~\ref{eq:regularize} is usually achieved by fixing $\mathrm{ops}_{\mathrm{target}}$ and choosing $\beta$ such that the number of \glspl{madds} of the final model after NAS are close to $\mathrm{ops}_{\mathrm{target}}$. If $\mathrm{ops}_{\mathrm{target}}$ is close to $\mathrm{ops}_{\mathrm{exp}}$, the right term of Equation~\ref{eq:regularize} becomes one thus leaving $\mathcal{L}_{\mathrm{CE}}$ unchanged. However, if $\mathrm{ops}_{\mathrm{exp}}$ is larger or smaller than $\mathrm{ops}_{\mathrm{target}}$, $\mathcal{L}_{\mathrm{CE}}$ is scaled up or down respectively.

...

...

@@ -49,12 +49,12 @@ Before actually performing \gls{nas} we pretrain the model blocks for 40 epochs

A model is trained until convergence after selection by the \gls{nas} procedure to optimize the performance. We use the same hyperparameters as in the \gls{nas} process.

\newsubsection{Results and Discussions}{results:nas:discussion}

\Fref{fig:nas_mfcc_results} shows the models found using \gls{nas} for different tradeoffs $\beta\in\{0, 1, 2, 4, 8, 16\}$ using 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. Models on the Pareto frontier are emphasized. The model size corresponds to the circle area. As already mentioned in the introduction of this section, the accuracy of a model is positively correlated with the amount of resources needed. Furthermore, we can observe that using more \glspl{mfcc} with larger tradeoff values $\beta$ produces models with higher accuracies than increasing the model size with fewer \glspl{mfcc} at lower tradeoff values. For example, all but the smallest 10 \gls{mfcc} models are dominated by the smallest 20 \gls{mfcc} model in terms of accuracy, number of parameters and number of \glspl{madds}.

\Fref{fig:nas_mfcc_results} shows the models found using \gls{nas} for different tradeoffs $\beta\in\{0, 1, 2, 4, 8, 16\}$ using 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. Models on the Pareto frontier are emphasized by full color. The model size corresponds to the circle area. As already mentioned in the introduction of this section, the accuracy of a model is positively correlated with the amount of resources needed. Furthermore, we can observe that using more \glspl{mfcc} with larger tradeoff values $\beta$ produces models with higher accuracies than increasing the model size with fewer \glspl{mfcc} at lower tradeoff values. For example, all but the smallest 10 \gls{mfcc} models are dominated by the smallest 20 \gls{mfcc} model in terms of accuracy, number of parameters and number of \glspl{madds}.

\caption{Test accuracy vs number of \glspl{madds} for models obtained using \gls{nas} with different tradeoffs $\beta\in\{0, 1, 2, 4, 8, 16\}$. \gls{nas} was performed with 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. The number of parameters corresponds to the circle area. Models on the Pareto frontier are emphasized.}

\caption{Test accuracy vs number of \glspl{madds} for models obtained using \gls{nas} with different tradeoffs $\beta\in\{0, 1, 2, 4, 8, 16\}$. \gls{nas} was performed with 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. The number of parameters corresponds to the circle area. Models on the Pareto frontier are emphasized by full color.}

\Fref{sec:results:nas} investigated the use of \gls{nas} for finding resource efficient \glspl{dnn} for \gls{kws}. We argued that \gls{nas} may be used when a classification task requires the deployment of several models on devices with different computational capabilities. Another factor to consider when designing resource efficient \glspl{dnn} for \gls{kws} is weight and activation quantization. When quantization is employed, weights and activations are transformed from 32 bit floating-point numbers to low-precision fixed-point numbers with bitwidths ranging from 1 to 8 bits.

In \Fref{sec:results:nas}we investigated the use of \gls{nas} for finding resource efficient \glspl{dnn} for \gls{kws}. We argued that \gls{nas} may be used when a classification task requires the deployment of several models on devices with different computational capabilities. Another factor to consider when designing resource efficient \glspl{dnn} for \gls{kws} is weight and activation quantization. When quantization is employed, weights and activations are transformed from 32 bit floating-point numbers to low-precision fixed-point numbers with bitwidths ranging from 1 to 8 bits.

Quantization of weights and activations serves different purposes. Quantization of the weights allows for a more compact model representation, i.e. the amount of memory required to store the weights decreases. On the other hand, quantization of the activations reduces the memory requirements for storing the activations at runtime. When both weights and activations are quantized, the forward pass can be accelerated by using efficient fixed-point operations instead of the more expensive floating-point operations.