Commit 7fac8dd0 authored by David Peter's avatar David Peter
Browse files

MAdds to MACs

parent 377984d4
No preview for this file type
......@@ -301,11 +301,11 @@
% --------------------------------------------------------------------------------------------------
% Resource Efficient Keyword Spotting
\newchapter{Resource Efficient Keyword Spotting}{kws}
There are many aspects to consider when designing resource efficient \glspl{dnn} for \gls{kws}. In this chapter, we will introduce five methods that address two very important aspects in resource efficient \glspl{dnn}: The amount of memory needed to store the parameters and the number of \gls{madds} per forward pass.
There are many aspects to consider when designing resource efficient \glspl{dnn} for \gls{kws}. In this chapter, we will introduce five methods that address two very important aspects in resource efficient \glspl{dnn}: The amount of memory needed to store the parameters and the number of \glspl{madds} per forward pass.
The amount of memory needed to store the parameters depends on two factors, the number of parameters and the associated bit width of the parameters. Decreasing the number of parameters is possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. \Fref{sec:kws:conv} deals with resource efficient convolutional layers whereas \Fref{sec:kws:nas} deals with the resource efficient \gls{nas} approach. Quantizing weights and activations to decrease the amount of memory needed is discussed in \Fref{sec:kws:quantization}.
The number of \gls{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws} applications. In most cases, the number \gls{madds} is tied to the latency of a model. By decreasing the number of \gls{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \gls{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \gls{madds} of a model is to replace the costly \gls{mfcc} extractor by a convolutional layer. This method is explained in \Fref{sec:kws:endtoend}. Finally, \Fref{sec:kws:multi_exit} will introduce multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \gls{madds} based on either the available computational resources or based on the complexity of the classification task.
The number of \glspl{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws} applications. In most cases, the number \glspl{madds} is tied to the latency of a model. By decreasing the number of \glspl{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \glspl{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \glspl{madds} of a model is to replace the costly \gls{mfcc} extractor by a convolutional layer. This method is explained in \Fref{sec:kws:endtoend}. Finally, \Fref{sec:kws:multi_exit} will introduce multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \glspl{madds} based on either the available computational resources or based on the complexity of the classification task.
\newsection{Resource Efficient Convolutional Layers}{kws:conv}
\input{\pwd/conv}
......
......@@ -9,7 +9,7 @@
\newacronym[]{dft}{DFT}{Discrete Fourier Transform}
\newacronym[]{dct}{DCT}{Discrete Cosine Transform}
\newacronym[shortplural=MBCs, longplural=Mobile Inverted Bottleneck Convolutions]{mbc}{MBC}{Mobile Inverted Bottleneck Convolution}
\newacronym[]{madds}{MAdds}{Multiply-Adds}
\newacronym[shortplural=MACs, longplural=Multiply-Accumulate Operations]{madds}{MAC}{Multiply-Accumulate Operation}
\newacronym[]{ste}{STE}{Straight-Through Estimator}
\newacronym[]{nas}{NAS}{Neural Architecture Search}
\newacronym[]{mfsts}{MFSTS}{Multi-Frame Shifted Time Similarity}
......
% **************************************************************************************************
% **************************************************************************************************
\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models in general are a family of highly efficient models for mobile applications. Utilizing \gls{mbc} as the main building block, MobileNetV2 attains a Top-1 accuracy of $72.0\%$ on the ImageNet \cite{Imagenet} dataset using $3.4$M parameters and a total of $300$M \gls{madds}. In comparison, MobileNetV1 only attains a Top-1 accuracy of $70.6\%$ using $4.2$M parameters and a total of $575$M \gls{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.
\Glspl{mbc} were introduced in MobileNetV2 \cite{Sandler2018} as a replacement for the depthwise separable convolutions used in MobileNetV1 \cite{Howard2017}. MobileNet models in general are a family of highly efficient models for mobile applications. Utilizing \gls{mbc} as the main building block, MobileNetV2 attains a Top-1 accuracy of $72.0\%$ on the ImageNet \cite{Imagenet} dataset using $3.4$M parameters and a total of $300$M \glspl{madds}. In comparison, MobileNetV1 only attains a Top-1 accuracy of $70.6\%$ using $4.2$M parameters and a total of $575$M \glspl{madds}. Based on these promising results, we will utilize \glspl{mbc} from MobileNetV2 as the main building block for our resource efficient models in this thesis.
\glspl{mbc} consist of three separate convolutions, a 1$\times$1 convolution followed by a depthwise separable $k \times k$ convolution followed again by a 1$\times$1 convolution. The first 1$\times$1 convolution consists of a convolutional layer followed by a batch normalization and the \gls{relu} activation function. The purpose of the first 1$\times$1 convolution is to expand the number of channels by the expansion rate factor $e \geq 1$ to transform the input feature map into a higher dimensional feature space. Then a depthwise separable $k \times k$ convolution with stride $s$ is applied to the high dimensional feature map followed by a batch normalization and the \gls{relu} activation. Then, a $1 \times 1$ convolution is applied to the high dimensional feature map in order to reduce the number of channels by a factor of $e' \leq 1$. If the input of the \gls{mbc} has the same shape as the output (e.g. when the stride is $s=1$ and $e=\frac{1}{e'}$), a residual connection is introduced. A residual connection adds the input of the \gls{mbc} to the output of the \gls{mbc}.
......
% **************************************************************************************************
% **************************************************************************************************
Convolutional layers are the key building blocks of modern \glspl{dnn}. Therefore, many research efforts have been dedicated to reducing the number of \gls{madds} and the number of parameters of convolutional layers. The basic principle of how the standard convolution, the depthwise separable convolution and the \gls{mbc} work has already been explained in \Fref{sec:dnn:cnn}. In this section, we will compare the standard convolution, the depthwise separable convolution and the \gls{mbc} in terms of \gls{madds} and number of parameters.
Convolutional layers are the key building blocks of modern \glspl{dnn}. Therefore, many research efforts have been dedicated to reducing the number of \glspl{madds} and the number of parameters of convolutional layers. The basic principle of how the standard convolution, the depthwise separable convolution and the \gls{mbc} work has already been explained in \Fref{sec:dnn:cnn}. In this section, we will compare the standard convolution, the depthwise separable convolution and the \gls{mbc} in terms of \glspl{madds} and number of parameters.
The standard convolution takes a $C_{\mathrm{in}} \times H \times W$ input and applies a convolutional kernel $C_{\mathrm{out}} \times C_{\mathrm{in}} \times k \times k$ to produce a $C_{\mathrm{out}} \times H \times W$ output \cite{Sandler2018}. The number of \gls{madds} to compute the standard convolution is
The standard convolution takes a $C_{\mathrm{in}} \times H \times W$ input and applies a convolutional kernel $C_{\mathrm{out}} \times C_{\mathrm{in}} \times k \times k$ to produce a $C_{\mathrm{out}} \times H \times W$ output \cite{Sandler2018}. The number of \glspl{madds} to compute the standard convolution is
\begin{equation}
C_{\mathrm{in}} \cdot H \cdot W \cdot k^2 \cdot C_{\mathrm{out}}.
\end{equation}
Depthwise separable convolutions are a drop-in replacement for the standard convolution layers. However, the cost to compute the depthwise separable convolution, i.e. the sum of the depthwise and the pointwise convolution, is lower than the computational cost of the standard convolution. The number of \gls{madds} to compute the depthwise separable convolution is
Depthwise separable convolutions are a drop-in replacement for the standard convolution layers. However, the cost to compute the depthwise separable convolution, i.e. the sum of the depthwise and the pointwise convolution, is lower than the computational cost of the standard convolution. The number of \glspl{madds} to compute the depthwise separable convolution is
\begin{equation}
\label{dwise_sep_ops}
C_{\mathrm{in}} \cdot H \cdot W \cdot (k^2 + C_{\mathrm{out}})
\end{equation}
which is the sum of the depthwise and the pointwise convolution. Depthwise separable convolutions reduce the computational cost by a factor of $k^2 \cdot C_{\mathrm{out}} / (k^2 + C_{\mathrm{out}})$. This corresponds to a reduction of up to $k^2$ in some cases.
\glspl{mbc} consist of three separate convolutions, a pointwise convolution followed by a depthwise convolution followed by another pointwise convolution. Given an input with dimension $C_{\mathrm{in}} \times H \times W$, the expansion rate $e$, kernel size $k$ and the output dimension of $C_{\mathrm{out}} \times H \times W$, the number of \gls{madds} to compute the \gls{mbc} is
\glspl{mbc} consist of three separate convolutions, a pointwise convolution followed by a depthwise convolution followed by another pointwise convolution. Given an input with dimension $C_{\mathrm{in}} \times H \times W$, the expansion rate $e$, kernel size $k$ and the output dimension of $C_{\mathrm{out}} \times H \times W$, the number of \glspl{madds} to compute the \gls{mbc} is
\begin{equation}
C_{\mathrm{in}} \cdot H \cdot W \cdot e \cdot (C_{\mathrm{in}} + k^2 + C_{\mathrm{out}}).
\end{equation}
Compared to Equation~\ref{dwise_sep_ops}, the \gls{mbc} has an extra term. However, with \glspl{mbc} we can utilize much smaller input and output dimensions without sacrificing performance and therefore reduce the number of \gls{madds}. \Fref{tab:conv_ops_params} compares the number of \gls{madds} and number of parameters for the standard convolution, the depthwise separable convolution and the \gls{mbc}.
Compared to Equation~\ref{dwise_sep_ops}, the \gls{mbc} has an extra term. However, with \glspl{mbc} we can utilize much smaller input and output dimensions without sacrificing performance and therefore reduce the number of \glspl{madds}. \Fref{tab:conv_ops_params} compares the number of \glspl{madds} and number of parameters for the standard convolution, the depthwise separable convolution and the \gls{mbc}.
\begin{table}
\centering
\begin{tabular}{lll}
\toprule
& \gls{madds} & Parameters \\
& \glspl{madds} & Parameters \\
\midrule
Standard conv. & $C_{\mathrm{in}} \cdot H \cdot W \cdot k^2 \cdot C_{\mathrm{out}}$ & $C_{\mathrm{in}} \cdot k^2 \cdot C_{\mathrm{out}}$ \\
Depthwise separable conv. & $C_{\mathrm{in}} \cdot H \cdot W \cdot (k^2 + C_{\mathrm{out}})$ & $C_{\mathrm{in}} \cdot (k^2 + C_{\mathrm{out}})$ \\
\gls{mbc} & $C_{\mathrm{in}} \cdot H \cdot W \cdot e \cdot (C_{\mathrm{in}} + k^2 + C_{\mathrm{out}})$ & $C_{\mathrm{in}} \cdot e \cdot (C_{\mathrm{in}} + k^2 + C_{\mathrm{out}})$ \\
\bottomrule
\end{tabular}
\caption{\gls{madds} and parameters for three different convolutional layers.}
\caption{\glspl{madds} and parameters for three different convolutional layers.}
\label{tab:conv_ops_params}
\end{table}
\ No newline at end of file
% **************************************************************************************************
% **************************************************************************************************
\gls{nas} has been successfully used to find models that are specifically tailored to the underlying hardware \cite{Cai2019,Tan2019} by, for instance, minimizing memory requirements, number of \gls{madds}, or latency of the resulting model. Therefore, \gls{nas} techniques are well-suited for finding resource efficient \glspl{dnn}. Popular \gls{nas} approaches use concepts such as reinforcement learning \cite{Pham2018},
\gls{nas} has been successfully used to find models that are specifically tailored to the underlying hardware \cite{Cai2019,Tan2019} by, for instance, minimizing memory requirements, number of \glspl{madds}, or latency of the resulting model. Therefore, \gls{nas} techniques are well-suited for finding resource efficient \glspl{dnn}. Popular \gls{nas} approaches use concepts such as reinforcement learning \cite{Pham2018},
gradient based methods \cite{Liu2019} or even evolutionary methods \cite{Liu2018} for exploring the search space.
Conventional \gls{nas} methods require thousands of architectures to be trained to find the final architecture. Not only is this very time consuming but in some cases even infeasible, especially for large-scale tasks such as ImageNet or when computational resources are limited. Therefore, several proposals have been made to reduce the computational overhead. So called proxy tasks for reducing the search cost are training for fewer epochs, training on a smaller dataset or learning a smaller model that is then scaled up. As an example, \cite{Zoph2018} searches for the best convolutional layer on the CIFAR10 dataset which is then stacked and applied to the much larger ImageNet dataset. Many other NAS methods implement this technique with great success \cite{Liu2018a,Real2019,Cai2018,Liu2019,Tan2019,Luo2018}. Another approach, called EfficientNet \cite{Tan2019a}, employs the NAS approach from \cite{Tan2019} to find a small baseline model which is subsequently scaled up. By scaling the three dimensions depth, width and resolution of the small baseline model, they obtain state-of-the-art performance.
......@@ -8,7 +8,7 @@ Conventional \gls{nas} methods require thousands of architectures to be trained
Although model scaling and stacking achieves good performances, the convolutional layers optimized on the proxy task may not be optimal for the target task. Therefore, several approaches have been proposed to get rid of proxy tasks. Instead, architectures are directly trained on the target task and optimized for the hardware at hand. In \cite{Cai2019} ProxylessNAS is introduced where an overparameterized network with multiple parallel candidate operations per layer is used as the base model. Every candidate operation is gated by a binary gate which either prunes or keeps the operation. During architecture search, the binary gates are then trained such that only one operation per layer remains and the targeted memory and latency requirements are met. \Fref{sec:kws:nas:proxyless} will explore ProxylessNAS in more detail.
\newsubsection{ProxylessNAS}{kws:nas:proxyless}
ProxylessNAS is a multi-objective \gls{nas} approach from \cite{Cai2019} that we will use to find resource efficient \gls{kws} models. In ProxylessNAS, hardware-aware optimization is performed to optimize the model accuracy and latency on different hardware platforms. However, in this thesis, we optimize for accuracy and the number of \gls{madds} and not for latency. By optimizing for the number of \gls{madds}, the model latency and the number of parameters is implicitly optimized as well. The trade-off between the accuracy and the number of \gls{madds} is established by regularizing the architecture loss as explained in \Fref{sec:results:nas:experimental}.
ProxylessNAS is a multi-objective \gls{nas} approach from \cite{Cai2019} that we will use to find resource efficient \gls{kws} models. In ProxylessNAS, hardware-aware optimization is performed to optimize the model accuracy and latency on different hardware platforms. However, in this thesis, we optimize for accuracy and the number of \glspl{madds} and not for latency. By optimizing for the number of \glspl{madds}, the model latency and the number of parameters is implicitly optimized as well. The trade-off between the accuracy and the number of \glspl{madds} is established by regularizing the architecture loss as explained in \Fref{sec:results:nas:experimental}.
ProxylessNAS constructs an overparameterized model with multiple parallel candidate operations per layer as the base model. A single operation is denoted by $o_i$. Every operation is assigned a real-valued architecture parameter $\alpha_i$. The $N$ architecture parameters are transformed to probability values $p_i$ by applying the softmax function.
Every candidate operation is gated by a binary gate $g_i$ which either prunes or keeps the operation. Only one gate per layer is active at a time. Gates are sampled randomly according to
......
......@@ -4,7 +4,7 @@
However, hand-crafted speech features such as \gls{mfsts} and \glspl{mfcc} may not be optimal for \gls{kws}. Therefore, recent works have proposed to directly feed a \gls{dnn} with raw audio waveforms. In \cite{Ravanelli2018}, a \gls{cnn} for speaker recognition is proposed that encourages to learn parametrized sinc functions as kernels in the first layer. This layer is referred to as SincConv layer. During training, a low and high cutoff frequency per kernel is determined. Therefore, a custom filter bank is derived by training the SincConv layer that is specifically tailored to the desired application. SincConvs have also been recently applied to \gls{kws} tasks \cite{Mittermaier2020}.
In this section, we will search for end-to-end \gls{kws} models using \gls{nas}. The end-to-end models obtained by \gls{nas} will use a SincConv as the input layer. Instead of \glspl{mfcc}, end-to-end models are supplied with the raw audio waveforms instead of hand-crafted speech features. We will compare models obtained by \gls{nas} using SincConvs with models using ordinary 1D convolutions to assess whether SincConvs pose any benefit over 1D convolutions. Furthermore, this section will compare end-to-end models with models using \glspl{mfcc} from \Fref{sec:results:nas} in terms of accuracy, number of \gls{madds} and number of parameters.
In this section, we will search for end-to-end \gls{kws} models using \gls{nas}. The end-to-end models obtained by \gls{nas} will use a SincConv as the input layer. Instead of \glspl{mfcc}, end-to-end models are supplied with the raw audio waveforms instead of hand-crafted speech features. We will compare models obtained by \gls{nas} using SincConvs with models using ordinary 1D convolutions to assess whether SincConvs pose any benefit over 1D convolutions. Furthermore, this section will compare end-to-end models with models using \glspl{mfcc} from \Fref{sec:results:nas} in terms of accuracy, number of \glspl{madds} and number of parameters.
\newsubsection{Experimental Setup}{results:endtoend:setup}
\Fref{tab:model_structure_endtoend} shows the overparameterized model used in this section. This model is supplied with raw audio waveforms instead of \glspl{mfcc}. It consists of five stages with two input stages (i) (ii), two intermediate stages (iii) (iv) and one output stage (v). Stages (i), (ii) and (v) are fixed whereas stages (iii) an (iv) are optimized using \gls{nas}. \glspl{mbc} are used as the main building blocks in stages (iii) and (iv). Stride is only applied to the first convolution of each stage.
......@@ -38,24 +38,24 @@ We pretrain the architecture blocks for 40 epochs at a learning rate of 0.05 bef
An architecture is trained until convergence after selection by the \gls{nas} procedure to optimize the performance. We use the same hyperparameters as in the architecture search process.
\newsubsection{Results and Discussions}{results:endtoend:discussion}
\Fref{fig:sincconv_vs_conv1d} shows the test accuracy vs the number of \gls{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. Models depicted were obtained by \gls{nas}. The number of parameters corresponds to the circle area. We can clearly see that models with SincConvs outperform models with 1D convolutions. Models with 1D convolutions fail to generalize well independent of the number of channels selected in the first layer. Models with SincConvs on the other hand generalize well. Furthermore, the test accuracy of SincConv models increases with the number of channels in the first layer as this was the case in \Fref{sec:results:nas}, where the test accuracy of the models increased by increasing the number of \glspl{mfcc}. We follow from this results that it is beneficial to introduce some prior knowledge into the first layer of an end-to-end \gls{kws} model.
\Fref{fig:sincconv_vs_conv1d} shows the test accuracy vs the number of \glspl{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. Models depicted were obtained by \gls{nas}. The number of parameters corresponds to the circle area. We can clearly see that models with SincConvs outperform models with 1D convolutions. Models with 1D convolutions fail to generalize well independent of the number of channels selected in the first layer. Models with SincConvs on the other hand generalize well. Furthermore, the test accuracy of SincConv models increases with the number of channels in the first layer as this was the case in \Fref{sec:results:nas}, where the test accuracy of the models increased by increasing the number of \glspl{mfcc}. We follow from this results that it is beneficial to introduce some prior knowledge into the first layer of an end-to-end \gls{kws} model.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{\pwd/plots/sincconv_vs_conv1d.png}
\caption{Test accuracy vs number of \gls{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. The models depicted in this figure were obtained using \gls{nas}. The number of parameters corresponds to the circle area.}
\caption{Test accuracy vs number of \glspl{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. The models depicted in this figure were obtained using \gls{nas}. The number of parameters corresponds to the circle area.}
\label{fig:sincconv_vs_conv1d}
\end{figure}
After assessing the performance differences between models with 1D convolutions and SincConvs, we will now look at the performance of SincConv models compared to models using \glspl{mfcc} from \Fref{sec:results:nas}. \Fref{fig:sincconv_vs_mfcc} shows the test accuracy vs number \gls{madds} of end-to-end \gls{kws} models compared to \gls{kws} models using \glspl{mfcc}. Only Pareto-optimal models are shown. The number of parameters corresponds to the circle area. We can observe that SincConv models contribute to the Pareto frontier. However, we also have to note that SincConv models do not reach the same performance as \gls{mfcc} models when the number of \gls{madds} is matched. We can again observe the tradeoff between model size and accuracy, where larger models have a higher test accuracy than smaller models.
After assessing the performance differences between models with 1D convolutions and SincConvs, we will now look at the performance of SincConv models compared to models using \glspl{mfcc} from \Fref{sec:results:nas}. \Fref{fig:sincconv_vs_mfcc} shows the test accuracy vs number \glspl{madds} of end-to-end \gls{kws} models compared to \gls{kws} models using \glspl{mfcc}. Only Pareto-optimal models are shown. The number of parameters corresponds to the circle area. We can observe that SincConv models contribute to the Pareto frontier. However, we also have to note that SincConv models do not reach the same performance as \gls{mfcc} models when the number of \glspl{madds} is matched. We can again observe the tradeoff between model size and accuracy, where larger models have a higher test accuracy than smaller models.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{\pwd/plots/sincconv_vs_mfcc.png}
\caption{Test accuracy vs number of \gls{madds} of end-to-end \gls{kws} models using SincConvs compared to \gls{kws} models using \glspl{mfcc}. All models depicted in this figure are Pareto-optimal models and were obtained using \gls{nas}. The number of parameters corresponds to the circle area.}
\caption{Test accuracy vs number of \glspl{madds} of end-to-end \gls{kws} models using SincConvs compared to \gls{kws} models using \glspl{mfcc}. All models depicted in this figure are Pareto-optimal models and were obtained using \gls{nas}. The number of parameters corresponds to the circle area.}
\label{fig:sincconv_vs_mfcc}
\end{figure}
\Fref{fig:nas_td_best} (a), (b) and (c) provides the topologies of some of the models from \Fref{fig:sincconv_vs_mfcc}. The model accuracy, number of \gls{madds}, number of parameters and tradeoff parameter $\beta$ of the models from \Fref{fig:nas_td_best} are included in \Fref{tab:nas_td_best}.
\Fref{fig:nas_td_best} (a), (b) and (c) provides the topologies of some of the models from \Fref{fig:sincconv_vs_mfcc}. The model accuracy, number of \glspl{madds}, number of parameters and tradeoff parameter $\beta$ of the models from \Fref{fig:nas_td_best} are included in \Fref{tab:nas_td_best}.
\begin{figure}
\centering
......@@ -70,13 +70,13 @@ After assessing the performance differences between models with 1D convolutions
\centering
\begin{tabular}{ccccc}
\toprule
\gls{mfcc} & $\beta$ & \gls{madds} & Parameters & Test Accuracy\\
\gls{mfcc} & $\beta$ & \glspl{madds} & Parameters & Test Accuracy\\
\midrule
40 & 4 & \SI{9.70}{\mega\nothing} & \SI{70.47}{\kilo\nothing} & \SI{94.39}{\percent} \\
60 & 2 & \SI{18.13}{\mega\nothing} & \SI{91.41}{\kilo\nothing} & \SI{95.38}{\percent} \\
80 & 0 & \SI{29.18}{\mega\nothing} & \SI{117.65}{\kilo\nothing} & \SI{95.74}{\percent} \\
\bottomrule
\end{tabular}
\caption{Test accuracy, \gls{madds} and number of parameters of some selected models. The best performing Pareto optimal models with 40, 60 and 80 SincConv kernels were selected respectively.}
\caption{Test accuracy, \glspl{madds} and number of parameters of some selected models. The best performing Pareto optimal models with 40, 60 and 80 SincConv kernels were selected respectively.}
\label{tab:nas_td_best}
\end{table}
\ No newline at end of file
......@@ -5,7 +5,7 @@ As we have seen in the previous sections of this chapter, there are many methods
In this section, we will explore different exit types and the associated accuracy at every exit. We will also explore the difference in accuracies if the model is trained together with the exits compared to using a pretrained model where only the exits are optimized. Finally, we will improve multi-exit models with distillation, where we will see further improvements in accuracy.
\newsubsection{Experimental Setup}{results:multiexit:setup}
For the following experiments in this section, we select one of the models from \Fref{sec:results:endtoend}. The topology of the model is visualized in \Fref{fig:multiex_model}. The model is an end-to-end model and was found using \gls{nas} with 60 SincConv channels, a tradeoff $\beta=4$ and uses \SI{10.82}{\mega\nothing} \gls{madds} and \SI{75.71}{\kilo\nothing} parameters.
For the following experiments in this section, we select one of the models from \Fref{sec:results:endtoend}. The topology of the model is visualized in \Fref{fig:multiex_model}. The model is an end-to-end model and was found using \gls{nas} with 60 SincConv channels, a tradeoff $\beta=4$ and uses \SI{10.82}{\mega\nothing} \glspl{madds} and \SI{75.71}{\kilo\nothing} parameters.
\begin{figure}
\centering
\includegraphics[height=\textwidth,angle=90]{\pwd/plots/multiex_model.png}
......@@ -62,12 +62,12 @@ Next, we look at knowledge distillation to improve the accuracy of multi-exit mo
\label{fig:multi_exit_opt_exit_model_with_without_dist.png}
\end{figure}
Finally, we look at the resource consumption of the multi-exit model. Of course, adding exit layers to an existing \gls{dnn} increases the memory requirements. In our case, the exits alone need \SI{123.92}{\kilo\nothing} parameters while the model itself without exits uses only \SI{75.71}{\kilo\nothing} parameters. This results in an increase in memory consumption of more than 2.5 times. This may be detrimental for some applications. However, with an increase in memory consumption comes more flexibility during the forward pass at runtime. \Fref{tab:exit_params_ops} shows the number of parameters and \gls{madds} for every exit as well as the cumulative number of \gls{madds} to compute a prediction at certain exits. In the ideal case, more than two times the number of \gls{madds} can be save by using the prediction at exit 1 instead of passing the input through the whole model up to the final exit.
Finally, we look at the resource consumption of the multi-exit model. Of course, adding exit layers to an existing \gls{dnn} increases the memory requirements. In our case, the exits alone need \SI{123.92}{\kilo\nothing} parameters while the model itself without exits uses only \SI{75.71}{\kilo\nothing} parameters. This results in an increase in memory consumption of more than 2.5 times. This may be detrimental for some applications. However, with an increase in memory consumption comes more flexibility during the forward pass at runtime. \Fref{tab:exit_params_ops} shows the number of parameters and \glspl{madds} for every exit as well as the cumulative number of \glspl{madds} to compute a prediction at certain exits. In the ideal case, more than two times the number of \glspl{madds} can be save by using the prediction at exit 1 instead of passing the input through the whole model up to the final exit.
\begin{table}
\begin{center}
\begin{tabular}{lccc}
\toprule
Exit & \# Parameters & \# \gls{madds} & \# \gls{madds} \\
Exit & \# Parameters & \# \glspl{madds} & \# \glspl{madds} \\
& & & Cumulative \\
\midrule
Exit 1 & \SI{30.73}{\kilo\nothing} & \SI{1.59}{\mega\nothing} & \SI{5.35}{\mega\nothing} \\
......@@ -78,6 +78,6 @@ Finally, we look at the resource consumption of the multi-exit model. Of course,
\bottomrule
\end{tabular}
\end{center}
\caption{Number of parameters and \gls{madds} of every exit. The cumulative \gls{madds} is the number of \gls{madds} needed to compute a prediction at a certain exit.}
\caption{Number of parameters and \glspl{madds} of every exit. The cumulative \glspl{madds} is the number of \glspl{madds} needed to compute a prediction at a certain exit.}
\label{tab:exit_params_ops}
\end{table}
% **************************************************************************************************
% **************************************************************************************************
The purpose of this thesis is to find resource efficient \glspl{dnn} for \gls{kws}. A resource efficient \gls{dnn} should be as accurate as possible while using as few resources as possible. The two types of resources most crucial are the number of parameters and the number of \gls{madds} needed per forward-pass. In most cases, there is a tradeoff between the model accuracy and the required resources, where small models use less resources and large models use more resources.
The purpose of this thesis is to find resource efficient \glspl{dnn} for \gls{kws}. A resource efficient \gls{dnn} should be as accurate as possible while using as few resources as possible. The two types of resources most crucial are the number of parameters and the number of \glspl{madds} needed per forward-pass. In most cases, there is a tradeoff between the model accuracy and the required resources, where small models use less resources and large models use more resources.
Exploring the accuracy-resource tradeoff is especially import for \gls{kws} applications where, given some classification task, different models need to be deployed on different devices with varying computational capabilities. For such applications, \gls{nas} can be used to automatically find the appropriate model for every device without the need for hand-tuning.
In this section, we explore the accuracy-resource tradeoff by utilizing \gls{nas}. We will compare the models found with \gls{nas} in terms of accuracy, number of parameters and number of \gls{madds} needed per forward-pass. We will focus on \gls{kws} models that are supplied with \glspl{mfcc}. Using the raw audio signal instead of \glspl{mfcc} has many potential benefits. \Fref{sec:results:endtoend} will explore end-to-end \gls{kws} models with \gls{nas} where the models are supplied with the raw audio signal instead of \glspl{mfcc}.
In this section, we explore the accuracy-resource tradeoff by utilizing \gls{nas}. We will compare the models found with \gls{nas} in terms of accuracy, number of parameters and number of \glspl{madds} needed per forward-pass. We will focus on \gls{kws} models that are supplied with \glspl{mfcc}. Using the raw audio signal instead of \glspl{mfcc} has many potential benefits. \Fref{sec:results:endtoend} will explore end-to-end \gls{kws} models with \gls{nas} where the models are supplied with the raw audio signal instead of \glspl{mfcc}.
\newsubsection{Experimental Setup}{results:nas:experimental}
Before extracting \glspl{mfcc}, we first augment the raw audio signal according to \Fref{sec:dataset:augmentation}. Then, we filter the audio signal with a lowpass filter with a cutoff frequency of $f_L=\SI{4}{\kilo\hertz}$ and a highpass filter with a cutoff frequency of $f_H=\SI{20}{\hertz}$ to remove any frequency bands that do not provide any useful information. For windowing, a Hann window with a window length of \SI{40}{\milli\second} and a stride of \SI{20}{\milli\second} is used. Then we extract either 10, 20, 30 or 40 \glspl{mfcc} from the audio signal. The number of Mel filters is selected to be 40 regardless of the number of \glspl{mfcc} extracted. We do not extract any dynamic \glspl{mfcc}.
The overparameterized model used in this section is shown in Table~\ref{tab:model_structure}. The model consists of three stages, (i) an input stage, (ii) an intermediate stage, and (iii) an output stage. Stages (i) and (iii) are fixed to a 5$\times$11 convolution and a 1$\times$1 convolution respectively. The number of \gls{madds} of the whole model is lower bounded by the number of \gls{madds} of stage (i) and stage (iii), which however is negligible. We apply batch normalization followed by the \gls{relu} non-linearity as an activation function after the convolutions of stages (i) and (iii). We use \glspl{mbc} as our main building blocks in stage (ii). Only convolutions from stage (ii) are optimized.
The overparameterized model used in this section is shown in Table~\ref{tab:model_structure}. The model consists of three stages, (i) an input stage, (ii) an intermediate stage, and (iii) an output stage. Stages (i) and (iii) are fixed to a 5$\times$11 convolution and a 1$\times$1 convolution respectively. The number of \glspl{madds} of the whole model is lower bounded by the number of \glspl{madds} of stage (i) and stage (iii), which however is negligible. We apply batch normalization followed by the \gls{relu} non-linearity as an activation function after the convolutions of stages (i) and (iii). We use \glspl{mbc} as our main building blocks in stage (ii). Only convolutions from stage (ii) are optimized.
During NAS, we allow \glspl{mbc} with expansion rates $e \in \{1,2,3,4,5,6\}$ and kernel sizes $k \in \{3,5,7\}$ for selection. We also include the zero operation. Therefore, our overparameterized model has $\#e\cdot\#k + 1=19$ binary gates per layer. For blocks where the input feature map size is equal to the output feature map size we include skip connections. If a zero operation is selected as a block by \gls{nas}, the skip connection allows this particular layer to be skipped resulting in an identity layer.
......@@ -38,33 +38,33 @@ The accuracy-resource tradeoff of the model is established by regularizing the a
\label{eq:regularize}
\mathcal{L}_{\mathrm{arch}} = \mathcal{L}_{\mathrm{CE}} \cdot \left(\frac{\log{\mathrm{ops}_{\mathrm{exp}}}}{\log{\mathrm{ops}_{\mathrm{target}}}}\right)^\beta
\end{equation}
where $\mathcal{L}_{\mathrm{CE}}$ is the cross entropy loss, $\mathrm{ops}_{\mathrm{exp}}$ the expected number of MAdd operations, $\mathrm{ops}_{\mathrm{target}}$ the target number of MAdd operations and $\beta$ the regularization parameter. The expected number of MAdd operations $\mathrm{ops}_{\mathrm{exp}}$ is the overall number of \gls{madds} expected to be used in the convolutions and in the fully connected layer based on the probabilities $p_i$ of the final model. For stages (i) and (iii) we simply sum the number of \gls{madds} used in the convolutions and in the fully connected layer. For stage (ii) layers, the expected number of \gls{madds} is defined as the weighted sum of \gls{madds} used in $o_i$ weighted by the probabilities $p_i$.
where $\mathcal{L}_{\mathrm{CE}}$ is the cross entropy loss, $\mathrm{ops}_{\mathrm{exp}}$ the expected number of \glspl{madds}, $\mathrm{ops}_{\mathrm{target}}$ the target number of \glspl{madds and $\beta$ the regularization parameter. The expected number of \glspl{madds $\mathrm{ops}_{\mathrm{exp}}$ is the overall number of \glspl{madds} expected to be used in the convolutions and in the fully connected layer based on the probabilities $p_i$ of the final model. For stages (i) and (iii) we simply sum the number of \glspl{madds} used in the convolutions and in the fully connected layer. For stage (ii) layers, the expected number of \glspl{madds} is defined as the weighted sum of \glspl{madds} used in $o_i$ weighted by the probabilities $p_i$.
Establishing the accuracy-resource tradeoff using Equation~\ref{eq:regularize} is usually achieved by fixing $\mathrm{ops}_{\mathrm{target}}$ and choosing $\beta$ such that the number of \gls{madds} of the final model after NAS are close to $\mathrm{ops}_{\mathrm{target}}$. If $\mathrm{ops}_{\mathrm{target}}$ is close to $\mathrm{ops}_{\mathrm{exp}}$, the right term of Equation~\ref{eq:regularize} becomes one thus leaving $\mathcal{L}_{\mathrm{CE}}$ unchanged. However, if $\mathrm{ops}_{\mathrm{exp}}$ is larger or smaller than $\mathrm{ops}_{\mathrm{target}}$, $\mathcal{L}_{\mathrm{CE}}$ is scaled up or down respectively.
Establishing the accuracy-resource tradeoff using Equation~\ref{eq:regularize} is usually achieved by fixing $\mathrm{ops}_{\mathrm{target}}$ and choosing $\beta$ such that the number of \glspl{madds} of the final model after NAS are close to $\mathrm{ops}_{\mathrm{target}}$. If $\mathrm{ops}_{\mathrm{target}}$ is close to $\mathrm{ops}_{\mathrm{exp}}$, the right term of Equation~\ref{eq:regularize} becomes one thus leaving $\mathcal{L}_{\mathrm{CE}}$ unchanged. However, if $\mathrm{ops}_{\mathrm{exp}}$ is larger or smaller than $\mathrm{ops}_{\mathrm{target}}$, $\mathcal{L}_{\mathrm{CE}}$ is scaled up or down respectively.
In this thesis, we focus on obtaining a wide variety of models with a different number of \gls{madds} rather than reaching a certain number of target \gls{madds}. Therefore, we keep $\mathrm{ops}_{\mathrm{target}}$ fixed while varying $\beta$. Varying $\beta$ while fixing $\mathrm{ops}_{\mathrm{target}}$ resulted in more diverse models in terms of number of \gls{madds} than fixing $\beta$ and varying $\mathrm{ops}_{\mathrm{target}}$. We selected $\beta \in \{0, 1, 2, 4, 8, 16\}$ for all experiments in this section.
In this thesis, we focus on obtaining a wide variety of models with a different number of \glspl{madds} rather than reaching a certain number of target \glspl{madds}. Therefore, we keep $\mathrm{ops}_{\mathrm{target}}$ fixed while varying $\beta$. Varying $\beta$ while fixing $\mathrm{ops}_{\mathrm{target}}$ resulted in more diverse models in terms of number of \glspl{madds} than fixing $\beta$ and varying $\mathrm{ops}_{\mathrm{target}}$. We selected $\beta \in \{0, 1, 2, 4, 8, 16\}$ for all experiments in this section.
Before actually performing \gls{nas} we pretrain the model blocks for 40 epochs at a learning rate of 0.05. During pretraining, model blocks are randomly sampled and trained on one batch of the training set. Sampling and training repeats until 40 epochs of training are passed. Optimization is performed with stochastic gradient descent using a batch size of 100. After pretraining we perform \gls{nas} for 120 epochs with an initial learning rate of 0.2. The learning rate is decayed according to a cosine schedule \cite{Cai2019}. We use a batch size of 100 and optimize the model using stochastic gradient descent. Also, label smoothing is applied to the targets with a factor of 0.1.
A model is trained until convergence after selection by the \gls{nas} procedure to optimize the performance. We use the same hyperparameters as in the \gls{nas} process.
\newsubsection{Results and Discussions}{results:nas:discussion}
\Fref{fig:nas_mfcc_results} shows the models found using \gls{nas} for different tradeoffs $\beta \in \{0, 1, 2, 4, 8, 16\}$ using 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. Models on the Pareto frontier are emphasized. The model size corresponds to the circle area. As already mentioned in the introduction of this section, the accuracy of a model is positively correlated with the amount of resources needed. Furthermore, we can observe that using more \glspl{mfcc} with larger tradeoff values $\beta$ produces models with higher accuracies than increasing the model size with fewer \glspl{mfcc} at lower tradeoff values. For example, all but the smallest 10 \gls{mfcc} models are dominated by the smallest 20 \gls{mfcc} model in terms of accuracy, number of parameters and number of \gls{madds}.
\Fref{fig:nas_mfcc_results} shows the models found using \gls{nas} for different tradeoffs $\beta \in \{0, 1, 2, 4, 8, 16\}$ using 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. Models on the Pareto frontier are emphasized. The model size corresponds to the circle area. As already mentioned in the introduction of this section, the accuracy of a model is positively correlated with the amount of resources needed. Furthermore, we can observe that using more \glspl{mfcc} with larger tradeoff values $\beta$ produces models with higher accuracies than increasing the model size with fewer \glspl{mfcc} at lower tradeoff values. For example, all but the smallest 10 \gls{mfcc} models are dominated by the smallest 20 \gls{mfcc} model in terms of accuracy, number of parameters and number of \glspl{madds}.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{\pwd/plots/nas_mfcc_results.png}
\caption{Test accuracy vs number of \gls{madds} for models obtained using \gls{nas} with different tradeoffs $\beta \in \{0, 1, 2, 4, 8, 16\}$. \gls{nas} was performed with 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. The number of parameters corresponds to the circle area. Models on the Pareto frontier are emphasized.}
\caption{Test accuracy vs number of \glspl{madds} for models obtained using \gls{nas} with different tradeoffs $\beta \in \{0, 1, 2, 4, 8, 16\}$. \gls{nas} was performed with 10 (red), 20 (blue), 30 (yellow) and 40 (green) \glspl{mfcc}. The number of parameters corresponds to the circle area. Models on the Pareto frontier are emphasized.}
\label{fig:nas_mfcc_results}
\end{figure}
\Fref{tab:nas_mfcc_best} shows the test accuracy, \gls{madds} and number of parameters of some selected models using 10, 20, 30 and 40 \glspl{mfcc}. We can observe that with \gls{nas}, a wide variety of models with different resource requirements can be obtained. We can get very small models such as the first model that only uses \SI{5.56}{\mega\nothing} \gls{madds} and \SI{71.58}{\kilo\nothing} parameters but still achieves an accuracy of \SI{94.96}{\percent}. On the other hand we can get larger models that achieve an respectable accuracy of \SI{96.65}{\percent} using \SI{142.40}{\mega\nothing} \gls{madds} and \SI{889.69}{\kilo\nothing} parameters.
\Fref{tab:nas_mfcc_best} shows the test accuracy, \glspl{madds} and number of parameters of some selected models using 10, 20, 30 and 40 \glspl{mfcc}. We can observe that with \gls{nas}, a wide variety of models with different resource requirements can be obtained. We can get very small models such as the first model that only uses \SI{5.56}{\mega\nothing} \glspl{madds} and \SI{71.58}{\kilo\nothing} parameters but still achieves an accuracy of \SI{94.96}{\percent}. On the other hand we can get larger models that achieve an respectable accuracy of \SI{96.65}{\percent} using \SI{142.40}{\mega\nothing} \glspl{madds} and \SI{889.69}{\kilo\nothing} parameters.
\begin{table}
\centering
\begin{tabular}{ccccc}
\toprule
\gls{mfcc} & $\beta$ & \gls{madds} & Parameters & Test Accuracy\\
\gls{mfcc} & $\beta$ & \glspl{madds} & Parameters & Test Accuracy\\
\midrule
10 & 16 & \SI{5.56}{\mega\nothing} & \SI{71.58}{\kilo\nothing} & \SI{94.96}{\percent} \\
20 & 8 & \SI{30.79}{\mega\nothing} & \SI{345.70}{\kilo\nothing} & \SI{96.10}{\percent} \\
......@@ -72,7 +72,7 @@ A model is trained until convergence after selection by the \gls{nas} procedure
40 & 2 & \SI{142.40}{\mega\nothing} & \SI{889.69}{\kilo\nothing} & \SI{96.65}{\percent} \\
\bottomrule
\end{tabular}
\caption{\gls{madds}, number of parameters and test accuracy of some selected models. The best performing Pareto optimal models with 10, 20, 30 and 40 \glspl{mfcc} were selected respectively.}
\caption{\glspl{madds}, number of parameters and test accuracy of some selected models. The best performing Pareto optimal models with 10, 20, 30 and 40 \glspl{mfcc} were selected respectively.}
\label{tab:nas_mfcc_best}
\end{table}
......
......@@ -19,7 +19,7 @@ Quantization by training distributions over discrete weights is another method f
In this section we will focus on quantization aware training using the \gls{ste}. Quantization aware training using the \gls{ste} does not introduce much overhead during the training of the quantized model. Furthermore, the performance of quantized models trained with the \gls{ste} is competitive. Quantization aware training using the \gls{ste} can also be used for quantizing both the weights and activations of a model. We will compare two techniques: (i) fixed bitwidth quantization and (ii) learned bitwidth quantization (also called mixed-precision quantization). For fixed bitwidth quantization, we will train a single model multiple times with different weight and activation bitwidths and compare the quantized models in terms of performance and memory requirements. Then, we will train the same model with learned weight and activation bitwidths and compare its performance and memory requirements to the fixed bitwidth models.
\newsubsection{Experimental Setup}{results:quantization:setup}
For the following experiments we selected one of the models previously found by \gls{nas} in \Fref{sec:results:nas}. The selected model is trained from scratch with different weight and activation bitwidths using either fixed bitwidth quantization or learned bitwidth quantization. The topology of the model is visualized in \Fref{fig:quant_model}. The model was found using \gls{nas} with 10 \glspl{mfcc}, a tradeoff $\beta=8$ and uses \SI{14.52}{\mega\nothing} \gls{madds} and \SI{188}{\kilo\nothing} parameters.
For the following experiments we selected one of the models previously found by \gls{nas} in \Fref{sec:results:nas}. The selected model is trained from scratch with different weight and activation bitwidths using either fixed bitwidth quantization or learned bitwidth quantization. The topology of the model is visualized in \Fref{fig:quant_model}. The model was found using \gls{nas} with 10 \glspl{mfcc}, a tradeoff $\beta=8$ and uses \SI{14.52}{\mega\nothing} \glspl{madds} and \SI{188}{\kilo\nothing} parameters.
\begin{figure}
\centering
\includegraphics[height=\textwidth,angle=90]{\pwd/plots/quant_model.png}
......
\addcontentsline{toc}{chapter}{Abstract (English)}
\begin{center}\Large\bfseries Abstract (English)\end{center}\vspace*{1cm}\noindent
This thesis explores different methods for designing \glspl{cnn} for \gls{kws} in limited resource environments. Our goal is to maximize the classification accuracy while minimizing the memory requirements as well as the number of \gls{madds} per forward pass. To achieve this goal, we first employ a differentiable \gls{nas} approach to optimize the structure of \glspl{cnn}. After a suitable \gls{kws} model is found with \gls{nas}, we conduct quantization of weights and activations to reduce the memory requirements even further. For quantization, we compare fixed bitwidth quantization and trained bitwidth quantization. Then, we perform \gls{nas} again to optimize the structure of end-to-end \gls{kws} models. End-to-end models perform classification directly on the raw audio waveforms, skipping the extraction of hand-crafted speech features such as \gls{mfcc}. We compare our models using \glspl{mfcc} to end-to-end models in terms of accuracy, memory requirements and number of \gls{madds} per forward pass. We also show that multi-exit models provide a lot of flexibility for \gls{kws} systems, allowing us to interrupt the forward pass early if necessary. All experiments are conducted on the \gls{gsc} dataset, a popular dataset for evaluating the classification accuracy of \gls{kws} applications.
This thesis explores different methods for designing \glspl{cnn} for \gls{kws} in limited resource environments. Our goal is to maximize the classification accuracy while minimizing the memory requirements as well as the number of \glspl{madds} per forward pass. To achieve this goal, we first employ a differentiable \gls{nas} approach to optimize the structure of \glspl{cnn}. After a suitable \gls{kws} model is found with \gls{nas}, we conduct quantization of weights and activations to reduce the memory requirements even further. For quantization, we compare fixed bitwidth quantization and trained bitwidth quantization. Then, we perform \gls{nas} again to optimize the structure of end-to-end \gls{kws} models. End-to-end models perform classification directly on the raw audio waveforms, skipping the extraction of hand-crafted speech features such as \gls{mfcc}. We compare our models using \glspl{mfcc} to end-to-end models in terms of accuracy, memory requirements and number of \glspl{madds} per forward pass. We also show that multi-exit models provide a lot of flexibility for \gls{kws} systems, allowing us to interrupt the forward pass early if necessary. All experiments are conducted on the \gls{gsc} dataset, a popular dataset for evaluating the classification accuracy of \gls{kws} applications.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment