Commit 07f9f005 authored by David Peter's avatar David Peter
Browse files

Finish multi exit results

parent f95538e7
No preview for this file type
......@@ -351,7 +351,7 @@
\newsection{End-to-end Keyword Spotting}{results:endtoend}
\input{\pwd/endtoend}
\newsection{Multi-exit Architectures}{results:multiexit}
\newsection{Multi-exit Models}{results:multiexit}
\input{\pwd/multiexit}
% --------------------------------------------------------------------------------------------------
......
......@@ -15,7 +15,7 @@ During NAS, we allow \glspl{mbc} with expansion rates $e \in \{1,2,3,4,5,6\}$ an
\begin{center}
\begin{tabular}{clcccc}
\toprule
Stage & Operation & Kernel size & Stride (H, W) & Channels & Layers \\
Stage & Operation & Kernel Size & Stride (H, W) & Channels & Layers \\
\midrule
(i) & SincConv & 400 & 160 & 1 & 1 \\
(ii) & Conv & 3$\times$3 & 2, 2 & 10 & 1 \\
......
% **************************************************************************************************
% **************************************************************************************************
As we have seen in previous sections, resource efficiency of \glspl{dnn} for \gls{kws} can be achieved by many methods such as \gls{nas}, quantization and end-to-end architectures. In most cases, different methods operate orthogonal to each other, meaning that they optimized different aspects of a \gls{dnn} and therefore can be combined. In practice, resource efficient \gls{kws} applications may run on hardware with restricted resources, either because the device itself has not many computational resources (e.g. microcontrollers) or because the devices hardware is shared among many applications (e.g. smartphones). In both cases, multi-exit architectures can pose a substantial benefit over ordinary \glspl{dnn}. In multi-exit architectures not only one, but $N$ different classifiers are trained from a single architecture. This is achieved by attaching exit layers to the \gls{dnn} that take in intermediate outputs and compute a prediction. Having more than one classifiers at different depths of the \gls{dnn} allows to prematurely cancel the forward pass if the computational budget is used up or if the quality of the prediction is high enough.
As we have seen in the previous sections of this chapter, there are many methods to obtain resource efficient \glspl{dnn} for \gls{kws} including \gls{nas}, quantization and end-to-end models. Another method that we consider are multi-exit models. In practice, resource efficient \glspl{dnn} for \gls{kws} may run on hardware with restricted resources, either because the device itself has not many computational resources (e.g. microcontrollers) or because the device hardware is shared among many applications (e.g. smartphones). In both cases, multi-exit models can pose a substantial benefit over ordinary \glspl{dnn}. In multi-exit models not only one, but $N$ different classifiers are trained from a single model. This is achieved by attaching several exit layers to the \gls{dnn} that take an intermediate result and compute a prediction. Having many classifiers at different depths of the \gls{dnn} allows to prematurely cancel the forward pass if the computational budget is exhausted or if the confidence of the prediction is high enough.
In this section, we will explore different exit types and the associated accuracy at every exit. We will also explore the difference in performance if the model is trained together with the exits compared to pretrained model where only the exits are optimized. Finally, we will improve multi-exit architectures with distillation, where we will see further improvements in accuracy.
In this section, we will explore different exit types and the associated accuracy at every exit. We will also explore the difference in accuracies if the model is trained together with the exits compared to using a pretrained model where only the exits are optimized. Finally, we will improve multi-exit models with distillation, where we will see further improvements in accuracy.
\newsubsection{Experimental Setup}{results:multiexit:setup}
For the following experiments we selected one of the models previously found by \gls{nas}. The topology of the model is visualized in \Fref{fig:multiex_model}. The is an end-to-end model and was found using \gls{nas} with 60 SincConv channels, a tradeoff $\beta=4$ and uses $10.82$M \gls{madds} and $75.71$k parameters.
For the following experiments in this section, we select one of the models from \Fref{sec:results:endtoend}. The topology of the model is visualized in \Fref{fig:multiex_model}. The model is an end-to-end model and was found using \gls{nas} with 60 SincConv channels, a tradeoff $\beta=4$ and uses \SI{10.82}{\mega\nothing} \gls{madds} and \SI{75.71}{\kilo\nothing} parameters.
\begin{figure}
\centering
\includegraphics[height=\textwidth,angle=90]{\pwd/plots/multiex_model.png}
......@@ -13,13 +13,13 @@ For the following experiments we selected one of the models previously found by
\label{fig:multiex_model}
\end{figure}
There will be 5 exits in total, one exit after every of the first four \gls{mbc} blocks named exit 1 to exit 4 respectively and another exit at the final output named exit 5. A single exit consists of multiple layers depending on the exit topology according to \Fref{tab:exit_topology}. At first, there is either a ordinary convolution or a depthwise separable convolution. Then there is an additional pointwise convolution followed by global average pooling and either one, two or three fully connected layers. Convolutional layers use batch norm and \gls{relu} activations. Linear layers also use batch norm and \gls{relu} activations,except for the last one.
There will be 5 exits in total, one exit after every of the first four \glspl{mbc} named exit 1 to exit 4 respectively and another exit at the final output named exit 5. A single exit consists of multiple layers depending on the exit topology according to \Fref{tab:exit_topology}. The first layer is either an ordinary convolution or a depthwise separable convolution. Then there is an additional pointwise convolution followed by global average pooling and either one, two or three fully connected layers. Convolutional layers use batch norm and \gls{relu} activations. Linear layers also use batch norm and \gls{relu} activations, except for the last linear layer where no batch norm and no activation function is used.
\begin{table}
\begin{center}
\begin{tabular}{lccc}
\toprule
Type & $K$ & $S$ & $C$ \\
Type & Kernel Size & Stride & Channels \\
\midrule
Conv/DS-Conv & 3$\times$3 & 1 & 40 \\
Conv & 1$\times$1 & 1 & 80 \\
......@@ -28,41 +28,41 @@ There will be 5 exits in total, one exit after every of the first four \gls{mbc}
\bottomrule
\end{tabular}
\end{center}
\caption{Exit topology used in this section. $K$ denotes the kernel size, $S$ the stride, $C$ the number of channels. The first layer is either a convolution or a depthwise separable convolution. This convolution is followed by pointwise convolution. Finally, global average pooling is performed followed by either one, two or three fully connected layers. Convolutional layers use batch norm and \gls{relu} activations. Linear layers also use batch norm and \gls{relu} activations,except for the last one.}
\caption{Exit topology used in this section. The first layer is either a convolution or a depthwise separable convolution. This convolution is followed by pointwise convolution. Finally, global average pooling is performed followed by either one, two or three fully connected layers. Convolutional layers use batch norm and \gls{relu} activations. Linear layers also use batch norm and \gls{relu} activations, except for the last linear layer where no batch norm and no activation function is used.}
\label{tab:exit_topology}
\end{table}
Models are trained with a learning rate of 0.05 and the stochastic gradient descent optimizer. During the course of training, the learning rate is decayed according to a cosine schedule. We use a batch size of 100 for training. Additionally for knowledge distillation, soft-labels are generated from a teacher model. The teacher model was searched for with \gls{nas} and achieves and accuracy of 96.55\%. We weight the classification and distillation loss according to \Fref{eq:dist_weight} with $\alpha = 0.3$.
Models are trained with a learning rate of 0.05 and the stochastic gradient descent optimizer. During the course of training, the learning rate is decayed according to a cosine schedule. We use a batch size of 100 for training. For knowledge distillation, soft-labels are generated from a teacher model. The teacher model was obtained with \gls{nas} and achieves and accuracy of \SI{96.55}{\percent}. We weight the classification and distillation loss according to \Fref{eq:dist_weight} with $\alpha = 0.3$.
\newsubsection{Results and Discussions}{results:multiexit:discussion}
\Fref{fig:multi_exit_compare_exits} shows the test accuracies at certain exits of a multi-exit architecture using different exit topologies. The model used in this figure was pretrained and only the exits were trained. As expected, different exit topologies result in different accuracies. We observe that exits with ordinary convolutions perform better than exits with depthwise separable convolutions. Furthermore, using more than 1 linear layer looks to be beneficial. However, the difference in performance becomes negligible between two and three linear layers. Since depthwise-separable convolutions are more resource efficient than ordinary convolutions, we chose the exit with depthwise separable convolutions and two linear layers as the exit layer for subsequent experiments.
\Fref{fig:multi_exit_compare_exits} shows the test accuracies at certain exits of a multi-exit model using different exit topologies. The model used in this figure was pretrained and then fixed before the exits were attached and trained. As expected, different exit topologies result in different test accuracies. We observe that exits with ordinary convolutions perform better than exits with depthwise separable convolutions. Furthermore, using more than one linear layer looks to be beneficial. However, the difference in accuracy becomes negligible between two and three linear layers. Since depthwise separable convolutions are more resource efficient than ordinary convolutions, we chose the exit with depthwise separable convolutions and two linear layers as the exit for subsequent experiments.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{\pwd/plots/multi_exit_compare_exits.png}
\caption{Test accuracies at certain exits of a multi-exit architecture using different exit types. Every exit either includes an ordinary convolution or a depthwise-separable convolution followed by either one, two or three linear layers.}
\caption{Test accuracies at certain exits of a multi-exit model using different exit topologies. Every exit either includes an ordinary convolution or a depthwise separable convolution followed by either one, two or three linear layers.}
\label{fig:multi_exit_compare_exits}
\end{figure}
Using the previously selected exit with an depthwise separable convolution and two linear layers, we now explore the test accuracies for two scenarios: (i) the model is pretrained and only the exits are optimized and (ii) the model is optimized together with the exits. Both scenarios are compared in \Fref{fig:multi_exit_opt_exit_vs_exit_and_model}. We can see that optimizing the model together with the exits (red) achieves better accuracies for all exits except the final one. We argue that the drop in accuracy of the last exit is due to the interactions between the earlier exits and the model, where optimizing the early exit might influence the model in a way that harms the performance on the final exit.
Using the exit with a depthwise separable convolution and two linear layers, we now explore the test accuracies of our multi-exit model for two scenarios: (i) the model is pretrained and fixed before the exits are attached and only the exits are optimized and (ii) the model is optimized together with the exits. Both scenarios are compared in \Fref{fig:multi_exit_opt_exit_vs_exit_and_model}. We can see that optimizing the model together with the exits (red) achieves better accuracies for all exits except the final one. We argue that the drop in accuracy of the last exit is due to the interactions between the earlier exits and the model, where optimizing the early exit might influence the model in a way that harms the accuracy of the final exit.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{\pwd/plots/multi_exit_opt_exit_vs_exit_and_model.png}
\caption{Test accuracies at certain exits of a multi-exit architecture when only the exits are trained and the model parameters itself are fixed (blue) vs training the exits together with the model parameters (red). The exit-type used in this figure incudes a depthwise-separable convolution followed by two linear layers.}
\caption{Test accuracies at certain exits of a multi-exit model when only the exits are trained and the pretrained model parameters are fixed (blue) vs training the exits together with the model parameters (red). The exit topology used in this figure incudes a depthwise separable convolution followed by two linear layers.}
\label{fig:multi_exit_opt_exit_vs_exit_and_model}
\end{figure}
Next, we look at knowledge distillation to improve training of multi-exit architectures. \Fref{fig:multi_exit_opt_exit_model_with_without_dist.png} show the test accuracies at certain exits. Subfigure (a) shows the test accuracies for models where the exits are optimized together with the model. We can see using knowledge distillation improves the accuracy of all exits. A similar observation can be made in subfigure (b) where only the exits are optimized. Clearly, knowledge distillation improves the performance in this case as well.
Next, we look at knowledge distillation to improve the accuracy of multi-exit models. \Fref{fig:multi_exit_opt_exit_model_with_without_dist.png} shows the test accuracies at certain exits. Subfigure (a) shows the test accuracies for models where the exits are optimized together with the model. We can see that knowledge distillation improves the accuracy of all exits. A similar observation can be made in subfigure (b) where the model is pretrained and then fixed and only the exits are optimized. Clearly, knowledge distillation improves the accuracy in this case as well.
\begin{figure}
\centering
\subfigure[]{\includegraphics[width=0.8\textwidth]{\pwd/plots/multi_exit_opt_exit_model_no_dist.png}}
\subfigure[]{\includegraphics[width=0.8\textwidth]{\pwd/plots/multi_exit_opt_exit_model_with_dist.png}}
\caption{Test accuracies at certain exits of a multi-exit architecture comparing distillation based training vs ordinary training. In subplot (a), the exits are trained together with the model parameters. In subplot (b), only the exits are trained and the model parameters itself are fixed. The exit-type used in both subplots incudes a depthwise-separable convolution followed by two linear layers.}
\caption{Test accuracies at certain exits of a multi-exit model comparing distillation based training vs ordinary training. In subplot (a), the exits are trained together with the model parameters. In subplot (b), only the exits are trained and the pretrained model parameters are fixed. The exit topology used in both subplots incudes a depthwise separable convolution followed by two linear layers.}
\label{fig:multi_exit_opt_exit_model_with_without_dist.png}
\end{figure}
Finally, we look at the resource consumption of the multi-exit architecture. Of course, adding exit layers to an existing \gls{dnn} increases the memory requirements. In our case, the exits alone need 123.92 k parameters while the model itself without exits uses only 75.71 k parameters. This results in increase in memory of more than 2.5 times and may be detrimental for some applications. However, with an increase in memory comes also more flexibility during the forward pass at runtime. \Fref{tab:exit_params_ops} shows the number of parameters and \gls{madds} for every exit as well as the cumulative number \gls{madds} to compute a prediction at certain exits. In the ideal case, more than two times the number of \gls{madds} can be save by using the predication at exit 1 instead of passing the data through the whole network up to the final exit (i.e. exit 5).
Finally, we look at the resource consumption of the multi-exit model. Of course, adding exit layers to an existing \gls{dnn} increases the memory requirements. In our case, the exits alone need \SI{123.92}{\kilo\nothing} parameters while the model itself without exits uses only \SI{75.71}{\kilo\nothing} parameters. This results in an increase in memory consumption of more than 2.5 times. This may be detrimental for some applications. However, with an increase in memory consumption comes more flexibility during the forward pass at runtime. \Fref{tab:exit_params_ops} shows the number of parameters and \gls{madds} for every exit as well as the cumulative number \gls{madds} to compute a prediction at certain exits. In the ideal case, more than two times the number of \gls{madds} can be save by using the prediction at exit 1 instead of passing the input through the whole model up to the final exit.
\begin{table}
\begin{center}
\begin{tabular}{lccc}
......@@ -70,14 +70,14 @@ Finally, we look at the resource consumption of the multi-exit architecture. Of
Exit & \# Parameters & \# \gls{madds} & \# \gls{madds} \\
& & & Cumulative \\
\midrule
Exit 1 & 30.73 k & 1.59 M & 5.35 M\\
Exit 2 & 30.73 k & 1.59 M & 5.84 M\\
Exit 3 & 30.73 k & 1.59 M & 6.20 M \\
Exit 4 & 31.73 k & 0.56 M & 6.75 M \\
Exit 5 & - & - & 10.82 M\\
Exit 1 & \SI{30.73}{\kilo\nothing} & \SI{1.59}{\mega\nothing} & \SI{5.35}{\mega\nothing} \\
Exit 2 & \SI{30.73}{\kilo\nothing} & \SI{1.59}{\mega\nothing} & \SI{5.84}{\mega\nothing} \\
Exit 3 & \SI{30.73}{\kilo\nothing} & \SI{1.59}{\mega\nothing} & \SI{6.20}{\mega\nothing} \\
Exit 4 & \SI{31.73}{\kilo\nothing} & \SI{0.56}{\mega\nothing} & \SI{6.75}{\mega\nothing} \\
Exit 5 & - & - & \SI{10.82}{\mega\nothing} \\
\bottomrule
\end{tabular}
\end{center}
\caption{Number of parameters and \gls{madds} of every exit. The cumulative \gls{madds} is the number of \gls{madds} needed to compute a prediction at a certain exit. }
\caption{Number of parameters and \gls{madds} of every exit. The cumulative \gls{madds} is the number of \gls{madds} needed to compute a prediction at a certain exit.}
\label{tab:exit_params_ops}
\end{table}
......@@ -19,7 +19,7 @@ Our overparameterized model has the same number of channels across all layers ex
\begin{center}
\begin{tabular}{clcccc}
\toprule
Stage & Operation & Kernel size & Stride (H, W) & Channels & Layers \\
Stage & Operation & Kernel Size & Stride (H, W) & Channels & Layers \\
\midrule
(i) & Conv & 5$\times$11 & 1, 2 & 72 & 1 \\
(ii) & MBC[e] / Identity & [k]$\times$[k] & 2, 2 & 72 & 12 \\
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment