There are many aspects to consider when designing resource efficient \glspl{dnn} for \gls{kws}. In this chapter, we will introduce five methods that address two very important aspects in resource efficient \glspl{dnn}: The amount of memory needed to store the parameters and the number of \glspl{madds} per forward pass.

There are many aspects to consider when designing resource efficient \glspl{dnn} for \gls{kws}. In this chapter, we will introduce five methods that address two very important aspects in resource efficient \glspl{dnn}, the amount of memory needed to store the parameters and the number of \glspl{madds} per forward pass.

The amount of memory needed to store the parameters depends on two factors, the number of parameters and the associated bit width of the parameters. Decreasing the number of parameters is possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. \Fref{sec:kws:conv} deals with resource efficient convolutional layers whereas \Fref{sec:kws:nas} deals with the resource efficient \gls{nas} approach. Quantizing weights and activations to decrease the amount of memory needed is discussed in \Fref{sec:kws:quantization}.

The number of \glspl{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws} applications. In most cases, the number \glspl{madds} is tied to the latency of a model. By decreasing the number of \glspl{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \glspl{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \glspl{madds} of a model is to replace the costly \gls{mfcc} extractor by a convolutional layer. This method is explained in \Fref{sec:kws:endtoend}. Finally, \Fref{sec:kws:multi_exit}will introduce multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \glspl{madds} based on either the available computational resources or based on the complexity of the classification task.

The number of \glspl{madds} per forward pass is another aspect that can not be neglected in resource efficient \gls{kws} applications. In most cases, the number \glspl{madds} is tied to the latency of a model. By decreasing the number of \glspl{madds}, we also implicitly decrease the latency of a model. Decreasing the number of \glspl{madds} is again possible by either employing resource efficient convolutional layers or by establishing an accuracy-size tradeoff using a resource efficient \gls{nas} approach. Another method to decrease the \glspl{madds} of a model is to replace the costly \gls{mfcc}feature extractor by a convolutional layer. This method is explained in \Fref{sec:kws:endtoend}. Finally, \Fref{sec:kws:multi_exit} introduces multi-exit models. As we will see, multi-exit models allow us to dynamically reduce the number of \glspl{madds} based on either the available computational resources or based on the complexity of the classification task.

Multi-exit models are an extension of traditional \glspl{dnn}. Instead of computing a single prediction, multi-exit models compute a number of $M$ predictions. A multi-exit model can be built from a traditional \gls{dnn} by extending the \gls{dnn} with so called early exits. Early exits are simply convolutional classifiers that take the output of an intermediate layer to compute a prediction. There are also models designed specifically with the multi-exit paradigm in mind. An example of a multi-exit model is the MSDNet model introduced in \cite{Huang18}.

Multi-exit models are an extension of traditional \glspl{dnn}. Instead of computing a single prediction, multi-exit models compute a number of $M$ predictions. A multi-exit model can be built from a traditional \gls{dnn} by extending the \gls{dnn} with so called early exits. Early exits are simply convolutional classifiers that take the output of an intermediate layer to compute a prediction. There are also models designed specifically with the multi-exit paradigm in mind. An example of a multi-exit model is the Multi-Scale Dense Net introduced in \cite{Huang18}.

The predictions of multi-exit models are computed at different depths of the network and therefore, later exits usually have a larger capacity and produce more accurate predictions than early exits. However, early predictions need fewer resources to compute. Multi-exit models can be beneficial when the computational budget at test-time is not constant or not known in advance. The computational budget may not be constant if there are jobs running concurrently or the processor speed is set dynamically. If the computational budget is limited at some point, the available budget can be used in multi-exit models to compute an early prediction which would not be possible in traditional \glspl{dnn}.

In many classification tasks there are certain samples that are very easy to classify. For easy samples, the output of a \gls{dnn} will report a very high probability that this sample belongs to a certain class $c$. Therefore, for easy samples, it usually sufficient to use an early prediction. Whether to use a certain early prediction or to compute a later more accurate prediction can be decided during runtime by observing the probability distribution of the prediction. If the probability of a certain class $c$ is higher than some threshold, the computation can be stopped, otherwise the computation must be resumed. Using this approach, the computational resources can be utilized very efficiently.

In many classification tasks there are certain samples that are very easy to classify. For easy samples, the output of a \gls{dnn} will report a very high probability that this sample belongs to a certain class $c$. Therefore, for easy samples, it is usually sufficient to use an early prediction. Whether to use a certain early prediction or to compute a later more accurate prediction can be decided during runtime by observing the probability distribution of the prediction. If the probability of a certain class $c$ is higher than some threshold, the computation can be stopped, otherwise the computation must be resumed. Using this approach, the computational resources can be utilized very efficiently.

In multi-exit models there are usually two methods for prediction: the anytime prediction and the budget prediction \cite{PhuongL19}. In anytime prediction, a crude initial prediction is produced and then gradually improved if the computational budget allows it. If the computational budget is depleted, the multi-exit model outputs the prediction of the last exit or an ensemble of all evaluated exits. In budget prediction, the computational budget is constant and therefore, only a single exit is selected that fits the computational budget. Since in budget prediction mode only one exit is selected, only one prediction is reported.

...

...

@@ -13,7 +13,7 @@ Multi-exit models are usually trained by attaching a loss function to every exit

\caption{Self-distillation as used in \cite{PhuongL19} to train a multi-exit model. The output of the final layer produces the soft targets that are then used to compute the distillation losses in the early exits. }

\caption{Self-distillation as used in \cite{PhuongL19} to train a multi-exit model. The output of the final layer produces the soft targets that are then used to compute the distillation losses in the early exits. Figure from \cite{PhuongL19}.}

@@ -34,7 +34,7 @@ During architecture search, the training of the architecture parameters $\alpha_

\label{fig:proxyless_nas_learning}

\end{figure}

Updating the architecture parameters via backpropagation requires computing $\partial\mathcal{L}/\partial\alpha_i$ which is not defined due to the sampling in (\ref{eq:proxyless_gates}). To solve this issue, the gradient of $\partial\mathcal{L}/\partial p_i$ is estimated as $\partial\mathcal{L}/\partial g_i$. This estimation is an application of the STE explained in \Fref{sec:kws:quantization}. With this estimation, backpropagation can be used to compute $\partial\mathcal{L}/\partial\alpha_i$ since the gates $g_i$ are part of the computation graph and thus $\partial\mathcal{L}/\partial g_i$ can be calculated using backpropagation.

Updating the architecture parameters via backpropagation requires computing $\partial\mathcal{L}/\partial\alpha_i$ which is not defined due to the sampling in (\ref{eq:proxyless_gates}). To solve this issue, the gradient of $\partial\mathcal{L}/\partial p_i$ is estimated as $\partial\mathcal{L}/\partial g_i$. This estimation is an application of the \gls{ste} explained in \Fref{sec:kws:quantization}. With this estimation, backpropagation can be used to compute $\partial\mathcal{L}/\partial\alpha_i$ since the gates $g_i$ are part of the computation graph and thus $\partial\mathcal{L}/\partial g_i$ can be calculated using backpropagation.

%\newsubsection{Single-Path NAS}{}

%Single-Path NAS was proposed in \cite{Stamoulis2019} as a hardware-efficient method for \gls{nas}. Compared to ProxylessNAS, Single-path NAS uses one single-path overparameterized network to encode all architectural decisions with shared convolutional kernel parameters. Parameter sharing drastically decreases the number of trainable parameters and therefore also reduces the search cost in comparison to ProxylessNAS.

@@ -18,19 +18,19 @@ The binary quantizer maps a real-valued number $x \in \REAL$ to a quantized numb

\end{cases}.

\end{equation}

In practice, the quantized integer weights are stored together with the scaling factor $\alpha$. Both the integer weights and the scale factor can then be used to perform efficient and fast arithmetic computations with other quantized numbers. The scaling factor $\alpha$ is very important since it increases the performance of quantized \glspl{dnn} substantially. To compute the scaling factor $\alpha$ we consider two methods: the statistics approach and the parameter approach.

In practice, the quantized integer weights are stored together with the scaling factor $\alpha$. Both the integer weights and the scale factor can then be used to perform efficient and fast arithmetic computations with other quantized numbers. The scaling factor $\alpha$ is very important since it increases the performance of quantized \glspl{dnn} substantially. To compute the scaling factor $\alpha$ we consider two methods: a statistics-based approach and a parameter-based approach.

In the statistics approach, the maximum absolute weight value is used to compute the scaling factor. Given an auxiliary weight tensor $\vm{W}$ from which the quantized weight tensor is computed, the scaling factor is computed according to

In the statistics-based approach, the maximum absolute weight value is used to compute the scaling factor. Given an auxiliary weight tensor $\vm{W}$ from which the quantized weight tensor is computed, the scaling factor is computed according to

\begin{equation}

\alpha = \max(|\vm{W}|).

\end{equation}

In the second approach, the scaling factor $\alpha$ is defined as a learnable parameter that is optimized using gradient descent.

In the parameter-based approach, the scaling factor $\alpha$ is defined as a learnable parameter that is optimized using gradient descent.

\Fref{fig:ste_quant} illustrates weight quantization using the \gls{ste} on a typical convolutional layer (without batch normalization). During the forward pass, the quantizer $Q$ performs the quantization of the auxiliary weight tensor $\mathbf{W}^l$ to obtain the quantized weight tensor $\mathbf{W}^l_q$. The quantized weight tensor is then used in the convolution. During the backward pass, the STE replaces the derivative of the quantizer $Q$ by the derivative of the identity function. In this way, the gradient can propagate back to $\mathbf{W}^l$.

\caption{Weight quantization using the STE on a typical convolutional layer (without batch normalization). Red boxes have zero gradients whereas green boxes have non-zero gradients. Weight updates are performed to the blue circle. During the forward pass, the weight tensor $\mathbf{W}^l$ is quantized to obtain the $k$ bit weight tensor $\mathbf{W}^l_q$ used in the convolution. The activation function is then applied to the output of the convolution $\mathbf{a}^{l+1}$ to obtain $\mathbf{x}^{l+1}$ which is the input tensor of the subsequent layer. During the backward pass, the \gls{ste} replaces the derivative of the quantizer by the derivative of the identity function \textit{id}.}

\caption{Weight quantization using the STE on a typical convolutional layer (without batch normalization). Red boxes have zero gradients whereas green boxes have non-zero gradients. Weight updates are performed to the blue circle. During the forward pass, the weight tensor $\mathbf{W}^l$ is quantized to obtain the $k$ bit weight tensor $\mathbf{W}^l_q$ used in the convolution. The activation function is then applied to the output of the convolution $\mathbf{a}^{l+1}$ to obtain $\mathbf{x}^{l+1}$ which is the input tensor of the subsequent layer. During the backward pass, the \gls{ste} replaces the derivative of the quantizer by the derivative of the identity function \textit{id}. Figure from \cite{Roth19}.}

@@ -27,7 +27,7 @@ For the following experiments we selected one of the models previously found by

\label{fig:quant_model}

\end{figure}

Quantization aware training is performed using different quantizers for weights and activations. In both cases, a binary quantizer is used for 1 bit weights whereas an integer quantizer is used for 2 to 8 bits. The scaling factor for weight quantization is obtained using the statistics approach where the absolute maximum weight value of the auxiliary weight tensor is used as the scaling parameter $\alpha$. Scaling the weights is performed individually per channel, i.e. that every channel has its own scaling factor. The scaling factor for activation quantization is obtained by defining the scaling factor as a learnable parameter that is optimized using gradient descent. For activations, a single scaling factor per tensor is used. We select different activation functions depending on the quantizer used. For binary activations, we use the $\mathrm{sign}$ activation whereas for 2 to 8 bit activations, we use the default \gls{relu} activation.

Quantization aware training is performed using different quantizers for weights and activations. In both cases, a binary quantizer is used for 1 bit weights whereas an integer quantizer is used for 2 to 8 bits. The scaling factor for weight quantization is obtained using the statistics-based approach where the absolute maximum weight value of the auxiliary weight tensor is used as the scaling parameter $\alpha$. Scaling the weights is performed individually per channel, i.e. that every channel has its own scaling factor. The scaling factor for activation quantization is obtained by defining the scaling factor as a learnable parameter that is optimized using gradient descent. For activations, a single scaling factor per tensor is used. We select different activation functions depending on the quantizer used. For binary activations, we use the $\mathrm{sign}$ activation whereas for 2 to 8 bit activations, we use the default \gls{relu} activation.

All models are trained with the same hyperparameters to ensure that the results are comparable. We train the models for 120 epochs with an initial learning rate of 0.001. The learning rate is decayed according to a cosine schedule over the course of training. We use a batch size of 100 and optimize the model using the ADAM optimizer. Label smoothing is applied to the targets with a factor of 0.1. \glspl{mfcc} are extracted similar to the experiment in the previous section in \Fref{sec:results:nas}.