Commit a1f4fcd6 authored by David Peter's avatar David Peter
Browse files

Finish endtoend results section

parent 77e9dc76
No preview for this file type
% **************************************************************************************************
% **************************************************************************************************
\gls{kws} with \glspl{dnn} is typically performed on hand-crafted speech features such as \gls{mfcc} that are extracted from raw audio waveforms. Extracting \gls{mfcc} involves computing the Fourier transform. However, performing the Fourier transform is computationally expensive and might exceed the capabilities of resource-constrained devices. Therefore, Ibrahim et al. \cite{Ibrahim2019} proposed to use simpler speech features derived in the time-domain. The features, referred to as \gls{mfsts}, are obtained by computing constrained lag autocorrelations on overlapping speech frames to form a 2D map. A \gls{tcn} \cite{Chiu2018} is then used to classify keywords on \gls{mfsts}.
\gls{kws} with \glspl{dnn} is typically performed on hand-crafted speech features such as \glspl{mfcc} that are extracted from raw audio waveforms. Extracting \glspl{mfcc} involves performing the Fourier transform. However, performing the Fourier transform is computationally expensive and might exceed the capabilities of resource-constrained devices. Therefore, Ibrahim et al. \cite{Ibrahim2019} proposed to use simpler speech features derived in the time-domain. The features, referred to as \gls{mfsts}, are obtained by computing constrained lag autocorrelations on overlapping speech frames to form a 2D map. A \gls{tcn} \cite{Chiu2018} is then used to classify keywords on \gls{mfsts}.
However, hand-crafted features such as \gls{mfsts} and \gls{mfcc} may not be optimal for \gls{kws}. Therefore recent works have proposed to directly feed the \glspl{dnn} with raw audio waveforms. In \cite{Ravanelli2018}, a \gls{cnn} for speaker recognition is proposed that encourages to learn parametrized sinc functions as kernels in the first layer. This layer is referred to as SincConv layer. During training, a low and high cutoff frequency per kernel is determined. Therefore, a custom filter bank is derived by training the SincConv layer that is specifically tailored to the desired application. SincConvs have also been recently applied to \gls{kws} tasks \cite{Mittermaier2020}.
However, hand-crafted speech features such as \gls{mfsts} and \glspl{mfcc} may not be optimal for \gls{kws}. Therefore, recent works have proposed to directly feed the \glspl{dnn} with raw audio waveforms. In \cite{Ravanelli2018}, a \gls{cnn} for speaker recognition is proposed that encourages to learn parametrized sinc functions as kernels in the first layer. This layer is referred to as SincConv layer. During training, a low and high cutoff frequency per kernel is determined. Therefore, a custom filter bank is derived by training the SincConv layer that is specifically tailored to the desired application. SincConvs have also been recently applied to \gls{kws} tasks \cite{Mittermaier2020}.
In this section, we will search for end-to-end \gls{kws} models using \gls{nas}. The models obtained by \gls{nas} will have the SincConv as the input layer. Instead of \gls{mfcc}, end-to-end \gls{kws} models are supplied the raw audio waverforms as input instead of hand-crafted features such as \gls{mfsts} or \gls{mfcc}. We will compare models obtained by \gls{nas} using SincConvs with models using ordinary 1-D convolutions to assess whether SincConvs pose any benefit over 1-D convolutions.
Furthermore, this section will compare end-to-end \gls{kws} models with models using \gls{mfcc} from \Fref{sec:results:nas} in terms of accuracy, number of \gls{madds} and number of parameters.
In this section, we will search for end-to-end \gls{kws} models using \gls{nas}. The end-to-end models obtained by \gls{nas} will use a SincConv as the input layer. Instead of \glspl{mfcc}, end-to-end models are supplied with the raw audio waveforms as input instead of hand-crafted speech features. We will compare models obtained by \gls{nas} using SincConvs with models using ordinary 1D convolutions to assess whether SincConvs pose any benefit over 1D convolutions. Furthermore, this section will compare end-to-end models with models using \glspl{mfcc} from \Fref{sec:results:nas} in terms of accuracy, number of \gls{madds} and number of parameters.
\newsubsection{Experimental Setup}{results:endtoend:setup}
\Fref{tab:model_structure_endtoend} shows the overparameterized model used in this section. This model is fed with the raw time-domain signal instead of \gls{mfcc}. It consists of five stages with two input stages (i) (ii), two intermediate stages (iii) (iv) and one output stage (v). Stages (i), (ii) and (v) are fixed whereas stages (iii) an (iv) are optimized using \gls{nas}. We use again \gls{mbc} as the main building blocks in stages (iii) and (iv). Stride (as stated in \Fref{tab:model_structure_endtoend}) is only applied to the first convolution of each stage.
\Fref{tab:model_structure_endtoend} shows the overparameterized model used in this section. This model is fed with raw audio waveforms instead of \glspl{mfcc}. It consists of five stages with two input stages (i) (ii), two intermediate stages (iii) (iv) and one output stage (v). Stages (i), (ii) and (v) are fixed whereas stages (iii) an (iv) are optimized using \gls{nas}. \glspl{mbc} are used as the main building blocks in stages (iii) and (iv). Stride is only applied to the first convolution of each stage.
During NAS, we allow \gls{mbc} with expansion rates $e \in \{1,2,3,4,5,6\}$ and kernel sizes $k \in \{3,5,7\}$ for selection. We also include the zero operation which effectively results in an identity layer. For blocks where the input feature map size is equal to the output feature map size we include skip connections. For the tradeoff parameter we again select $\beta \in \{0, 1, 2, 4, 8, 16\}$ for all experiments in this section.
During NAS, we allow \glspl{mbc} with expansion rates $e \in \{1,2,3,4,5,6\}$ and kernel sizes $k \in \{3,5,7\}$ for selection. We also include the zero operation which effectively results in an identity layer. For blocks where the input feature map size is equal to the output feature map size we include skip connections. For the tradeoff parameter we select $\beta \in \{0, 1, 2, 4, 8, 16\}$.
\begin{table}
\begin{center}
......@@ -21,7 +19,7 @@ During NAS, we allow \gls{mbc} with expansion rates $e \in \{1,2,3,4,5,6\}$ and
Stage & Operation & Kernel size & Stride (H, W) & Channels & Layers \\
\midrule
(i) & SincConv & 400 & 160 & 1 & 1 \\
(ii) & Conv & 3x3 & 2, 2 & 10 & 1 \\
(ii) & Conv & 3$\times$3 & 2, 2 & 10 & 1 \\
(iii) & MBC[$e$] / Identity & [$k$]$\times$[$k$] & 2, 2 & 20& 3 \\
(iv) & MBC[$e$] / Identity & [$k$]$\times$[$k$] & 2, 2 & 40 & 3 \\
(v) & Conv & 1$\times$1 & 1, 1 & 80 & 1 \\
......@@ -30,18 +28,18 @@ During NAS, we allow \gls{mbc} with expansion rates $e \in \{1,2,3,4,5,6\}$ and
\bottomrule
\end{tabular}
\end{center}
\caption{NAS model used for KWS. $K$ denotes the kernel size, $S$ the stride, $C$ the number of channels and $L$ the number of layers per stage. Stages (i) and (ii) and (v) are fixed. For stage (iii) and (iv), the parameters $e$ (expansion rate), $k$ (kernel size) and whether an identity layer is selected or not is optimized using NAS.}
\caption{Overparameterized end-to-end model used for \gls{nas}. Stages (i) and (ii) and (v) are fixed. For stage (iii) and (iv), the parameters $e$ (expansion rate), $k$ (kernel size) and whether an identity layer is selected or not is optimized using \gls{nas}.}
\label{tab:model_structure_endtoend}
\end{table}
Before performing classification on raw audio waveforms, we select a window length of 25ms and a hop length of 10ms to split up the raw audio waveform into frames. Therefore, at a sampling frequency fs = 16kHz, the SincConv filter length is 400 and the hop length is 160. Before filtering with the SincConv, a Hamming window is applied to the frames.
Before performing classification on raw audio waveforms, we select a window length of \SI{25}{\milli\second} and a hop length of \SI{10}{\milli\second} to split up the raw audio waveforms into frames. Therefore, at a sampling frequency $f_s = \SI{16}{\kilo\hertz}$, the SincConv filter length is 400 and the hop length is 160. Before filtering with the SincConv, a Hamming window is applied to the frames.
We pretrain the architecture blocks for 40 epochs at a learning rate of 0.05 before \gls{nas} is performed. During pretraining, architecture blocks are randomly sampled and trained on one batch of the training set. Sampling and training repeats until 40 epochs of training are passed. Optimization is performed with stochastic gradient descent using a mini-batch size of 100. After pretraining we perform an architecture search for 120 epochs with an initial learning rate of 0.2. The learning rate is decayed according to a cosine schedule \cite{Cai2019}. We use a batch size of 100 and optimize the model using the ADAM optimizer. Label smoothing is applied to the targets with a factor of 0.1.
An architecture is trained until convergence after selection by the NAS procedure to optimize the performance. We use the same hyperparameters as in the architecture search process.
An architecture is trained until convergence after selection by the \gls{nas} procedure to optimize the performance. We use the same hyperparameters as in the architecture search process.
\newsubsection{Results and Discussions}{results:endtoend:discussion}
\Fref{fig:sincconv_vs_conv1d} shows the test accuracy vs number of \gls{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. Models depicted were obtained by \gls{nas}. The number of parameters corresponds to the circle area. We can clearly see that models with SincConvs outperform models with 1-D convolutions. Models with 1-D convolution fail to generalize well independent of the number of channels selected in the first layer. Models with SincConvs on the other hand generalize well. Furthermore, SincConv model performance increases with the number of channels in the first layer as this was the case in \Fref{sec:results:nas}, where the model performance increased with increasing the number of \gls{mfcc}. We follow from this results that it is beneficial to introduce some sort of prior knowledge into the first layer of an end-to-end \gls{kws} model. In case of SincConvs, the prior knowledge includes the shape of kernels to mimic a bandpass filter bank.
\Fref{fig:sincconv_vs_conv1d} shows the test accuracy vs the number of \gls{madds} of end-to-end \gls{kws} models using Conv1D or SincConv at the input layer. Models depicted were obtained by \gls{nas}. The number of parameters corresponds to the circle area. We can clearly see that models with SincConvs outperform models with 1D convolutions. Models with 1D convolutions fail to generalize well independent of the number of channels selected in the first layer. Models with SincConvs on the other hand generalize well. Furthermore, the test accuracy of SincConv models increases with the number of channels in the first layer as this was the case in \Fref{sec:results:nas}, where the test accuracy of the models increased by increasing the number of \glspl{mfcc}. We follow from this results that it is beneficial to introduce some prior knowledge into the first layer of an end-to-end \gls{kws} model.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{\pwd/plots/sincconv_vs_conv1d.png}
......@@ -49,23 +47,23 @@ An architecture is trained until convergence after selection by the NAS procedur
\label{fig:sincconv_vs_conv1d}
\end{figure}
After assessing the performance differences between models with 1-D convolutions and SincConvs, we will now look at the performance of SincConv models compared to models using \gls{mfcc} from \Fref{sec:results:nas}. \Fref{fig:sincconv_vs_mfcc} shows the test accuracy vs number \gls{madds} of end-to-end \gls{kws} models compared to \gls{kws} models using \gls{mfcc}. Only Pareto-optimal models are shown. The number of parameters corresponds to the circle area. We can observe that SincConv models contribute to the Pareto frontier. However, we also have to note that SincConv models do not reach the same performance as \gls{mfcc} models when the number of \gls{madds} is matched. We can again observe the tradeoff between model size and accuracy, where larger models have a better performance than smaller models.
After assessing the performance differences between models with 1D convolutions and SincConvs, we will now look at the performance of SincConv models compared to models using \glspl{mfcc} from \Fref{sec:results:nas}. \Fref{fig:sincconv_vs_mfcc} shows the test accuracy vs number \gls{madds} of end-to-end \gls{kws} models compared to \gls{kws} models using \glspl{mfcc}. Only Pareto-optimal models are shown. The number of parameters corresponds to the circle area. We can observe that SincConv models contribute to the Pareto frontier. However, we also have to note that SincConv models do not reach the same performance as \gls{mfcc} models when the number of \gls{madds} is matched. We can again observe the tradeoff between model size and accuracy, where larger models have a higher test accuracy than smaller models.
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{\pwd/plots/sincconv_vs_mfcc.png}
\caption{Test accuracy vs number of \gls{madds} of end-to-end \gls{kws} models using SincConvs compared to \gls{kws} models using \gls{mfcc}. All models depicted in this figure are Pareto-optimal models and were obtained using \gls{nas}. The number of parameters corresponds to the circle area.}
\caption{Test accuracy vs number of \gls{madds} of end-to-end \gls{kws} models using SincConvs compared to \gls{kws} models using \glspl{mfcc}. All models depicted in this figure are Pareto-optimal models and were obtained using \gls{nas}. The number of parameters corresponds to the circle area.}
\label{fig:sincconv_vs_mfcc}
\end{figure}
\Fref{fig:nas_td_best} (a), (b) and (c) provide the topologies of some of the models from \Fref{fig:sincconv_vs_mfcc}. The notation is the same as in \Fref{sec:results:nas}. One difference to models from \Fref{sec:results:nas} is that the input to the \gls{dnn} is a raw time-domain signal. The raw time-domain signal is then processed by the SincConv that produces either 40, 60 or 80 channels. The output of the SincConv is then fed to a standard \gls{cnn} structure. Compared to \gls{mfcc} models, the models in \Fref{fig:nas_td_best} are fully convolutional and do not require any hand-crafted feature extraction. The model accuracy, number of \gls{madds}, number of parameters and tradeoff parameter $\beta$ of the models from \Fref{fig:nas_td_best} are included in \Fref{tab:nas_td_best}.
\Fref{fig:nas_td_best} (a), (b) and (c) provides the topologies of some of the models from \Fref{fig:sincconv_vs_mfcc}. The model accuracy, number of \gls{madds}, number of parameters and tradeoff parameter $\beta$ of the models from \Fref{fig:nas_td_best} are included in \Fref{tab:nas_td_best}.
\begin{figure}
\centering
\subfigure[]{\includegraphics[width=0.24\textwidth]{\pwd/plots/pless_sweep_td_SincConv_wm1_f40_3.png}}
\subfigure[]{\includegraphics[width=0.24\textwidth]{\pwd/plots/pless_sweep_td_SincConv_wm1_f60_2.png}}
\subfigure[]{\includegraphics[width=0.24\textwidth]{\pwd/plots/pless_sweep_td_SincConv_wm1_f80_0.png}}
\caption{Model structure of some selected models. The best performing Pareto optimal model with 40 SincConv kernels (a) was selected. Similarly, the best performing Pareto optimal models for 60 (b) and 80 SincConv kernels (d) were selected.}
\caption{Model structure of some selected models. The best performing Pareto optimal models with 40 (a), 60 (b) and 80 (c) SincConv kernels were selected respectively.}
\label{fig:nas_td_best}
\end{figure}
......@@ -75,11 +73,11 @@ After assessing the performance differences between models with 1-D convolutions
\toprule
\gls{mfcc} & $\beta$ & \gls{madds} & Parameters & Test Accuracy\\
\midrule
40 & 4 & 9.70 M & 70.47 k & 94.39 \% \\
60 & 2 & 18.13 M & 91.41 k & 95.38 \% \\
80 & 0 & 29.18 M & 117.65 k & 95.74 \% \\
40 & 4 & \SI{9.70}{\mega\nothing} & \SI{70.47}{\kilo\nothing} & \SI{94.39}{\percent} \\
60 & 2 & \SI{18.13}{\mega\nothing} & \SI{91.41}{\kilo\nothing} & \SI{95.38}{\percent} \\
80 & 0 & \SI{29.18}{\mega\nothing} & \SI{117.65}{\kilo\nothing} & \SI{95.74}{\percent} \\
\bottomrule
\end{tabular}
\caption{Test accuracy, \gls{madds} and number of parameters of some selected models. The best performing Pareto optimal model with 40 SincConv kernels (a) was selected. Similarly, the best performing Pareto optimal models for 60 (b) and 80 SincConv kernels (d) were selected.}
\caption{Test accuracy, \gls{madds} and number of parameters of some selected models. The best performing Pareto optimal models with 40 (a), 60 (b) and 80 (c) SincConv kernels were selected respectively.}
\label{tab:nas_td_best}
\end{table}
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment