Commit f95538e7 authored by David Peter's avatar David Peter
Browse files

Update endtoend.tex

parent a1f4fcd6
......@@ -2,13 +2,12 @@
% **************************************************************************************************
\gls{kws} with \glspl{dnn} is typically performed on hand-crafted speech features such as \glspl{mfcc} that are extracted from raw audio waveforms. Extracting \glspl{mfcc} involves performing the Fourier transform. However, performing the Fourier transform is computationally expensive and might exceed the capabilities of resource-constrained devices. Therefore, Ibrahim et al. \cite{Ibrahim2019} proposed to use simpler speech features derived in the time-domain. The features, referred to as \gls{mfsts}, are obtained by computing constrained lag autocorrelations on overlapping speech frames to form a 2D map. A \gls{tcn} \cite{Chiu2018} is then used to classify keywords on \gls{mfsts}.
However, hand-crafted speech features such as \gls{mfsts} and \glspl{mfcc} may not be optimal for \gls{kws}. Therefore, recent works have proposed to directly feed the \glspl{dnn} with raw audio waveforms. In \cite{Ravanelli2018}, a \gls{cnn} for speaker recognition is proposed that encourages to learn parametrized sinc functions as kernels in the first layer. This layer is referred to as SincConv layer. During training, a low and high cutoff frequency per kernel is determined. Therefore, a custom filter bank is derived by training the SincConv layer that is specifically tailored to the desired application. SincConvs have also been recently applied to \gls{kws} tasks \cite{Mittermaier2020}.
However, hand-crafted speech features such as \gls{mfsts} and \glspl{mfcc} may not be optimal for \gls{kws}. Therefore, recent works have proposed to directly feed a \gls{dnn} with raw audio waveforms. In \cite{Ravanelli2018}, a \gls{cnn} for speaker recognition is proposed that encourages to learn parametrized sinc functions as kernels in the first layer. This layer is referred to as SincConv layer. During training, a low and high cutoff frequency per kernel is determined. Therefore, a custom filter bank is derived by training the SincConv layer that is specifically tailored to the desired application. SincConvs have also been recently applied to \gls{kws} tasks \cite{Mittermaier2020}.
In this section, we will search for end-to-end \gls{kws} models using \gls{nas}. The end-to-end models obtained by \gls{nas} will use a SincConv as the input layer. Instead of \glspl{mfcc}, end-to-end models are supplied with the raw audio waveforms as input instead of hand-crafted speech features. We will compare models obtained by \gls{nas} using SincConvs with models using ordinary 1D convolutions to assess whether SincConvs pose any benefit over 1D convolutions. Furthermore, this section will compare end-to-end models with models using \glspl{mfcc} from \Fref{sec:results:nas} in terms of accuracy, number of \gls{madds} and number of parameters.
In this section, we will search for end-to-end \gls{kws} models using \gls{nas}. The end-to-end models obtained by \gls{nas} will use a SincConv as the input layer. Instead of \glspl{mfcc}, end-to-end models are supplied with the raw audio waveforms instead of hand-crafted speech features. We will compare models obtained by \gls{nas} using SincConvs with models using ordinary 1D convolutions to assess whether SincConvs pose any benefit over 1D convolutions. Furthermore, this section will compare end-to-end models with models using \glspl{mfcc} from \Fref{sec:results:nas} in terms of accuracy, number of \gls{madds} and number of parameters.
\newsubsection{Experimental Setup}{results:endtoend:setup}
\Fref{tab:model_structure_endtoend} shows the overparameterized model used in this section. This model is fed with raw audio waveforms instead of \glspl{mfcc}. It consists of five stages with two input stages (i) (ii), two intermediate stages (iii) (iv) and one output stage (v). Stages (i), (ii) and (v) are fixed whereas stages (iii) an (iv) are optimized using \gls{nas}. \glspl{mbc} are used as the main building blocks in stages (iii) and (iv). Stride is only applied to the first convolution of each stage.
\Fref{tab:model_structure_endtoend} shows the overparameterized model used in this section. This model is supplied with raw audio waveforms instead of \glspl{mfcc}. It consists of five stages with two input stages (i) (ii), two intermediate stages (iii) (iv) and one output stage (v). Stages (i), (ii) and (v) are fixed whereas stages (iii) an (iv) are optimized using \gls{nas}. \glspl{mbc} are used as the main building blocks in stages (iii) and (iv). Stride is only applied to the first convolution of each stage.
During NAS, we allow \glspl{mbc} with expansion rates $e \in \{1,2,3,4,5,6\}$ and kernel sizes $k \in \{3,5,7\}$ for selection. We also include the zero operation which effectively results in an identity layer. For blocks where the input feature map size is equal to the output feature map size we include skip connections. For the tradeoff parameter we select $\beta \in \{0, 1, 2, 4, 8, 16\}$.
......@@ -78,6 +77,6 @@ After assessing the performance differences between models with 1D convolutions
80 & 0 & \SI{29.18}{\mega\nothing} & \SI{117.65}{\kilo\nothing} & \SI{95.74}{\percent} \\
\bottomrule
\end{tabular}
\caption{Test accuracy, \gls{madds} and number of parameters of some selected models. The best performing Pareto optimal models with 40 (a), 60 (b) and 80 (c) SincConv kernels were selected respectively.}
\caption{Test accuracy, \gls{madds} and number of parameters of some selected models. The best performing Pareto optimal models with 40, 60 and 80 SincConv kernels were selected respectively.}
\label{tab:nas_td_best}
\end{table}
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment