Every machine learning algorithm needs a dataset to train and evaluate on. Since the goal of this thesis is to build \glspl{dnn} for \gls{kws}, the choice fell on the \gls{gsc} dataset \cite{Warden2018}. According to the the author of the \gls{gsc}, the \gls{gsc} is a public dataset of spoken words designed to help train and evaluate \gls{kws} systems. The \gls{gsc} was only introduced recently in 2018. However, the \gls{gsc} is now widely used as reference dataset and allows reproducible results for \gls{kws} systems. By the time of the writing of this thesis, there exist two versions of the \gls{gsc}. The first version contains approximately 65k samples over 30 classes while the second version contains approximately 110k samples over 35 classes. In this thesis, we will use the first version of the \gls{gsc} exclusively.

Every machine learning algorithm needs a dataset to train and evaluate on. Since the goal of this thesis is to build \glspl{dnn} for \gls{kws}, the choice fell on the \gls{gsc} dataset \cite{Warden2018}. According to the author, the \gls{gsc} is a public dataset of spoken words designed to help train and evaluate \gls{kws} systems. The \gls{gsc} was only introduced recently in 2018. However, the \gls{gsc} is now widely used as reference dataset and allows reproducible results for \gls{kws} systems. By the time of writing this thesis, there have been two versions of the \gls{gsc}. The first version contains approximately 65k samples over 30 classes while the second version contains approximately 110k samples over 35 classes. In this thesis, we will use the first version of the \gls{gsc} exclusively.

\Fref{sec:dataset:classes} lists the classes used for classification as well as the train-test-validation split used in this thesis. Then \Fref{sec:dataset:augmentation} explains the data augmentation that is employed during training. Finally, \Fref{sec:dataset:feature_extraction} deals with feature extraction, in particular the extraction of \glspl{mfcc}.

where $x(n)$ are the samples of a frame and $N$ is the number of points used to compute the \gls{dft}.

where $x(n)$ are the samples of a frame and $N$ is the number of points used to compute the \gls{dft} and $k$ is the frequency bin.

\newsubsubsectionNoTOC{(iv) Mel spectrum}

The Mel spectrum is computed by applying a set of triangular bandpass filters, called the Mel-filter bank, to the magnitude spectrum $|X(k)|^2$. The Mel scale describes the relationship between the physical frequency of a tone and the frequency that is perceived by humans. Below \SI{1}{\kilo\hertz}, there is a linear relationship. Above \SI{1}{\kilo\hertz}, the relationship becomes logarithmic. To convert from a physical frequency to the frequency in the Mel scale there exists the following relationship