Commit e3a4e7b5 authored by David Peter's avatar David Peter
Browse files


parent 5d5b0a18
No preview for this file type
% **************************************************************************************************
% **************************************************************************************************
%Motivation: Was ist KWS? Warum braucht man es? Warum muss es resourcenschonend sein?\\
\glspl{iva} like Google's Assistant, Apple's Siri and Amazon's Alexa have gained an lot of popularity in recent years. \glspl{iva} provide an alternative interface for human-computer interaction aside from the more traditional interfaces such as mouse, keyboard, touchscreen and display. \glspl{iva} are capable of interpreting human speech, follow spoken commands and reply via synthesized voices. Modern \glspl{iva} provide many functionalities including responses to questions, email management, to-do list management, calendar management, home automation control and media playback control. Part of the success of \glspl{iva} can be attributed to the rapid advancements in \gls{nlp} and \gls{ai}, both being key technologies in the development and the application of \glspl{iva}.
For \glspl{iva} to fulfil requests, a complex pipeline of different technologies is necessary. First, \gls{asr} is employed to convert a spoken command into a text transcription. Then, natural language understanding is used to interpret and to extract the intention of the user from the text transcription. A dialogue manager then produces a response to the spoken command. Finally, text to speech is used to convert the response from the dialogue manager to spoken words that are then supplied back to the user.
Since \gls{asr} systems are typically very complex and computation-intensive, running \gls{asr} as part of an \gls{iva} in always-on mode results in a steady high energy consumption. This is especially problematic for mobile devices whose batteries are drained quickly when running \gls{asr} permanently.
A common solution is to run a low-cost \gls{kws} system that is listening permanently only for a limited set of prespecified keywords. Upon detection of a keyword, a full \gls{asr} system is triggered which then listens for a rich set of user commands. The requirements of a \gls{kws} system are:
\item The system should be resource efficient to mitigate the aforementioned energy problem,
\item it should run in real-time and,
\item it should be accurate to maintain a high user-experience.
\newsection{Scope of this Thesis}{intro:scope}
%Scope: Was wird abgehandelt? Contributions?\\
In this thesis, we will focus on \gls{kws} as one crucial aspect of the \gls{iva} pipeline. Recently, \glspl{dnn} have become the state-of-the-art in \gls{kws}, slowly replacing the more traditional \glspl{hmm}. While \glspl{hmm} achieve reasonable performances, they are hard to train and computationally expensive at runtime. Because of the limitations of \glspl{hmm}, we opted to focus solely on \gls{dnn} based \gls{kws} models. In particular, we will we focus on \glspl{cnn} for \gls{kws}.
With \glspl{dnn}, it is possible to obtain resource efficient \gls{kws} models with competitive performances. Of course there is a tradeoff between the resource efficiency of a model and its performance. We will explore this tradeoff in depth using different methods from the literature including \gls{nas}, weight and activation quantization, end-to-end models and multi-exit models.
\gls{kws} will be performed on the \gls{gsc} dataset, a public dataset published by Google to enable the comparison of \gls{kws} models. The \gls{gsc} consists of 1-second long audio files of spoken words from many different speakers. Our models will be trained and evaluated on 10 keyword classes labeled \enquote{yes}, \enquote{no}, \enquote{up}, \enquote{down}, \enquote{left}, \enquote{right}, \enquote{on}, \enquote{off}, \enquote{stop}, \enquote{go}. Two additional classes are added called \enquote{unknown} and \enquote{silence}, where the \enquote{unknown} class is a collection of unrelated keywords and the \enquote{silence} class includes no keywords and only background noise.
%Outline: Thesis outline. Kapitel mit Sätzen beschreiben.
\ No newline at end of file
The outline of this thesis is as follows: Chapter 2 provides the theoretical background to \glspl{dnn}. This chapter explains the various learning approaches, capacity, over- and underfitting, \glspl{mlp}, \glspl{cnn} and the training of \glspl{dnn}. Chapter 3 provides the theoretical background for resource efficient \gls{kws}. This chapter explains resource efficient convolutional layers, \gls{nas}, weight and activation quantization, end-to-end models and multi-exit models. The \gls{gsc} dataset, data augmentation and feature extraction is explained in Chapter 4. The experimental results of this thesis are presented and discussed in Chapter 5. Finally, Chapter 6 provides the conclusion to this thesis.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment