Commit 24f895ca authored by David Peter's avatar David Peter
Browse files

Remove unused acronyms

parent 6c91d484
No preview for this file type
......@@ -13,10 +13,6 @@
\newacronym[]{ste}{STE}{Straight-Through Estimator}
\newacronym[]{nas}{NAS}{Neural Architecture Search}
\newacronym[]{mfsts}{MFSTS}{Multi-Frame Shifted Time Similarity}
\newacronym[]{tcn}{TCN}{Temporal Convolutional Neural Network}
\newacronym[shortplural=RNNs, longplural=Recurrent Neural Networks]{rnn}{RNN}{Recurrent Neural Network}
\newacronym[]{asr}{ASR}{Automatic Speech Recognition}
\newacronym[]{ai}{AI}{Artificial Intelligence}
\newacronym[shortplural=IVAs, longplural=Intelligent Virtual Assistants]{iva}{IVA}{Intelligent Virtual Assistant}
\newacronym[]{nlp}{NLP}{Natural Language Processing}
\newacronym[shortplural=HMMs, longplural=Hidden Markov Models]{hmm}{HMM}{Hidden Markov Model}
% **************************************************************************************************
% **************************************************************************************************
Every layer $l$ in the \gls{mlp} contains a certain number of neurons where every neuron in layer $l$ is connected to every neuron in the previous layer $l-1$. In \glspl{mlp} there are no recurrent connections. Neural networks with recurrent connections are called \glspl{rnn} and are not discussed in this thesis.
Every layer $l$ in the \gls{mlp} contains a certain number of neurons where every neuron in layer $l$ is connected to every neuron in the previous layer $l-1$. In \glspl{mlp} there are no recurrent connections. Neural networks with recurrent connections are called Recurrent Neural Networks and are not discussed in this thesis.
Stacking layers makes the \gls{mlp} more expressive and therefore increases its capacity. The capacity can also be increased when the number of neurons in a layer is increased. The number of stacked layers $L$ determines the depth of the network whereas the number of neurons per layer determines the width of the network. Both the number of layers and the number of neurons per layer are hyperparameters of the \gls{mlp} that are not directly optimized by the training algorithm but rather are hand-picked by the human expert designing the \gls{mlp}.
......
......@@ -2,7 +2,7 @@
% **************************************************************************************************
\newsection{Motivation}{intro:motivation}
%Motivation: Was ist KWS? Warum braucht man es? Warum muss es resourcenschonend sein?\\
\glspl{iva} like Google's Assistant, Apple's Siri and Amazon's Alexa have gained a lot of popularity in recent years. \glspl{iva} provide an alternative interface for human-computer interaction in addition to the more traditional interfaces such as mouse, keyboard, touchscreen and display. \glspl{iva} are capable of interpreting human speech, follow spoken commands and reply via synthesized voices. Modern \glspl{iva} provide many functionalities including responses to questions, email management, to-do list management, calendar management, home automation control and media playback control. Part of the success of \glspl{iva} can be attributed to the rapid advancements in \gls{nlp} and \gls{ai}, both being key technologies in the development and the application of \glspl{iva}.
\glspl{iva} like Google's Assistant, Apple's Siri and Amazon's Alexa have gained a lot of popularity in recent years. \glspl{iva} provide an alternative interface for human-computer interaction in addition to the more traditional interfaces such as mouse, keyboard, touchscreen and display. \glspl{iva} are capable of interpreting human speech, follow spoken commands and reply via synthesized voices. Modern \glspl{iva} provide many functionalities including responses to questions, email management, to-do list management, calendar management, home automation control and media playback control. Part of the success of \glspl{iva} can be attributed to the rapid advancements in Natural Language Processing and Artificial Intelligence, both being key technologies in the development and the application of \glspl{iva}.
For \glspl{iva} to fulfil requests, a complex pipeline of different technologies is necessary. First, \gls{asr} is employed to convert a spoken command into a text transcription. Then, natural language understanding is used to interpret and to extract the intention of the user from the text transcription. A dialogue manager then produces a response to the spoken command. Finally, text to speech is used to convert the response from the dialogue manager to spoken words that are then supplied back to the user.
......
% **************************************************************************************************
% **************************************************************************************************
\gls{kws} with \glspl{dnn} is typically performed on hand-crafted speech features such as \glspl{mfcc} that are extracted from raw audio waveforms. Extracting \glspl{mfcc} involves performing the Fourier transform. However, performing the Fourier transform is computationally expensive and might exceed the capabilities of resource-constrained devices. Therefore, Ibrahim et al. \cite{Ibrahim2019} proposed to use simpler speech features derived in the time-domain. The features, referred to as \gls{mfsts}, are obtained by computing constrained lag autocorrelations on overlapping speech frames to form a 2D map. A \gls{tcn} \cite{Chiu2018} is then used to classify keywords on \gls{mfsts}.
\gls{kws} with \glspl{dnn} is typically performed on hand-crafted speech features such as \glspl{mfcc} that are extracted from raw audio waveforms. Extracting \glspl{mfcc} involves performing the Fourier transform. However, performing the Fourier transform is computationally expensive and might exceed the capabilities of resource-constrained devices. Therefore, Ibrahim et al. \cite{Ibrahim2019} proposed to use simpler speech features derived in the time-domain. The features, referred to as \gls{mfsts}, are obtained by computing constrained lag autocorrelations on overlapping speech frames to form a 2D map. A Temporal Convolutional Neural Network \cite{Chiu2018} is then used to classify keywords on \gls{mfsts}.
However, hand-crafted speech features such as \gls{mfsts} and \glspl{mfcc} may not be optimal for \gls{kws}. Therefore, recent works have proposed to directly feed a \gls{dnn} with raw audio waveforms. In \cite{Ravanelli2018}, a \gls{cnn} for speaker recognition is proposed that encourages to learn parametrized sinc functions as kernels in the first layer. This layer is referred to as SincConv layer. During training, a low and high cutoff frequency per kernel is determined. Therefore, a custom filter bank is derived by training the SincConv layer that is specifically tailored to the desired application. SincConvs have also been recently applied to \gls{kws} tasks \cite{Mittermaier2020}.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment