@@ -39,7 +39,7 @@ The shape of the sigmoid resembles the letter \enquote{S} and is shown in \Fref{

\label{fig:sigmoid}

\end{figure}

However, the sigmoid activation can be used as an activation function in the output layer of a binary classifier. A binary classifier has a single output neuron in the output layer. If the sigmoid activation function is used, the output of the binary classifier gives the probability $p$ that the input belongs to class $c_1$. The probability that the input belongs to class $c_2$ is then simply $1-p$.

However, the sigmoid activation can be used as an activation function in the output layer of a binary classifier. A binary classifier has a single output neuron in the output layer. If the sigmoid activation function is used, the output of the binary classifier gives the probability $p$ that the input belongs to class $y=1$. The probability that the input belongs to class $y=2$ is then simply $1-p$.

\newsubsubsection{Rectified Linear Units}{dnn:activation_functions:relu}

\glspl{relu}\cite{Nair2010} are often used as a replacement of the sigmoid activation function in the hidden layers of a \gls{mlp}. The \gls{relu} is defined by

...

...

@@ -124,4 +124,4 @@ Applying the identity activation $g(z)=z$ is the same as applying no activation

\label{fig:identity}

\end{figure}

The identity activation is seldom used as an activation function in the hidden layers of a \gls{mlp}. The expressive power of \glspl{mlp} is the result of the nonlinear relationship between the input and the targets. However, \glspl{mlp} with identity activation functions in the hidden layers can only compute a linear relationship between the input and the targets, making identity activations in the hidden layers of a \gls{mlp} useless. As we will see later on, there are certain cases where the identity function is used as an activation function in some selected hidden layers of a \gls{dnn}.

\ No newline at end of file

The identity activation is seldom used as an activation function in the hidden layers of a \gls{mlp}. The expressive power of \glspl{mlp} is the result of the nonlinear relationship between the input and the targets. However, \glspl{mlp} with identity activation functions in the hidden layers can only compute a linear relationship between the input and the targets, making identity activations in the hidden layers of a \gls{mlp} useless. As we will see later, there are certain cases where the identity function is used as an activation function in some selected hidden layers of a \gls{dnn}.

Introduced in 2015, batch normalization \cite{Ioffe2015} has been widely adapted in the field of deep learning to speed up the learning of \glspl{mlp}. Batch normalization ensures that the activations inside a \gls{mlp} are normalized and therefore every hidden layer gets a standardized input. Given a vector of activations $\vm{z}=(z_1, \dots, z_N)^T$ of a hidden layer, a single activation is normalized by

Introduced in 2015, batch normalization \cite{Ioffe2015} has been widely adapted in the field of deep learning to speed up the learning of \glspl{dnn}. Batch normalization ensures that the activations inside a \gls{dnn} are normalized and therefore every hidden layer gets a standardized input. Given a vector of activations $\vm{z}=(z_1, \dots, z_N)^T$ of a hidden layer, a single activation is normalized by

where $\epsilon$ is a small value to avoid division by zero.

However, simply normalizing the activations can reduce the expressiveness of a \gls{mlp}. Therefore, two learnable parameters $\gamma$ and $\beta$ are introduced to change the mean and standard deviation of the normalized activations according to

However, simply normalizing the activations can reduce the expressiveness of a \gls{dnn}. Therefore, two learnable parameters $\gamma$ and $\beta$ are introduced to change the mean and standard deviation of the normalized activations according to

\begin{equation}

\tilde{z}_i = \gamma\overline{z}_i + \beta.

\end{equation}

Batch normalization is most commonly applied before the activation of a \gls{mlp} layer. It is also the possible to apply the batch normalization after the activation. However most deep learning applications tend to apply batch normalization before the activation.

Batch normalization is most commonly applied before the activation of a \gls{dnn} layer. It is also possible to apply batch normalization after the activation function. However most deep learning applications tend to apply batch normalization before the activation function.

One advantage of using batch normalization is that it allows to increase the learning rate, therefore speeding up training. Another advantage is, that models trained using batch normalization are more robust than models trained without batch normalization.

\glspl{dnn} are machine learning models inspired by the human brain. \glspl{dnn} have been applied to many fields including computer vision, speech recognition and machine translation. This chapter summarizes the most important concepts from the book \enquote{Deep Learning} by Goodfellow et al. \cite{Goodfellow2016}. The first two sections of this chapter introduce the three learning approaches for \glspl{dnn}, model capacity, over- and underfitting and validation sets. Then, the third and fourth section of this chapter introduce two classes of \glspl{dnn}, the \gls{mlp} and the \gls{cnn}. Finally, the fifth section explains how \glspl{dnn} can be trained to perform well at certain tasks.

\glspl{dnn} are machine learning models inspired by the human brain. \glspl{dnn} have been applied to many fields including computer vision, speech recognition and machine translation. This chapter summarizes the most important concepts from the book \enquote{Deep Learning} by Goodfellow et al. \cite{Goodfellow2016}. The first two sections of this chapter introduce the three learning approaches for \glspl{dnn}, model capacity, over- and underfitting and validation sets. Then, the third and fourth section of this chapter introduce two architectures of \glspl{dnn}, the \gls{mlp} and the \gls{cnn}. Finally, the fifth section explains how \glspl{dnn} can be trained to perform well at certain tasks.

@@ -4,9 +4,9 @@ Every layer $l$ in the \gls{mlp} contains a certain number of neurons where ever

Stacking layers makes the \gls{mlp} more expressive and therefore increases its capacity. The capacity can also be increased when the number of neurons in a layer is increased. The number of stacked layers $L$ determines the depth of the network whereas the number of neurons per layer determines the width of the network. Both the number of layers and the number of neurons per layer are hyperparameters of the \gls{mlp} that are not directly optimized by the training algorithm but rather are hand-picked by the human expert designing the \gls{mlp}.

The first layer of the \gls{mlp} is called the input layer that takes a vector $\vm{x}$ as the input. Intermediate layers are called hidden layers because the output of a hidden layer is usually not observed directly. The last layer of the \gls{mlp} is called the output layer. The output of the output layer is called the predicted label. The predicted label in classification tasks with $C$ classes is usually just a single categorical value $\hat{y}\in\{1, \dots, C\}$. In practice, the output of the \gls{mlp} is a probability distribution described by the vector $\vm{\hat{y}}\in\REAL^C$. However, the predicted label $\hat{y}$ can be extracted from the probability distribution by simply taking the $\argmax$ of the probability distribution vector $\hat{y}=\argmax\vm{\hat{y}}$.

The first layer of the \gls{mlp} is called the input layer that takes a vector $\vm{x}$ as the input. Intermediate layers are called hidden layers because the output of a hidden layer is usually not observed directly. The last layer of the \gls{mlp} is called the output layer. The output of the output layer is called the predicted target. The predicted target in classification tasks with $C$ classes is usually just a single categorical value $\hat{y}\in\{1, \dots, C\}$. In practice, the predicted target is a probability distribution described by the vector $\vm{\hat{y}}\in\REAL^C$. The target is also just a single categorical value $y \in\{1, \dots, C\}$. In practice, the target is described as a one-hot encoded vector $\vm{y}\in\REAL^C$ where a single element of the vector is set to one and all the other elements are set to zero.

\Fref{fig:mlp_structure} visualizes the structure of the \gls{mlp}. On the left (green), the vector $\vm{x}$ serves as the input to the \gls{mlp}. After the input layer, there are $L$ hidden layers (blue) followed by an output layer (red) that computes a probability distribution vector from which the predicted label is extracted.

\Fref{fig:mlp_structure} visualizes the structure of the \gls{mlp}. On the left (green), the vector $\vm{x}$ serves as the input to the \gls{mlp}. After the input layer, there are $L$ hidden layers (blue) followed by an output layer (red) that computes a probability distribution vector from which the predicted target is extracted.

When we build machine learning models such as \glspl{dnn}, we want the models to perform well on previously unseen inputs. This property, performing well on unseen inputs, is called generalization. In practice, generalization can be measured by evaluating the model on a test dataset that is separate from the dataset the model was trained on. The generalization of a model is measured by a generalization error, sometimes called the test error. The generalization error should ideally be as low as possible. The test dataset where the generalization error is measured should not be arbitrary. Instead, the test dataset should be somehow related to the dataset that was used to train the model. Usually, the train and test dataset are obtained by randomly splitting up a large dataset into two parts.

A model may not able to generalize well on the test dataset. If this happens, there are two reasons: underfitting or overfitting. Underfitting occurs when the model capacity is too low and the model struggles to reduce the error on the training dataset. Overfitting occurs when the model capacity is too large and the model \enquote{memorizes} the training data. Overfitting becomes evident during training when the gap between the training and test error increases. When the model capacity is right, the model is able to generalize well.

A model may not be able to generalize well on the test dataset. If this happens, there are two reasons: underfitting or overfitting. Underfitting occurs when the model capacity is too low and the model struggles to reduce the error on the training dataset. Overfitting occurs when the model capacity is too large and the model \enquote{memorizes} the training data. Overfitting becomes evident during training when the gap between the training and test error increases. When the model capacity is right, the model is able to generalize well.

As an example to illustrate underfitting and overfitting, three models with different capacities are depicted in \Fref{fig:over_underfit}. Every model was trained to fit a function to the noisy samples of a quadratic function. In the figure on the left, the capacity of the model is too low since only a linear function was used to fit the data and therefore the shape of the quadratic function is matched poorly. In the figure on the right, the model capacity is too high since a higher order polynomial was used to fit the data and therefore the model overfits to the noisy samples. The model on the right has the lowest training error but at the same time will perform poorly on unseen data. In the figure in the middle, the model capacity is right and the model neither overfits nor underfits to the data. The capacity of this model is appropriate and therefore, this model will perform best on unseen data.

...

...

@@ -63,7 +63,7 @@ As an example to illustrate underfitting and overfitting, three models with diff

\caption{Underfitting (left), proper fit (middle), and overfitting (right) of a function fitted to the noisy samples of a quadratic function. Drawing by\cite{user121799}.}

\caption{Underfitting (left), proper fit (middle), and overfitting (right) of a function fitted to the noisy samples of a quadratic function. Figure from\cite{user121799}.}

\label{fig:over_underfit}

\end{figure}

...

...

@@ -82,5 +82,5 @@ As an example to illustrate underfitting and overfitting, three models with diff

An intuitive approach to find the point of optimal capacity is to stop training when the test error increases. This approach has a name and is called early stopping. With early stopping, the dataset is split into three parts, the training dataset, the validation dataset and the test dataset. During training, the error on the validation set is monitored. If the error on the validation does not decrease for some time or even increases, the training is stopped and the error on test set is reported. Early stopping is a regularization method. In general, regularization is any modification we make to a learning algorithm that is intended to reduce its test error but not necessarily its training error \cite{Goodfellow2016}. Other methods for regularization, including weight decay and dropout, are explained in \Fref{sec:dnn:regularization}.

An intuitive approach to find the point of optimal capacity is to stop training when the test error increases. This approach has a name and is called early stopping. With early stopping, the dataset is split into three parts, the training dataset, the validation dataset and the test dataset. During training, the error on the validation set is monitored. If the error on the validation does not decrease for some time or even increases, the training is stopped and the error on test set is reported. Early stopping is a regularization method. In general, regularization is any modification we make to a learning algorithm that is intended to reduce its test error but not necessarily its training error \cite{Goodfellow2016}. Other methods for regularization, including weight decay and dropout, are explained in detail in \Fref{sec:dnn:regularization}.

@@ -17,7 +17,7 @@ where $\alpha \in [0, \infty)$ is a hyperparameter. When $\alpha=0$, no regulari

Weight decay uses an $L^2$ parameter norm penalty by setting the penalty term to $\Omega(\vm{\theta})=\frac{1}{2}||\vm{w}||^2_2$. This type of regularization drives the magnitude of the weights towards zero.

Label smoothing is a regularization technique for models used in classification tasks. The targets in classification tasks are usually represented as one-hot encoded vectors $\vm{y}\in\REAL^C$ where a single element of the vector is set to one and all the other elements are set to zero. A classifier with a softmax output can never predict one-hot targets exactly. Instead, the classifier will continue to increase the weights to try to match the targets as good as possible. However, the increase in the magnitude of the weights may lead to overfitting.

Label smoothing is a regularization technique for models used in classification tasks. The targets in classification tasks are usually represented as one-hot encoded vectors $\vm{y}\in\REAL^C$. A classifier with a softmax output can never predict one-hot targets exactly. Instead, the classifier will continue to increase the weights to try to match the targets as good as possible. However, the increase in the magnitude of the weights may lead to overfitting.

To solve this issue, label smoothing replaces the zeros in the one-hot target vectors with $\frac{\epsilon}{C-1}$ and the ones in the one-hot target vectors with $1-\epsilon$ where $\epsilon$ is a small value usually set to $0.1$. By performing label smoothing on the targets, the classifier still learns the correct predictions without the pursuit of matching the hard probabilities of one-hot encoded vectors.