Training is an essential part for every machine learning model. Training refers to the (often iterative) process of tuning the model parameters with the goal to minimize the error on a given dataset. To minimize the error, a loss function is needed. The loss function computes a scalar value based on the model predictions and the corresponding targets. The better the prediction, the lower the value of the loss function will be. Tuning the model parameters to minimize the loss function can then performed by an iterative method called gradient descent.

Training is an essential part for every machine learning model. Training refers to the (often iterative) process of tuning the model parameters with the goal to minimize the error on a given dataset. To minimize the error, a loss function is needed. The loss function computes a scalar value based on the model predictions and the corresponding targets. The better the prediction, the lower the value of the loss function will be. Tuning the model parameters to minimize the loss function can then be performed by an iterative method called gradient descent.

This section will introduce two important loss functions as well as the gradient descent method. In addition, this section will introduce two methods to improve the generalization performance of a model called regularization and knowledge distillation.

\mathrm{softmax}(\vm{z}; T \rightarrow\infty)_i &= \frac{1}{C}.

\end{align}

When the temperature $T$ is zero, we get a \enquote{hard} probability distribution as a one-hot-encoded vector. By increasing the temperature, the distribution gets \enquote{softer}. When the temperature reaches infinity, we get a uniform distribution.

When the temperature $T$ is zero, we get a \enquote{hard} probability distribution as a one-hot-encoded vector denoted by the indicator function $\mathbbm{1}(i = c)$. By increasing the temperature, the distribution gets \enquote{softer}. When the temperature reaches infinity, we get a uniform distribution.

For classification tasks, the teacher model usually produces a quite confident prediction in the form of a \enquote{hard} probability distribution. The additional information, like similarities between classes, then reside in the very small probabilities of the soft targets. Such small probabilities however have little to no influence to the loss function of the student and therefore do not impact the training of the student. Instead of using the soft targets of the teacher with temperature $T=1$, the temperature is increased to \enquote{soften} the probability distribution.

@@ -6,9 +6,9 @@ Tuning the parameters of a \gls{dnn} model with gradient descent is an iterative

\end{equation}

where $\epsilon > 0$ is called the step size or the learning rate. The learning rate is a hyperparameter that is usually chosen by hand. The learning rate should neither be too high nor too low in order for the gradient descent algorithm to converge. If the learning rate is too small, training takes very long since the parameters are changed only marginally at each iteration. On the other, if the learning rate is too large, the algorithm might diverge, causing the value of the loss function to increase.

In some applications, the learning is decreased over time. If that is the case, the value of the learning changes based on the current iteration step $k$ is denoted by $\epsilon_k$.

In some applications, the learning rate is decreased over time. If that is the case, the value of the learning rate changes based on the current iteration step $k$ is denoted by $\epsilon_k$.

The gradient $\nabla_{\vm{\theta}}\mathcal{L}(\vm{\theta})$ is the vector of partial derivatives of the loss with respect to all the weights and biases of a \gls{dnn}. To compute all partial derivatives, backpropagation \cite{Rumelhart1986} is employed. Modern deep learning frameworks, e.g. PyTorch, provide built-in methods to automatically compute the partial derivatives of the loss with respect to the weights and biases by means of backpropagation.

The gradient $\nabla_{\vm{\theta}}\mathcal{L}(\vm{\theta})$ is the vector of partial derivatives of the loss with respect to all the weights and biases of a \gls{dnn}. To compute all partial derivatives, backpropagation \cite{Rumelhart1986} is employed. Modern deep learning frameworks, e.g. PyTorch, provide built-in methods which automatically compute the partial derivatives of the loss with respect to the weights and biases by means of backpropagation.

One problem of default gradient descent is that the computational cost to compute the gradient for the whole dataset becomes very large for large datasets. In stochastic gradient descent, gradient steps are taken on small subsets of the dataset instead. This reduces the computational cost to compute a single iteration substantially. Therefore, nearly all modern deep learning models use stochastic gradient descent for training.

...

...

@@ -24,7 +24,7 @@ with

where $l(f(\vm{x}_i; \vm{\theta}), \vm{y}_i)$ is the per-sample loss (i.e. \gls{mse}, cross entropy, etc.). After every batch was used once to perform a gradient step, the model has undergone one epoch of training. In practice, several epochs of training are necessary to achieve a reasonable performance.

\newsubsubsection{Momentum and Nesterov Momentum}{dnn:gradient_descent:momentum}

The method of momentum \cite{POLYAK19641} is commonly used to accelerate learning with gradient descent. Instead of performing parameter updates on the negative gradient, a velocity term $\vm{v}$ is instead used to update the parameters. The velocity $\vm{v}$ accumulates an exponential moving average of past gradients. By computing the moving average, the velocity $\vm{v}$ averages out oscillations of the gradient while preserving the direction towards the minima. A nice representation of the gradient descent with momentum is depicted in \Fref{fig:gd_momentum}.

The method of momentum \cite{POLYAK19641} is commonly used to accelerate learning with gradient descent. Instead of performing parameter updates on the negative gradient, a velocity term $\vm{v}$ is used instead to update the parameters. The velocity $\vm{v}$ accumulates an exponential moving average of past gradients. By computing the moving average, the velocity $\vm{v}$ averages out oscillations of the gradient while preserving the direction towards the minima. A nice representation of the gradient descent with momentum is depicted in \Fref{fig:gd_momentum}.

\Fref{sec:dnn:over_underfitting}dealt with the concept of over- and underfitting in machine learning. Also, the notion of the training and test error was introduced but not further explained. The error (or sometimes called loss) is usually just a scalar value. It is an indicator of how well a machine learning model performs on a given dataset. The smaller the error, the better the machine learning model performs. The loss function can be different depending on the task. In this subsection, the most commonly used loss functions for classification and regression are introduced.

In \Fref{sec:dnn:over_underfitting}we discussed the concept of over- and underfitting in machine learning. Also, the notion of the training and test error was introduced but not further explained. The error (or sometimes called loss) is usually just a scalar value. It is an indicator of how well a machine learning model performs on a given dataset. The smaller the error, the better the machine learning model performs. The loss function can be different depending on the task. In this subsection, the most commonly used loss functions for classification and regression are introduced.

The \gls{mse} is the average of squared differences between the predicted and the real targets across all $N$ samples in the dataset. The \gls{mse} is most commonly used for regression tasks and depends on the model parameters $\vm{\theta}$. The \gls{mse} for a single sample is defined as

...

...

@@ -15,9 +15,9 @@ where $\vm{y}_i$ is the $i$-th one-hot encoded target and $f(\vm{x}_i; \vm{\thet

For classification tasks, a different loss functions has to be used. Typically, the choice for classification tasks with $C$ classes is the cross entropy loss. The cross entropy loss for a single sample is defined as

where $\vm{y}_{ij}$ is $j$-th element of the $i$-th one-hot encoded target and $f(\vm{x}_i; \vm{\theta})_j$ is the $j$-th element of the model prediction for the $i$-th input $\vm{x}_i$. The cross entropy loss over all samples is simply the average over all per-sample losses

where $y_{ij}$ is $j$-th element of the $i$-th one-hot encoded target and $f(\vm{x}_i; \vm{\theta})_j$ is the $j$-th element of the model prediction for the $i$-th input $\vm{x}_i$. The cross entropy loss over all samples is simply the average over all per-sample losses

Regularization is any technique that aims to decrease the test error, sometimes even at the expense of increasing the training error. In this section, several regularization methods are introduced, all of which are used in this thesis at some point.

A simple regularization technique in deep learning is early stopping. Early stopping relies on the observation that, given a model with large enough capacity, the test error has a "U"-shape where the best model is the model that produces the lowest error in the test error curve. Early stopping simply selects the model with the lowest test error. To asses the test error, the error on the validation set is monitored.

A simple regularization technique in deep learning is early stopping. Early stopping relies on the observation that, given a model with large enough capacity, the test error has a "U"-shape where the best model is the model that produces the lowest error in the test error curve. Early stopping simply selects the model with the lowest validation error.

One way to implement early stopping is to store the model parameters every time the validation error decreases. If the validation error does not decrease any more, training can be stopped. After training is finished, the model with the lowest validation error is selected.