Neural Networks

Fundamentals of Neural Networks

Optimizers

Gradient Descent optimization algorithms, are often used as a black-box solution to train a neural network by helping the parameters converge to a global minima efficiently. However, optimizer's hyperparameters can considerably affect the eventual result of training and knowing the specifics along with advantages and downfalls of optimizers can help one decide when to use a particular optimizer and what effect can fine tuning these hyperparameters can have on the model. Let's discuss the types of gradient descent first.

Batch Gradient Descent / Vanilla Gradient Descent

  • Computes gradient descent on the cost function w.r.t the parameters on the entire training set.

  • As it needs to compute gradients on the entire training set it's slow and also intractable for datasets that don't fit in memory.

  • Doesn't allow to update models online, i.e with new examples on the fly.

  • Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

  • Matrix multiplications significantly improve performance.

Stochastic Gradient Descent - SGD

  • In contrast performs a parameter update for each training example.

  • SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily and does away with redundant computations in Batch Gradient Descent.

  • While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.

  • However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behavior as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

  • Data is shuffled after every epoch.

Mini-batch Gradient Descent

  • Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples.

  • Reduces the variance of the parameter updates, which can lead to more stable convergence

  • Can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.

Takeaways

  • Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.

  • Learning rate schedules try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.

  • Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.

  • Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

Momentum

  • SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum.

  • Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations.

  • The momentum term gamma is usually set to 0.9 or a similar value.

  • Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity, if there is air resistance, i.e. < 1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

ut=γut1+αΔu _t = \gamma * u_{t-1} + \alpha * \Delta

Nesterov Accelerated Gradient

  • Nesterov Accelerated Gradient is a way to give our momentum term a kind of prescience of when to slow down the momentum when nearing a minima.

  • Computing the extra term thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters  but w.r.t. the approximate future position of our parameters:

ut=γut1+η(Δγut1)u _t = \gamma * u_{t-1} + \eta*( \Delta - \gamma*u_{t-1})
  • This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks.

  • Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.

Adagrad

  • It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

  • Adagrad greatly improved the robustness of SGD and used it for training large-scale neural nets at Google, which – among other things – learned to recognize cats in YouTube videos. Now, who doesn't like cats?

  • Adagrad was used to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.

  • One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.

  • Adagrad’s main weakness is its accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw.

ut=ηGt+ϵgtu_t = \frac{\eta }{\sqrt{G_t + \epsilon}} \odot g_t
Adagrad

Adadelta

  • Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.

  • With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.

  • Adadelta has a very slow start as,

    • At step 0, the running average of these updates is zero, so the first update will be very small.

    • As the first update is very small, the running average of the updates will be very small at the beginning, which is kind of a vicious circle at the beginning.

Adadelta

RMSprop

  • RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests to be set to 0.9, while a good default value for the learning rate  is 0.001

E[g2]t=0.9E[g2]t1+0.1gt2E[g^2]_t = 0.9E[g^2]_{t-1} + 0.1g^2_t
ut=ηE[g2]t+ϵgtu_t = \frac{\eta }{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t

Adam

  • Adaptive Moment Estimation is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum

  • The authors propose default values of 0.9 for b1, 0.999 for b2, and 10 ^ -8 for epsilon. They show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

Nadam

  • As we have seen before, Adam can be viewed as a combination of RMSprop and momentum. RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients.

  • We have also seen that Nesterov accelerated gradient (NAG) is superior to vanilla momentum. Nadam (Nesterov-accelerated Adaptive Moment Estimation) thus combines Adam and NAG. In order to incorporate NAG into Adam, we need to modify its momentum term.

AMSGrad

Shuffling vs Curriculum Learning

  • Generally, we want to avoid providing the training examples in a meaningful order to our model as this may bias the optimization algorithm. Consequently, it is often a good idea to shuffle the training data after every epoch.

  • On the other hand, for some cases where we aim to solve progressively harder problems, supplying the training examples in a meaningful order may actually lead to improved performance and better convergence. The method for establishing this meaningful order is called Curriculum Learning.

Regularization

  • An effective regularizer is said to be the one that makes a profitable trade by reducing variance significantly while not overly increasing the bias.

  • A neural network has the tendency to memorize its training data, especially if it contains more than enough capacity. In such cases, the network fails catastrophically when subjected to the test data.

  • This is the classic case of the network failing to generalize. To avoid this tendency, the model uses a regularizing layer or function.

  • It discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Batch Normalization

  • To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance. As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper.

  • Batch normalization reestablishes these normalizations for every mini-batch and changes are backpropagated through the operation as well. By making normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing (and sometimes even eliminating) the need for Dropout.

Dropout

  • The idea of dropout is simple. Given a dropout rate, the Dropout layer randomly removes that fraction of units from participating in the next layer. For example, if the first layer has 256 units, after dropout = 0.45 is applied, only (1 - 0.45) * 256 units = 140 units from layer 1 participate in layer 2.

  • The Dropout layer makes neural networks robust to unforeseen input data because the network is trained to predict correctly, even if some units are missing.

  • It's worth noting that dropout is not used in the output layer and it is only active during training. Moreover, dropout is not present during predictions.

L1 Norm / Lasso Regression

  • Lasso is another variation, in which the above function is minimized. Its clear that this variation differs from ridge regression only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm.

L2 Norm / Ridge Regression

L1 vs L2 Norm

  • This sheds light on the obvious disadvantage of ridge regression, which is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors.

  • However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large.

  • Therefore, the lasso method also performs variable selection and is said to yield sparse models.

Shortcut Learning in Deep Neural Networks

Last updated

Was this helpful?