Optimizers in Machine Learning

Anjani Suman
Artificial Intelligence in Plain English
4 min readAug 30, 2020

--

Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rates in order to reduce the losses.

Different types of optimizers:

  1. Batch Gradient Descent (BGD)
  2. Stochastic gradient descent (SGD)
  3. Mini-batch gradient descent (MBGD)
  4. Adagrad
  5. Adadelta
  6. RMSProp
  7. Adam

Here I’m going to discuss all of them in detail:

  1. Batch Gradient Descent (BGD):

Gradient update rule:
BGD uses the data of the entire training set to calculate the gradient of the cost function to the parameters.

Disadvantages:

Because this method calculates the gradient for the entire data set in one update, the calculation is very slow, it will be very tricky to encounter a large number of data sets, and you cannot invest in new data to update the model in real-time.

Batch gradient descent can converge to a global minimum for convex functions and to a local minimum for non-convex functions.

2. Stochastic gradient descent (SGD) :

Gradient update rule:
Compared with BGD’s calculation of gradients with all data at one time, SGD updates the gradient of each sample with each update.

x +=-learning_rate * dx

where x is a parameter, dx is the gradient and learning rate is constant

For large data sets, there may be similar samples, so BGD calculates the gradient. There will be redundancy, and SGD is updated only once, there is no redundancy, it is faster, and new samples can be added.

Disadvantages:
However, because SGD is updated more frequently, the cost function will have severe oscillations.
BGD can converge to a local minimum, of course, the oscillation of SGD may jump to a better local minimum.

When we decrease the learning rate slightly, the convergence of SGD and BGD is the same.

3. Mini-batch gradient descent (MBGD) :

Gradient update rule:

MBGD uses a small batch of samples, that is, n samples to calculate each time. In this way, it can reduce the variance when the parameters are updated, and the convergence is more stable.
It can make full use of the highly optimized matrix operations in the deep learning library for more efficient gradient calculations.

The difference from SGD is that each cycle does not act on each sample, but a batch with n samples.

Setting the value of hyper-parameters: n Generally value is 50 ~ 256.

Disadvantages:

Mini-batch gradient descent does not guarantee good convergence,

If the learning rate is too small, the convergence rate will be slow. If it is too large, the loss function will oscillate or even deviate at the minimum value.
One measure is to set a larger learning rate. When the change between two iterations is lower than a certain threshold, the learning rate is reduced.

4. Adagrad:

Adagrad is an algorithm for gradient-based optimization which adapts the learning rate to the parameters, using low learning rates for parameters associated with frequently occurring features, and using high learning rates for parameters associated with infrequent features.

So, it is well-suited for dealing with sparse data.

But the same update rate may not be suitable for all parameters. For example, some parameters may have reached the stage where only fine-tuning is needed, but some parameters need to be adjusted a lot due to the small number of corresponding samples.

Adagrad proposed this problem, an algorithm that adaptively assigns different learning rates to various parameters among them. The implication is that for each parameter, as its total distance updated increases, its learning rate also slows.

GloVe word embedding uses adagrad where infrequent words required a greater update and frequent words require smaller updates.

Adagrad eliminates the need to manually tune the learning rate.

5. Adadelta :

There are three problems with the Adagrad algorithm

1. The learning rate is monotonically decreasing.
2. The learning rate in the late training period is very small.
3. It requires manually setting a global initial learning rate.

Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive, monotonically reducing the learning rate.

It does this by restricting the window of the past accumulated gradient to some fixed size of w. Running average at time t then depends on the previous average and the current gradient.

In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.

6. RMSProp :

The full name of the RMSProp algorithm is called Root Mean Square Prop, which is an adaptive learning rate optimization algorithm proposed by Geoff Hinton.

RMSProp tries to resolve Adagrad’s radically diminishing learning rates by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient.

The difference is that RMSProp calculates the differential squared weighted average of the gradient. This method is beneficial to eliminate the direction of large swing amplitude and is used to correct the swing amplitude so that the swing amplitude in each dimension is smaller. On the other hand, it also makes the network function converge faster.

In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.

RMSProp divides the learning rate by the average of the exponential decay of squared gradients.

7. Adam :

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop.

Adam also keeps an exponentially decaying average of past gradients, similar to momentum.

Adam can be viewed as a combination of Adagrad and RMSprop.

Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average of past gradients.

Adam is computationally efficient and has very little memory requirements.

Adam optimizer is one of the most popular and famous gradient descent optimization algorithms.

--

--