# Adam: A Method for Stochastic Optimization

## TLDR:

As we know, the learning rate is a hugely impactful parameter when training neural networks. Instead of having a flat learning rate over the entire course of training, the ADAM optimization algorithm is an adaptation of stochastic gradient descent {SGD} that modifies the learning rates used for each parameter over the course of training based on the moving average of the gradient - $dLoss/dp$ - and the moving average of the squared gradient - $dLoss^2/dp$. The basic idea behind this is that if your parameter {gradient} keeps on going in the same direction, you can expect that it will keep going in the same direction, and it'd be useful to increase your learning rate to get there fast. Alternatively, if your gradient is sort of bouncing around, then you probably want a smaller learning rate so you can settle into a more delicately optimized paramter value. ADAM didn't actually introduce this concept {often referred to as momentum}, but was novel for using parameter-specific adaptive learning rates in combination with these moving averages of gradients and squared gradients.

ADAM tends to perform very well in practice and is quite popular. Because of the adaptive learning rates, it tends to require less learning rate tuning and tends to converge faster than traditional SGD. It also, I believe, can be combined effectively with other optimization approaches, like learning-rate schedulers {e.g. one-cycle-policy}. Usually, the weight given to the moving averages is set to about 0.9 for the average gradient and 0.999 for the average squared gradient - i.e. the latter relies most heavily on the moving average, because squared values can be very bouncy and have high variance.

## Summary

"Adam: A Method for Stochastic Optimization" is a seminal paper that introduces the Adam optimization algorithm, an efficient method for adapting learning rates for each parameter in a neural network model. The key difference between Adam and earlier first-order gradient-based optimization methods, like stochastic gradient descent {SGD}, is that Adam computes adaptive learning rates for different parameters using estimates of first and second moments of the gradients, while SGD uses a fixed learning rate for all parameters.

In more detail, the Adam algorithm calculates an exponential moving average of the gradient and the squared gradient, and these moving averages are then used to scale the learning rate for each weight in the neural network. The moving averages themselves are estimates of the first moment {the mean: $\frac{1}{n}\sum_{i=1}^{n} x_i$} and the second raw moment {the uncentered variance: $\frac{1}{n}\sum_{i=1}^{n} x_i^2$} of the gradient.

The algorithm is defined as follows:

1. Initialize the first and second moment vectors, (m) and (v), to 0.

2. For each iteration (t):

a. Obtain the gradients (g_t) on the current mini-batch.

b. Update biased first moment estimate: (m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t).

c. Update biased second raw moment estimate: (v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2).

d. Compute bias-corrected first moment estimate: (\hat{m}_t = m_t / (1 - \beta_1^t)).

e. Compute bias-corrected second raw moment estimate: (\hat{v}_t = v_t / (1 - \beta_2^t)).

f. Update parameters: ( \theta_{t+1} = \theta_t - \alpha \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)).

Here, (\beta_1) and (\beta_2) are the decay rates for the moving averages {typically set to 0.9 and 0.999, respectively}, (\alpha) is the step size or learning rate {typically set to 0.001}, and (\epsilon) is a small constant added for numerical stability {typically set to (10^{-8})}.

One of the major advantages of Adam over traditional SGD is that it requires less tuning of the learning rate and converges faster in practice. It is also invariant to diagonal rescaling of the gradients, which makes it well-suited for problems with sparse gradients or with noisy and/or sparse updates.