Bag of tricks for optimizing optimization hyperparameters: 1. Use a learning rate finder 2. When batch_size is multiplied by $k$, learning rate should be multiplied by $\sqrt(k)$ 3. Use cyclical momentum {LR high --> momentum low and vice versa} 4. Use weight decay {first set to 0, then find best LR, then tune weight decay}
The paper "A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay" by Leslie N. Smith and Nicholay Topin in 2018 presents a systematic methodology for the selection and tuning of key hyperparameters in training neural networks.
Training a neural network involves numerous hyperparameters, such as the learning rate, batch size, momentum, and weight decay. These hyperparameters can significantly impact the model's performance, yet their optimal settings are often problem-dependent and can be challenging to determine. Traditionally, these hyperparameters have been tuned somewhat arbitrarily or through computationally expensive grid or random search methods.
The authors aim to provide a disciplined, systematic approach to the selection and tuning of these critical hyperparameters. They seek to provide a methodology that reduces the amount of guesswork and computational resources required in hyperparameter tuning.
The authors propose various strategies and techniques for hyperparameter tuning:
a. Learning Rate: They recommend the use of a learning rate finder, which involves training the model for a few epochs while letting the learning rate increase linearly or exponentially, and plotting the loss versus the learning rate. The learning rate associated with the steepest decrease in loss is chosen.
b. Batch Size: The authors propose a relationship between batch size and learning rate: when the batch size is multiplied by (k), the learning rate should also be multiplied by (\sqrt{k}).
c. Momentum: The authors recommend a cyclical momentum schedule: when the learning rate is high, the momentum should be low, and vice versa.
d. Weight Decay: The authors advise to first set the weight decay to 0, find the optimal learning rate, and then to tune the weight decay.
The authors validate their methodology on a variety of datasets and models, including CIFAR-10 and ImageNet. They found that their approach led to competitive or superior performance compared to traditionally tuned models, often with less computational cost.
This paper offers a structured and more intuitive way to handle hyperparameter tuning, which can often be a complex and time-consuming part of model training. The methods proposed could potentially save researchers and practitioners a significant amount of time and computational resources.
Moreover, the findings challenge some common practices in deep learning, such as the use of a fixed momentum value. This could lead to more exploration into dynamic or cyclical hyperparameter schedules.
However, as with any methodology, the effectiveness of these techniques may depend on the specific task or dataset. For example, the optimal batch size and learning rate relationship may differ for different model architectures or optimization algorithms.