Basically, this is a super popular and effective method to improve neural network regularization and generalization. You zero-out neuron outputs during training with probability $p$, which forces the network to not over-rely on specific neurons, and learn important feature representations multiple times (in slightly different ways, neccessarily). In theory and in practice this tends to prevent overfitting, which improves generalization. Later research has shown that approaches like dropout are not neccessary to achieve suprisingly good generalization from neural networks, but regardless it is still a very useful approach that is very commonly used.
During testing, dropout is not used. NOTE: I have tested using dropout during testing to get a distribution of output predictions as a method of approximating model uncertainty. There are various issues with this and it only gets at a specific type of model uncertainty, but can still be quite useful and is quite easy to implment.
"Dropout: A Simple Way to Prevent Neural Networks from Overfitting" is a groundbreaking paper by Hinton et al. that introduced the concept of "dropout" as a simple and effective regularization technique to prevent overfitting in neural networks.
The dropout technique involves randomly "dropping out", or zeroing, a number of neuron outputs {output features} of the hidden layers in a neural network during training {or you can apply this to your initial input features in a non-hidden layer, too!). This means that each neuron output {'hidden unit' in the paper) is set to zero with a probability of (p), and it's kept with a probability of (1-p). This introduces noise into the output values of a layer, which can be thought of as creating a "thinned" network. Each unique dropout configuration corresponds to a different thinned network, and all these networks share weights. During training, dropout samples from this exponential set of different thinned architectures.
In more detail, if (y) is the vector of computed outputs of the dropout layer, and (r) is a vector of independent Bernoulli random variables each of which has probability (p) of being 1, then the operation of a dropout layer during training can be described by:
[ r_j \sim Bernoulli(p) ]
[ \tilde{y} = r \odot y ]
where (\odot) denotes element-wise multiplication.
During testing, no units are dropped out, but instead the layer's outputs are scaled down by a factor equal to the dropout rate, to balance the fact that more units are active than at training time. This can be seen as creating an ensemble of all subnetworks, and this ensemble method helps to reduce overfitting.
The dropout method provides a computationally cheap and remarkably effective regularization method to combine the predictions of many different models in order to improve generalization. The paper shows that dropout improves the performance of neural networks on supervised learning tasks in speech recognition, document classification, and several visual object recognition tasks. It's now a standard technique for training neural networks, especially deep neural networks.
The key idea behind dropout is to introduce randomness in the hidden layers of the network during training, which helps to prevent overfitting. By randomly dropping out neurons, we are essentially creating a new network at each training step. This is akin to training multiple separate neural networks with different architectures in parallel.
The training process with dropout can be summarized as follows:
A hidden unit is selected at random at each training step and its contribution to the network's learning at that step is temporarily removed.
The learning algorithm runs as usual and backpropagates the error terms and updates the weights.
At the next training step, a different set of hidden units is dropped out.
The randomness introduced by dropout forces the hidden layer to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
During testing or evaluation, the dropout procedure is not applied and the full network is used. However, the output of each neuron is multiplied by the dropout rate (p) to compensate for the fact that during training, on average, only (p) fraction of the neurons were active.
The use of dropout has been found to significantly improve the performance of deep neural networks, especially those suffering from overfitting due to having a large number of parameters. It is now a commonly used technique in deep learning model training.
In terms of mathematical representation, if we denote the output from a dropout layer during training as (\tilde{y} = r \odot y), where (r) is a vector of independent Bernoulli random variables each with probability (p) of being 1, then the output from the same layer during testing would be (y' = p \cdot y), which scales down the outputs by the dropout rate (p).
Dropout can be used along with other regularization techniques such as weight decay and max-norm constraints. It can also be combined with other optimization methods and learning rate schedules. The paper suggests that using dropout prevents network units from co-adapting too much to the data, thus improving the network's ability to generalize to unseen data.
Since the original Dropout was proposed, several variations have been developed to improve or alter the original mechanism. Here are a few examples:
Spatial Dropout: This is a variant of dropout that is designed for convolutional neural networks. In standard dropout, neurons are dropped randomly and independently. In contrast, spatial dropout drops entire 1D feature maps from the 3D inputs. The idea is that this would help the model to generalize better and reduce the overfitting by ensuring that the model learns to use all features rather than relying too much on a specific one.
DropConnect: This is another variant of dropout where instead of deactivating the neurons, it drops the connections between neurons. In other words, it sets weights within the network to zero with a certain probability during the training process.
Variational Dropout: This variant extends dropout into a Bayesian framework, where dropout is used as a variational inference technique to approximate the posterior distribution of the weights. Variational dropout can automatically adapt the dropout rates and can also be applied to recurrent architectures. NOTE: imo most forms of bayesian neural networks - where weights have prior distributions - are kind of dumb, and are not doing what you might think they're doing when you hear 'bayesian neural network'. Not sure exactly how this approach would work, but - as a Bayesian - I usually don't bother with bayesian neural networks.
Alpha Dropout: This variant of dropout is developed for Self-Normalizing Neural Networks (SNNs). It preserves the self-normalizing property where the mean and variance of the inputs are preserved allowing the SNN to self-normalize.
DropBlock: This variant is designed for convolutional neural networks. DropBlock extends the idea of spatial dropout by dropping contiguous regions of a feature map (blocks) instead of dropping out individual elements independently.
Each of these methods has different effects and can be better suited to some types of tasks or network architectures than others. The choice of which to use would typically be based on the specifics of the task and the network architecture in use.