This paper proposed a new initialization method for the weights in neural networks and introduced a new activation function called Parametric ReLU {PReLU}.
This paper's main contributions are the introduction of a new initialization method for rectifier networks {called "He Initialization"} and the proposal of a new variant of the ReLU activation function called the Parametric Rectified Linear Unit {PReLU}.
The authors noted that the existing initialization methods, such as Xavier initialization, did not perform well for networks with rectified linear units {ReLUs}. Xavier initialization is based on the assumption that the activations are linear. However, ReLUs are not linear functions, which might cause the variance of the outputs of neurons to be much larger than the variance of their inputs.
To address this issue, the authors proposed a new method for initialization, which they referred to as "He Initialization". It is similar to Xavier initialization, but it takes into account the non-linearity of the ReLU function. The initialization method is defined as follows:
[ W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right) ]
where (n_{\text{in}}) is the number of input neurons, (W) is the weight matrix, and (\mathcal{N}(0, \sqrt{\frac{2}{n_{\text{in}}}})) represents a Gaussian distribution with mean 0 and standard deviation (\sqrt{\frac{2}{n_{\text{in}}}}).
The paper also introduces a new activation function called the Parametric Rectified Linear Unit {PReLU}. The standard ReLU activation function is defined as (f(x) = \max(0, x)), which means that it outputs the input directly if it is positive, otherwise, it outputs zero. While it has advantages, the ReLU function also has a drawback known as the "dying ReLU" problem, where a neuron might always output 0, effectively killing the neuron and preventing it from learning during the training process.
The PReLU is defined as follows:
[ f(x) = \begin{cases} x & \text{if } x \geq 0 \newline a_i x & \text{if } x < 0 \end{cases} ]
where (a_i) is a learnable parameter. When (a_i) is set to 0, PReLU becomes the standard ReLU function. When (a_i) is set to a small value {e.g., 0.01}, PReLU becomes the Leaky ReLU function. However, in PReLU, (a_i) is learned during the training process.
The authors tested their methods on the ImageNet Large-Scale Visual Recognition Challenge 2014 {ILSVRC2014} dataset and achieved top results. Using an ensemble of their models, they achieved an error rate of 4.94%, surpassing the human-level performance of 5.1%.
The introduction of He Initialization and PReLU have had significant impacts on the field of deep learning:
of deeper networks.
While the He initialization and PReLU have been widely adopted, they are not without limitations:
He Initialization: While this method works well with ReLU and its variants, it might not be the best choice for other activation functions. Therefore, the choice of initialization method still depends on the specific activation function used in the network.
PReLU: While PReLU helps mitigate the dying ReLU problem, it introduces additional parameters to be learned, increasing the complexity and computational cost of the model. In some cases, other methods like batch normalization or other activation functions might be preferred due to their lesser computational complexity.
In conclusion, the paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" made significant contributions to the field of deep learning by introducing He initialization and the PReLU activation function. These methods have been widely adopted and have helped improve the performance of deep neural networks, particularly in computer vision tasks.