# Delving Deep into Rectifiers

This paper proposed a new initialization method for the weights in neural networks and introduced a new activation function called Parametric ReLU {PReLU}.

### Introduction

This paper's main contributions are the introduction of a new initialization method for rectifier networks {called "He Initialization"} and the proposal of a new variant of the ReLU activation function called the Parametric Rectified Linear Unit {PReLU}.

### He Initialization

The authors noted that the existing initialization methods, such as Xavier initialization, did not perform well for networks with rectified linear units {ReLUs}. Xavier initialization is based on the assumption that the activations are linear. However, ReLUs are not linear functions, which might cause the variance of the outputs of neurons to be much larger than the variance of their inputs.

To address this issue, the authors proposed a new method for initialization, which they referred to as "He Initialization". It is similar to Xavier initialization, but it takes into account the non-linearity of the ReLU function. The initialization method is defined as follows:

[ W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right) ]

where (n_{\text{in}}) is the number of input neurons, (W) is the weight matrix, and (\mathcal{N}(0, \sqrt{\frac{2}{n_{\text{in}}}})) represents a Gaussian distribution with mean 0 and standard deviation (\sqrt{\frac{2}{n_{\text{in}}}}).

### Parametric ReLU {PReLU}

The paper also introduces a new activation function called the Parametric Rectified Linear Unit {PReLU}. The standard ReLU activation function is defined as (f(x) = \max(0, x)), which means that it outputs the input directly if it is positive, otherwise, it outputs zero. While it has advantages, the ReLU function also has a drawback known as the "dying ReLU" problem, where a neuron might always output 0, effectively killing the neuron and preventing it from learning during the training process.

The PReLU is defined as follows:

[ f(x) = \begin{cases} x & \text{if } x \geq 0 \newline a_i x & \text{if } x < 0 \end{cases} ]

where (a_i) is a learnable parameter. When (a_i) is set to 0, PReLU becomes the standard ReLU function. When (a_i) is set to a small value {e.g., 0.01}, PReLU becomes the Leaky ReLU function. However, in PReLU, (a_i) is learned during the training process.

### Experimental Results

The authors tested their methods on the ImageNet Large-Scale Visual Recognition Challenge 2014 {ILSVRC2014} dataset and achieved top results. Using an ensemble of their models, they achieved an error rate of 4.94%, surpassing the human-level performance of 5.1%.

### Implications

The introduction of He Initialization and PReLU have had significant impacts on the field of deep learning:

• He Initialization: It has become a common practice to use He Initialization for neural networks with ReLU and its variants. This method helps mitigate the problem of vanishing/exploding gradients, enabling the training

of deeper networks.

• PReLU: PReLU and its variant, Leaky ReLU, are now widely used in various deep learning architectures. They help mitigate the "dying ReLU" problem, where some neurons essentially become inactive and cease to contribute to the learning process.

### Limitations

While the He initialization and PReLU have been widely adopted, they are not without limitations:

• He Initialization: While this method works well with ReLU and its variants, it might not be the best choice for other activation functions. Therefore, the choice of initialization method still depends on the specific activation function used in the network.

• PReLU: While PReLU helps mitigate the dying ReLU problem, it introduces additional parameters to be learned, increasing the complexity and computational cost of the model. In some cases, other methods like batch normalization or other activation functions might be preferred due to their lesser computational complexity.

### Conclusion

In conclusion, the paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" made significant contributions to the field of deep learning by introducing He initialization and the PReLU activation function. These methods have been widely adopted and have helped improve the performance of deep neural networks, particularly in computer vision tasks.

Tags: Delving Deep into Rectifiers, Optimization