Rethinking the Inception Architecture for Computer Vision

TLDR

Mostly, instead of using large convolutional layers {e.g. 5x5}, use stacked, smaller convolutional layers {e.g. 3x3 flowing into another 3x3}, as this uses fewer parameters while maintaining or increasing the receptive field. Also, auxiliary classifiers {losses} help things, as expected.

Motivation

The authors begin by discussing the motivations behind their work. They found that the Inception v1 architecture, which was introduced in their previous paper titled "Going Deeper with Convolutions," was computationally expensive and had a large number of parameters. This led to problems with overfitting and made the model difficult to train.

Factorization into smaller convolutions

One of the key insights of the paper is that convolutions can be factorized into smaller ones. The authors show that a 5x5 convolution can be replaced with two 3x3 convolutions, and a 3x3 convolution can be replaced with a 1x3 followed by a 3x1 convolution.

This factorization does not only reduce the computational cost but also improves the performance of the model, in this case.

Mathematically, this is represented as:

[ \text{{5x5 convolution}} \rightarrow \text{{3x3 convolution}} + \text{{3x3 convolution}} ]

[ \text{{3x3 convolution}} \rightarrow \text{{1x3 convolution}} + \text{{3x1 convolution}} ]

The factorization in the Inception architecture is achieved by breaking down larger convolutions into a series of smaller ones. Let's go into detail with an example:

Consider a 5x5 convolution operation. This operation involves 25 multiply-adds for each output pixel. If we replace this single 5x5 convolution with two 3x3 convolutions, we can achieve a similar receptive field with fewer computations. Here's why:

A 3x3 convolution involves 9 multiply-adds for each output pixel. If we stack two of these, we end up with (2 \times 9 = 18) multiply-adds, which is less than the 25 required for the original 5x5 convolution. Furthermore, the two 3x3 convolutions have a receptive field similar to a 5x5 convolution because the output of the first 3x3 convolution becomes the input to the second one.

Similarly, a 3x3 convolution can be replaced by a 1x3 convolution followed by a 3x1 convolution. This reduction works because the composition of the two convolutions also covers a 3x3 receptive field, but with (3 + 3 = 6) parameters instead of 9.

The motivation for these factorizations is to reduce the computational cost {number of parameters and operations} while maintaining a similar model capacity and receptive field size. This can help to improve the efficiency and performance of the model.

Auxiliary classifiers

Another improvement introduced in the Inception v2 architecture is the use of auxiliary classifiers. These are additional classifiers that are added to the middle of the network. The goal of these classifiers is to propagate the gradient back to the earlier layers of the network, which helps to mitigate the vanishing gradient problem.

Inception v2 Architecture

The Inception v2 architecture consists of several inception modules, which are composed of different types of convolutional layers. Each module includes 1x1 convolutions, 3x3 convolutions, and 5x5 convolutions, as well as a pooling layer. The outputs of these layers are then concatenated and fed into the next module.

The architecture also includes two auxiliary classifiers, which are added to the 4a and 4d modules.

Here is a simplified illustration of the Inception v2 architecture:

              ------------
             | Inception |
   --------  | Module 1a |   --------
  | Input |  ------------   | Output |
   --------  | Inception |   --------
             | Module 2a |
              ------------
                  ...
              ------------
             | Inception |
             | Module 4e |
              ------------
                |  |  |
   ----------------  -----------------
  | Auxiliary Classifier 1 |  Auxiliary Classifier 2 |
   ----------------  -----------------

Results

The paper reports that the Inception v2 architecture achieves a top-5 error rate of 6.67% on the ImageNet classification task, which was a significant improvement over the previous Inception v1 architecture.

Implications

The Inception v2 architecture introduced in this paper has had a significant impact on the field of computer vision. Its design principles, such as factorization into smaller convolutions and the use of auxiliary classifiers, have been widely adopted in other architectures. Moreover, the Inception v2 architecture itself has been used as a base model in many computer vision tasks, including image classification, object detection, and semantic segmentation.

This design allows the model to capture both local features {through small convolutions} and abstract features {through larger convolutions and pooling} at each layer. The dimensionality reduction steps help to control the computational complexity of the model.

NoiseDive

Rethinking the Inception Architecture for Computer Vision

TLDR

Motivation

Factorization into smaller convolutions

Auxiliary classifiers

Inception v2 Architecture

Results

Implications

👁️ 1439

hills

22:39

30.05.23