Monday, November 12, 2018

Weight Distribution of MobileNet V1

Deep learning is increasingly used in devices with limited computing resource or limited power. For example, when deep learning is used in smartphone, power consumption becomes a primary concern. In order to reduce power consumption and increase computation efficiency, it is preferred to convert deep learning algorithm from floating point to fixed point. In order to convert an implementation from floating point to fixed point, first we need to know the distribution of parameters of the algorithm. In this blog, we choose a popular deep learning algorithm, MobileNet V1 [1], and plot the distributions of its weights.

As the first step, let us check the architecture of MobileNet V1 network:

import numpy as np
import matplotlib.pyplot as plt
import keras

base_model = keras.applications.mobilenet.MobileNet(weights='imagenet');

base_model.summary()

What this Python code does is to load the model of MobileNet V1 with the weights trained using ImageNet. base_model.summary() lists the summary of the network. The whole summary is long and we won't post it here. But the first a few lines look like this:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0
_________________________________________________________________
conv1_pad (ZeroPadding2D)    (None, 225, 225, 3)       0
_________________________________________________________________
conv1 (Conv2D)               (None, 112, 112, 32)      864
_________________________________________________________________
conv1_bn (BatchNormalization (None, 112, 112, 32)      128
_________________________________________________________________
conv1_relu (ReLU)            (None, 112, 112, 32)      0
_________________________________________________________________
conv_dw_1 (DepthwiseConv2D)  (None, 112, 112, 32)      288
_________________________________________________________________
conv_dw_1_bn (BatchNormaliza (None, 112, 112, 32)      128
_________________________________________________________________
conv_dw_1_relu (ReLU)        (None, 112, 112, 32)      0
_________________________________________________________________
conv_pw_1 (Conv2D)           (None, 112, 112, 64)      2048
_________________________________________________________________
conv_pw_1_bn (BatchNormaliza (None, 112, 112, 64)      256
_________________________________________________________________
conv_pw_1_relu (ReLU)        (None, 112, 112, 64)      0
_________________________________________________________________


Each line of the summary includes three parts: the name of the layer, output shape of this layer, the number of parameters. Taking the example of the conv1 layer. "conv1 (Conv2D)" tells that its name is conv1 and it is doing 2D convolution. The output shape is "(None, 112, 112, 32) ", which means the output has 32 channels with image size 112x112 in each channel. The number of parameter is 864. 864 is calculated by 3x3x3x32=864 whereas 3x3 is the kernel size, 3 is the number of input channels, and 32 is the number of output channels. The next line of conv1_bn is for batch normalization. Batch normalization is proposed in [2] for faster convergence of neural network training. The equation of batch normalization is as below

\(y=\gamma\frac{x-E(x)}{\sqrt{Var(x)+\epsilon}}+\beta\)

Therefore, there are 4 parameters for each channel of batch normalization: \(\gamma\), \(\beta\), \(E(x)\), \(Var(x)\) (\(\epsilon\) is a constant, not a parameter). Since conv1_bn has 32 channels, it has 128 parameters.

MobileNet V1 is famous for decomposing a normal 2D convolution to a deep-wise convolution plus a 2D convolution with 1x1 kernel for reduced complexity. The layers of conv_dw_1 and conv_pw_1 in the summary show that. The layer of conv_dw_1 applied one and only one 3x3 kernel for convolution operation of each input channel. The resulted output shape is the same as the input shape. Since conv_dw_1 has 32 channels, the number of parameters is 32x3x3=288. Next, the conv_pw_1 layer is a 2D convolution with 1x1 kernel. Thus it has 1x1x32(input channels)x64(output channels)=2048. If a conventional 3x3 convolution is used for the same input and output, the number of parameters is 3x3x32x64 and it is much larger than 3x3x32+1x1x32x64. 

After going through the summary of the network, now it is time to plot its weights. To get the weights, we can use

W = base_model.get_weights();

The output W is a list with multiple elements. Each element is a numpy array. We can further print out the shape of each element.

for i in range(len(W)):
  print(W[i].shape)

Again, the printed outputs are long but it is sufficient to list the first dozen of lines to show how it work.
(3, 3, 3, 32)
(32,)
(32,)
(32,)
(32,)
(3, 3, 32, 1)
(32,)
(32,)
(32,)
(32,)
(1, 1, 32, 64)
(64,)
(64,)
(64,)
(64,)

The first line is for the weights of conv1 layer with 3x3x3x32 parameters. The next four lines are for the conv1_bn layer. After checking the source code for how weights are added to the model, we are convinced that the four lines are in turn for \(\gamma\), \(\beta\), \(E(x)\) and \(Var(x)\). The remaining lines of the printout can be mapped to the network layers too.

When plotting the distribution of weight, we plot it layer by layer. For instance, we can plot the distribution of conv1_bn layer weights by using:

plt.hist(W[0].flatten())

Figure below shows the distribution of convolution layers of MobileNet V1 model trained with ImageNet data:

Distributions of  convolution weights


The plot of weight distribution shows that the weight distribution are mostly symmetric. The dynamic range changes from [-0.5,0.5] in conv_pw_13 to [-30, 25]  in conv_dw_1.

Next we plot the distribution of weights used in batch normalization layers. The formula of batch normalization can be simplified to

\(y=\gamma\frac{x}{\sqrt{Var(x)+\epsilon}}+\beta-\gamma\frac{E(x)}{\sqrt{Var(x)+\epsilon}}\)

We call \(\frac{\gamma}{\sqrt{Var(x)+\epsilon}}\) scale and call \(\beta-\gamma\frac{E(x)}{\sqrt{Var(x)+\epsilon}}\) bias.


Distribution of batch normalization scales






































Distribution of batch normalization bias





































It is observed that the scales of batch normalization are mostly positive. The dynamic range of bias distribution changes from [-2.5, 2.5] to [-30, 25], which seems to be less than the dynamic range of weight distribution of convolution layers.

Source code for this analysis can be found here.



[1] Andrew G. Howard et al, MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications
[2] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

1 comment:

  1. Let us make MobileNet 8-bit quantization together. See https://www.ibiblio.org/e-notes/ml/notes.htm
    Evgeny Demidov

    ReplyDelete