Nan Zhang's Blog: November 2018

This PyTorch script is the by-product of attending a Kaggle competition for image segmentation (https://www.kaggle.com/c/tgs-salt-identification-challenge). In this competition, the original images come from geological survey. The whole area of the original image can be divided into subarea with salt under the surface and subarea without salt under the surface (see original mask image). The competition goal is to segment test images into binary masks in which white means salt area and black means non-salt area.

Usually I wrote deep learning scripts using Keras. However, in this case, we choose to use PyTorch for pragmatic considerations. It is well-known that UNet [1] provides good performance for segmentation task. Part of the UNet is based on well-known neural network models such as VGG or Resnet. Compared with Keras, PyTorch seems to provide more options of pre-trained models. For instance, pre-trained model for Resnet34 is available in PyTorch but not in Keras. In order to capture the benefit of transfer learning, PyTorch is chosen over Keras for implementation.

For performance enhancement, when dividing training data to training set and validation set, stratification is used to ensure that images with various salt coverage percentage are all well-represented. In codes below, all training images are divided into 10 categories based on the percentage of their salt coverage ratio from low to high.

# Stratification: data binning based on salt size in mask. Divide each category to training and validation data
ind = np.arange(len(Y_train_shaped))
np.random.shuffle(ind)
coverage = []
for i in range(0, len(Y_train_shaped)):
  coverage.append(np.sum(Y_train_shaped[ind[i]]))

hist, bin_edges = np.histogram(coverage)
# In np.digitize, each index i returned is such that bins[i-1] <= x < bins[i]
# Need to increase the last bin_edges by 1 to avoid genarating a new category with digitize
bin_edges[len(bin_edges)-1] = bin_edges[len(bin_edges)-1] + 1
cindex = np.digitize(coverage,bin_edges)

In the next step, when training images are divided into training and validation set with 8:2 ratio, they are divided into 8:2 in each of these 10 salt-percentage-based categories. This is call stratification. In addition, we use 5-fold cross validation. It means the all training images are equally divided to 5 sets: 0/1/2/3/4. Then model 0 is trained with set 0 as validation and set 1/2/3/4 as training; model 1 is trained with set 1 as validation and set 0/2/3/4 as training; and so on. When all finished, we will have 5 trained models, and the final test results can be ensemble of the outputs of these 5 models. The benefit of K-fold learning is that all data is fully utilized and improved performance by ensemble; the drawback is higher computational complexity.

val_size = 2/10
for ii in range(5): #5-fold learning
    k = ii
    print('Training for '+str(k)+' of 5 fold starts!')
    train_idxs = []
    val_idxs = []
    for i in range(0,10):
      index_temp = ind[cindex==i+1]
      list_temp = index_temp.T.tolist()
      val_samples = round(len(index_temp)*val_size)
      if (k == 0):
          val_idxs = val_idxs + list_temp[:val_samples]
          train_idxs = train_idxs + list_temp[val_samples:]
      elif (k == 4):
          val_idxs = val_idxs + list_temp[4*val_samples:]
          train_idxs = train_idxs + list_temp[:4*val_samples]
      else:
          val_idxs = val_idxs + list_temp[k*val_samples:(k+1)*val_samples]
          train_idxs = train_idxs + list_temp[:k*val_samples] + list_temp[(k+1)*val_samples:]

In this implementation, we use AlbuNet [2], which is an variation of UNet.

    model = AlbuNet(pretrained=True, is_deconv=True);
    model.cuda();

    criterion = nn.BCEWithLogitsLoss()
    learning_rate = 1e-3
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

AlbuNet is UNet based on Resnet34. Its topology is shown below:

To capture the whole implementation, source code of this implementation can be found here. You can run the script after downloading the dataset from Kaggle.

[1] Olaf Ronneberger et al, U-Net: Convolutional Networks for Biomedical Image Segmentation
[2] Vladimir Iglovikov and Alexey Shvets, TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation

Deep learning is increasingly used in devices with limited computing resource or limited power. For example, when deep learning is used in smartphone, power consumption becomes a primary concern. In order to reduce power consumption and increase computation efficiency, it is preferred to convert deep learning algorithm from floating point to fixed point. In order to convert an implementation from floating point to fixed point, first we need to know the distribution of parameters of the algorithm. In this blog, we choose a popular deep learning algorithm, MobileNet V1 [1], and plot the distributions of its weights.

As the first step, let us check the architecture of MobileNet V1 network:

import numpy as np
import matplotlib.pyplot as plt
import keras

base_model = keras.applications.mobilenet.MobileNet(weights='imagenet');

base_model.summary()

What this Python code does is to load the model of MobileNet V1 with the weights trained using ImageNet. base_model.summary() lists the summary of the network. The whole summary is long and we won't post it here. But the first a few lines look like this:

_________________________________________________________________
Layer (type) Output Shape Param #
================================================
input_1 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
conv1_pad (ZeroPadding2D) (None, 225, 225, 3) 0
_________________________________________________________________
conv1 (Conv2D) (None, 112, 112, 32) 864
_________________________________________________________________
conv1_bn (BatchNormalization (None, 112, 112, 32) 128
_________________________________________________________________
conv1_relu (ReLU) (None, 112, 112, 32) 0
_________________________________________________________________
conv_dw_1 (DepthwiseConv2D) (None, 112, 112, 32) 288
_________________________________________________________________
conv_dw_1_bn (BatchNormaliza (None, 112, 112, 32) 128
_________________________________________________________________
conv_dw_1_relu (ReLU) (None, 112, 112, 32) 0
_________________________________________________________________
conv_pw_1 (Conv2D) (None, 112, 112, 64) 2048
_________________________________________________________________
conv_pw_1_bn (BatchNormaliza (None, 112, 112, 64) 256
_________________________________________________________________
conv_pw_1_relu (ReLU) (None, 112, 112, 64) 0
_________________________________________________________________

Each line of the summary includes three parts: the name of the layer, output shape of this layer, the number of parameters. Taking the example of the conv1 layer. "conv1 (Conv2D)" tells that its name is conv1 and it is doing 2D convolution. The output shape is "(None, 112, 112, 32) ", which means the output has 32 channels with image size 112x112 in each channel. The number of parameter is 864. 864 is calculated by 3x3x3x32=864 whereas 3x3 is the kernel size, 3 is the number of input channels, and 32 is the number of output channels. The next line of conv1_bn is for batch normalization. Batch normalization is proposed in [2] for faster convergence of neural network training. The equation of batch normalization is as below

\(y=\gamma\frac{x-E(x)}{\sqrt{Var(x)+\epsilon}}+\beta\)

Therefore, there are 4 parameters for each channel of batch normalization: \(\gamma\), \(\beta\), \(E(x)\), \(Var(x)\) (\(\epsilon\) is a constant, not a parameter). Since conv1_bn has 32 channels, it has 128 parameters.

MobileNet V1 is famous for decomposing a normal 2D convolution to a deep-wise convolution plus a 2D convolution with 1x1 kernel for reduced complexity. The layers of conv_dw_1 and conv_pw_1 in the summary show that. The layer of conv_dw_1 applied one and only one 3x3 kernel for convolution operation of each input channel. The resulted output shape is the same as the input shape. Since conv_dw_1 has 32 channels, the number of parameters is 32x3x3=288. Next, the conv_pw_1 layer is a 2D convolution with 1x1 kernel. Thus it has 1x1x32(input channels)x64(output channels)=2048. If a conventional 3x3 convolution is used for the same input and output, the number of parameters is 3x3x32x64 and it is much larger than 3x3x32+1x1x32x64.

After going through the summary of the network, now it is time to plot its weights. To get the weights, we can use

W = base_model.get_weights();

The output W is a list with multiple elements. Each element is a numpy array. We can further print out the shape of each element.

for i in range(len(W)):
  print(W[i].shape)

Again, the printed outputs are long but it is sufficient to list the first dozen of lines to show how it work.
(3, 3, 3, 32)
(32,)
(32,)
(32,)
(32,)
(3, 3, 32, 1)
(32,)
(32,)
(32,)
(32,)
(1, 1, 32, 64)
(64,)
(64,)
(64,)
(64,)

The first line is for the weights of conv1 layer with 3x3x3x32 parameters. The next four lines are for the conv1_bn layer. After checking the source code for how weights are added to the model, we are convinced that the four lines are in turn for \(\gamma\), \(\beta\), \(E(x)\) and \(Var(x)\). The remaining lines of the printout can be mapped to the network layers too.

When plotting the distribution of weight, we plot it layer by layer. For instance, we can plot the distribution of conv1_bn layer weights by using:

plt.hist(W[0].flatten())

Figure below shows the distribution of convolution layers of MobileNet V1 model trained with ImageNet data:

Distributions of convolution weights

The plot of weight distribution shows that the weight distribution are mostly symmetric. The dynamic range changes from [-0.5,0.5] in conv_pw_13 to [-30, 25] in conv_dw_1.

Next we plot the distribution of weights used in batch normalization layers. The formula of batch normalization can be simplified to

\(y=\gamma\frac{x}{\sqrt{Var(x)+\epsilon}}+\beta-\gamma\frac{E(x)}{\sqrt{Var(x)+\epsilon}}\)

We call \(\frac{\gamma}{\sqrt{Var(x)+\epsilon}}\) scale and call \(\beta-\gamma\frac{E(x)}{\sqrt{Var(x)+\epsilon}}\) bias.

Distribution of batch normalization scales

Distribution of batch normalization bias

It is observed that the scales of batch normalization are mostly positive. The dynamic range of bias distribution changes from [-2.5, 2.5] to [-30, 25], which seems to be less than the dynamic range of weight distribution of convolution layers.

Source code for this analysis can be found here.

[1] Andrew G. Howard et al, MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications
[2] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Nan Zhang's Blog

Tuesday, November 27, 2018

A PyTorch implementation of Image Segmentation Using UNet, Stratification and K-Fold Learning

Monday, November 12, 2018

Weight Distribution of MobileNet V1

About Me

Blog Archive