Thursday, December 20, 2018

Signal Processing Magic (4) -- DFT Analysis For A Tone

Tone is the most basic signal, yet it is quite useful. Because of its simplicity, tone is often used to calibrate a system. Thus it is important to know how to analyze a tone. In signal processing, spectrum analysis is an important tool, and spectrum is usually obtained through DFT.



For a tone with duration T, in time domain it looks like the plot below. As digitized signal, it has many samples; as complex waveform, it has both real and imaginary parts (blue curve for real part and red curve for imaginary part). Assuming sampling period Ts and in total N samples, T=N*Ts. Assuming Ts=1us and N=10000, T=10ms.




















In the spectrum of a tone, there are a few important features to note. The first feature is that the delta between the amplitude of the main lobe and the amplitude of the side lobe next to the main lobe is 13.3dB. The second feature is that the width of the main lobe is 2/T. In this case, 2/T=200Hz. These features can be used for tone evaluation. For example, if the amplitude delta between the main lobe and the side lobe is larger than 13.3dB, this often means the tone is corrupted by a windowing function which can increases the amplitude delta.





















Both features can be explained by the relationship between rectangle waveform and sinc function. A tone with limited duration can be seen as imposing a rectangular window on a tone signal with infinite duration. If the tone signal has infinite duration, in spectrum it will show as an impulse. For a tone with limited duration, instead we will see a spectrum like above and this is due to frequency transform of a rectangular window in time domain. From signal processing 101, we know that a rectangular window in time domain can be transformed into a sinc function in frequency domain. sinc(x) = sin(x)/x. sinc(0) = 1 and it corresponds to the max amplitude of the main lobe (point A in the plot below). Point B corresponds to the max amplitude of the first side lobe in the spectrum. At point B, derivative of sinc(x) is 0. By numerical method, x point B can be found as 4.4934 (note that the x-axis of sinc function plot is scaled by Pi). And thus we find that the amplitude delta between point A and point B is exactly 13.3dB.






Tuesday, November 27, 2018

A PyTorch implementation of Image Segmentation Using UNet, Stratification and K-Fold Learning

This PyTorch script is the by-product of attending a Kaggle competition for image segmentation (https://www.kaggle.com/c/tgs-salt-identification-challenge). In this competition, the original images come from geological survey. The whole area of the original image can be divided into subarea with salt under the surface and subarea without salt under the surface (see original mask image). The competition goal is to segment test images into binary masks in which white means salt area and black means non-salt area.


Usually I wrote deep learning scripts using Keras. However, in this case, we choose to use PyTorch for pragmatic considerations. It is well-known that UNet [1] provides good performance for segmentation task. Part of the UNet is based on well-known neural network models such as VGG or Resnet. Compared with Keras, PyTorch seems to provide more options of pre-trained models. For instance, pre-trained model for Resnet34 is available in PyTorch but not in Keras. In order to capture the benefit of transfer learning, PyTorch is chosen over Keras for implementation.

For performance enhancement, when dividing training data to training set and validation set, stratification is used to ensure that images with various salt coverage percentage are all well-represented. In codes below, all training images are divided into 10 categories based on the percentage of their salt coverage ratio from low to high.


# Stratification: data binning based on salt size in mask. Divide each category to training and validation data
ind = np.arange(len(Y_train_shaped))
np.random.shuffle(ind)
coverage = []
for i in range(0, len(Y_train_shaped)):
  coverage.append(np.sum(Y_train_shaped[ind[i]]))

hist, bin_edges = np.histogram(coverage)
# In np.digitize, each index i returned is such that bins[i-1] <= x < bins[i]
# Need to increase the last bin_edges by 1 to avoid genarating a new category with digitize
bin_edges[len(bin_edges)-1] = bin_edges[len(bin_edges)-1] + 1
cindex = np.digitize(coverage,bin_edges)

In the next step, when training images are divided into training and validation set with 8:2 ratio, they are divided into 8:2 in each of these 10 salt-percentage-based categories. This is call stratification. In addition, we use 5-fold cross validation. It means the all training images are equally divided to 5 sets: 0/1/2/3/4. Then model 0 is trained with set 0 as validation and set 1/2/3/4 as training; model 1 is trained with set 1 as validation and set 0/2/3/4 as training; and so on. When all finished, we will have 5 trained models, and the final test results can be ensemble of the outputs of these 5 models. The benefit of K-fold learning is that all data is fully utilized and improved performance by ensemble; the drawback is higher computational complexity.


val_size = 2/10
for ii in range(5): #5-fold learning
    k = ii
    print('Training for '+str(k)+' of 5 fold starts!')
    train_idxs = []
    val_idxs = []
    for i in range(0,10):
      index_temp = ind[cindex==i+1]
      list_temp = index_temp.T.tolist()
      val_samples = round(len(index_temp)*val_size)
      if (k == 0):
          val_idxs = val_idxs + list_temp[:val_samples]
          train_idxs = train_idxs + list_temp[val_samples:]
      elif (k == 4):
          val_idxs = val_idxs + list_temp[4*val_samples:]
          train_idxs = train_idxs + list_temp[:4*val_samples]
      else:
          val_idxs = val_idxs + list_temp[k*val_samples:(k+1)*val_samples]
          train_idxs = train_idxs + list_temp[:k*val_samples] + list_temp[(k+1)*val_samples:]

In this implementation, we use AlbuNet [2], which is an variation of UNet.


    model = AlbuNet(pretrained=True, is_deconv=True);
    model.cuda();

    criterion = nn.BCEWithLogitsLoss()
    learning_rate = 1e-3
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

AlbuNet is UNet based on Resnet34. Its topology is shown below:



















To capture the whole implementation, source code of this implementation can be found here. You can run the script after downloading the dataset from Kaggle.


[1] Olaf Ronneberger et al, U-Net: Convolutional Networks for Biomedical Image Segmentation
[2] Vladimir Iglovikov and Alexey Shvets, TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation


Monday, November 12, 2018

Weight Distribution of MobileNet V1

Deep learning is increasingly used in devices with limited computing resource or limited power. For example, when deep learning is used in smartphone, power consumption becomes a primary concern. In order to reduce power consumption and increase computation efficiency, it is preferred to convert deep learning algorithm from floating point to fixed point. In order to convert an implementation from floating point to fixed point, first we need to know the distribution of parameters of the algorithm. In this blog, we choose a popular deep learning algorithm, MobileNet V1 [1], and plot the distributions of its weights.

As the first step, let us check the architecture of MobileNet V1 network:

import numpy as np
import matplotlib.pyplot as plt
import keras

base_model = keras.applications.mobilenet.MobileNet(weights='imagenet');

base_model.summary()

What this Python code does is to load the model of MobileNet V1 with the weights trained using ImageNet. base_model.summary() lists the summary of the network. The whole summary is long and we won't post it here. But the first a few lines look like this:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0
_________________________________________________________________
conv1_pad (ZeroPadding2D)    (None, 225, 225, 3)       0
_________________________________________________________________
conv1 (Conv2D)               (None, 112, 112, 32)      864
_________________________________________________________________
conv1_bn (BatchNormalization (None, 112, 112, 32)      128
_________________________________________________________________
conv1_relu (ReLU)            (None, 112, 112, 32)      0
_________________________________________________________________
conv_dw_1 (DepthwiseConv2D)  (None, 112, 112, 32)      288
_________________________________________________________________
conv_dw_1_bn (BatchNormaliza (None, 112, 112, 32)      128
_________________________________________________________________
conv_dw_1_relu (ReLU)        (None, 112, 112, 32)      0
_________________________________________________________________
conv_pw_1 (Conv2D)           (None, 112, 112, 64)      2048
_________________________________________________________________
conv_pw_1_bn (BatchNormaliza (None, 112, 112, 64)      256
_________________________________________________________________
conv_pw_1_relu (ReLU)        (None, 112, 112, 64)      0
_________________________________________________________________


Each line of the summary includes three parts: the name of the layer, output shape of this layer, the number of parameters. Taking the example of the conv1 layer. "conv1 (Conv2D)" tells that its name is conv1 and it is doing 2D convolution. The output shape is "(None, 112, 112, 32) ", which means the output has 32 channels with image size 112x112 in each channel. The number of parameter is 864. 864 is calculated by 3x3x3x32=864 whereas 3x3 is the kernel size, 3 is the number of input channels, and 32 is the number of output channels. The next line of conv1_bn is for batch normalization. Batch normalization is proposed in [2] for faster convergence of neural network training. The equation of batch normalization is as below

\(y=\gamma\frac{x-E(x)}{\sqrt{Var(x)+\epsilon}}+\beta\)

Therefore, there are 4 parameters for each channel of batch normalization: \(\gamma\), \(\beta\), \(E(x)\), \(Var(x)\) (\(\epsilon\) is a constant, not a parameter). Since conv1_bn has 32 channels, it has 128 parameters.

MobileNet V1 is famous for decomposing a normal 2D convolution to a deep-wise convolution plus a 2D convolution with 1x1 kernel for reduced complexity. The layers of conv_dw_1 and conv_pw_1 in the summary show that. The layer of conv_dw_1 applied one and only one 3x3 kernel for convolution operation of each input channel. The resulted output shape is the same as the input shape. Since conv_dw_1 has 32 channels, the number of parameters is 32x3x3=288. Next, the conv_pw_1 layer is a 2D convolution with 1x1 kernel. Thus it has 1x1x32(input channels)x64(output channels)=2048. If a conventional 3x3 convolution is used for the same input and output, the number of parameters is 3x3x32x64 and it is much larger than 3x3x32+1x1x32x64. 

After going through the summary of the network, now it is time to plot its weights. To get the weights, we can use

W = base_model.get_weights();

The output W is a list with multiple elements. Each element is a numpy array. We can further print out the shape of each element.

for i in range(len(W)):
  print(W[i].shape)

Again, the printed outputs are long but it is sufficient to list the first dozen of lines to show how it work.
(3, 3, 3, 32)
(32,)
(32,)
(32,)
(32,)
(3, 3, 32, 1)
(32,)
(32,)
(32,)
(32,)
(1, 1, 32, 64)
(64,)
(64,)
(64,)
(64,)

The first line is for the weights of conv1 layer with 3x3x3x32 parameters. The next four lines are for the conv1_bn layer. After checking the source code for how weights are added to the model, we are convinced that the four lines are in turn for \(\gamma\), \(\beta\), \(E(x)\) and \(Var(x)\). The remaining lines of the printout can be mapped to the network layers too.

When plotting the distribution of weight, we plot it layer by layer. For instance, we can plot the distribution of conv1_bn layer weights by using:

plt.hist(W[0].flatten())

Figure below shows the distribution of convolution layers of MobileNet V1 model trained with ImageNet data:

Distributions of  convolution weights


The plot of weight distribution shows that the weight distribution are mostly symmetric. The dynamic range changes from [-0.5,0.5] in conv_pw_13 to [-30, 25]  in conv_dw_1.

Next we plot the distribution of weights used in batch normalization layers. The formula of batch normalization can be simplified to

\(y=\gamma\frac{x}{\sqrt{Var(x)+\epsilon}}+\beta-\gamma\frac{E(x)}{\sqrt{Var(x)+\epsilon}}\)

We call \(\frac{\gamma}{\sqrt{Var(x)+\epsilon}}\) scale and call \(\beta-\gamma\frac{E(x)}{\sqrt{Var(x)+\epsilon}}\) bias.


Distribution of batch normalization scales






































Distribution of batch normalization bias





































It is observed that the scales of batch normalization are mostly positive. The dynamic range of bias distribution changes from [-2.5, 2.5] to [-30, 25], which seems to be less than the dynamic range of weight distribution of convolution layers.

Source code for this analysis can be found here.



[1] Andrew G. Howard et al, MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications
[2] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Thursday, September 6, 2018

Is Python Image resize() Function Reversible?

During image processing tasks, frequently we need to change the image size. For instance, the original image size is 100x100 but certain neural network needs input size of at least 255x255. That requires us to interpolate, which increases the image size. On the other hand, we might get some high-definition images with size much larger than 255x255. For these images, we will perform image decimation to reduce the image size. For both image interpolation and decimation, we can use resize() function defined in skimage.transform Python module.

Some tasks require first changing image size from 100x100 to 255x255, processing it, and then changing it back to 100x100. These kinds of tasks motivate the study of this article. The question we ask is this: if we change an image of size x0-by-y0 to size x1-by-y1 and then change it back to the original size, will the newly generated image the same as the original image? If they are the same, we call the resize() function to be reversible; if not, we call it not reversible. If it is reversible, that means the whole image conversion process does not add any additional noise, which in general is a good thing to have.

First let us see what is offered in skimage resize() function. Note that the manual use the word "interpolation" for all image size change no matter it becomes larger or smaller. According to the manual, six interpolation methods are provided in the function controlled by a parameter named "order". According to the manual:


The order of interpolation. The order has to be in the range 0-5:
  • 0: Nearest-neighbor
  • 1: Bi-linear (default)
  • 2: Bi-quadratic
  • 3: Bi-cubic
  • 4: Bi-quartic
  • 5: Bi-quintic

Among these methods, nearest-neighbor, bi-linear and bi-cubic methods are probably the most well-known approaches with increased complexity. For each pixel in the new image, nearest-neighbor method identifies the pixel in the old image the closest to the new pixel and then make the new pixel identical to the old pixel. Nearest-neighbor method only needs one pixel in the old image for interpolation. By comparison, bi-linear method needs four pixels in the old image. Bi-linear interpolation can be explained by the figure below borrowed from wiki. To generate the new pixel P, bi-linear interpolation uses four points in the original image (Q11, Q12, Q21, Q22). Linear interpolation is performed first in x dimension and then y dimension.


Bi-cubic interpolation is more complicated. Bi-cubic represents interpolation using a polynomial with 16 coefficients. To figure out these coefficients, it needs 16 points from the original image as inputs. The bi-cubic equation is as below:
\[f(x,y)=\begin{bmatrix} x^{3}&x^{2}&x&1\\  \end{bmatrix}\begin{bmatrix} a_{3,3}&a_{3,2}&a_{3,1}&a_{3,0}\\ a_{2,3}&a_{2,2}&a_{2,1}&a_{2,0}\\a_{1,3}&a_{1,2}&a_{1,1}&a_{1,0}\\ a_{0,3}&a_{0,2}&a_{0,1}&a_{0,0}\\ \end{bmatrix}\begin{bmatrix} y^{3}\\ y^{2}\\y\\1 \end{bmatrix}\]

In our experiments, we will take two pictures, one for image and another for mask. Image and mask are common settings for image segmentation task. These two original pictures are both of size 101x101. Then we will resize these two to 128x128 and then convert them back to the orignal size of 101x101. The evaluation metric is sum(abs(original image - output image from two resize operations)). When this summation equals to zero, we call the image resize operation reversible. We will evaluate three interpolation approaches: nearest distance, bi-linear, bi-cubic.

Original: left side is the original image while the right side is mask. Mask is binary. Each pixel of mask is either 0 or 1.




















Nearest-neighbor: What is shown here is the delta between original and the output of two resize operations using nearest-neighbor interpolation. For nearest-neighbor method, delta is zero for both image and mask. The reason for zero delta is this: nearest-neighbor method is bi-directional. When a new pixel is nearest neighbor to an old pixel, when resize back, the old pixel is the nearest neighbor to the new pixel as well. That means resize operation using nearest-neighbor method is reversible.




















Bi-linear: bi-linear interpolation is not reversible. As the diagram below shows, the delta between original and resize output is not all zero. Since mask is binary, the delta for mask in the right is more significant than the image in the left. For the picture in the left, non-zero points show at image boundary and also scatter sporadically over the whole image. For the picture in the right, non-zero points shows in the boundary and also along the border between all-zeros area and all-ones area of the original mask image.




















Bi-cubic: bi-cubic has the same issue as bi-linear. However, based on metric of the sum of square error, bi-cubic generates less delta than that of bi-linear. In term of image in the left, the sum square error is 3.14 for bi-linear versus 0.32 for bi-cubic; in term of mask in the right, the sum square error is 13.14 for bi-linear versus 4.78 for bi-cubic.



What this study tells us is that from the perspective of reducing additional noise caused by resize forward and backward, the best choice is nearest-neighbor interpolation since it does not introduce any additional noise. Other than that, bi-cubic is also a good choice although the computation complexity is higher. Note that when applying resize() function, bi-linear is the default. Sample script used in this study can be found here.

Wednesday, September 5, 2018

My $1000 Deep Learning Box

For machine learning enthusiastic, it is always good to have their own box to do deep learning number crunching. The easiest way is probably to purchase a desktop from Dell, HP or other vendors, and then add a GPU on top. I have considered this option but eventually gave it up. One reason for abandoning this idea is that it is hard to customize a commercially available desktop. For instance, if you want a 750W power supply to ensure GPU won't be short of power or need a motherboard which can support two GPUs, it is not easy to find such a desktop in the market with acceptable price. There are other reasons as well. I want to use a Linux computer dedicated for machine learning purpose but most of the computers available in the market come with Window OS. Therefore, after some contemplating, I decide to assemble my own computer, for which I have not done before.

Hardware

After searching in Internet, I found lots of useful information. In particular, I took good reference from the computer component list provided in this blog. But I also made some modifications on top of Yanda's list. Here is the list of components I am using:

1. EVGA GeForce GTX 1060 SC GAMING, ACX 2.0 (Single Fan), 6GB GDDR5, DX12 OSD Support (PXOC), 06G-P4-6163-KR   (A GPU not as fancy as GTX1080. But it is a fair choice for personal usage and can be upgraded later. $278)

2. ASUS ROG Strix Z370-G Gaming LGA1151 (Intel 8th Gen) DDR4 DP HDMI M.2 Z370 Micro ATX Motherboard with onboard 802.11ac WiFi, Gigabit LAN and USB 3.1 (Motherboard recommended by Yanda. This board can support up to 2 GPUs, which means there is room to expand. $189)

3. WD Blue 3D NAND 500GB PC SSD - SATA III 6 Gb/s M.2 2280 Solid State Drive - WDS500G2B0B (this 500GB SDD drive is also recommended by Yanda. I like it since this SDD drive can be embedded into the Z370-G motherboard. $95)

4. Corsair Vengeance LPX 16GB (2x8GB) DDR4 DRAM 3200MHz C16 Desktop Memory Kit - Black (CMK16GX4M2B3200C16) (16GM DRAM, 170$)

5. Intel 8th Gen Core i5-8400 Processor (For budget purpose, I did not purchase i7 but instead settled with an 8th gen i5 processor, $180)

6. EVGA Supernova 750 G3, 80 Plus Gold 750W, Fully Modular, Eco Mode with New HDB Fan, 10 Year Warranty, Includes Power ON Self Tester, Compact 150mm Size, Power Supply 220-G3-0750-X1 (this 750W power supply is recommended by Amazon. It seems to be a quite popular choice and is so far so good for me. $97)

7. Thermaltake Versa H15 SPCC Micro ATX Mini Tower Computer Chassis CA-1D4-00S1NN-00 (This case is also recommended by Amazon. I think any case with good reviews and supports microATX standard should do the job. $41)

All seven components add to $1050. If one chooses better CPU (like i7), better GPU (like GTX1080), or more than one GPU, the budget for this box will increase. But it also shows the benefit of DIY since it is fairly easy to tune the computer setting according to your own budget and needs.

I assembled everything myself. Even though it is the first time I did that, the whole process seems not to be that difficult. And my computer already starts running and it seems that I have not blew up anything. However, you want to read the manual especially the motherboard manual carefully before starting. Youtube education videos can also be quite benefitial.

Software

I installed Ubuntu Linux in my computer. Ubuntu website provides a good tutorial of how to create bootable USB stick for Ubuntu. I follow up that direction to install a Ubuntu 18.04 and everything works fine. During my installation process, I did not allow installing 3rd software out of the concern that it can mess up with Nvidia GPU installation done later.

The trickiest thing during the whole procedure of box assembly is installing CUDA. CUDA support for Linux is far from perfect. As of the time I wrote this blog, CUDA website only support Ubuntu 17 and 16, but not Ubuntu 18. Internet search returns lots of results of how to install CUDA in Ubuntu yet many of them are confusing especially for a Ubuntu newbie like me. For example, I had followed some recommendation to turn off X-server and it ends up with endless black screen. What eventually saves me is this blog.  The author lays out a relatively straightforward path to install CUDA in Ubuntu 18.04 and most importantly, it works! The only thing missing in that blog is how to install Nvidia driver version 390. For Nvidia driver installation, this discussion tells how to do it. It shows two ways and below is what I did. After driver installation done, remember to use nvidia-smi to verify the installation.
$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt update
$ sudo apt install nvidia-390
After installing CUDA, CUDNN, Tensorflow, you are still a few steps away from running your first deep learning code in the box. For example, you may want to install keras and other Python packages. However, these steps are straightforward. Finally, you should observe a keras or tensorflow sample code running on GPU and it calls a good end to your day.

Sunday, August 26, 2018

Image Augmentation For Deep Learning Without Using ImageDataGenerator

Image augmentation is an important technique used in image-related deep learning to reduce overfitting. Neural networks used in deep learning often have millions of parameters and they are good at memorizing training data. That means the converged network model is good at training data but bad at other data such as validation or test data. We call this phenomenon "overfitting" since the model fits too well to the training data. There are multiple ways to alleviate overfitting issue. One of them is image augmentation. Image augmentation means that instead of feeding training data only, we also feed randomly tilted, shifted, flipped or scaled training images to the model. Randomized images make it harder for the neural network to memorize the training data. Therefore, it may reduce the gap between training accuracy and validation accuracy. Ultimately, data augmentation technique could improve the validation accuracy.

Some deep learning library like Keras provides good built-in function for image augmentation called ImageDataGenerator. This tool is fairly easy to use and fits well with Keras framework. However, if other deep learning library like PyTorch is used, ImageDataGenerator is not available and users need to develop its own image augmentation code. In this article, we will demonstrate how to do that. In particular, we will show an example of how to augment images for the task of image segmentation and this task involves a pair of images: image and mask. Similar method can also be used to augment single image.

The baseline of augmentation code comes from EKami's work related to Kaggle carvana challenge (https://github.com/EKami/carvana-challenge/blob/original_unet/src/img/augmentation.py). This code is an excellent example of how image augmentation should be done. On top of it, we made the following improvements:
1. The original function of augment_img(img, mask) seems not to always change image and mask at the same time. Digging deeper, random_shift_scale_rotate() function are called twice, one for image and another for mask. However, np.random.random() used in that function can have different values for image and for mask. That seems to explain why image and mask are not always modified simultaneously. To address this issue, we create a new function named augment_img_reshape(). In this new function, a common random number is always shared between image and mask conversion. That guarantees that both images are modified simultaneously.
2. To favor deep learning, the input image are often modified to channel-first format as (im_chan, im_width, im_height) for image and (1, im_width, im_height) for mask. This format does not match with the format required by the baseline code. Instead, we add format conversion support to augment_img_reshape().

To show an example, running image_augmentation.py stored in here generated both original image/mask and image/mask after augmentation. One example is shown below. Note that since it is a random event, sometime the images before and after augmentation are identical.




















Sunday, July 22, 2018

My First Kaggle Competition -- Plant Seedlings Classification

Kaggle.com is a website famous for hosting machine learning competitions. Recently I participated one competition named plant seedings classification (https://www.kaggle.com/c/plant-seedlings-classification/) and would like to share what I have learned here. The goal of this competition is for image classification. For training set, the host provides over four thousand images including 12 different plant species. The goal is to categorize the test image into one of these 12 species. This problem is a typical computer vision challenge which can be solved by deep learning. The neural network used is convolutional neural network (CNN). After CNN network trained by training images, it will be applied to test images for classification.

In order to achieve good performance, I tried various approaches and found the following two tips to be particularly useful:
1. Start from pre-trained network, and retrain it using the new data.
2. Ensemble by combining results of multiple models.

Before diving into more details, let me list the results of some approaches that I have tried.




Tip 1 is not my first choice. In fact, I started from designing a 10-layer network (conv/conv/pool/conv/conv/pool/conv/conv/pool/dense) by myself (row 'My own CNN network' in the table with 0.95717 accuracy). But then I found that if building our classifier based on some well-designed deep networks like VGG, Xception and Densenet, better accuracy can be achieved. Explaining what is VGG or Densenet is not the scope of this article. However, it is fair to say that these networks have more layers than the network I designed and also some of them such as densenet or InceptionV3 apply more exotic topology. Training these networks from scratch can be challenging, but we can start from pre-trained weight and after that it is much easier to train. An example of Keras code is shown below for building a model based on pre-trained Xception network. The base model is initialized from a Xception network with pre-trained weight based on Imagenet but not including top part. Then a pooling layer and a dense layer are added upon the base model. The output of the dense layer predicts the likelihood of which plant the image belongs to. This example can be easily extended to other network such as VGG. Weight setting corresponding to the best result are saved and used for testing. Using the pre-trained network boosts accuracy by at least two percents compared with self-designed networks in the first row. The best accuracy of single network is achieved by densenet201 and densenet169 as 0.98236.

base_model = Xception(weights='imagenet', input_shape=(img_size, img_size, 3), include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.5)(x)
x = Dense(1024, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(12, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)

model.compile(optimizer='Adadelta',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit_generator(datagen.flow(train_img, train_label, batch_size=batch_size), steps_per_epoch=len(train_img)//batch_size, epochs=100, verbose=1)
model.save_weights('Xception.h5')

To further improve performance, we need tip 2. If you ask machine learning practitioners the best way to improve performance, many of them probably tell you "ensemble". Ensemble method is to combine together the prediction results of multiple models. Since the prediction results generated by these models are more or less independent, combining them together reduce the noise and enhance the results. However, the non-trivial thing about ensemble is that combining more models does not mean better results. For example, in our experiments Densenet201 + Densenet169 + Densenet121 + InceptV3 + Xcep + InceptRosV2 + VGG16 + VGG19 has the most combining and not the best accuracy. Without straightforward rule, finding the best ensemble model takes lots of trials. Another problem of ensemble approach is how much weight should be assigned to each base model. Not all base models are equal; some base models behave better than others. Most likely base model with better performance should be assigned larger weight. In communication, there are schemes such as Maximum-Ratio Combining (MRC), which combines multiple models and determines combining weight using information of signal and noise level. But in deep learning, it is not easy to obtain metrics such as signal and noise level. Thus in my work, I assign each weight to each base model. But please put in your mind that other ways of weight assignment in deep learning do exist. After many trials, eventually we found that Densenet201 + Densenet169 + InceptV3 + Xcep  +InceptRosV2 gives the best performance (0.98992). I should say that by ensemble method alone, this ensemble model gives the accuracy of 0.9874. But after analyzing multiple submissions and correlated with their results, we are able to correct two entries which boosts the final accuracy to 0.98992. So we game the system a little bit but not too much. This score of 0.98992 ranks top 3% accuracy. The top player of leader board has accuracy of 1.0!






In summary, participating the competition gives lots of fun. Although we participated after the competition closed and our score does not count, working on this problem still teach us quite a few things. We also learned from many discussions and kernels in Kaggle website. Here we want to pay back by sharing our experience and hopefully it can benefit our audience.

[Update on 11/16/2018] Code examples have been added to https://github.com/legendzhangn/blog/tree/master/plant_seedlings

Saturday, June 30, 2018

What Keras Source Code Tells Us about LSTM Processing

Recurrent Neural Network (RNN) is a type of neural network which can save its states and process input sequences. With this recurrent architecture, RNN can be used for applications such as language analysis, time series prediction and speech recognition. Sometimes it can be challenging to understand how RNN processes it inputs. I met with this problem when I started to learn RNN and reading Keras source code helps to clarify this puzzle.

Under Keras, the input to RNN has three dimensions (batch_size, timesteps, input_dim). batch_size controlled how often the network is updated. The network is updated once after batch_size sequences are processed. For example, if there are 256 sequences and batch_size is set as 32, that means after processing the whole set, the network will be updated 8 times. The parameter of timesteps is how many times a RNN cell runs. In each run, it starts from saved states and generates new states and outputs. input_dim is the number of features contained in an input. Another important parameter is the number of RNN units. For example, the instruction below creates 4 units of LSTM while LSTM is a popular type of RNN. Here, timesteps for input is set as 1 and input_dim is set as 2. This instruction does not define batch_size and it means that batch_size will be defined later. From now on, we name num_units as the number of units.

model.add(LSTM(4, input_shape=(1, 2)))

Keras RNN source code can be found in here. Reading the source code reveals how RNN processes its input. How LSTM operates can be found in call() function of class LSTMCell. This part of code (shown below) are only related to input_dim and num_unit but not other parameters. For instance, the dimension of self.kernel_i/self.kernel_f/self.kernel_c/self.kernel_o is (input_dim, num_units). inputs is a vector of input_dim elements. Therefore, after dot product between kernel and inputs, input_dim does not exist any more and x_i/x_f/x_c/x_o become vectors with the length of num_units, which match with the dimensionality of bias. Another observation is that in this stage, each LSTM unit works independently without interference from other units.

            if 0 < self.dropout < 1.:
                inputs_i = inputs * dp_mask[0]
                inputs_f = inputs * dp_mask[1]
                inputs_c = inputs * dp_mask[2]
                inputs_o = inputs * dp_mask[3]
            else:
                inputs_i = inputs
                inputs_f = inputs
                inputs_c = inputs
                inputs_o = inputs
            x_i = K.dot(inputs_i, self.kernel_i)
            x_f = K.dot(inputs_f, self.kernel_f)
            x_c = K.dot(inputs_c, self.kernel_c)
            x_o = K.dot(inputs_o, self.kernel_o)
            if self.use_bias:
                x_i = K.bias_add(x_i, self.bias_i)
                x_f = K.bias_add(x_f, self.bias_f)
                x_c = K.bias_add(x_c, self.bias_c)
                x_o = K.bias_add(x_o, self.bias_o)

            if 0 < self.recurrent_dropout < 1.:
                h_tm1_i = h_tm1 * rec_dp_mask[0]
                h_tm1_f = h_tm1 * rec_dp_mask[1]
                h_tm1_c = h_tm1 * rec_dp_mask[2]
                h_tm1_o = h_tm1 * rec_dp_mask[3]
            else:
                h_tm1_i = h_tm1
                h_tm1_f = h_tm1
                h_tm1_c = h_tm1
                h_tm1_o = h_tm1
            i = self.recurrent_activation(x_i + K.dot(h_tm1_i,
                                                      self.recurrent_kernel_i))
            f = self.recurrent_activation(x_f + K.dot(h_tm1_f,
                                                      self.recurrent_kernel_f))
            c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c,
                                                            self.recurrent_kernel_c))
            o = self.recurrent_activation(x_o + K.dot(h_tm1_o,
                                                      self.recurrent_kernel_o))

Code above has nothing to do with time concept, which is signature of RNN. So how is time getting introduced into RNN implementation? The class of RNN(layer) defined in the same script of recurrent.py is the base class for RNN. In call() function of this class, we can see a call to backend function K.rnn where timesteps is defined as input_length to this function. We call it backend function since Keras is based on tensorflow or Theano, and rnn() function is provided not in Keras but in tensorflow or Theano library depending on which one Keras replies. Although having not yet checked the source code of rnn function, I believe what happened is that in the backend function, the above logic of LSTMCell is called timesteps times to generate the final output.

        last_output, outputs, states = K.rnn(step,
                                             inputs,
                                             initial_state,
                                             constants=constants,
                                             go_backwards=self.go_backwards,
                                             mask=mask,
                                             unroll=self.unroll,
                                             input_length=timesteps)

For reference purpose, equations for LSTM are provided here:
Input gate: \(i_{t}=\sigma(W^{(i)}x_{t}+U^{(i)}h_{t-1})\)
Forget: \(f_{t}=\sigma(W^{(f)}x_{t}+U^{(f)}h_{t-1})\)
Output: \(o_{t}=\sigma(W^{(o)}x_{t}+U^{(o)}h_{t-1})\)
New memory cell: \(\tilde{c}_{t}=tanh(W^{(c)}x_{t}+U^{(c)}h_{t-1})\)
Final memory cell: \(c_{t}=f_{t}\circ c_{t-1}+i_{t}\circ \tilde{c}_{t}\)
Final hidden state: \(h_{t}=o_{t}\circ tanh(c_{t})\)
whereas \(\circ\) means pointwise operation


In this blog, I summarized what I have found in Keras source code. Hopefully it can be useful to you.

The largest difference between ordinary RNN versus LSTM is that LSTM has an extra hidden state.

Monday, June 4, 2018

How To Load/Save Data in Python

Loading/Saving data is a common task in numerical analysis. In python, the most convenient way is to use the save/load function provided in numpy package. An example is as the follows

import numpy as np
rand_array = np.random.rand(1,10);
np.save('rand.npy',rand_array);
b=np.load('rand.npy');

This example shows how to save an array in a .npy file and then load. This save/load function is quite similar to the save/load in Matlab except for that Matlab allows saving multiple arrays in single file while in Python one file can hold only one array. For comparison purpose, Matlab save/load code is shown below where two arrays with random numbers are save in the same file


rand_array=rand(1,10);
rand_array2 = rand(1,10);
save('rand_file','rand_array','rand_array2');
myFile = load('rand_file');

Saturday, March 24, 2018

3D LIDAR DIY Using PulsedLight LIDAR-Lite Device and Servo

LIDAR stands for LIght Detection And Ranging. LIDAR uses laser to measure the distance. LIDAR is capable to do things no many other sensors can do. It can accurately measure distance (tens to hundreds of meters) of objects around, which generates a good picture of surrounding environment. Because of this functionality, LIDAR is a must-have for many self-driving car designs. Many tech hobbyists want to put their fingers on a LIDAR device. However, commercial LIDAR devices are quite expensive and they are usually well above one thousand dollars. Good news is that with availability of cheap laser component like PulsedLight LIDAR-Lite and open source platform like Arduino, now we can make by ourselves a 3D LIDAR scanner with cost well below 1000$. In this blog, I will introduce such a device made by myself.

PulsedLight LIDAR-Lite is used in this device to measure the distance using laser. The maximum distance LIDAR-Lite measures is 40 meters. Two servos, one for vertical direction and another for horizontal direction, are used for scanning the 3D space. Both horizontal and vertical servos can rotate up to 180 degrees. This defines the coverage of this scanner. Arduino platform controls all of them. By connecting Arduino to PC, 3D scan data can be downloaded to PC to control point cloud file. At the same time, PC can power the device through USB. Thus the device does not need to have its own power. Figure below shows how the device looks like.

























When the device scans, it first fixes vertical angle and sweeps horizontally. Then it moves one vertical step and sweeps horizontally again. This Youtube video shows the scanning process




After scanning finished, a point cloud will be generated. Often several point clouds need to be stitched together to generate the whole picture for a building. The residential house model below is generated in that way.







3D model can be produced based on point cloud (Thanks Igor!). Below is the 3D model for the same residential house.
























After 3D printing, we have a plastic model of the house. We can view it together with Google street map side-by-side.









Sunday, February 18, 2018

Signal Processing Magic (3) -- Farrow Structure

Farrow structure was invented by C. W Farrow. It minimizes the number of multiplication used in fractional interpolation. Farrow structure is often deployed when there is fractional sampling rate change in the signal path. For example, ADC rate is set as 120MHz but baseband signal processing only supports 100MHz. Therefore, a filter needs to be designed for supporting rate conversion from 120MHz to 100MHz. Farrow structure is one candidate for this kind of job.

The setting of Farrow depends on the polynomial it supports. Let us take an example of using cubic Lagrange for fractional interpolation. The output y[n] can be written as combination of input x[n] as:
\[y[n]=C_{-2}x[n-2]+C_{-1}x[n-1]+C_{0}x[n]+C_{1}x[n+1]\]

Assuming fractional delay is \(\tau\), using Lagrange formula, \(C_{-2}\) equals to the division of \((-\tau-(-1))(-\tau)(-\tau-1)\) by \((-2-(-1))(-2)(-2-1)\). Thus,
\[C_{-2}=\frac{\tau^{3}}{6}-\frac{\tau}{6}\]
Extend the same formula to other coefficients, we have
\[C_{-1}=-\frac{\tau^{3}}{2}+\frac{\tau^{2}}{2}+\tau\]
\[C_{0}=\frac{\tau^{3}}{2}-\tau^{2}-\frac{\tau}{2}+1\]
\[C_{1}=-\frac{\tau^{3}}{6}+\frac{\tau^{2}}{2}-\frac{\tau}{3}\]
Rewriting the y[n] equation, we have
\[y[n]=\frac{1}{6}[(x[n-2]-3x[n-1]+3x[n]-x[n+1])\tau^{3}+ \\
(3x[n-1]-6x[n]+3x[n+1])\tau^2+ \\
(-x[n-2]+6x[n-1]-3x[n]-2x[n+1])\tau+6x[n]] \]

Corresponding Farrow structure is shown below. It only needs three multipliers and \(\tau\) is input to all multipliers. This simplifies the hardware design. When polynomial is different, the coefficients of FIR filter change accordingly. We should also note that due to the difference between input and output rates, one input sample not necessarily generates one output sample. A control logic is needed to dictate when to generate outputs.



Thursday, February 15, 2018

Signal Processing Magic (2) -- Sigma Delta ADC

Sigma Delta ADC is a commonly used architecture for converting analog signal to digital. The conception of ADC is quite simple: let us say we have 4 bits with bit3 representing 1volt, bit2 for 0.5 v, bit1 for 0.25v and bit0 for 0.125v, I can quantize an analog input in volt to these 4 bits. For example, if the input is 1.2 volts, then it becomes [bit3 bit2 bit1 bit0] = [1010] = 1.25v. The quantization error is equal to 1.2-1.25 = -0.05v. Quantization noise comes from quantization error. A straightforward thought is to use binary search to convert analog input to digital. This is the basic idea behind SAR (Successive Approximation Register) ADC. 

Sigma Delta ADC, our main topic in this article, is constructed differently. Assuming x(n) is the input and y(n) is the output of sigma delta modulator, y(n) has a much higher sampling rate than x(n) and y(n) is also only 1 bit. So in term of y(n), sigma delta can be seen as a trade off between sampling rate and bitwidth. y(n) will then pass through lowpass filter and decimator to produce the multiple-bit ADC output we commonly work with.



Now let us do some math to figure out what really happens in sigma delta. First we calculate the transfer function between v(n) and u(n)

\[z^{-1}(U(z)+V(z))=V(z)\]
which gives
\[H(z)=\frac{V(z)}{U(z)}=\frac{z^{-1}}{1-z^{-1}}\]

The transfer function from input, x(n), to output, y(n), is equal to
\[\frac{Y(z)}{X(z)}=\frac{H(z)}{1+H(z)}=z^{-1}\]
which tell us there is signal delay but no signal distortion.

e(n) added to 1-bit ADC in the diagram represents quantization noise. The transfer function from e(n) to y(n) is
\[\frac{Y(z)}{E(z)}=\frac{1}{1+H(z)}=1-z^{-1}\]
Frequency response of Y(z)/E(z) is shown below. Its amplitude increases with frequency which means sigma delta modulator pushes out noise from in-band to out-of-band. Reducing in-band noise level is the major benefit of sigma delta. What we discussed here is first-order sigma delta. If we goes to higher order sigma delta, the in-band noise level will be further reduced but the cost paid is more sophisticated hardware and higher out-of-band noise level.


Signal Processing Magic (1) -- CIC Filter

In the world of signal processing, there are many elegant schemes. They are beautiful and pragmatic at the same time. Sometime you are wondering how the inventors create them. I feel obliged to introduce them to my audience. Today, let me start from a filter named CIC (Cascaded Integrator–Comb). CIC filter is mainly used for integer up-sampling and down-sampling. The main benefit of this scheme is that no multiplication is needed, which is welcomed in term of hardware complexity.

Figure below shows CIC filter architecture with N stages and decimation rate of R. M is an extra parameter and it can be 1 or 2. Normally M is set as 1. CIC filter can also be used as interpolation and the flow is opposite when used as decimation.



Accordingly, we can write the transfer function of decimation CIC filter as:

\[(\frac{1-z^{-RM}}{1-z^{-1}})^{N}\]


By increasing N with fixed R, the filter can have more rejection.




By increasing R with fixed N, the filter's cutoff frequency decreases proportionally.