Story Behind the Convolutional Neural Networks (CNN) with PyTorch Part I

8 min readMar 18, 2019

Hello Folks!

Humans are actually awesome in detecting objects and classifying patterns. But with machines the story is different. The computer does not see the objects it observes the patterns based on number or binaries. Basically, Images which are stored in the computer consist of matrices of number where each number represents pixels For black and white images, there will only be one number per pixel to represent its darkness, while for colored images, there will be 3 numbers for the RGB channels (one for red, one for blue, one for green).

The process by which a machine analyze the objects or recognize it is called Computer Vision. Convolutional neural networks are the fascinating algorithms behind Computer Vision. In particular, this tutorial series will show you both the theory and practical application of Convolutional Neural Networks in PyTorch.

Convolutional Neural Networks (CNN)

Convolutional Neural Networks also known as ConvNets leverage spatial information and hence are superb in classifying images. Basically, ConvNets aims to find the correlations between the adjacent inputs in the image. In Simple terms, Convolutional Neural Networks consists of one or more convolutional layers followed by fully connected layers. The Architecture of CNN is based on a structure of the 2D input image. Another benefit of CNN's is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units.

Before Kicking off PyTorch Let’s talk more of key intuitions beyond Conv Neural Networks!

What is an Image?

A Simple intuition is that every image is represented as a matrix of pixel values. In ConvNets we used to refer Channel which is a certain component of an Image. If we Click an Image from our Standard Camera it will have 3 Channels(RGB) Red, Green, and Blue. Each Color is having stacked two Dimensional Matrices stacked on each other having pixel ranges from 0 to 255.

A grayscale image, on the other hand, has just one channel. The pixels of the greyscale image will range from 0 to 255 in which zero will represent black and 255 represents white.

Local receptive fields

In order to store the spatial information of the image, we store an image with a matrix of pixels. Local Receptive fields are basically single hidden neurons. Moreover, we can encode more information about an image if we have overlapping submatrices.

Let’s take an example if we take MINST data having an image size of 28 x 28 every single submatrix will be 5 x 5. Hence we will be able to generate 23 x 23 local receptive field neurons in the next hidden layer. In PyTorch, the size of every single submatrix is called stride, and this is a hyperparameter that can be fine-tuned during the construction of our Conv Nets.

Shared weights and bias

Share weights are nothing but the sharing of weights by all neurons in a particular feature map. In Simple words, the main goal is to use the same set of weights and bias for all the neurons in the hidden layers. Hence each layer in ConvNets will learn independent features which are purely derived from the image.

The “Convolution” Factor in ConvNets

The CNN basically work behind the scenes with the convolution operators. The Primary goal of Convolution in ConvNets is to extract important features from the input Image. It basically captures the spatial relationship between pixels of an image by using small squares of input data.

Let’s Do not bother about Maths behind the ConveNets.We will Dive Deep How it works for an Image!

As we know that every image is a matrix of pixels values. Let’s take an example of an image of 5 x 5 having pixel values of 0 and 1 only. Point to be noted here is that for Greyscale image we have pixel values that range from 0 to 255 but for ease of understanding we have taken 0 and 1 here.

Let’s us consider another matrix having 3 x 3 matrix as shown below

Figure II

So the Convolution of 5 x 5 image i.e. Figure I and 3 x 3 matrix i.e. Figure IIcan be computed below

Let’s put some efforts in understanding the computation that is being done. Basically, we Iterate over the green image by 1 pixel also known as Stride. For every Position, we perform element-wise multiplication (between the two matrices) followed by summation. By this operation, we got the final integer which forms a single element of the output matrix (pink). If we follow the CNN terminology the 3 x 3 matrix as shown in Figure II is called Filter or Kernal or Feature Detector and Sliding the filter over an image and computing element-wise multiplication followed by summation is called Feature Map. It is important to note that filters act as feature detectors from the original input image.

Moreover, the size of the Feature Maps depends on the following three parameters.

Depth is related to the number of filters we use for convolution operation
Strides are nothing but the number of pixels which we slide our feature matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around.
Zero-padding is used to form padding with zeroes across the border of the input image so that we can apply the filter to bordering elements of our input image matrix. One Advantage of padding is that it helps in controlling the size of feature maps.

Pooling Step in ConvNets

Pooling also known as Down-sampling is used to compress the dimensionality of each feature maps but also grasp the important information. Pooling is classified into two types-

Max Pooling In case of Max Pooling we take Maximum element from the rectified feature map with the 2 x 2 window. In Simple Words, Max pooling uses the maximum value from each cluster of neurons in the prior layer.

Average Pooling Instead of taking maximum value we can also take the average or sum of all elements in the Rectified Feature map window.

In practice, Max Pooling has been shown to work better!

Flattening

This step is pretty simple, hence the shockingly short tutorial. As the name suggests, we are going to flatten our pooled feature map into a column like in the image above. Later We are going to insert this data into the Artificial Neural Nets.

Fully Connected Layer

The term Full Connected here implies that every neuron in the previous layer is connected to every neuron in the next layer after. I recommend you to read out this post if you are unfamiliar with this. After Adding the convolution and pooling layer we merge it with a Fully Connected layer in order to wrap up the complete CNN Architecture. Moreover, the output of Convolution and Pooling layer are in 3 Dimensional Vectors but Fully Connected layer requires 1 D Vectors of Input So, we flatten the output of the final pooling layer to a vector and that becomes the input to the fully connected layer.

The purpose of fully Connected Convolution layer is to use the high-level features from the input image in order to classify various classes based on train data. Moreover, if we are not dealing with classification problem adding a fully-connected layer is not a good way of learning a non-linear combination of these high-level features. One thing I need to point out here is that the summation of probabilities for the output of a fully connected layer is 1.We can verify this by using softmax activation function in the output layer of a fully connected layer. The Softmax function takes a vector of arbitrary real-valued scores and flattens it to the vector between zero and one.

The Whole Convolution Process-A Quick Recap for You

Let’s make a quick recap of what we learned about convolutional neural networks. In the diagram above, you can see the entire process of creating and optimizing a convolutional neural network that we covered throughout this post.

Start with an input Image.
Next, we apply filters or feature maps to Input image which basically provides us with a convolutional layer.
Adding up we will be eliminating linearity of an image by using rectifier function say Relu.
The image becomes ready for the pooling step, the purpose of which is providing our CNN Architecture with Down Sampling and reducing the dimensionality of feature maps.
We then flatten our pooled feature before pushing it to our Artificial Neural Nets.

Throughout this process, the weights and the feature maps, are trained and repeatedly altered in order for the network to reach the best performance in order to classify images and objects more efficiently as possible.

In Part II of this Series, I will be Walking through the Image Classification using the Great PyTorch!

References

There is a ton of CNN tutorials on the web, but the most comprehensive one is the Stanford CS231N course by Andrej Karpathy. The reading material is available here, and the video lectures are here. Highly recommended.

If you’re interested in applying CNN to natural language processing, this is a great article. Another very detailed one is available here.

All articles of Chris Olah are packed with great information and visualizations. CNN related posts are available here and here.

This Article is inspired by the most Innovative explanation of ConvNets which is available here.

If you like this post, please follow me as I will be posting some awesome tutorials on Machine Learning as well as Deep Learning.

Also, check out this Superb Post on Logistic Regression with PyTorch!

Cheers!

Story Behind the Convolutional Neural Networks (CNN) with PyTorch Part I

Written by Shashwat Tiwari