Image Classification

Yesterday, Google released an amazing tool called AutoDraw . It recognises the strokes you draw and suggests well-drawn items by using Machine Learning. In this demo, we are going to use the same concept of Transfer Learning in the post (Style Transfer Network) from my previous post. The pre-trained Convolutional Neural Network model is still vgg16 from Oxford. By using this, we will redefine it’s fully-connected layer (Dense) to predict some specific objects we want.


CNN – Convolutional Neural Network

I know the below architecture graph maybe confusing to you, but don’t worry. We will go through the basic concept of CNN afterwards.

“Feature Evaluation of Deep Convolutional Neural Networks for Object Recognition and Detection” by Hirokatsu KATAOKA, Ph.D.

The idea here is that we keep all the convolutional layers, but replace the final fully connected layers with our own classifier. This way we can use VGGNet as a feature extractor for our images then easily train a simple classifier on top of that. What we’ll do is take the first fully connected layer with those flattened units, including thresholding with ReLUs. We can use those values as a code for each image, then build a classifier on top of those codes.

Introduction of Computer vision

To begin with, we will use an handwritten digits image from a data-set called MNIST as an example of how computer see a image file . You may check the details on Yann LeCun’s website.

For example, there is an 28 pixels by 28 pixels Black and White handweitten image. We can actaully interpret this as a big array of numbers:

That is it! How simple?  But that is just an greyscale image. The next problem is how we can feed a colored image to the CNN? Yes! by Convolution!



Imagine you have an 4×4 image and typically we have red, green and blue channels/depths. Each entry corresponds to one pixel, 0 for black and 1 for white (typically it’s between 0 and 255 for grayscale images). The below image shows you the cross-section of an input volume of size: 4 x 4 x 3. It comprises of the 3 Colour channel matrices of the input image.

Source: An Illustrated Explanation


The first layer in a CNN is always a Convolutional Layer. For example, we have an input image which is 5x5x3 array of pixel values. Now, the best way to explain a conv layer is to imagine a flashlight that is shining over the top left of the image. Let’s say that the light this flashlight shines covers a 2 x 2 area. And now, let’s imagine this flashlight sliding across all the areas of the input image. In machine learning terms, this flashlight is called a filter(or sometimes referred to as a neuron or a kernel) and the region that it is shining over is called the receptive field.

Now this filter is also an array of numbers (the numbers are called weights or parameters). A very important note is that the depth of this filter has to be the same as the depth of the input (this makes sure that the math works out), so the dimensions of this filter is 2x2x3.

This below image shows you how a filter get the output from only one channel. Now, let’s take the first position the filter is in for example.  It would be the top left corner. As the filter is sliding, or convolving, around the input image, it is multiplying the values in the filter with the original pixel values of the image (aka computing element wise multiplications). These multiplications are all summed up.  So now you have a single number. Remember, this number is just representative of when the filter is at the top left of the image. Now, we repeat this process for every location on the input volume. (Next step would be moving the filter to the right by 1 unit, then right again by 1, and so on).


So far, we just used one channel as a example of our extracted feature from our input, but actually we have 3 channels in our input image haven’t been convoluted. To handle this 3D convolved case. You may imagine each pancake in your input stack is called a feature map. Since we want to pass the image data we have to make the result convolutional. To achieve this, we may try to map those three feature maps(RBG) to K feature maps. And this process will keep iterate for multiple times(K times) to reduce the dimensionality and increase the depth of your network layer after layer.

In a normal, non-convolutional neural network, we would have ignored this adjacency. In a normal network, we would have connected every pixel in the input image to a neuron in the next layer. In doing so, we would not have taken advantage of the fact that pixels in an image are close together for a reason and have special meaning. By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and objects, in an image.


Fully Connected layer

After a few iterations, we got a flatten matrix that represents our input image. At the top, you can put your classifier. You have a representation where all this spacial information has been squeezed out, and only parameters that map to content of the image remain. A “Fully Connected” layer is a standard, non-convolutional layer, where all inputs are connected to all output neurons. This is also referred to as a “dense“(稠密) layer.

There is so much more we haven’t covered in this post include the nonlinear and pooling layers as well as hyperparameters of the network such as filter sizes, stride, and padding. Of course the concept of optimisation methods in CNN, like

  1. pooling
  2. 1×1 convolutions
  3. inception architecture

And I will try to make a part 2 asap to explain those confusing terms!


After all these steps we have made, we slided across in-vertically like we’re painting it with a brush. On the output, we’ve drawn another image. It’s got a different width, a different height.  more importantly, it’s got a different depthInstead of just R,G, and B, you have an output that’s got many colored channels, K of themThis operation is called the convolution.

The general idea is that they will form a pyramid. At the bottom, you have this big image, but very shallow just R, G, and B. You’re going to apply convolutions that are going to progressively squeeze the spacial dimensions while increasing the depth which corresponds roughly to the semantic complexity(語義復雜性) of your representation.

A simple CNN with 3 maxpool layers


Coding demo:

You may find the source code of this demo in this repo