Style Transfer Network

In this demo, I am gonna use a pre-trained model called vgg16(A CNN model) to implement a style transfer network to learn from a style image to an output image by using Transfer Learning.  The training dataset of vgg used is called ImageNet.

The ImageNet project is a large visual database designed for use in visual object recognition software research. As of 2016, over ten million URLs of images have been hand-annotated by ImageNet to indicate what objects are pictured; in at least one million of the images, bounding boxes are also provided. (Wiki)


VGG is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition” . The model achieves 92.7% top-5 test accuracy in ImageNet , which is a dataset of over 14 million images belonging to 1000 classes.

VGG16 architecture:

Reference Link:


And this a subset to show you how imagenet look like.


Normally we use a pre-trained network to  freeze the training of the feature extraction layers and only fine tune the last fully connected layers like the below image, but not for this case; Style Transfer. In this demo, we don’t really care about the prediction from the network, so we don’t need to connect those fully connected layers at the bottom.


We’ll frame the style transfer task as an optimization problem, where we have some loss function that measures an error value that we want to minimize. Our loss function in this case can be decomposed into two parts:

  1. Content loss (Which means what cnn extras from picture. Pixel>Edges>Object parts>Object models)
  2. Style loss

We will initialize total loss to zero and add content and style loss to it. Then we run our output image and our reference image through the network.


Step 1: Calculating Content loss 

We’ll get a set of  feature representations for both. Then we will measure the Euclidean distance between them to calculate our loss.

The idea of distance is very useful in machine learning. It’s called Similarity learning.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn from examples a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification. (Wiki)


Step 2: Calculating Style loss

We still pass both images through the net to observe their activations. But instead of comparing the raw activations directly like for content, we’ll add an extra step to measure the correlation of the activations. For both of our images we’ll take what’s called the gram matrix of the activations at a given layer in the network. This will measure which features tend to activate together. It represents the tendency of features to co-occur in different parts of the image.

Once we have this, we can define the style loss as a Euclidean distance between the gram matrices for the reference image and output image.


Step 3: Compute the total style loss as a weighted sum of the style loss at each layer

Reference Link:  Deep Learning for Computer Vision: Visualization (UPC 2016)

Step 4: Define gradients of the output image with respect to the loss

We needa use those gradients to iteratively improve our output image to minimize a loss, but this time use them to update our output image. Not our weights!


Coding demo:

You may find the source code of this demo in this repo.