Image Captioning using Attention mechanism

12 min readDec 31, 2020

well you might say, Dog is catching a frisbee. Some one else can say, dog and child are playing with frisbee, you’re absolutely correct isn’t! But imagine a computer can say this type of caption for a given input image, this easily like a human ? This is one of the most interesting and challenging problems in general , Well In this blog I’m going to share the deep insight of how we can solve this problem very easily using the advances in AI that we have today. Just follow me, let’s go.

Agenda:

Business problems
Business problem to DL problem
Business constraints/metrics
Success metrics
Data Collection
Understanding the data
Data Cleaning
Data Preprocessing — Images, captions
Data preparation
Need for tf.data pipeline
Model Architecture
Inference using Tensorflow Lite
It’s time to see the results
Normal and Tensorflow Lite models(post quantization comparison)
Future work and Conclusion
References

Business Problems:

Firstly we need to understand how important is this problem in real world scenarios, let’s see few applications

Self driving cars — Automatic driving is one of the biggest challenges and if we can properly caption the scene around the car, it can give a boost to the self driving system.
Aid to the blind — We can create a product for the blind which will guide them travelling on the roads without the support of anyone else. We can do this by first converting the scene into text and then the text to voice. Both are now famous applications of Deep Learning. Refer this link where its shown how Nvidia research is trying to create such a product.
CCTV cameras are everywhere today, but along with viewing the world, if we can also generate relevant captions, then we can raise alarms as soon as there is some malicious activity going on somewhere. This could probably help reduce some crime and/or accidents.
Automatic Captioning can help make Google Image Search as good as Google Search, as then every image could be first converted into a caption and then search can be performed based on the caption.
There are many NLP applications right now, which extract insights/summary from a given text data or an essay etc. The same benefits can be obtained by people who would benefit from automated insights from images.

Business problem to DL problem:

Given an image predicts the description of an image using CNN/Attention mechanism

Business metrics:

BLEU stands for Bilingual Evaluation Understudy. It is an algorithm, which has been used for evaluating the quality of machine translated text. We can use BLEU to check the quality of our generated caption.

BLEU is language independent
Easy to understand
It is easy to compute.
It lies between [0,1]. Higher the score better the quality of caption
BLEU tells how good is our predicted caption as compared to the provided reference captions.

Success metrics:

We should phrase the success metrics independently with evaluation metrics,

So let’s think in the user’s perspective, I (user) only cares only about how meaningful the description is and how fast it is able to return description after giving the input to the image. So my final goal before productionizing is to preserve these things.

so here we’re dealing with images and text also, so we should take care of memory consumption and CPU utilization, while productionizing the model

Here the problem is with memory consumption and CPU utilization let’s go in detail

Let’s take one image

Please note while modeling we just not only pass image but also with the text, let’s assume that we have 6 words for one sentence so for single image we have 6 data points to get the final generated description of the image and we also know that for single image we have 5 sentences(refer: dataset) so 6*5=30 sentences(one image data points) like this you can assume how much memory is consumed for the entire dataset in order to load into the RAM even if we load, the system will get slow down, so in order to avoid this problem we will be using the concept of tf.data we will discuss more about this in the later part of the case study

Business/deployment constraints:

Some low latency is required because users can’t wait for a long time to see the results of the given image.
Interpretability is pretty important because why is the word generated from the image ?
Memory consumption/CPU utilization should be as low as possible for better image search or faster processing.
The generated text should be semantically meaningful to the user.
We should not miss the unique words for the given image. if we miss word the richness/meaning of the image will be missed.

Data Collection:

There are many open source datasets available for this problem, like Flickr 8k (containing8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc.

But for the purpose of this case study, I have used the Flickr 8k dataset which you can download by filling this form provided by the University of Illinois at Urbana-Champaign. Also training a model with large number of images may not be feasible on a system which is not a very high end PC/Laptop.

This dataset contains 8000 images each with 5 captions (as we have already seen in the Introduction section that an image can have multiple captions, all being relevant simultaneously).

These images are bifurcated as follows:

Training Set — 6000 images
Dev Set — 1000 images
Test Set — 1000 images

Understanding the data:

If you have downloaded the data from the link that I have provided, then, along with images, you will also get some text files related to the images. One of the files is “Flickr8k.token.txt” which contains the name of each image along with its 5 captions

sample text

Thus every line contains the <image name>#i <caption>, where 0≤i≤4

i.e. the name of the image, caption number (0 to 4) and the actual caption.

Changed the data into this format for better understanding

code for the above format

After this I stored in the data frame called Data

removing the unnecessary text in the caption data

Data preprocessing — images:

So we have all the images right now,

Images are nothing but input (X) to our model. As you may already know that any input to a model must be given in the form of a vector.

We need to convert every image into a fixed sized vector which can then be fed as input to the neural network. For this purpose, we opt for transfer learning by using the InceptionV3 model (Convolutional Neural Network) created by Google Research.

This model was trained on Imagenet dataset to perform image classification on 1000 different classes of images. However, our purpose here is not to classify the image but just get fixed-length informative vector for each image. This process is called automatic feature engineering.

Hence, we just remove the last softmax layer from the model and extract a 2048 length vector (bottleneck features) for every image as follows:

Converting into fixed length, and then sending to model to get the feature vector length, the below code gives that.

Data preprocessing — Captions:

We must note that captions are something that we want to predict. So during the training period, captions will be the target variables (Y) that the model is learning to predict.

But the prediction of the entire caption, given the image does not happen at once. We will predict the caption word by word. Thus, we need to encode each word into a fixed sized vector.

we will convert the captions into numeric form

Data Preparation:

This is one of the most important steps in this case study. Here we will understand how to prepare the data in a manner which will be convenient to be given as input to the deep learning model.

Hereafter, I will try to explain the remaining steps by taking a sample example as follows:

Now the questions that will be answered are: how do we frame this as a supervised learning problem?, what does the data matrix look like? how many data points do we have?, etc.

First we need to convert both the images to their corresponding 2048 length feature vector as discussed above. Let “Image” be the feature vectors

Secondly, let’s build the vocabulary by adding the two tokens “startseq” and “endseq” in both of them: (Assume we have already performed the basic cleaning steps)

caption : startseq children at dog park endseq

Let’s give an index to each word in the vocabulary:

startseq-1, children-2, at-3, dog-4, park-5, endseq-6

In the first step we pass the image vector and the first word, then it predicts the next word and in the next step we pass the image vector + first word+ predicted word then it will give the next word like this we will keep on predicting the words until the last word i.e., endseq

data matrix for the single image and it caption

We must now understand that in every data point, it’s not just the image which goes as input to the system, but also, a partial caption which helps to predict the next word in the sequence.

Since we are processing sequences, we will employ a Recurrent Neural Network to read these partial captions (more on this later).

However, we have already discussed that we are not going to pass the actual English text of the caption, rather we are going to pass the sequence of indices where each index represents a unique word.

Since we have already created an index for each word, let’s now replace the words with their indices and understand how the data matrix will look like:

data matrix after padding and numerical conversion

Need for a tf.data pipeline:

I hope this gives you a good sense as to how we can prepare the dataset for this problem. However, there is a big catch in this.

However, in our actual training dataset we have 6000 images, each having 5 captions. This makes a total of 30000 images and captions.

Even if we assume that each caption on an average is just 7 words long, it will lead to a total of 30000*7 i.e. 210000 data points.

Size of the data matrix = n*m

Where n-> number of data points (assumed as 210000)

And m-> length of each data point

Clearly m= Length of image vector(2048) + Length of partial caption(x).

m = 2048 + x

But what is the value of x?

Well you might think it is 34, but no wait, it’s wrong.

Every word (or index) will be mapped (embedded) to higher dimensional space through one of the word embedding techniques.

Later, during the model building stage, we will see that each word/index is mapped to a 200-long vector using embedding layer.

Now each sequence contains 34 indices, where each index is a vector of length 200. Therefore x = 34*200 = 6800

Hence, m = 2048 + 6800 = 8848.

Finally, size of data matrix= 210000 * 8848= 1858080000 blocks.

Now even if we assume that one block takes 2 byte, then, to store this data matrix, we will require more than 3 GB of main memory.

This is pretty huge requirement and even if we are able to manage to load this much data into the RAM, it will make the system very slow

to avoid this we have a API called tf.data.Dataset.prefetch transformation, which can be used to decouple the time when data is produced from the time when data is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested. The number of elements to prefetch should be equal to (or possibly greater than) the number of batches consumed by a single training step. You could either manually tune this value, or set it

So how does using a tf.data solve this problem?

If you know the basics of Deep Learning, then you must know that to train a model on a particular dataset, we use some version of Stochastic Gradient Descent (SGD) like Adam, Rmsprop, Adagrad, etc.

With SGD, we do not calculate the loss on the entire data set to update the gradients. Rather in every iteration, we calculate the loss on a batch of data points (typically 64, 128, 256, etc.) to update the gradients.

This means that we do not require to store the entire dataset in the memory at once. Even if we have the current batch of points in the memory, it is sufficient for our purpose.

To understand more about tf.data, please read here

Model Architecture:

This is the attention based architecture, here we have encoder and decoder with the attention, when we pass the image to the encoder it will return the featurization vectors later based on attention weights it will calculate the context vector and then based on the “startseq ”+ context vector the decoder will output the first word and in the next iteration we will pass generated first word output with context vector, now it will output the second word like this we will keep on predict the next words until the “endseq”

Inference using Tensorflow Lite model:

We will take the final trained encoder and decoder attention model and convert then into Tensorflow Lite model, which is the best for the low latency and CPU requirements for model quantization, we will use the TfLite model further

for more info about the model quantization please read here

So till now, we have seen how to prepare the data and build the model. In the final step of this series, we will understand how do we test (infer) our model by passing in new images, i.e. how can we generate a caption for a new test image.

how inference is calculated using Beam Search:

lets suppose vocabulary size:1000

Iteration1: we first send image feature vector and “startseq ” to the model, it returns the top 3 predicted words(Beam index =3) with probability i.e., which means it computed 30000 probabilities out of which, it predicted top3

Iteration2: In the next step it will predict top 3 predicted words based on the previous word and image vector

like this it will keep on predict the words until the final “endseq ” and later it will return the final predicted sentence with maximum probability

for more info about the beam search please watch this video

this is the simple model architecture, for understanding purpose

The above step performs until the “endseq ” after that it will predict the final sentence

Beam search code implementation:

It’s time to see the results:

Taking some of the images from test set

predicted caption: a child in a pink dress is climbing up stairs in an entry way

predicted caption: a little girl is sitting in front of a large painted rainbow

predicted caption: man lays on a bench while his dog sits by him

Normal and Tensorflow lite models(post quantization comparison):

Future works and Conclusion:

Working on larger datasets.
Doing hyperparameter and changing the architecture of the model.
Implementing local attention mechanism.

Complete code for this blog, please refer my GitHub account

Hope you liked this blog, feel free to like, comment and for more information reach me on my linked In profile below.

Sai teja Kandra - Software Developer - TRINITI ADVANCED SOFTWARE LABS PRIVATE LIMITED | LinkedIn

View Sai teja Kandra's profile on LinkedIn, the world's largest professional community. Sai teja has 1 job listed on…

www.linkedin.com

References:

Image captioning with visual attention | TensorFlow Core

Given an image like the example below, our goal is to generate a caption such as "a surfer riding on a wave". Image…

www.tensorflow.org

How to Develop a Deep Learning Photo Caption Generator from Scratch - Machine Learning Mastery

Develop a Deep Learning Model to Automatically Describe Photographs in Python with Keras, Step-by-Step. Caption…

machinelearningmastery.com