Image Captioning using Attention mechanism
well you might say, Dog is catching a frisbee. Some one else can say, dog and child are playing with frisbee, you’re absolutely correct isn’t! But imagine a computer can say this type of caption for a given input image, this easily like a human ? This is one of the most interesting and challenging problems in general , Well In this blog I’m going to share the deep insight of how we can solve this problem very easily using the advances in AI that we have today. Just follow me, let’s go.
Agenda:
- Business problems
- Business problem to DL problem
- Business constraints/metrics
- Success metrics
- Data Collection
- Understanding the data
- Data Cleaning
- Data Preprocessing — Images, captions
- Data preparation
- Need for tf.data pipeline
- Model Architecture
- Inference using Tensorflow Lite
- It’s time to see the results
- Normal and Tensorflow Lite models(post quantization comparison)
- Future work and Conclusion
- References
Business Problems:
Firstly we need to understand how important is this problem in real world scenarios, let’s see few applications
- Self driving cars — Automatic driving is one of the biggest challenges and if we can properly caption the scene around the car, it can give a boost to the self driving system.
- Aid to the blind — We can create a product for the blind which will guide them travelling on the roads without the support of anyone else. We can do this by first converting the scene into text and then the text to voice. Both are now famous applications of Deep Learning. Refer this link where its shown how Nvidia research is trying to create such a product.
- CCTV cameras are everywhere today, but along with viewing the world, if we can also generate relevant captions, then we can raise alarms as soon as there is some malicious activity going on somewhere. This could probably help reduce some crime and/or accidents.
- Automatic Captioning can help make Google Image Search as good as Google Search, as then every image could be first converted into a caption and then search can be performed based on the caption.
- There are many NLP applications right now, which extract insights/summary from a given text data or an essay etc. The same benefits can be obtained by people who would benefit from automated insights from images.
Business problem to DL problem:
Given an image predicts the description of an image using CNN/Attention mechanism
Business metrics:
BLEU stands for Bilingual Evaluation Understudy. It is an algorithm, which has been used for evaluating the quality of machine translated text. We can use BLEU to check the quality of our generated caption.
- BLEU is language independent
- Easy to understand
- It is easy to compute.
- It lies between [0,1]. Higher the score better the quality of caption
- BLEU tells how good is our predicted caption as compared to the provided reference captions.
Success metrics:
We should phrase the success metrics independently with evaluation metrics,
So let’s think in the user’s perspective, I (user) only cares only about how meaningful the description is and how fast it is able to return description after giving the input to the image. So my final goal before productionizing is to preserve these things.
so here we’re dealing with images and text also, so we should take care of memory consumption and CPU utilization, while productionizing the model
Here the problem is with memory consumption and CPU utilization let’s go in detail
Let’s take one image
Please note while modeling we just not only pass image but also with the text, let’s assume that we have 6 words for one sentence so for single image we have 6 data points to get the final generated description of the image and we also know that for single image we have 5 sentences(refer: dataset) so 6*5=30 sentences(one image data points) like this you can assume how much memory is consumed for the entire dataset in order to load into the RAM even if we load, the system will get slow down, so in order to avoid this problem we will be using the concept of tf.data we will discuss more about this in the later part of the case study
Business/deployment constraints:
- Some low latency is required because users can’t wait for a long time to see the results of the given image.
- Interpretability is pretty important because why is the word generated from the image ?
- Memory consumption/CPU utilization should be as low as possible for better image search or faster processing.
- The generated text should be semantically meaningful to the user.
- We should not miss the unique words for the given image. if we miss word the richness/meaning of the image will be missed.
Data Collection:
There are many open source datasets available for this problem, like Flickr 8k (containing8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc.
But for the purpose of this case study, I have used the Flickr 8k dataset which you can download by filling this form provided by the University of Illinois at Urbana-Champaign. Also training a model with large number of images may not be feasible on a system which is not a very high end PC/Laptop.
This dataset contains 8000 images each with 5 captions (as we have already seen in the Introduction section that an image can have multiple captions, all being relevant simultaneously).
These images are bifurcated as follows:
- Training Set — 6000 images
- Dev Set — 1000 images
- Test Set — 1000 images
Understanding the data:
If you have downloaded the data from the link that I have provided, then, along with images, you will also get some text files related to the images. One of the files is “Flickr8k.token.txt” which contains the name of each image along with its 5 captions
Thus every line contains the <image name>#i <caption>, where 0≤i≤4
i.e. the name of the image, caption number (0 to 4) and the actual caption.
Changed the data into this format for better understanding
code for the above format
After this I stored in the data frame called Data
removing the unnecessary text in the caption data
Data preprocessing — images:
So we have all the images right now,
Images are nothing but input (X) to our model. As you may already know that any input to a model must be given in the form of a vector.
We need to convert every image into a fixed sized vector which can then be fed as input to the neural network. For this purpose, we opt for transfer learning by using the InceptionV3 model (Convolutional Neural Network) created by Google Research.
This model was trained on Imagenet dataset to perform image classification on 1000 different classes of images. However, our purpose here is not to classify the image but just get fixed-length informative vector for each image. This process is called automatic feature engineering.
Hence, we just remove the last softmax layer from the model and extract a 2048 length vector (bottleneck features) for every image as follows:
Converting into fixed length, and then sending to model to get the feature vector length, the below code gives that.
Data preprocessing — Captions:
We must note that captions are something that we want to predict. So during the training period, captions will be the target variables (Y) that the model is learning to predict.
But the prediction of the entire caption, given the image does not happen at once. We will predict the caption word by word. Thus, we need to encode each word into a fixed sized vector.
we will convert the captions into numeric form
Data Preparation:
This is one of the most important steps in this case study. Here we will understand how to prepare the data in a manner which will be convenient to be given as input to the deep learning model.
Hereafter, I will try to explain the remaining steps by taking a sample example as follows:
Now the questions that will be answered are: how do we frame this as a supervised learning problem?, what does the data matrix look like? how many data points do we have?, etc.
First we need to convert both the images to their corresponding 2048 length feature vector as discussed above. Let “Image” be the feature vectors
Secondly, let’s build the vocabulary by adding the two tokens “startseq” and “endseq” in both of them: (Assume we have already performed the basic cleaning steps)
caption : startseq children at dog park endseq
Let’s give an index to each word in the vocabulary:
startseq-1, children-2, at-3, dog-4, park-5, endseq-6
In the first step we pass the image vector and the first word, then it predicts the next word and in the next step we pass the image vector + first word+ predicted word then it will give the next word like this we will keep on predicting the words until the last word i.e., endseq
We must now understand that in every data point, it’s not just the image which goes as input to the system, but also, a partial caption which helps to predict the next word in the sequence.
Since we are processing sequences, we will employ a Recurrent Neural Network to read these partial captions (more on this later).
However, we have already discussed that we are not going to pass the actual English text of the caption, rather we are going to pass the sequence of indices where each index represents a unique word.
Since we have already created an index for each word, let’s now replace the words with their indices and understand how the data matrix will look like:
Need for a tf.data pipeline:
I hope this gives you a good sense as to how we can prepare the dataset for this problem. However, there is a big catch in this.
However, in our actual training dataset we have 6000 images, each having 5 captions. This makes a total of 30000 images and captions.
Even if we assume that each caption on an average is just 7 words long, it will lead to a total of 30000*7 i.e. 210000 data points.
Size of the data matrix = n*m
Where n-> number of data points (assumed as 210000)
And m-> length of each data point
Clearly m= Length of image vector(2048) + Length of partial caption(x).
m = 2048 + x
But what is the value of x?
Well you might think it is 34, but no wait, it’s wrong.
Every word (or index) will be mapped (embedded) to higher dimensional space through one of the word embedding techniques.
Later, during the model building stage, we will see that each word/index is mapped to a 200-long vector using embedding layer.
Now each sequence contains 34 indices, where each index is a vector of length 200. Therefore x = 34*200 = 6800
Hence, m = 2048 + 6800 = 8848.
Finally, size of data matrix= 210000 * 8848= 1858080000 blocks.
Now even if we assume that one block takes 2 byte, then, to store this data matrix, we will require more than 3 GB of main memory.
This is pretty huge requirement and even if we are able to manage to load this much data into the RAM, it will make the system very slow
to avoid this we have a API called tf.data.Dataset.prefetch transformation, which can be used to decouple the time when data is produced from the time when data is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested. The number of elements to prefetch should be equal to (or possibly greater than) the number of batches consumed by a single training step. You could either manually tune this value, or set it
So how does using a tf.data solve this problem?
If you know the basics of Deep Learning, then you must know that to train a model on a particular dataset, we use some version of Stochastic Gradient Descent (SGD) like Adam, Rmsprop, Adagrad, etc.
With SGD, we do not calculate the loss on the entire data set to update the gradients. Rather in every iteration, we calculate the loss on a batch of data points (typically 64, 128, 256, etc.) to update the gradients.
This means that we do not require to store the entire dataset in the memory at once. Even if we have the current batch of points in the memory, it is sufficient for our purpose.
To understand more about tf.data, please read here
Model Architecture:
This is the attention based architecture, here we have encoder and decoder with the attention, when we pass the image to the encoder it will return the featurization vectors later based on attention weights it will calculate the context vector and then based on the “startseq ”+ context vector the decoder will output the first word and in the next iteration we will pass generated first word output with context vector, now it will output the second word like this we will keep on predict the next words until the “endseq”
Inference using Tensorflow Lite model:
We will take the final trained encoder and decoder attention model and convert then into Tensorflow Lite model, which is the best for the low latency and CPU requirements for model quantization, we will use the TfLite model further
for more info about the model quantization please read here
So till now, we have seen how to prepare the data and build the model. In the final step of this series, we will understand how do we test (infer) our model by passing in new images, i.e. how can we generate a caption for a new test image.
how inference is calculated using Beam Search:
lets suppose vocabulary size:1000
Iteration1: we first send image feature vector and “startseq ” to the model, it returns the top 3 predicted words(Beam index =3) with probability i.e., which means it computed 30000 probabilities out of which, it predicted top3
Iteration2: In the next step it will predict top 3 predicted words based on the previous word and image vector
like this it will keep on predict the words until the final “endseq ” and later it will return the final predicted sentence with maximum probability
for more info about the beam search please watch this video
this is the simple model architecture, for understanding purpose
The above step performs until the “endseq ” after that it will predict the final sentence
Beam search code implementation:
It’s time to see the results:
Taking some of the images from test set
predicted caption: a child in a pink dress is climbing up stairs in an entry way
predicted caption: a little girl is sitting in front of a large painted rainbow
predicted caption: man lays on a bench while his dog sits by him
Normal and Tensorflow lite models(post quantization comparison):
Future works and Conclusion:
- Working on larger datasets.
- Doing hyperparameter and changing the architecture of the model.
- Implementing local attention mechanism.
Complete code for this blog, please refer my GitHub account
Hope you liked this blog, feel free to like, comment and for more information reach me on my linked In profile below.
References: