IntroductionPictures convey information better than words and so the adage, 'A picture is worth a thousand words'. Many of us enjoy reading comics more than reading a novel in black and white letters.
Also, pictures or images facilitate communication between two persons who do not share a common language. With this in mind, this project aims at converting a given text input into a sequence of images that convey the same message. How enjoyable it would be if you can convert your favorite novel into a comic automatically? The applications of this tool would be great in translation as the image-domain representation could serve as an intermediate stage while converting between two languages.
It is necessary to represent the text and the image in a format such that the two can be compared. Let us assume that both the images and text are represented using vectors. If the image and text convey the same meaning, then the word-vector and image-vector need to be close. For example, let us suppose that the word 'cat' has a word vector v,the picture of a cat has an image-vector i and the picture of a table has an image vector t. Then, ||v - i|| must be smaller than ||v - t||. Basically, the mapping function between the word-vector and image-vector needs to be learnt to be able to perform the task of representing a piece of text using a set of images.
Google's word2vec is a deep-learning inspired method that focuses on learning the meaning of words. It attempts to grasp the meaning and semantic relationships of words. Word2vec is a neural net that takes raw text as input and converts them into vectors without any human intervention. The accuracy of the tool increases with the size of the training data. With sufficient training, it is capable of learning relationships in the form of analogies. For example, the operation of V(king) - V(man) + V(woman) results in V(queen), where V( ) stands for the vector representation of a word. We trained using the freebase data set that contains 100 billion words taken from various news articles.
The ImageNet LSVRC-2010 contest was to classify 1.2 million high-resolution images into 1000 classes. The deep convolutional network trained by A. Krizhevsky et al. achieved one of the best results. This is a deep neural net with 5 convolutional layers and 3 fully-connected layers. The output from the penultimate layer which is a vector of length 1000 is used as the vector representation of the image. We used the Flickr 8k data set where every image has 5 captions that describe it. The images were passed through the net to obtain the image-vector representation
Mapping the vector representations:
In our work, the word2vec represents word using a 200 dimensional vector whereas the image-vector is a 1000 dimensional vector as mentioned before. So, we need to take the two representations to a common embedding space where they can be compared. This is done by using a siamese neural network described in the below figure.
|Structure of the siamese neural network to learn the mapping between the image and word vectors|
- Training on a larger data-set for better accuracy
- Incorporating more information about the text to generate a sequence of images
- Developing of a quantifiable measure that evaluates the accuracy of the image representation of the given text