Tuesday, 21 April 2015

Comic Polyglot - Team 42

Comic Polyglot

Everyone loves comics. Wouldn't it be wonderful if we could read all the amazing comics spread across different languages like Japanese, Spanish, French and German in English or Hindi? Periodic comics in different languages come out every other day or week, but are inaccessible for a long time due to the intensive manual translations efforts required. This manual process consists of teams of people working separately on text extraction, translation and infusion for every line of text in a comic. Our project attempts to automate this process and make comics available in other languages within a day of original raw release with minimal human effort required. The hurdles that we face include automation of complex processes like text detection, extraction, removal, optical character recognition, language translation based on context and text infusion. This project on completion has the potential to disrupt the scanlation process and the fan community and also affect the comic industry.

To create such a system, we need to break it down into smaller systems and attempt to solve them individually. The first hurdle that we face is text extraction, and the majority of our time in the Winter School was spent tackling this problem.

Various stages involved in approaching the problem:
  • Text Detection
  • Text Extraction
  • Translation
  • Text Infusion

Comics Survey:

The comics that are currently available today come in various formats such as
  • Simple black and white Manga comics which have characters all over the place. These generally have kanji-like characters in text boxes, speech/thought bubbles but they are not limited to these areas. Mostly Japanese or Korean.
  • Colored comics that have text only in the speech/thought bubbles. These generally follow the pattern that the text is black and the bubble is white and generally ovular/elliptical in shape.
  • Colored comics that can have text placed anywhere inside the comic frames.

Initial Attempt:

During the Winter School, we initially attempted to extract japanese characters from speech bubbles in a manga imageset that we created. After discussing with the professors, we attempted to detect bubbles in 3 different ways :
  • Radon Transform 
  • Hough Transform
  • Condensation Algorithm


Radon Transform computes projections of an image matrix along specific directions by linear integration of an image (object/function) over a parallel beam.

Value of these integrals is represented by a function Rʘ(x’) where ʘ is angle of inclination of Source/Destination from reference axis and x’ is distance of ray(in parallel beam) from origin.

For radon transform we calculate such curves for each angle varying from 0 to 180 degrees. By this we get a heat representing change of integrals over various angles. Which is helpful to provide various information about of the object like shape.

Here are the some of the outputs obtained:-

Circle at centre – Since circle is symmetric about any axis, so the integral is constant (i.e. Straight Line) with high intensity in middle and reducing gradually towards the ends.

Shifted Circle – Due to shift there is also a gradual distortion in line. But the structure remains moreover same.

Square :- 

Comic Bubble:- Bubble’s in comic are approximately circle with some distortions.

As we can see Radon transform is capable of detecting circles very efficiently, but only if they are completely filled inside i.e. solid shapes as shown in above pics. This phenomenon is not a thumb rule in comics.

Almost failing to use this practically for bubble detection, but we had given it one more try by applying radon transform over integral images as suggested by our mentor Prof. Bhiksha, it was also a fail as radon transform of integral images could not be drilled to a specific format/shape.

While investigating the above, we found a paper on text detection in natural environments[1]. There was also an implementation in MATLAB[2] that we found to be extremely useful and helpful in getting us off the ground running. We made some changes to the methods followed that can be seen in the diagram below.

Text Detection Pipeline

mser - New Page.png
Text Extraction Pipeline
Image is read and its dimension is reduced by converting it to grayscale image.

Now we detect various blobs in image by use of Maximally Stable Extremal Regions (MSER). Here blobs are the regions of similar intensities in the image as all characters have same color/intensity. Detection of only text is distinguished from background on basis of area occupied by regions since all characters have a constant amount of size. Experimentally we found that area of such regions lies between 10 - 2000. Here the maximum value is quite high because while detecting MSER regions various small regions lying close form a bigger blob i.e. in domain of a comic most of text on one bubble end up to same MSER.
Drawback with MSER is that it detects most of the text with non-text part also.

To improve the result, we also canny edge detection on the initial grayscale image. Canny Edge Detector detects all edges present in the image i.e. all text as well as non test part. This output is then masked with the previous output of MSER to get the most significant part in our context. Canny edges are very thin to make them significant (transform them to regions) we grow edges relative to the gradient (Gradient grown edges). Now this contains regions representing text as well as non-text.

The output of this procedure when subtracted from MSER gives all text part with a significantly low non-text part.

Filtering text and non-text part based on connected component analysis.
The connected components obtained from above analysis are filtered on basis of these three factors Area, Eccentricity and Solidity. By, experimental analysis it was found that these factors are governed by these conditions in domain of comic:
  • Area lies between 10 and 2000
  • Eccentricity of a component is greater than .995 which contain text.
  • Solidity is less than 0.4

We know that all characters have a constant stroke width. So, we use this property to filter the content. We apply a stroke width transform on every filtered connected component. Now if variation in average stroke width in each connected component is greater than a threshold then it is eliminated, we experimented with different variance of stroke widths for our dataset and found the optimal stroke width variance to be 0.36.

To extract text containing regions from image we apply morphological transform to combine nearby regions and then construct a bounding box around that region to extract the specific part of image.

Support Vector Machine (SVM)

The output of the text extraction pipeline contained all the text patches and huge number of non-text patches, so we decided to design a SVM classifier to classify between text and non-text patches. We trained an SVM using hog features of image patches. It used RBG kernel which gave poor result (accuracy was 44%). As in our case, number of instances were very very less than number of features so it was better to use liblinear python library that contain linear kernel and it consumes less time and memory and produced better accuracy. On training 277 images and testing on 69 images, accuracy was 79% for Hog kernel(2 * 2). But when it was trained with 4500 images, accuracy was 98% (but one important thing is that in that dataset there were images which contained test as well as non-text, these type of images were removed from dataset. Next thing we will do is the images which contain text as well as non-text would be counted as positive dataset).

Image Cleaning

Processes a scanned document of text to clean the text background.

What the script does is as follows[3]:
  • Optionally, crops the image
  • Optionally, converts to grayscale
  • Optionally, enhance the image to stretch or normalize
  • Creates a grayscale version of the enhanced image and applies a local area threshold using -lat and (optionally) and optionally some blurring for antialiasing to create a mask
  • Composites the mask with the corrected image to make the background white
  • Optionally unrotates, sharpens, changes the saturation, trims and pads

Text Extraction

Optical Character Recognition: To retrieve text from images we used Tesseract OCR (one of the best available OCR). Tesseract works fine for standard formal text but for comics where fonts are cursive it doesn't gave a very good result.

Following figure shows the output of OCR ran on 60 image patches. Here darker the colour more precision was detected in character recognition.

As observed from above, at some places OCR gave a very poor result.

To correct this we used levenshtein distance to get the most nearest match of each word from a collection of around 90,000 spanish words, which we collected from CMU dataset. Through this we were able to correct most of the miss detected words by OCR.


Text detected by OCR was translated to the specified language by use of Google translate.

Text Infusion

Text Infusion refers to removing previous text and replacing with the translated text.

Problems faced in this context was that patches removed were rectangular in shape which contained text as well as background. If we just remove the patch and place the text it looks very odd. For this we had to remove text not background from the patch.

To handle this we used the property of comic images that text is always in black in colour. For this we took HSV transform of the image and separated all black part from the image. Now we again applied Connected component analysis (that was applied in text detection pipeline) with more fine value. Then we merged this reduced text black part (mostly line) with the other colored part to get patch without background.

Now we just placed translated text in this patch and replaced it back in the image to get the translated comic.


Neural Network based Classifier

We worked in developing an artificial neural network based system to recognize the text and non-text regions in the given comic image. The idea was to extract the text part from the comic image, run an OCR on it and get the detected text and use an appropriate language translation API to convert the language into English and put back the the translated text back into the image at its appropriate location. The ANN was trained to classify the patches among two categories. Currently we are working on training our network on images of spanish comics.

Initial Approach to Training:

Since most of the text in the comic strip is inside some kind of bubble. Our initial approach was based on detecting the bubble in this image, so that at later stage of the pipeline the text can be extracted from it by running the OCR.

Dataset Preparation:

A positive dataset of 16X16 patches which contains some part of the bubble were made from the ground truth made variety of bubbles found in the comic images, aim is to detect such lines which are part of bubble. Negative dataset prepared contains patches of same dimension containing other portions of image that has no bubble in it.

Training & Accuracy:

Our ANN was trained with such patches with each patch of positive dataset labelled as ‘1’ and each patch of negative dataset labelled as ‘0’. The network training resulted in accuracy of above 90% for most of the time. Caffe network framework developed by researchers at UC Berkley was used in our project to make the neural network. Final trained caffe model of our network was implemented using the python wrapper. The results obtained were below the acceptance level, which might be due to the face that ground truth images of the bubble weren’t the true replica of the bubbles present in the comic images as the neighbourhood of 16X16 patch was totally different in comic image than our ground truth image. We came up with a different approach to detect the text directly in the image.

Current Approach:

Currently we are working on making a robust feature model for each patch. Aim is to directly detect the text in comic images than detecting bubble as in previous approach.

Dataset Preparation:

Texts written in bubble in comic images were taken out to create ground truth text images that form our positive dataset, the rest of the part of the comic strip was used to form negative dataset.

A feature model is prepared by taking out 16X16 non-intersecting patches from the text images and as well as from non-text images and each patch is processed to extract HOG features and lbp features which gives a one dimensional 1X36 HOG feature vector and a 1X36 lbp feature vector respectively.

Now a feature model of 6X6 2-D matrix having 2 no of channels, first channel=1 corresponds to HOG features arranged in 6X6 form and second channel=2 corresponds to lbp features of the same patch arranged in 6X6 form. This way a 6X6 feature model having two number of channels of each 16X16 patch taken from both classes is prepared. This feature modelling is done for each patch of positive dataset and negative dataset.The labels are assigned to these feature models as ’1’ for a positive patch and ‘0’ for a negative patch. The idea is to come up with a robust feature model that encompasses the information about the neighboring pixels also.

Training & Accuracy:

A Convolutional Neural Net is being trained with such 6X6 2-D feature vectors with their corresponding assigned labels. The currently observed accuracy is above 50% which we hope to improve further by using more features of a patch while training and improving our dataset selection.

Untitled Diagram.jpg

We are The Team 42:

  1. Akshay Dixit - VIT (akshaydixi@gmail.com)
  2. Gaurav Bansal - IIITA (gauravbansal9999@gmail.com)
  3. Aman Raj - DTU (amanraj9993@gmail.com)
  4. Priyanka Selvan - MSRIT (priyanka.nessie@gmail.com)
  5. Harshvardhan Solanki - IITK (harsh.vardhan0744@gmail.com)
  6. Farhat Abbas - NITS (abbasfarhat47@gmail.com)


[1] Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, Hong-Wei Hao. Robust Text Detection in Natural Scene Images.IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 970-983, 2014

[2] Mathworks. Automatically Detect and Recognize Text in Natural Images. http://in.mathworks.com/help/vision/examples/automatically-detect-and-recognize-text-in-natural-images.html

[4] http://caffe.berkeleyvision.org/

Wednesday, 4 March 2015

Story-board Problem


Pictures convey information better than words and so the adage, 'A picture is worth a thousand words'. Many of us enjoy reading comics more than reading a novel in black and white letters.
Also, pictures or images facilitate communication between two persons who do not share a common language. With this in mind, this project aims at converting a given text input into a sequence of images that convey the same message. How enjoyable it would be if you can convert your favorite novel into a comic automatically? The applications of this tool would be great in translation as the image-domain representation could serve as an intermediate stage while converting between two languages.


It is necessary to represent the text and the image in a format such that the two can be compared. Let us assume that both the images and text are represented using vectors. If the image and text convey the same meaning, then the word-vector and image-vector need to be close. For example, let us suppose that the word 'cat' has a word vector v,the picture of a cat has an image-vector i and the picture of a table has an image vector t. Then, ||v - i|| must be smaller than ||v - t||. Basically, the mapping function between the word-vector and image-vector needs to be learnt to be able to perform the task of representing a piece of text using a set of images.

Word-Vector Conversion:

Google's word2vec is a deep-learning inspired method that focuses on learning the meaning of words. It attempts to grasp the meaning and semantic relationships of words. Word2vec is a neural net that takes raw text as input and converts them into vectors without any human intervention. The accuracy of the tool increases with the size of the training data. With sufficient training, it is capable of learning relationships in the form of analogies. For example, the operation of V(king) - V(man) + V(woman) results in V(queen), where V( ) stands for the vector representation of a word. We trained using the freebase data set that contains 100 billion words taken from various news articles.

Image-Vector Conversion:

The ImageNet LSVRC-2010 contest was to classify 1.2 million high-resolution images into 1000 classes. The deep convolutional network trained by A. Krizhevsky et al. achieved one of the best results. This is a deep neural net with 5 convolutional layers and 3 fully-connected layers. The output from the penultimate layer which is a vector of length 1000 is used as the vector representation of the image. We used the Flickr 8k data set where every image has 5 captions that describe it. The images were passed through the net to obtain the image-vector representation

Mapping the vector representations:

In our work, the word2vec represents word using a 200 dimensional vector whereas the image-vector is a 1000 dimensional vector as mentioned before. So, we need to take the two representations to a common embedding space where they can be compared. This is done by using a siamese neural network described in the below figure. 

Structure of the siamese neural network to learn the mapping between the image and word vectors

Future works:

  • Training on a larger data-set for better accuracy
  • Incorporating more information about the text to generate a sequence of images
  • Developing of a quantifiable measure that evaluates the accuracy of the image representation of the given text

Friday, 9 January 2015



To build a voice forensics system that would identify bodily features such as height, weight, age, sex, region of origin and various other demographic information about a miscreant from the voice evidence collected. The end objective is to build an extensive, if not comprehensive, one-of-a-kind  voice print database to enable authorities to track criminals.


Security has become a great concern for the citizens of our nation. With incidents such as bomb attacks, ransom calls, and threat calls to life and property occurring more frequently, it is important to develop a mechanism to help curb them. It is vital for the government to devise a mechanism to deal with threats and ransom calls in an effective and promising way. Voice Forensics has potential to help the law enforcement agencies by providing valuable information such as height, weight, age, sex of the suspect from the voice evidence available. In the current scenario of crime investigation in India, we are technologically ill-equipped to investigate cases that have only audio as their evidence. Our project tries to solve this problem. Through this project we wish to explore and improvise the area of Voice Forensics. The ultimate aim of the project is to equip law enforcement agencies with the tools to process voice samples and provide physical and demographic information about the miscreant that could be used as an important evidence for investigation purposes. We propose to build a unique one-of-a-kind voice print database for further research and analysis.


During the process of criminal investigations, it is imperative to extract as much information as possible from the available evidences. Currently, the National Crime Records Bureau cites two methods of Criminal identification, one using fingerprints, and the other is a portrait building system. Fingerprint matching could provide accurate information about the criminal, but in cases where evidence is not available, or if the person is not recorded in the database, we will not be able to make any predictions. In case of fingerprints, it is impossible to approximate predictions about the person, if he/she is not recorded in the database. Presently, there are 11 divisions under the CBI for forensics and crime investigations in India. Surprisingly, voice forensics is not one of them yet. With the technology we are developing, it would be possible for the CBI to investigate cases with the evidences obtained from voice and speech also. With this tool, voice could be used as a reliable evidence in a court proceeding as per Section 65B of the Indian Evidence Act, 1972.

Exploration of voice as a possible evidence is quite recent, and there are some advanced voice identification software’s being developed, such as VoiceGrid. While, voice based technologies such as Siri and Cortana are used as personal digital assistants in mobile phones, VoiceGrid is a database intensive tool that has been adopted by various state police organizations in the USA and Russia for identification of miscreants based on the voice sample captured. These systems rely largely on an existing database to make exact or close-to matches. However, in cases when the exact voice samples cannot be matched, or is unavailable in the database, it is very useful to extract physical and geographical information of the miscreant from the voice sample available. Hence, there is scope to develop much smarter and efficient systems for the purpose of voice forensic study. In addition to this problem, there are no publicly available benchmarks to test an attribute identification method. This is mainly due to the difficulty in procuring a large dataset for the models to work on and the absence of a framework for testing. Moreover, there exists no framework that does the work of:

1.    Collecting a large quantity of audio data from the citizens of our nation
2.    Storing, Analyzing, Validating the audio samples collected and managing it securely.
3.    Perform formal research on the collected voice samples. There is no framework that allows for testing different models that predict physical attribute of a person from their voice.
4.    Provides aids to the work of researchers across the country to use this nationwide audio database for other interesting applications. (Anonymity of persons will be maintained for security purposes).

Within this project, we propose to build a framework that would solve these problems. We aim to build the necessary technology for voice forensics and investigation. The long term aim of this project is to equip law enforcement agencies with the required tools to perform voice forensics and provide necessary evidence for enforcement of law and order. With the system we build, the officials should be able to estimate with good accuracy, the physical and geographical features of the suspect.


Voice samples were collected from 40 students who participated in the IPTSE CMU-NITK Winter School 2014. The age group of the participants was in the range of 19-22 years. The height, weight, age and sex of the students were recorded. Each student was asked to speak a set of 25 phonetically rich sentences randomly selected from the large TIMIT database. Thus, there were 25 recordings per person, making a total of 1000 recordings. The samples were recorded using an external microphone on Audacity, in a relatively quiet room. We ensured that the recordings were lossless. All other necessary conditions like distance between the speaker and the microphone were taken care of while recording the voice samples.


The Framework we are developing consists of machine learning tools, classification and regression algorithms that extract and analyze features of the voice and learn the correlations of the physical features and voice of the speaker. The framework depicts the pipeline of computations and analysis. The pipeline mainly consists of the following:

1.    Feature Extraction
2.    Normalization of data
3.    Clustering (Bag of Words Model)
4.    Machine Learning Algorithms
Ø  Classifier Models
Ø  Regression Models

The pipeline followed is depicted in the picture demonstrated above. The following sections will explain the above sections in detail.


The initial results that we obtained was itself a proof of concept for what we were trying to build. Given that the data set we used to test our system was meagre and biased (male-female ratio was 3:1), we were still able to generate results with good accuracy. We could predict the gender of an unknown person’s voice with an accuracy of 95.2% and predict his/her height with an error of 6.5cm. With more data, and fine-tuning, our system could become reliable enough to finally reach our desired goals.


To make our tool publicly usable, we have developed a website. The website allows a user to upload a voice sample (only .wav files are accepted as of now) and outputs the physical characteristics of the owner of that voice in the sample. To predict the physical features, the voice sample inputed is run on the already trained model. In future, we intend to make a provision for users to contribute training data as well. To ensure authenticity and security, only validated users shall be allowed to upload their voice samples and their physical characteristics. After inspection of the samples collected from the website for genuineness, it will be used for training of our models.


     LASSO regression for height estimation,
     Augmenting 6000 features(speaker traits) with bags of words features,
     Since we have got high accuracy for gender classification, we would now hope to see better results by using the predicted gender itself as a feature for height prediction.
     The data collected was biased, we had a girls to boys ratio of 1:3. We need to test our models on a larger data set with unbiased inputs and check for the performance.


We would like to extend our gratitude to our guides Prof Bhiksha Raj, Prof Rita Singh from CMU and Mr. Pulkit Agrawal, PhD student from University of California, Berkeley. A special thanks to the entire IPTSE Winter School Team for providing us the opportunity and resources to work on this project.


Team Voice Forensics:
1. Tejeswini Sundaram, BTech Computer Science, MIT Manipal
2. Priya Soundararajan, Int. M.Sc. Applied Mathematics, IIT Roorkee
3. Utkarsh Patange, BTech Computer Science, IIT Kanpur
4. Sakthivel Sivaraman, BTech Mechanical Engineering, NITK Surathkal

Friday, 2 January 2015

Project TexEmo - Text Based Emotion Detection System

What is Project TexEmo !?

Emotions are universal and extend beyond boundaries of language, literature, religion, age, etc. Communicating with people is not just about transmitting information or a message but also expressing your emotions. In today’s world, while we are continuously striving to make machines smarter and intelligent, but we have not been able to make them detect and understand emotions. Through our project we aim to make machines detect emotions from any form of text, which constitutes about 70-80% of information available to us. Understanding emotions exposes us to an array of possibilities: personalized information generation, like advertisements, search results, etc; development of powerful human-computer interaction machines and evolution of more intuitive and emotionally characterized text to speech systems.

People Behind:
Satish Palaniappan  - SSNCE, Anna University
Dhruv Goel  - MIT Manipal
Skand Arora  - Amity University

For more info follow us on our blog, Here !

The Famous Five

Team Famous Five

Predicting the Impact of an Image on Social Networks


Team Members:

  • Chirag Nagpal -------------------------------- AIT, Pune
  • Kodali Naveen -------------------------------- BITS, Hyderabad
  • Megha Arora   --------------------------------- IIIT, Delhi
  • Nimisha Sharath ------------------------------ NITK, Surathkal
  • Rohan Katyal ---------------------------------- IIIT, Delhi