Comic PolyglotEveryone loves comics. Wouldn't it be wonderful if we could read all the amazing comics spread across different languages like Japanese, Spanish, French and German in English or Hindi? Periodic comics in different languages come out every other day or week, but are inaccessible for a long time due to the intensive manual translations efforts required. This manual process consists of teams of people working separately on text extraction, translation and infusion for every line of text in a comic. Our project attempts to automate this process and make comics available in other languages within a day of original raw release with minimal human effort required. The hurdles that we face include automation of complex processes like text detection, extraction, removal, optical character recognition, language translation based on context and text infusion. This project on completion has the potential to disrupt the scanlation process and the fan community and also affect the comic industry.
To create such a system, we need to break it down into smaller systems and attempt to solve them individually. The first hurdle that we face is text extraction, and the majority of our time in the Winter School was spent tackling this problem.
Various stages involved in approaching the problem:
- Text Detection
- Text Extraction
- Text Infusion
Comics Survey:The comics that are currently available today come in various formats such as
- Simple black and white Manga comics which have characters all over the place. These generally have kanji-like characters in text boxes, speech/thought bubbles but they are not limited to these areas. Mostly Japanese or Korean.
- Colored comics that have text only in the speech/thought bubbles. These generally follow the pattern that the text is black and the bubble is white and generally ovular/elliptical in shape.
- Colored comics that can have text placed anywhere inside the comic frames.
Initial Attempt:During the Winter School, we initially attempted to extract japanese characters from speech bubbles in a manga imageset that we created. After discussing with the professors, we attempted to detect bubbles in 3 different ways :
- Radon Transform
- Hough Transform
- Condensation Algorithm
Radon Transform computes projections of an image matrix along specific directions by linear integration of an image (object/function) over a parallel beam.
Value of these integrals is represented by a function Rʘ(x’) where ʘ is angle of inclination of Source/Destination from reference axis and x’ is distance of ray(in parallel beam) from origin.
For radon transform we calculate such curves for each angle varying from 0 to 180 degrees. By this we get a heat representing change of integrals over various angles. Which is helpful to provide various information about of the object like shape.
Here are the some of the outputs obtained:-
Circle at centre – Since circle is symmetric about any axis, so the integral is constant (i.e. Straight Line) with high intensity in middle and reducing gradually towards the ends.
Shifted Circle – Due to shift there is also a gradual distortion in line. But the structure remains moreover same.
Comic Bubble:- Bubble’s in comic are approximately circle with some distortions.
As we can see Radon transform is capable of detecting circles very efficiently, but only if they are completely filled inside i.e. solid shapes as shown in above pics. This phenomenon is not a thumb rule in comics.
Almost failing to use this practically for bubble detection, but we had given it one more try by applying radon transform over integral images as suggested by our mentor Prof. Bhiksha, it was also a fail as radon transform of integral images could not be drilled to a specific format/shape.
While investigating the above, we found a paper on text detection in natural environments. There was also an implementation in MATLAB that we found to be extremely useful and helpful in getting us off the ground running. We made some changes to the methods followed that can be seen in the diagram below.
Text Detection Pipeline
|Text Extraction Pipeline|
Now we detect various blobs in image by use of Maximally Stable Extremal Regions (MSER). Here blobs are the regions of similar intensities in the image as all characters have same color/intensity. Detection of only text is distinguished from background on basis of area occupied by regions since all characters have a constant amount of size. Experimentally we found that area of such regions lies between 10 - 2000. Here the maximum value is quite high because while detecting MSER regions various small regions lying close form a bigger blob i.e. in domain of a comic most of text on one bubble end up to same MSER.
Drawback with MSER is that it detects most of the text with non-text part also.
To improve the result, we also canny edge detection on the initial grayscale image. Canny Edge Detector detects all edges present in the image i.e. all text as well as non test part. This output is then masked with the previous output of MSER to get the most significant part in our context. Canny edges are very thin to make them significant (transform them to regions) we grow edges relative to the gradient (Gradient grown edges). Now this contains regions representing text as well as non-text.
The output of this procedure when subtracted from MSER gives all text part with a significantly low non-text part.
Filtering text and non-text part based on connected component analysis.
The connected components obtained from above analysis are filtered on basis of these three factors Area, Eccentricity and Solidity. By, experimental analysis it was found that these factors are governed by these conditions in domain of comic:
- Area lies between 10 and 2000
- Eccentricity of a component is greater than .995 which contain text.
- Solidity is less than 0.4
We know that all characters have a constant stroke width. So, we use this property to filter the content. We apply a stroke width transform on every filtered connected component. Now if variation in average stroke width in each connected component is greater than a threshold then it is eliminated, we experimented with different variance of stroke widths for our dataset and found the optimal stroke width variance to be 0.36.
To extract text containing regions from image we apply morphological transform to combine nearby regions and then construct a bounding box around that region to extract the specific part of image.
Support Vector Machine (SVM)The output of the text extraction pipeline contained all the text patches and huge number of non-text patches, so we decided to design a SVM classifier to classify between text and non-text patches. We trained an SVM using hog features of image patches. It used RBG kernel which gave poor result (accuracy was 44%). As in our case, number of instances were very very less than number of features so it was better to use liblinear python library that contain linear kernel and it consumes less time and memory and produced better accuracy. On training 277 images and testing on 69 images, accuracy was 79% for Hog kernel(2 * 2). But when it was trained with 4500 images, accuracy was 98% (but one important thing is that in that dataset there were images which contained test as well as non-text, these type of images were removed from dataset. Next thing we will do is the images which contain text as well as non-text would be counted as positive dataset).
Image CleaningProcesses a scanned document of text to clean the text background.
What the script does is as follows:
- Optionally, crops the image
- Optionally, converts to grayscale
- Optionally, enhance the image to stretch or normalize
- Creates a grayscale version of the enhanced image and applies a local area threshold using -lat and (optionally) and optionally some blurring for antialiasing to create a mask
- Composites the mask with the corrected image to make the background white
- Optionally unrotates, sharpens, changes the saturation, trims and pads
Text ExtractionOptical Character Recognition: To retrieve text from images we used Tesseract OCR (one of the best available OCR). Tesseract works fine for standard formal text but for comics where fonts are cursive it doesn't gave a very good result.
Following figure shows the output of OCR ran on 60 image patches. Here darker the colour more precision was detected in character recognition.
As observed from above, at some places OCR gave a very poor result.
To correct this we used levenshtein distance to get the most nearest match of each word from a collection of around 90,000 spanish words, which we collected from CMU dataset. Through this we were able to correct most of the miss detected words by OCR.
TranslationText detected by OCR was translated to the specified language by use of Google translate.
Text InfusionText Infusion refers to removing previous text and replacing with the translated text.
Problems faced in this context was that patches removed were rectangular in shape which contained text as well as background. If we just remove the patch and place the text it looks very odd. For this we had to remove text not background from the patch.
To handle this we used the property of comic images that text is always in black in colour. For this we took HSV transform of the image and separated all black part from the image. Now we again applied Connected component analysis (that was applied in text detection pipeline) with more fine value. Then we merged this reduced text black part (mostly line) with the other colored part to get patch without background.
Now we just placed translated text in this patch and replaced it back in the image to get the translated comic.
Neural Network based ClassifierWe worked in developing an artificial neural network based system to recognize the text and non-text regions in the given comic image. The idea was to extract the text part from the comic image, run an OCR on it and get the detected text and use an appropriate language translation API to convert the language into English and put back the the translated text back into the image at its appropriate location. The ANN was trained to classify the patches among two categories. Currently we are working on training our network on images of spanish comics.
Initial Approach to Training:Since most of the text in the comic strip is inside some kind of bubble. Our initial approach was based on detecting the bubble in this image, so that at later stage of the pipeline the text can be extracted from it by running the OCR.
Dataset Preparation:A positive dataset of 16X16 patches which contains some part of the bubble were made from the ground truth made variety of bubbles found in the comic images, aim is to detect such lines which are part of bubble. Negative dataset prepared contains patches of same dimension containing other portions of image that has no bubble in it.
Training & Accuracy:Our ANN was trained with such patches with each patch of positive dataset labelled as ‘1’ and each patch of negative dataset labelled as ‘0’. The network training resulted in accuracy of above 90% for most of the time. Caffe network framework developed by researchers at UC Berkley was used in our project to make the neural network. Final trained caffe model of our network was implemented using the python wrapper. The results obtained were below the acceptance level, which might be due to the face that ground truth images of the bubble weren’t the true replica of the bubbles present in the comic images as the neighbourhood of 16X16 patch was totally different in comic image than our ground truth image. We came up with a different approach to detect the text directly in the image.
Current Approach:Currently we are working on making a robust feature model for each patch. Aim is to directly detect the text in comic images than detecting bubble as in previous approach.
Dataset Preparation:Texts written in bubble in comic images were taken out to create ground truth text images that form our positive dataset, the rest of the part of the comic strip was used to form negative dataset.
A feature model is prepared by taking out 16X16 non-intersecting patches from the text images and as well as from non-text images and each patch is processed to extract HOG features and lbp features which gives a one dimensional 1X36 HOG feature vector and a 1X36 lbp feature vector respectively.
Now a feature model of 6X6 2-D matrix having 2 no of channels, first channel=1 corresponds to HOG features arranged in 6X6 form and second channel=2 corresponds to lbp features of the same patch arranged in 6X6 form. This way a 6X6 feature model having two number of channels of each 16X16 patch taken from both classes is prepared. This feature modelling is done for each patch of positive dataset and negative dataset.The labels are assigned to these feature models as ’1’ for a positive patch and ‘0’ for a negative patch. The idea is to come up with a robust feature model that encompasses the information about the neighboring pixels also.
Training & Accuracy:A Convolutional Neural Net is being trained with such 6X6 2-D feature vectors with their corresponding assigned labels. The currently observed accuracy is above 50% which we hope to improve further by using more features of a patch while training and improving our dataset selection.
We are The Team 42:
- Akshay Dixit - VIT (email@example.com)
- Gaurav Bansal - IIITA (firstname.lastname@example.org)
- Aman Raj - DTU (email@example.com)
- Priyanka Selvan - MSRIT (firstname.lastname@example.org)
- Harshvardhan Solanki - IITK (email@example.com)
- Farhat Abbas - NITS (firstname.lastname@example.org)
 Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, Hong-Wei Hao. Robust Text Detection in Natural Scene Images.IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 970-983, 2014
 Mathworks. Automatically Detect and Recognize Text in Natural Images. http://in.mathworks.com/help/vision/examples/automatically-detect-and-recognize-text-in-natural-images.html
 TEXTCLEANER Script http://www.fmwconcepts.com/imagemagick/textcleaner/ http://caffe.berkeleyvision.org/