Tuesday, 30 December 2014

Project NELS: Never Ending Learning of Sound

Synopsis:

Most of the robotic machines in today's time take decisions based on visual images but they seldom use sounds as sensory inputs for taking actions. Imaging a scenario where your machine could understand different sounds like you do. Imagine a machine that could sense a knock on your door, someone breaking into your house and take decisions by sensing different sounds in the environment. This could help in efficient decision making and make machine more intelligent. But, how does one make the machine understand millions of sounds existing in the universe?

The main motivation behind this long term project is to make the machine continuously learn all the sounds that exist in the universe. However, nature has its own grammar in the sense that the sounds follow certain rules that humans are able to understand intuitively, but this information is unusable by machines. The NELS Team is developing an artificial intelligent system that continually crawls the web for sound samples and automatically learn their meanings, associations and semantics with minimal human intervention.
  
The Never Ending Language Learner (NELL) and Never Ending Image Learner (NEIL) are similar projects that aim to make the machines develop the capability to read and understand images respectively. These project have been very successful. The Never Ending Learning of Sound (NELS) is a similar effort with the aim to make machines develop the capability to hear.

Approach to Project:

We have currently divided the project into 3 modules:

Module 1: Web Crawler - The web crawler continuously crawls the web to collect sound samples and other information about the sound sample. Currently, the crawler is implemented for youtube.com, but we are in the process of having a generic crawler in place. A major problem that remains to be dealt with is handling near duplicate sound samples. We are planning to implement LSH Techniques for the same.

Module 2: Artificial Intelligent System - We have trained classifiers with training data obtained from findsounds sound database. These classifiers are used to classify the newly encountered sound sample and re-train the classifier. The Re-Training is not implemented as of now, but it forms an essential part of the project. We are currently extracting the Mel Frequency Coefficients from the sound samples and applying Kmeans Algorithm to obtain the bag of words. The current binary SVM Classifiers (Chi-Square Kernel) are trained using these bag of words.

Module 3: Web Interface: The backend of the website is based in Python-Django. The web interface allows the users on the NELS framework visualize what relationships and attributes are currently formulated by NELS. The web interface also forms the basis for manual human evaluation of the relationships learnt by NELS. Our aim is to continuously reduce the human intervention required by NELS.

Tools Used:

1. OpenSmile
2. Octave
3. LibSVM (Linear, RBF and Precomputed Kernel)
4. Python
5. Django Web Framework
6. Front End Technologies: HTML, CSS, JavaScript, jQuery.
7. PortAudio

We will be shifting all our codebase to a single language i.e Python and will do away with octave so that the backend processing is faster and the response time for user queries reduces.

GitHub Repository and Detailed Description:

The detailed step by step description of the approach followed for the NELS Web Framework is available in the README.md. Please visit our Github Repository for the source code for the project. We appreciate any suggestions.

Current Results:

The precision and recall for the Guitar Test Data using SVM Classifiers was found to be 75% and 40% respectively. Our immediate future aim is to improve this so that we have the end to end system ready and the NELS can start learning different sound samples.

Future Work:

In order to improve upon the efficiency of the model, we have though of undergoing the following:

1. Try other classification methods (such as different kernels in LibSVM) and identify which provides best precision and recall. Our aim is to have as high precision as possible.

2. Identify other features from the sound samples (apart from MFCC Coefficients) that could help in having more information about sound and lead to better accuracy of classification.

3. Incremental clustering to identify new classes.

4. Implement Locality Sensitive Hashing (LSH) Technique that would help to identify near duplicates and hence significantly improve the efficiency of the crawler.

Project Mentors:

We would like to thank Prof Dr. Bhiksha Raj, Prof Dr. Rita Singh and Pulkit Agrawal for their constant guidance and overwhelming support. Their wisdom, clarity of thought and persistent encouragement motivated us to bring this project to its present state. Working with them has been a great learning experience.

1. Prof Dr. Bhiksha Raj, School of Computer Science, Carnegie Mellon University. Link to Official Page.

2. Prof Dr. Rita Singh, School of Computer Science, Carnegie Mellon University. Link to Official Page.

3. Pulkit Agrawal, Graduate Student, Department of Computer Science, University of California, Berkeley. Link to Official Page.

Project Contributors:

The following is the NELS Team. If you have any questions or want to contribute, contact:

Rohan Badlani, 3rd year undergraduate student, Computer Science, BITS Pilani (rohan.badlani@gmail.com)

Aditi Bhatnagar, 3rd year undergraduate student, Information and Communication Technology, DAICT (aditi24.bhatnagar@gmail.com)

Amogh Hiremath, 1st year M.Tech Student, Electronics and Communication, NIT Surathkal (amogh3892@gmail.com)

Ankit Shah, 4th year undergraduate student, Electronics and Communication, NIT Surathkal (ankit.tronix@gmail.com)

Parnika Nevaskar, 2nd year undergraduate student, Computer Science, DAU Indore, (parnika.nevaskar@gmail.com)


Team: StrangersUnited 
at Microprocessor Lab, NIT Surathkal, India.
(Left-Right) Parnika Nevaskar, Aditi Bhatnagar, Ankit Shah, Rohan Badlani, Amogh Hiremath

Stay tuned for further updates. 

Friday, 12 December 2014

Teams,Ideas,Bottlenecks and Eureka Moments

So we were assigned our track (Multimedia analysis) and we formed a team (Heuristics). Now what?
As we sat and thought about all the things we wanted to build for the workshop, I was amazed how three sleep-deprived minds can come up with stuff that is amazing and totally bizarre at the same time. Or ideas darted from social-help applications to education to some completely random stuff (which I'd rather not mention here). Coming up with ideas was fun, but deciding one among them- now that was the tough part. And finally, we decided to let the others help us decide. So the presentation we put up the next day was full of unsure ideas and broken descriptions. Here's what they looked like:
  1. Automatically generating commentary for a sports match.
  2. Automatic generation of doodles.
  3. A gesture-based system that generates music depending upon the user actions.
  4. Aiding learning for dyslexic students.
Bhiksha sir and Rita ma'am helped us decide. We were going to build no.1!
Only- we had absolutely no idea how we'd go about it. Pulkit had to sit with us and brainstorm for hours before we stopped panicking and the unrealistic problem seemed possible. Well- it was toned down to something that was possible. The plan now is to be able to tell what objects are there on the video and what actions are being performed. Then maybe, we could move on to extracting key frames to obtain relevant data. And if, somehow, we managed all that, we can optimistically hope that we'll be able to teach a machine to narrate a tennis match in real time!

As of now- we still scavenging for the dataset X) You can find our latest presentation here: https://drive.google.com/file/d/0B1d6SWTGqTXbQ1dpdWdRdno0MXc/view?usp=sharing

We are- team Heuristics. And we'll keep you posted about all our blunders and bond-moments from this blog!