Wednesday, 11 October 2017

The "Language Dilemma" in Natural Language Processing

Ola amigos :) Yes, I know what you're thinking, I missed you too. Nevertheless, I am back with a short and sweet adventure that had me preoccupied for a while.

So, let me start with the prequel. As part of a class assignment, we were asked to do some basic NLP exploration - tokenization, stemming and stop word removal, blah blah. While I skimmed through the instructions, nothing sparked my interest as much as the line "The assignment can be done in java or python".

If I was one day away from the submission, or if I was sleep-derived, it was an evident choice - python. Why, you ask? Well, I am lazy. I have worked with NLP in python. It is easy to use and provides some excellent functionality.  Need I say more? Fortunately for me, and unfortunately for you, I wanted to explore. If it was the last decision of my life, who would I choose?

So I decided to compare Java Stanford Core NLP with python NLTK. But wait, isn't that like comparing guavas to mangoes? Not really, as I intended to contrast the API, performance and developer-friendliness of the two libraries.

CoreNLP uses the Penn Tree Bank(PTB) tokenizer with some cool flags to rewrite British spellings to american, normalizing currency etc.. On the other hand, NLTK also provides some useful tokenizers for tweets, punctuation etc.. If you are accustomed to having things custom-made, you can also avail the regular expression tokenizer, which is a part of both the libraries.

As far as the performance is concerned, I thought it wise to check how well the PTBTokenizer of CoreNLP and the word tokenizer of NLTK scale with the number of tokens. You are welcome to argue otherwise. While the former shows little deviation from the average "tokenizing" time, the time taken by the latter is erratic.

Going forward, it is worthwhile to note that stopword removal is easier in python with the list of common English stopwords at your disposal. CoreNLP probably holds the opinion that stopwords are specific to the language in general, and the text in particular; a fair argument. This can be a useful extension if you are hunting for the same.

Evidently, there are still a ton of aspects waiting to be compared and contrasted, but that is for another time. Till then, happy coding in a parallel world! 😊

Saturday, 1 July 2017

Documents as Vectors?

All of us read - novels, magazines, newspapers, articles, so on and so forth. One of my recent adventures had me amazed at the competence of the human brain - how it manages to skim through large text, formulating a precis, very effortlessly. If I asked you to classify a bunch of articles into a bunch of random classes like like 'teddy bears', 'coffee' and 'rubik cubes', it is probably a piece of cake! Just exaggerating! But imagine getting a machine to do that - text classification.

So, my job was simple - given an article on the web, classify it (with a model of course).

One thing led to another and I met this really interesting guy called doc2vec. Don't you worry! This is not going to be the 10001th article explaining how those vectors come into this world, or how they make themselves useful to mankind. In fact, I'm not really sure if I understand it completely myself; but that is precisely the beauty of it. One can work with it safely without attending to all that cryptic deep learning stuff.

Having successfully trained a model on 100 web articles, I dared to test it. The results were absurd, en masse astray! Always bear in mind a golden rule - a model only learns what you teach it! Hence, I repeated the exercise with about 400 articles. It obviously did not on work on the first try, but thanks to all those google discussions on model parameters, I groomed it to be acceptable. Another golden rule - there is no definite set of parameters which yields an almighty model. You have to mess around till you find your magic numbers.

While all this is well and good, in a parallel world, imagine employing this approach for regional language text, say Kannada. Yes, I had the same reaction. Allow me to enlist some of the challenges I was confronted with.

  • The training articles in Kannada were not as numerous as English.
  • If you were ever fortunate enough to study Kannada, you will agree with me that the language has a complex structure. For example, words like ಭಾರತ, ಭಾರತದಲ್ಲಿ, ಭಾರತಕ್ಕೆ and ಭಾರತದಿಂದ are treated as different words (vibhakti pratyayas and its cousins).
  • Despite the variety of API that nltk has bestowed upon us, regional Indian languages find poor support in terms of stemming and lemmatization.
  • Augmenting my dismay, my hunt for pre-trained doc2vec Kannada models also turned out to be futile.
In a nutshell, while doc2vec does look promising with all those fancy similarity queries, it might not quite serve your purpose. It involves a fair amount of dabbling and dawdling. Nevertheless, this adventure had me awestruck at the immense work that has been done in the field of NLP.
Happy coding in a parallel world! 😊

Sunday, 4 June 2017

My 'trailing' story

Yosh, I am back to blogging in the parallel world, with a whole new set of ventures to explore and bore you with the same. Okay, lets get down to the brass tacks straightaway.
I happened to stumble upon this exciting piece of software called TrailDB and allow me to explain to you my trails through it.

 'Trails' in a database?
Before we find ourselves muddled in the technicality and perplexity of things, let us take a minute to note the primary objective of this innovative database - storing and querying time-based user events. If you are born and brought up with relational databases (like I have), you might be wondering, how is this any different? This should be your sweet and simple answer. Having highlighted that, I am not going further into the nuts and bolts of things; the documentation serves that purpose quite brilliantly.

The must-have installation quandary!
I hate installing stuff, especially when they say it is 'simple'. As the documentation promises, the installation on Linux is pretty straightforward. However, if you ever want to test your patience, I recommend installing the same on Windows (Use of bash is not an option). After fixing what seemed like a multitude of errors in the manual installation of dependencies, I realized with tears of grief, that the required version of libJudy (an interesting dependency) is unavailable for windows.

The biased documentation
When I finally decided to get my hands dirty with the actual API, it was all good till I got to the exciting part of running queries. I was using the python binding, but looking at C documentation! Nevertheless, it led me to several intriguing questions such as - what could be an equivalent of C struct in python? Dictionary? Tuple? List? Well, after staring at the source code for ample time, I deciphered the syntax to run queries ( 'event filters' in trailDB terminology).

The final verdict
At the end of the day, the assessment of any database comes down to two things - storage and query capabilities. While trailDB scores with the former (thanks to all the expensive stuff happening at 'C' level), the latter ain't all that impressive. For example, a simple query that involves comparison has to be done in the application world, as it can check only for equality! Not to mention, that the query has to be in conjunctive normal form only.
Some fancy compression!




In my opinion, while it isn't suitable for applications that are database intensive, trailDB is apt if you are looking for a light database, say for archiving massive data (that can be represented as strings easily).


That brings us to the end of yet another exciting adventure in the parallel world. Relevant code will be available here. Adios and happy coding in a parallel world! 😊

Wednesday, 19 April 2017

Bacons and Pickles!

For my last adventure in the parallel world, I decided to do something special - get all of you hungry, and also curious about the The Oracle of Bacon. I happened to find this interesting git repository, which calculates the bacon number of actors picked up from the IMDb database. Yes I know; I could not resist a temptation to mess with the code.

As the author has very well stated, this problem can be reduced to that of doing a BFS on an undirected graph, with actors as vertices and common movies as edges. Unfortunately, the immense dataset adversely impacts performance. Putting on profiler glasses, gives us pretty much the same view: maximum time to load input data. But wait, why pickle?

cProfiler for python

The Elephant coming into the Room

After reading almost every post containing 'pickle' on stackoverflow.com, I ventured to alter a few things.

  • Disabling garbage collector : Unreferenced objects invite the garbage collector which might in turn, slower the input. [Source
  • JSON as input : After a night's struggle, I managed to convert the pickle input file into a JSON. This would obviously cause problems when a new dataset is used, but the difference in performance is worth considering.

Input files converted to JSON

A long-shot at parallelism

Having scored with memory, I questioned myself if there was some scope for parallelism. Intuitively, it seemed like a bad idea because the graph was being implemented with huge, in-memory dictionaries, that were perplexing to deal with. Nevertheless, I gave it a try with the following approach.

  1. Given an actor, find his/her co-stars using the actor-to-movies and movies-to-actors mappings.
  2. If you are lucky enough to find Mr.Kevin Bacon in your co-stars, return 1.
  3. Else, do a parallel BFS on his/her co-stars till you find a minimum link. (The actor's bacon number will be that of your co-star's, incremented by 1)
Much to my disappointment, due to a number of factors like python's GIL, the huge dictionaries (being passed as arguments to processes), basic structure of underlying code, I was not very successful. All the same, documentation of these attempts and others are available here.

Happy coding in a parallel world 😊








Monday, 17 April 2017

Just rand()-only!

Life was throwing stones at me, so I turned to the parallel world for a sweet, simple and sexy adventure. Little did I know, that I will get hit by another revolting, wrecking boulder. Let me elaborate!

We have all met prime numbers in life. My adventure started off with a lucid attempt to write a code snippet that counts the number of prime numbers in a range. The vanilla approach involves finding factors of each number; counting primes as you iterate. But who likes vanilla? I wanted chocolate or strawberry!

There are plenty of fancy algorithms for primality testing - of which the I chose Miller-Rabin. Its is based on Fermat's theorem that states that if p is prime, then, ap-1 = 1(mod p) for any integer less than p.

Me : So, all I have to do is check if that condition is satisfied for all a.
The wiser me:  But wait, how is that better from my vanilla method?
Me : Then, maybe I do it for just one such a.
The wiser me : But what if I am unlucky enough to select an a that satisfies on coincidence?
Me : Oh! I get it , I have to choose a small number of such a and deal with the error.
The wiser me : *rolls eyes* Finally!

As you pretty much guessed, asomething huge is pretty knotty to compute. There are tricks such as modular exponentiation, splitting into powers of 2 etc... that our prudent mathematicians have discovered. Let us take a moment to say arigato! Exploiting all such clever techniques, I proudly came up with a working version of prime.c. It was no child's play, especially for someone who can only see, not look!

Get to the parallelism already!

Broadly thinking, I could come up with two ways in which parallelism can be benefitted from.

  • Parallelising the test for the numbers themselves
  • Given a number, parallelising the test for the trials ( the a we spoke about earlier)
I set out to explore the first. After the necessary #pragma mantras, I executed the parallel code. It was slower than the serial code.Yes, that was precisely the boulder I mentioned. I searched. I inspected. I scrutinized. I probed. I found nothing.

Finally, a Mr.Robert Reed opened my eyes with this. rand() was the culprit! I was using rand() to pick those 'a' s and as it turns out, rand() is not thread-safe. Hence, all those calls to rand() were being done in serial. Added with the overhead of creating those threads, my parallel code was bound to be slower.

Moral of the story: Do not assume that you can parallelise anything and everthing; and also by the way, rand() is not always thread safe.

All relevant code and logs wills be found here. Happy coding in a parallel world! 😊









Thursday, 16 March 2017

Multi-(threading | processing) ?

My third adventure in the parallel world started off with my own code - a lexer for C, written in Python. To those of you who are new to terminology of compiler design - lexical analysis is the step 0 of the compilation process.
In a nutshell, a lexer performs the following functions:
  • Strip off comments from the source code
  • Generate tokens i.e seggregate input code into identifiers, operators, punctuation etc...
  • Populate the symbol table
So, how are the tokens generated?

The input code is matched with a bunch of predefined regular expression and classified accordingly. Sounds simple enough? At this point, it is worth noting that this can be really expensive for large pieces of code as the expressions themselves are not guaranteed to be simple. But wait, can't this be done in parallel?  Yes, you hit the nail right on its head!

Revisiting our problem with the "parallel eye"

Ideally, we want some workers that scan the input in parallel. As soon as one worker finds the matching regular expression, we want all the workers to stop. It reminds me of the time when I made my whole family search for my spectacles at home while I'd actually left it in college! That's right, we have to deal with input that matches nothing as well.

Parallelism in Python

Python is a language that has never disappointed me in its collection of libraries and quite expectedly-voila! We are all well familiar with the differences between processes and threads as stated by our dear old python; nevertheless, I thought this would be a good opportunity to explore into the same.

Mutiprocessing vs Multithreading

Implementing the lexer with multi-(threading|processing) can be approached in two ways:
  • Make each worker return the match object or None
  • Make each worker update a global variable
Obviously, the second option is better, as we can let other workers terminate if and when the global variable is set. This is precisely where I was accurately wrong, fundamentally!  Processes don't share memory - hence, even if the different worker processes saw the same initial value of the global variable, they would be updating their own local copies of the global variable; thus defeating the purpose successfully! Not to mention, sharing the variable among processes is not really worth the effort.
On the other hand, threads (even with python's GIL) serve our purpose to a great extent.

As you pretty much guessed, the red dots correspond to multiprocessing while the green ones to multithreading. The separation is strikingly large!

Hope this post made you a little curious about the parallel world of python! Adios till my next adventure - happy coding in a parallel world. 😊

P. S: All relevant code and readme files will be here.

Tuesday, 28 February 2017

A Simple Text Editor


A parallel world has many doors,
With lots to explore and lots to find.
Knock knock and thee shall ask for more,
With fork and join, to blow your mind.

Enough with the theatricals? Yes, I understand.

So, for my second adventure in the parallel world, I set out to inspect parallelism in Java. Don't worry - this is not going to be the nine hundred and ninety ninth post that you are reading, explaining Java's incredible journey from threads to executor service! Contrarily, this is about my experience in putting the same into use.

I found this cute, simple text editor written in Java on GitHub, that I thought had some scope for improvement. It had most of the features of a typical text editor, except for scroll pane (which I thought was pretty necessary). After sufficient twiddling, I happened to notice that it had no "find all" feature and there was my light-bulb moment - implementing a parallel search for a word in a document.


Implementing a parallel search for a word in a document

If you think about it, it is piece of cake - You have some jumbo text, in which you have to find the occurrence of a word. All you java lovers are probably wondering why all this hue and cry for a simple String.indexOf() function call. Well, I was also in agreement till I discovered that the indexOf function is, in reality, implemented as O(mn) algorithm.  (If you are fortunate enough to have not come across this notation, you can either refer here or take my word for "We can do better".)

A simple parallel approach to the same problem is to divide the search among threads i.e break up the jumbo text into smaller chunks and make each thread return occurrences of the search string. But wait, what if the word is spread across the chunks? Yes, we have to handle that case separately.

A screenshot of the find all implementation
All said and done, with a few lines of code, this new feature was added to the text editor. It seemed to perform pretty well with a running time of 2 milliseconds for 237 occurrences in a text of about 20000 characters (using the same old O(mn) algorithm). However, there was no significant change in the performance when the number of threads was increased.

Hope you liked this short simple post on Java parallelism. Look forward to more in the upcoming posts. Till such time, happy coding in a parallel world.😊

P.S. All relevant code will be available here.


Sunday, 29 January 2017

The Page Rank Mystery

I kick-started my quest for some long slow code, out there, lost in the GitHub world, waiting to be rescued by some parallel programmer, who'd walk by, and wave his magic openmp wand and change the world. Eventually I discovered that hunting for "slow code" is a wild goose chase; it makes much more sense to look for "interesting code". Voila-algorithm implementations!

So I stumbled upon this repository implementing the infamous page rank algorithm. Much to my disappointment, it was very well-written in Greek. The readme directed me to this useful link: How Google Finds Your Needle in the Web's Haystack, that explained the latent ugly math of the algorithm. Henceforth, I dived into the C++ code to get a clearer picture of what is happening.

The code was surprisingly very fast for large inputs (about 1.2 seconds for 100000 vertices). This is probably because of the extensive use of library functions that are assuredly compiler friendly. An extremely useful lesson: When you write code, exploit the language library as much as you can! Nevertheless, when I looked at 0.65 seconds response time on Google search results page, I hoped that there might be some scope for improvement. 


A quick examination of the code from the perspective of a profiler yields obvious results : that the page rank computation is essentially the bottleneck.  I suppressed my instincts to instantly use the #pragma mantra and played around with the serial code. For instance, it might have caught your eye that the current element of the H vector is computed in each and every iteration!

But wait! Aren't computations involving double data types expensive? Loss of precision? It depends on the representation? Either way, why would I want to compute it every single time? Can't I just compute once and store it in an array? Isn't that more cache friendly?

Yes. I had the same set of questions on my mind. As it turns out, I could not have been more wrong. The computation, though expensive, is conditional. For a sparse matrix, pre-computing values in an array would hardly make any difference; in fact, pray that it does not add more overhead. This led me to my second lesson: Don't always blame the expensive computation.

Parallelizing the computations on the rows using good old openmp reduced the running time by 0.2 seconds (16% speedup). Frankly, I was balked. Ouch! Expectations hurt. I welcome any comments on your speculation of the reason. 

Moving on, my next move was to address the smaller elephant in the room - the file input. Reading a huge file line by line seems highly inefficient for several reasons such as increase in number of system calls and therefore increased interrupts, the file pointer has to be moved more frequently etc...

So, what is the most efficient way to read a file? Perhaps read the entire file into a string and then parse the string. I tested the same with a sample file and realized that this was indeed the case.
Hurrah! Our page rank algorithm got 0.15 s faster!


Summing up, I believe that the optimization is still "work in progress". But that is for another time in another post. Till such time, happy coding in a parallel world. 😊

P.S: All the relevant code and readme files will be available in my gitHub account.