Wednesday, 11 October 2017

The "Language Dilemma" in Natural Language Processing

Ola amigos :) Yes, I know what you're thinking, I missed you too. Nevertheless, I am back with a short and sweet adventure that had me preoccupied for a while.

So, let me start with the prequel. As part of a class assignment, we were asked to do some basic NLP exploration - tokenization, stemming and stop word removal, blah blah. While I skimmed through the instructions, nothing sparked my interest as much as the line "The assignment can be done in java or python".

If I was one day away from the submission, or if I was sleep-derived, it was an evident choice - python. Why, you ask? Well, I am lazy. I have worked with NLP in python. It is easy to use and provides some excellent functionality.  Need I say more? Fortunately for me, and unfortunately for you, I wanted to explore. If it was the last decision of my life, who would I choose?

So I decided to compare Java Stanford Core NLP with python NLTK. But wait, isn't that like comparing guavas to mangoes? Not really, as I intended to contrast the API, performance and developer-friendliness of the two libraries.

CoreNLP uses the Penn Tree Bank(PTB) tokenizer with some cool flags to rewrite British spellings to american, normalizing currency etc.. On the other hand, NLTK also provides some useful tokenizers for tweets, punctuation etc.. If you are accustomed to having things custom-made, you can also avail the regular expression tokenizer, which is a part of both the libraries.

As far as the performance is concerned, I thought it wise to check how well the PTBTokenizer of CoreNLP and the word tokenizer of NLTK scale with the number of tokens. You are welcome to argue otherwise. While the former shows little deviation from the average "tokenizing" time, the time taken by the latter is erratic.

Going forward, it is worthwhile to note that stopword removal is easier in python with the list of common English stopwords at your disposal. CoreNLP probably holds the opinion that stopwords are specific to the language in general, and the text in particular; a fair argument. This can be a useful extension if you are hunting for the same.

Evidently, there are still a ton of aspects waiting to be compared and contrasted, but that is for another time. Till then, happy coding in a parallel world! 😊