Saturday, 1 July 2017

Documents as Vectors?

All of us read - novels, magazines, newspapers, articles, so on and so forth. One of my recent adventures had me amazed at the competence of the human brain - how it manages to skim through large text, formulating a precis, very effortlessly. If I asked you to classify a bunch of articles into a bunch of random classes like like 'teddy bears', 'coffee' and 'rubik cubes', it is probably a piece of cake! Just exaggerating! But imagine getting a machine to do that - text classification.

So, my job was simple - given an article on the web, classify it (with a model of course).

One thing led to another and I met this really interesting guy called doc2vec. Don't you worry! This is not going to be the 10001th article explaining how those vectors come into this world, or how they make themselves useful to mankind. In fact, I'm not really sure if I understand it completely myself; but that is precisely the beauty of it. One can work with it safely without attending to all that cryptic deep learning stuff.

Having successfully trained a model on 100 web articles, I dared to test it. The results were absurd, en masse astray! Always bear in mind a golden rule - a model only learns what you teach it! Hence, I repeated the exercise with about 400 articles. It obviously did not on work on the first try, but thanks to all those google discussions on model parameters, I groomed it to be acceptable. Another golden rule - there is no definite set of parameters which yields an almighty model. You have to mess around till you find your magic numbers.

While all this is well and good, in a parallel world, imagine employing this approach for regional language text, say Kannada. Yes, I had the same reaction. Allow me to enlist some of the challenges I was confronted with.

  • The training articles in Kannada were not as numerous as English.
  • If you were ever fortunate enough to study Kannada, you will agree with me that the language has a complex structure. For example, words like ಭಾರತ, ಭಾರತದಲ್ಲಿ, ಭಾರತಕ್ಕೆ and ಭಾರತದಿಂದ are treated as different words (vibhakti pratyayas and its cousins).
  • Despite the variety of API that nltk has bestowed upon us, regional Indian languages find poor support in terms of stemming and lemmatization.
  • Augmenting my dismay, my hunt for pre-trained doc2vec Kannada models also turned out to be futile.
In a nutshell, while doc2vec does look promising with all those fancy similarity queries, it might not quite serve your purpose. It involves a fair amount of dabbling and dawdling. Nevertheless, this adventure had me awestruck at the immense work that has been done in the field of NLP.
Happy coding in a parallel world! 😊