A Parallel World

Friday, 11 May 2018

Exploring NewMA!

Hola amigos :) I admit it has been quite a while since my last post. Thanks to bunch of mundane chores that kept me occupied, I never realised how it got so late, so soon. Yes, that was a Dr. Seuss quote to get the mood going. Anyways, welcome to my next adventure in the parallel world that further explores our NUMA architecture, who got introduced to us in my previous post.

So here is the precept - we have the chummy, supersonic local memory in contrast to the surly, slothful remote memory; followed by the important question that is probably in your head - What is the big deal?

Imagine you waiter at a pizzeria and I told you that cheesy pizzas are actually being prepared in the kitchen of the building next door (remote memory), rather than your own (local memory). The delay in delivering cheesy pizzas (memory accesses) to the hungry customers will adversely affect the business of the pizzeria.

Likewise, it is the application running on top of NUMA that will be afflicted by remote memory access latency. All said and done, we don't want our beloved applications be to be unhappy at the end of the day (a.k.a "the big deal").

So, we picked this really cool database application called Redis and analysed how performance is impacted when its memory accesses are forced to be from local and remote memories respectively. The results, depicted in the following graphs, were indeed quite astonishing.

CPI is the number of CPU cycles consumed to execute one instruction. Evidently, the lower the CPI, the happier is our application, which is the case when it is primarily accessing the local memory.

So how do we appease our application, flatter it perhaps? Stay tuned while I find a way to the this guy's heart. Meanwhile, happy coding in a parallel world! 😊

Sunday, 4 February 2018

The NewMA architecture

Bonjour! Yes, it has been quite a while again; but now, gear up for a bunch of exciting adventures in the parallel world as I explore a world of processors and memory and how they pull the strings of performance. Consequently, the mundane task of writing hardware-friendly code is entrusted to us, programmers (cause we didn't have enough already!). Yosh, let's get started.

NUMA and cheesy pizzas
I went for dinner with a group of processors, and we decided to order a cheesy pizza. Hungry as we were, we figured that we would spend more time waiting to reach the pizza than eating it (fetching data takes more cycles than processing it). So each of us got a slice of our own (local memory). Now toss around the following ideas from my head.

I love pizza, so I want my friend's slice as well (local access and remote access)
But eating from my friend's slice is easier said than done (remote access has higher latency)
Also, reaching out to a friend's slice who is seated farther from me is more laborious than that of my immediate neighbor (remote memory latency is a function of processor distance)

Any thoughts on how I can get more of that cheesy pizza quickly? Furthermore, if you aren't a big fan of pizza, you can always read this.

Moving forward with the assumption that we can now talk in NUMA/pizza jargon, we have an upfront errand at hand - ensuring that our application takes advantage of this underlying mysterious architecture. If you're of the opinion that making a process access only its local memory is sufficient to score on your performance grade card, then I would like to quote Da Vinci in reply, "The greatest deception that men suffer, is from their own opinions". By the way, this idea is termed as the ''node-local" policy by the Linux kernel.

Summons to contest the "node-local" policy

1) One process, one core, one memory... always?
Consider a complex application that runs multiple complex processes on multiple core complexes and has a complex memory access pattern - leads to a complex situation indeed.

2) Keep those threads on the move

Another approach involves migrating a thread to the respective processor where the remote memory access occurs. However, one has to pay close attention to the trade-off between the overhead of context switching to remote memory access latency.

3) Clash of caches
When the architecture has blessed us with independent processors, it seems unwise to cram all the threads of a process onto a single node; thereby increasing contention of shared resources like those prized last level caches.

4) Asymmetric interconnect leads to asymmetric latencies
There might exist some out-of-the-world processors whose interconnects are not of the same bandwidth, making efficient scheduling of threads a hard nut to crack!

I will be back to trouble you with more anecdotes about NUMA. Till such time, happy coding in a parallel world! 😊

Wednesday, 11 October 2017

The "Language Dilemma" in Natural Language Processing

Ola amigos :) Yes, I know what you're thinking, I missed you too. Nevertheless, I am back with a short and sweet adventure that had me preoccupied for a while.

So, let me start with the prequel. As part of a class assignment, we were asked to do some basic NLP exploration - tokenization, stemming and stop word removal, blah blah. While I skimmed through the instructions, nothing sparked my interest as much as the line "The assignment can be done in java or python".

If I was one day away from the submission, or if I was sleep-derived, it was an evident choice - python. Why, you ask? Well, I am lazy. I have worked with NLP in python. It is easy to use and provides some excellent functionality. Need I say more? Fortunately for me, and unfortunately for you, I wanted to explore. If it was the last decision of my life, who would I choose?

So I decided to compare Java Stanford Core NLP with python NLTK. But wait, isn't that like comparing guavas to mangoes? Not really, as I intended to contrast the API, performance and developer-friendliness of the two libraries.

CoreNLP uses the Penn Tree Bank(PTB) tokenizer with some cool flags to rewrite British spellings to american, normalizing currency etc.. On the other hand, NLTK also provides some useful tokenizers for tweets, punctuation etc.. If you are accustomed to having things custom-made, you can also avail the regular expression tokenizer, which is a part of both the libraries.

As far as the performance is concerned, I thought it wise to check how well the PTBTokenizer of CoreNLP and the word tokenizer of NLTK scale with the number of tokens. You are welcome to argue otherwise. While the former shows little deviation from the average "tokenizing" time, the time taken by the latter is erratic.

Going forward, it is worthwhile to note that stopword removal is easier in python with the list of common English stopwords at your disposal. CoreNLP probably holds the opinion that stopwords are specific to the language in general, and the text in particular; a fair argument. This can be a useful extension if you are hunting for the same.

Evidently, there are still a ton of aspects waiting to be compared and contrasted, but that is for another time. Till then, happy coding in a parallel world! 😊

Saturday, 1 July 2017

Documents as Vectors?

All of us read - novels, magazines, newspapers, articles, so on and so forth. One of my recent adventures had me amazed at the competence of the human brain - how it manages to skim through large text, formulating a precis, very effortlessly. If I asked you to classify a bunch of articles into a bunch of random classes like like 'teddy bears', 'coffee' and 'rubik cubes', it is probably a piece of cake! Just exaggerating! But imagine getting a machine to do that - text classification.

So, my job was simple - given an article on the web, classify it (with a model of course).

One thing led to another and I met this really interesting guy called doc2vec. Don't you worry! This is not going to be the 10001^th article explaining how those vectors come into this world, or how they make themselves useful to mankind. In fact, I'm not really sure if I understand it completely myself; but that is precisely the beauty of it. One can work with it safely without attending to all that cryptic deep learning stuff.

Having successfully trained a model on 100 web articles, I dared to test it. The results were absurd, en masse astray! Always bear in mind a golden rule - a model only learns what you teach it! Hence, I repeated the exercise with about 400 articles. It obviously did not on work on the first try, but thanks to all those google discussions on model parameters, I groomed it to be acceptable. Another golden rule - there is no definite set of parameters which yields an almighty model. You have to mess around till you find your magic numbers.

While all this is well and good, in a parallel world, imagine employing this approach for regional language text, say Kannada. Yes, I had the same reaction. Allow me to enlist some of the challenges I was confronted with.

The training articles in Kannada were not as numerous as English.
If you were ever fortunate enough to study Kannada, you will agree with me that the language has a complex structure. For example, words like ಭಾರತ, ಭಾರತದಲ್ಲಿ, ಭಾರತಕ್ಕೆ and ಭಾರತದಿಂದ are treated as different words (vibhakti pratyayas and its cousins).
Despite the variety of API that nltk has bestowed upon us, regional Indian languages find poor support in terms of stemming and lemmatization.
Augmenting my dismay, my hunt for pre-trained doc2vec Kannada models also turned out to be futile.

In a nutshell, while doc2vec does look promising with all those fancy similarity queries, it might not quite serve your purpose. It involves a fair amount of dabbling and dawdling. Nevertheless, this adventure had me awestruck at the immense work that has been done in the field of NLP.

Happy coding in a parallel world! 😊

Sunday, 4 June 2017

My 'trailing' story

Yosh, I am back to blogging in the parallel world, with a whole new set of ventures to explore and bore you with the same. Okay, lets get down to the brass tacks straightaway.
I happened to stumble upon this exciting piece of software called TrailDB and allow me to explain to you my trails through it.

'Trails' in a database?
Before we find ourselves muddled in the technicality and perplexity of things, let us take a minute to note the primary objective of this innovative database - storing and querying time-based user events. If you are born and brought up with relational databases (like I have), you might be wondering, how is this any different? This should be your sweet and simple answer. Having highlighted that, I am not going further into the nuts and bolts of things; the documentation serves that purpose quite brilliantly.

The must-have installation quandary!
I hate installing stuff, especially when they say it is 'simple'. As the documentation promises, the installation on Linux is pretty straightforward. However, if you ever want to test your patience, I recommend installing the same on Windows (Use of bash is not an option). After fixing what seemed like a multitude of errors in the manual installation of dependencies, I realized with tears of grief, that the required version of libJudy (an interesting dependency) is unavailable for windows.

The biased documentation
When I finally decided to get my hands dirty with the actual API, it was all good till I got to the exciting part of running queries. I was using the python binding, but looking at C documentation! Nevertheless, it led me to several intriguing questions such as - what could be an equivalent of C struct in python? Dictionary? Tuple? List? Well, after staring at the source code for ample time, I deciphered the syntax to run queries ( 'event filters' in trailDB terminology).

The final verdict
At the end of the day, the assessment of any database comes down to two things - storage and query capabilities. While trailDB scores with the former (thanks to all the expensive stuff happening at 'C' level), the latter ain't all that impressive. For example, a simple query that involves comparison has to be done in the application world, as it can check only for equality! Not to mention, that the query has to be in conjunctive normal form only.

Some fancy compression!

In my opinion, while it isn't suitable for applications that are database intensive, trailDB is apt if you are looking for a light database, say for archiving massive data (that can be represented as strings easily).

That brings us to the end of yet another exciting adventure in the parallel world. Relevant code will be available here. Adios and happy coding in a parallel world! 😊

Wednesday, 19 April 2017

Bacons and Pickles!

For my last adventure in the parallel world, I decided to do something special - get all of you hungry, and also curious about the The Oracle of Bacon. I happened to find this interesting git repository, which calculates the bacon number of actors picked up from the IMDb database. Yes I know; I could not resist a temptation to mess with the code.

As the author has very well stated, this problem can be reduced to that of doing a BFS on an undirected graph, with actors as vertices and common movies as edges. Unfortunately, the immense dataset adversely impacts performance. Putting on profiler glasses, gives us pretty much the same view: maximum time to load input data. But wait, why pickle?

cProfiler for python

The Elephant coming into the Room

After reading almost every post containing 'pickle' on stackoverflow.com, I ventured to alter a few things.

Disabling garbage collector : Unreferenced objects invite the garbage collector which might in turn, slower the input. [Source]
JSON as input : After a night's struggle, I managed to convert the pickle input file into a JSON. This would obviously cause problems when a new dataset is used, but the difference in performance is worth considering.

Input files converted to JSON

A long-shot at parallelism

Having scored with memory, I questioned myself if there was some scope for parallelism. Intuitively, it seemed like a bad idea because the graph was being implemented with huge, in-memory dictionaries, that were perplexing to deal with. Nevertheless, I gave it a try with the following approach.

Given an actor, find his/her co-stars using the actor-to-movies and movies-to-actors mappings.
If you are lucky enough to find Mr.Kevin Bacon in your co-stars, return 1.
Else, do a parallel BFS on his/her co-stars till you find a minimum link. (The actor's bacon number will be that of your co-star's, incremented by 1)

Much to my disappointment, due to a number of factors like python's GIL, the huge dictionaries (being passed as arguments to processes), basic structure of underlying code, I was not very successful. All the same, documentation of these attempts and others are available here.

Happy coding in a parallel world 😊

Monday, 17 April 2017

Just rand()-only!

Life was throwing stones at me, so I turned to the parallel world for a sweet, simple and sexy adventure. Little did I know, that I will get hit by another revolting, wrecking boulder. Let me elaborate!

We have all met prime numbers in life. My adventure started off with a lucid attempt to write a code snippet that counts the number of prime numbers in a range. The vanilla approach involves finding factors of each number; counting primes as you iterate. But who likes vanilla? I wanted chocolate or strawberry!

There are plenty of fancy algorithms for primality testing - of which the I chose Miller-Rabin. Its is based on Fermat's theorem that states that if p is prime, then, a^p-1 = 1(mod p) for any integer less than p.

Me : So, all I have to do is check if that condition is satisfied for all a.
The wiser me: But wait, how is that better from my vanilla method?
Me : Then, maybe I do it for just one such a.
The wiser me : But what if I am unlucky enough to select an a that satisfies on coincidence?
Me : Oh! I get it , I have to choose a small number of such a and deal with the error.
The wiser me : *rolls eyes* Finally!

As you pretty much guessed, a^{something huge}is pretty knotty to compute. There are tricks such as modular exponentiation, splitting into powers of 2 etc... that our prudent mathematicians have discovered. Let us take a moment to say arigato! Exploiting all such clever techniques, I proudly came up with a working version of prime.c. It was no child's play, especially for someone who can only see, not look!

Get to the parallelism already!

Broadly thinking, I could come up with two ways in which parallelism can be benefitted from.

Parallelising the test for the numbers themselves
Given a number, parallelising the test for the trials ( the a we spoke about earlier)

I set out to explore the first. After the necessary #pragma mantras, I executed the parallel code. It was slower than the serial code.Yes, that was precisely the boulder I mentioned. I searched. I inspected. I scrutinized. I probed. I found nothing.

Finally, a Mr.Robert Reed opened my eyes with this. rand() was the culprit! I was using rand() to pick those 'a' s and as it turns out, rand() is not thread-safe. Hence, all those calls to rand() were being done in serial. Added with the overhead of creating those threads, my parallel code was bound to be slower.

Moral of the story: Do not assume that you can parallelise anything and everthing; and also by the way, rand() is not always thread safe.

All relevant code and logs wills be found here. Happy coding in a parallel world! 😊