Thursday, 16 March 2017

Multi-(threading | processing) ?

My third adventure in the parallel world started off with my own code - a lexer for C, written in Python. To those of you who are new to terminology of compiler design - lexical analysis is the step 0 of the compilation process.
In a nutshell, a lexer performs the following functions:
  • Strip off comments from the source code
  • Generate tokens i.e seggregate input code into identifiers, operators, punctuation etc...
  • Populate the symbol table
So, how are the tokens generated?

The input code is matched with a bunch of predefined regular expression and classified accordingly. Sounds simple enough? At this point, it is worth noting that this can be really expensive for large pieces of code as the expressions themselves are not guaranteed to be simple. But wait, can't this be done in parallel?  Yes, you hit the nail right on its head!

Revisiting our problem with the "parallel eye"

Ideally, we want some workers that scan the input in parallel. As soon as one worker finds the matching regular expression, we want all the workers to stop. It reminds me of the time when I made my whole family search for my spectacles at home while I'd actually left it in college! That's right, we have to deal with input that matches nothing as well.

Parallelism in Python

Python is a language that has never disappointed me in its collection of libraries and quite expectedly-voila! We are all well familiar with the differences between processes and threads as stated by our dear old python; nevertheless, I thought this would be a good opportunity to explore into the same.

Mutiprocessing vs Multithreading

Implementing the lexer with multi-(threading|processing) can be approached in two ways:
  • Make each worker return the match object or None
  • Make each worker update a global variable
Obviously, the second option is better, as we can let other workers terminate if and when the global variable is set. This is precisely where I was accurately wrong, fundamentally!  Processes don't share memory - hence, even if the different worker processes saw the same initial value of the global variable, they would be updating their own local copies of the global variable; thus defeating the purpose successfully! Not to mention, sharing the variable among processes is not really worth the effort.
On the other hand, threads (even with python's GIL) serve our purpose to a great extent.

As you pretty much guessed, the red dots correspond to multiprocessing while the green ones to multithreading. The separation is strikingly large!

Hope this post made you a little curious about the parallel world of python! Adios till my next adventure - happy coding in a parallel world. 😊

P. S: All relevant code and readme files will be here.

No comments:

Post a Comment