Wednesday, 19 April 2017

Bacons and Pickles!

For my last adventure in the parallel world, I decided to do something special - get all of you hungry, and also curious about the The Oracle of Bacon. I happened to find this interesting git repository, which calculates the bacon number of actors picked up from the IMDb database. Yes I know; I could not resist a temptation to mess with the code.

As the author has very well stated, this problem can be reduced to that of doing a BFS on an undirected graph, with actors as vertices and common movies as edges. Unfortunately, the immense dataset adversely impacts performance. Putting on profiler glasses, gives us pretty much the same view: maximum time to load input data. But wait, why pickle?

cProfiler for python

The Elephant coming into the Room

After reading almost every post containing 'pickle' on stackoverflow.com, I ventured to alter a few things.

  • Disabling garbage collector : Unreferenced objects invite the garbage collector which might in turn, slower the input. [Source
  • JSON as input : After a night's struggle, I managed to convert the pickle input file into a JSON. This would obviously cause problems when a new dataset is used, but the difference in performance is worth considering.

Input files converted to JSON

A long-shot at parallelism

Having scored with memory, I questioned myself if there was some scope for parallelism. Intuitively, it seemed like a bad idea because the graph was being implemented with huge, in-memory dictionaries, that were perplexing to deal with. Nevertheless, I gave it a try with the following approach.

  1. Given an actor, find his/her co-stars using the actor-to-movies and movies-to-actors mappings.
  2. If you are lucky enough to find Mr.Kevin Bacon in your co-stars, return 1.
  3. Else, do a parallel BFS on his/her co-stars till you find a minimum link. (The actor's bacon number will be that of your co-star's, incremented by 1)
Much to my disappointment, due to a number of factors like python's GIL, the huge dictionaries (being passed as arguments to processes), basic structure of underlying code, I was not very successful. All the same, documentation of these attempts and others are available here.

Happy coding in a parallel world 😊








No comments:

Post a Comment