Friday, 11 May 2018

Exploring NewMA!

Hola amigos :) I admit it has been quite a while since my last post. Thanks to bunch of mundane chores that kept me occupied, I never realised how it got so late, so soon. Yes, that was a Dr. Seuss quote to get the mood going. Anyways, welcome to my next adventure in the parallel world that further explores our NUMA architecture, who got introduced to us in my previous post.

So here is the precept - we have the chummy, supersonic local memory in contrast to the surly, slothful remote memory; followed by the important question that is probably in your head - What is the big deal?

Imagine you waiter at a pizzeria and I told you that cheesy pizzas are actually being prepared in the kitchen of the building next door (remote memory), rather than your own (local memory). The delay in delivering cheesy pizzas (memory accesses) to the hungry customers will adversely affect the business of the pizzeria.

Likewise, it is the application running on top of NUMA that will be afflicted by remote memory access latency. All said and done, we don't want our beloved applications be to be unhappy at the end of the day (a.k.a "the big deal").

So, we picked this really cool database application called Redis and analysed how performance is impacted when its memory accesses are forced to be from local and remote memories respectively. The results, depicted in the following graphs, were indeed quite astonishing.



CPI is the number of CPU cycles consumed to execute one instruction. Evidently, the lower the CPI, the happier is our application, which is the case when it is primarily accessing the local memory.

So how do we appease our application, flatter it perhaps? Stay tuned while I find a way to the this guy's heart. Meanwhile, happy coding in a parallel world! 😊

Sunday, 4 February 2018

The NewMA architecture

Bonjour! Yes, it has been quite a while again; but now, gear up for a bunch of exciting adventures in the parallel world as I explore a world of processors and memory and how they pull the strings of performance. Consequently, the mundane task of writing hardware-friendly code is entrusted to us, programmers (cause we didn't have enough already!). Yosh, let's get started.

NUMA and cheesy pizzas
I went for dinner with a group of processors, and we decided to order a cheesy pizza. Hungry as we were, we figured that we would spend more time waiting to reach the pizza than eating it (fetching data takes more cycles than processing it). So each of us got a slice of our own (local memory). Now toss around the following ideas from my head.

  • I love pizza, so I want my friend's slice as well (local access and remote access)
  • But eating from my friend's slice is easier said than done (remote access has higher latency)
  • Also, reaching out to a friend's slice who is seated farther from me is more laborious than that of my immediate neighbor (remote memory latency is a function of processor distance)
Any thoughts on how I can get more of that cheesy pizza quickly? Furthermore, if you aren't a big fan of pizza, you can always read this.

Moving forward with the assumption that we can now talk in NUMA/pizza jargon, we have an upfront errand at hand - ensuring that our application takes advantage of this underlying mysterious architecture. If you're of the opinion that making a process access only its local memory is sufficient to score on your performance grade card, then I would like to quote Da Vinci in reply, "The greatest deception that men suffer, is from their own opinions". By the way, this idea is termed as the ''node-local" policy by the Linux kernel.

Summons to contest the "node-local" policy

1) One process, one core, one memory... always?
Consider a complex application that runs multiple complex processes on multiple core complexes and has a complex memory access pattern - leads to a complex situation indeed.

2) Keep those threads on the move
Another approach involves migrating a thread to the respective processor where the remote memory access occurs. However, one has to pay close attention to the trade-off between the overhead of context switching to remote memory access latency.

3) Clash of caches
When the architecture has blessed us with independent processors, it seems unwise to cram all the threads of a process onto a single node; thereby increasing contention of shared resources like those prized last level caches.

4) Asymmetric interconnect leads to asymmetric latencies
There might exist some out-of-the-world processors whose interconnects are not of the same bandwidth, making efficient scheduling of threads a hard nut to crack!

I will be back to trouble you with more anecdotes about NUMA. Till such time, happy coding in a parallel world! 😊