Bonjour! Yes, it has been quite a while again; but now, gear up for a bunch of exciting adventures in the parallel world as I explore a world of processors and memory and how they pull the strings of performance. Consequently, the mundane task of writing hardware-friendly code is entrusted to us, programmers (cause we didn't have enough already!). Yosh, let's get started.
NUMA and cheesy pizzas
I went for dinner with a group of processors, and we decided to order a cheesy pizza. Hungry as we were, we figured that we would spend more time waiting to reach the pizza than eating it (fetching data takes more cycles than processing it). So each of us got a slice of our own (local memory). Now toss around the following ideas from my head.
Moving forward with the assumption that we can now talk in NUMA/pizza jargon, we have an upfront errand at hand - ensuring that our application takes advantage of this underlying mysterious architecture. If you're of the opinion that making a process access only its local memory is sufficient to score on your performance grade card, then I would like to quote Da Vinci in reply, "The greatest deception that men suffer, is from their own opinions". By the way, this idea is termed as the ''node-local" policy by the Linux kernel.
NUMA and cheesy pizzas
I went for dinner with a group of processors, and we decided to order a cheesy pizza. Hungry as we were, we figured that we would spend more time waiting to reach the pizza than eating it (fetching data takes more cycles than processing it). So each of us got a slice of our own (local memory). Now toss around the following ideas from my head.
- I love pizza, so I want my friend's slice as well (local access and remote access)
- But eating from my friend's slice is easier said than done (remote access has higher latency)
- Also, reaching out to a friend's slice who is seated farther from me is more laborious than that of my immediate neighbor (remote memory latency is a function of processor distance)
Moving forward with the assumption that we can now talk in NUMA/pizza jargon, we have an upfront errand at hand - ensuring that our application takes advantage of this underlying mysterious architecture. If you're of the opinion that making a process access only its local memory is sufficient to score on your performance grade card, then I would like to quote Da Vinci in reply, "The greatest deception that men suffer, is from their own opinions". By the way, this idea is termed as the ''node-local" policy by the Linux kernel.
Summons to contest the "node-local" policy
1) One process, one core, one memory... always?
Consider a complex application that runs multiple complex processes on multiple core complexes and has a complex memory access pattern - leads to a complex situation indeed.
Consider a complex application that runs multiple complex processes on multiple core complexes and has a complex memory access pattern - leads to a complex situation indeed.
2) Keep those threads on the move
Another approach involves migrating a thread to the respective processor where the remote memory access occurs. However, one has to pay close attention to the trade-off between the overhead of context switching to remote memory access latency.
3) Clash of caches
When the architecture has blessed us with independent processors, it seems unwise to cram all the threads of a process onto a single node; thereby increasing contention of shared resources like those prized last level caches.
When the architecture has blessed us with independent processors, it seems unwise to cram all the threads of a process onto a single node; thereby increasing contention of shared resources like those prized last level caches.
4) Asymmetric interconnect leads to asymmetric latencies
There might exist some out-of-the-world processors whose interconnects are not of the same bandwidth, making efficient scheduling of threads a hard nut to crack!
I will be back to trouble you with more anecdotes about NUMA. Till such time, happy coding in a parallel world! 😊
There might exist some out-of-the-world processors whose interconnects are not of the same bandwidth, making efficient scheduling of threads a hard nut to crack!
I will be back to trouble you with more anecdotes about NUMA. Till such time, happy coding in a parallel world! 😊
Very well written Ms. Varsha Ramesh. Must say, the cheesy pizza comparison is truly commendable. The concept has become very clear now, will be sharing this to all my colleagues. Keep the good work up :)
ReplyDelete