Latency vs. throughput, a praise of chickens.

So our story today starts off with a trip to city hall. Now, I can hear you thinking “Daniel, where are you going with this? I thought you promised a wizard story this week?”. Bear with me, I promise this post has to do something with AI, all be it tangential, but I’ll get to that later.  Let me offer you a question before I begin my story:

If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?

Seymour Cray
Most people would choose the oxen, but today I’m going sing a praise for the chickens. By the end of it, I hope you’ll agree with me. Today I’ll be exploring the central idea behind parallel computing: latency vs. throughput.

A trip to city hall

cityhallAlright, imagine you need to renew your passport. You get all your documents ready, you make an appointment at city hall and you go there. When you finally get there, an eery suspicion creeps up on you as you arrive and there is no parking spot to be found. Finally, already frustrated, you find a spot. As you walk in you are greeted by the sight of waaaay too many people waiting for too long. “Damn it, ” you say to yourself, “I knew it.” You go and get a number, 523. You look up to the screen displaying “Up Next”… 489. Well, that’s just great. Why does this always happen? Can’t they hire more people or something? Are they all really that incompetent? The problem is that city hall is simply optimising for something different (throughput) than you (latency). Let me explain what I’m talking about.

Driving cross country

Let me first start off with some definitions, so we know what we’re talking about.

Definition (latency) 

The latency of a process is the time it takes for that process to complete. This is measured in time, usually milliseconds (ms)

Definition (throughput) 

The throughput of a worker is the number of tasks that worker can complete per unit of time. This is measured in tasks per time, often in FLoating-point Operations Per Second (FLOPS).

driving-cross-countryYou can probably see that, although the two are highly related, they are not necessarily aligned. I’ll illustrate this with an example. Imagine you have to go to a concert with 40 people but the concert is 4500 km from your home. Since flying is expensive you decided to drive. You have two options. You can take a sports car with which you can drive 130 km/h and which seats two people, or you can take a bus which seats 20 people but can only drive 50 km/h. Lucky for you, the venue of the concert has a teleporter that can send the vehicle back to your home, so we won’t have to take return trips into account. Let’s see how the two compare:

Vehicle Latency (h) Throughput (p/h) Total time (h)
Sports car 34.6 0.0578 346.2
Bus 90.0 0.222 180


As you can see, even though the individual trips (latency) of the bus is more than twice as much, the total time spent travelling is only slightly more than half. This is the heart of the oxen versus chickens question. You sacrifice individual execution time, but because you have a lot of instances working together the overall time goes down. This is why your trip to city hall is always a pain. Institutions like them are optimised everything for throughput instead of latency. They want their employees to always be working, which means they have to make sure there is always someone waiting for that employee. Doing this means that they can help more people per day, even if it means that the individual people have to wait longer.

How it works inside your computer

“Alright Daniel, all these metaphors and examples are very nice, but how does it really work?”. Yes yes, I was getting to that. In your computer, you usually have two main components that can perform instructions: the Central Processing Unit (CPU) and the Graphics Processing Unit (GPU). I’ll get to why it’s called a GPU in a moment. Going back to the question at the beginning, the CPU is your ox and the GPU is your collection of chickens. The CPU is optimised for latency. It is very good at executing instructions very quickly, but it can only do them sequentially. The GPU, on the other hand, is optimised for throughput, it has lots of components that can all perform instructions independently at the same time, but they’re a lot less powerful.

Too good to be true

“Okay that sounds great, why don’t we use GPUs for everything then?” I hear you ask. Because there’s a catch. There’s always a catch. This method only works if you have jobs that you can individually work on. Imagine if you have an ice cream cone with lots of scoops. In this scenario adding more people to help you eat it doesn’t help, because nobody can start eating their scoop before the person before them has finished theirs. In this scenario, you’re better off just having one person who can eat fast, instead of a lot of slower persons.

The same is true for computing. Working on a GPU only makes sense if your job involves simpler tasks that can be performed independently from each other. Subjects where parallel computing can benefit you include:

  • Image processing and animations. For example, if you want to convert a photo to grayscale, you have to perform a relatively simple computation on every pixel. An animation is actually just some geometric transformation applied to a sprite or model. Doing this requires a lot of calculations on all the points involved. This is also why it is called a GPU, it’s usually used to do all work regarding graphical elements.
  • (scientific) Simulations. For example, if you want to simulate how a certain protein folds you have to do a lot of calculations on billions of different points.
  • Data analysis. If you want to know how points relate to each other, you have to perform a lot of computation on those individual points.

In these case working on the GPU gives you a huge performance increase. Remember those FLOPSs I mentioned in the definition? This is why throughput is usually measured in FLOPS. Because in many situations it’s all about how many calculations you can do on different points. In different scenarios where the computation depends on previous sections of the programme, you’re better off sticking to the CPU.

Why am I telling you all of this?

Because I thought you wanted to learn something.

How does this relate to AI?

Often in AI you have to train your programmes or at the very least process huge amounts of data. For example, when you are training a neural network, searching for a pattern within your data or working through a hedging algorithm these things can increase your performance significantly. I feel like optimisations like this are often a bit of an afterthought in AI. Maybe because devs are expensive, and they feel that every hour they spend optimising is and hour they can’t spend training. I don’t know how those two things relate (especially cost-wise), but I thought you should at least know about the flip-side.

I hope I have given you a good introduction to the general idea behind parallel computing, and that you’ve learned to embrace the chickens. Who knows, maybe you can consider a bit of parallelisation next time you have to train your AI. I’ll probably try to dive a bit more deeply into this subject at some point. If you have any questions left, let me know somewhere on the internet, and I’ll try to address them as best I can.


Like, Share, subscribe: