A 'wait loss' system for data centres

By Larry Hardesty
Friday, 15 August, 2014

A system developed by MIT researchers could reduce data-transmission delays across server farms by 99.6%.

Big websites usually maintain their own data centres, banks of tens or even hundreds of thousands of servers, all passing data back and forth to field users’ requests. Like any big, decentralised network, data centres are prone to congestion: packets of data arriving at the same router at the same time are put in a queue, and if the queues get too long, packets can be delayed.

At the annual conference of the ACM Special Interest Group on Data Communication, to be held next week in Chicago, MIT researchers will present a research paper on a new network management system that, in experiments, reduced the average queue length of routers in a Facebook data centre by 99.6% - virtually doing away with queues.

When network traffic was heavy, the average latency - the delay between the request for an item of information and its arrival - shrank nearly as much, from 3.56 microseconds to 0.23 microseconds.

Like the internet, most data centres use decentralised communication protocols: each node in the network decides, based on its own limited observations, how rapidly to send data and which adjacent node to send it to.

Decentralised protocols have the advantage of an ability to handle communication over large networks with little administrative oversight.

The MIT system, dubbed Fastpass, instead relies on a central server called an ‘arbiter’ to decide which nodes in the network may send data to which others during which periods of time.

“It’s not obvious that this is a good idea,” says Hari Balakrishnan, the Fujitsu Professor in Electrical Engineering and Computer Science and one of the paper’s co-authors.

With Fastpass, a node that wishes to transmit data first issues a request to the arbiter and receives a routing assignment in return.

“If you have to pay these maybe 40 microseconds to go to the arbiter, can you really gain much from the whole scheme?” says Jonathan Perry, a graduate student in electrical engineering and computer science (EECS) and another of the paper’s authors. “Surprisingly, you can.”

Balakrishnan and Perry are joined on the paper by Amy Ousterhout, another graduate student in EECS; Devavrat Shah, the Jamieson Associate Professor of Electrical Engineering and Computer Science; and Hans Fugal of Facebook.

Division of labour

The researchers’ experiments indicate that an arbiter with eight cores can keep up with a network transmitting 2.2 terabits of data per second. That’s the equivalent of a 2000-server data centre with gigabit-per-second connections transmitting at full bore all the time.

“This paper is not intended to show that you can build this in the world’s largest data centres today,” Balakrishnan says. “But the question as to whether a more scalable centralised system can be built, we think the answer is yes.”

Moreover, “the fact that it’s two terabits per second on an eight-core machine is remarkable,” Balakrishnan says. “That could have been 200 gigabits per second without the cleverness of the engineering.”

Data farm concept diagram

This diagram of data transfer in a server farm demonstrates how Fastpass selects routes through the network so as to avoid congestion at any point between the sender and the receiver. Illustration: Christine Daniloff/MIT

The key to Fastpass’s efficiency is a technique for splitting up the task of assigning transmission times so that it can be performed in parallel on separate cores. The problem, Balakrishnan says, is one of matching source and destination servers for each time slot.

“If you were asked to parallelise the problem of constructing these matchings,” he says, “you would normally try to divide the source-destination pairs into different groups and put this group on one core, this group on another core and come up with these iterative rounds. This system doesn’t do any of that.”

Instead, Fastpass assigns each core its own time slot, and the core with the first slot scrolls through the complete list of pending transmission requests. Each time it comes across a pair of servers, neither of which has received an assignment, it schedules them for its slot. All other requests involving either the source or the destination are simply passed on to the next core, which repeats the process with the next time slot. Each core thus receives a slightly attenuated version of the list the previous core analysed.

Bottom line

Today, to avoid latencies in their networks, most data centre operators simply sink more money into them. Fastpass “would reduce the administrative cost and equipment costs and pain and suffering to provide good service to the users”, Balakrishnan says. “That allows you to satisfy many more users with the money you would have spent otherwise.”

Networks are typically evaluated according to two measures: latency, or the time a single packet of data takes to traverse the network, and throughput, or the total amount of data that can pass through the network in a given interval.

“It used to be that everybody was only bothered about throughput,” says George Varghese, a principal researcher and partner at Microsoft Research who studies networking. “But latency is the new frontier for everybody.”

Throughput is the relevant figure when data tends to traverse the network in large chunks, Varghese explains. But in data centres, that’s less and less common. “There are many parts to serving a web page, and they’re all generally pushed onto different machines,” he says.

“Latency matters because when these machines are conspiring to do work on your behalf, any time a machine has to wait for a while before getting something from a partner machine, that machine is idle.”

The crucial question is whether centralised network-management systems like Fastpass can handle the volume of requests generated by the largest of today’s server farms.

“It’s sufficiently surprising that it even scales to a pretty large network of 1000 switches,” Varghese says. “Nobody probably would have expected that you could have this complete control over both when you send and where you send. And they’re scaling by adding multiple cores, which is promising.

“So once the elephant is out of the tent, it is not inconceivable that there are some simple tricks that will let this scale.”

Reprinted with permission of MIT News.

A 'wait loss' system for data centres

Division of labour

Bottom line

AI in Australia must be built as a data system, not just a compute system

Why Australia's data-centre growth must be matched by cybersecurity investment

Sustainable AI infrastructure could become Australia's next great export

Content from other channels on our network