How data centres will quickly and economically harness AI
By Matias Peluffo, VP, Building and Data Centre Connectivity, APAC, CommScope
Monday, 16 October, 2023
When popular science fiction depicts the ‘rise of machine intelligence’, it usually comes with lasers, explosions or, in some of the gentler examples, a mild philosophical dread. But there can be no doubt that interest in the possibilities of artificial intelligence (AI) and machine learning (ML) in real-life applications is on the rise, and new applications are popping up daily.
Millions of users globally are already engaging with AI using ChatGPT, Bard and other AI interfaces. But most of these users don’t realise that their cosy desktop exchanges with a curious AI assistant are actually driven by massive data centres all over the world.
Enterprises are investing in AI clusters within their data centres, building, training and refining their AI models to suit their business strategies. These AI cores are composed of racks upon racks of GPUs (graphical processing units) that provide the incredible parallel processing power that AI models require for the exhaustive training of their algorithms.
With the data sets imported, inference AI analyses that data and makes sense of it. This is the process that determines whether an image contains a cat or a small dog, based on its training of what characteristics are common to cats but not dogs. Then, generative AI can process that data to create entirely new images or text.
It’s this ‘intelligent’ processing that has captured the imaginations of people, governments and enterprises everywhere — but creating a useful AI algorithm requires vast amounts of data for training purposes, and this is an expensive and power-intensive process.
Efficient training is where it begins
Data centres generally maintain discrete AI and compute clusters, which work together to process the data that trains the AI algorithm. The amount of heat generated by these power-hungry GPUs limits how many can be housed together in a given rack space, so optimising the physical layout is a must, in order to reduce heat and minimise link latency.
AI clusters require a new data centre architecture. The GPU servers require much more connectivity between servers, but there are fewer servers per rack due to power and heat constraints. This leads to situations where we have more inter-rack cabling than traditional data centres, with links that require 100 to 400G at distances that cannot be supported by copper.
When training a large-scale AI, it’s generally held that about 30% of the required time is consumed by network latency, while the remaining 70% is spent on compute time. Since training a large model can cost up to $10 million, this networking time represents a significant cost. Even a latency saving of 50 nanoseconds, or 10 metres of fibre, is significant, and nearly all the links in AI clusters are limited to 100-metre reaches.
Trimming metres, nanoseconds and watts
Operators should carefully consider which optical transceivers and fibre cables they will use in their AI clusters to minimise cost and power consumption. Because fibre runs must be as short as possible, the optics cost will be dominated by the transceiver. Transceivers that use parallel fibre will have an advantage in that they do not require the optical multiplexers and demultiplexers used for wavelength division multiplexing. This results in both lower cost and lower power for transceivers with parallel fibre. The transceiver cost savings more than offset the small increase in cost for a multifibre cable instead of a duplex fibre cable. For example, 400G-DR4 transceivers with eight fibre cables are more cost effective than 400G-FR4 transceivers with duplex fibre cable.
Links up to 100 m are supported by singlemode fibre and multimode fibre applications. Advances like silicon photonics have reduced the cost of singlemode transceivers, bringing them closer to the cost of equivalent multimode transceivers. When comparing the overall costs of deployment for high-speed transceivers (400G+), the cost of a singlemode transceiver is double the cost of an equivalent multimode transceiver. While multimode fibre has a slightly higher cost than singlemode fibre, the difference in cable cost between multimode and singlemode is smaller since multifibre cable costs are dominated by MPO connectors.
In addition, high-speed multimode transceivers use one to two watts less power than their singlemode counterparts. With up to 768 transceivers in a single AI cluster, a setup using multimode fibre will save up to 1.5 kW. This may seem small compared to the 10 kW that each GPU server consumes, but for AI clusters any opportunity to save power can deliver significant savings over the course of AI training and operation.
Transceivers vs AOCs
Many AI/ML clusters and HPCs use active optical cables (AOCs) to interconnect GPUs and switches. An active optical cable is a fibre cable with integrated optical transmitters and receivers on either end. Most AOCs are used for short reaches and typically use multimode fibre and VCSELs. High-speed (>40G) active optical cables will use the same OM3 or OM4 fibre used in fibre cables that connect optical transceivers. The transmitters and receivers in an AOC may be the same as in analogous transceivers but are the castoffs. Transmitters and receivers don’t need to meet rigorous interoperability specs; they only need to operate with the specific unit attached to the other end of the cable. Since no optical connectors are accessible to the installer, the skills required to clean and inspect fibre connectors are not needed.
The downside of AOCs is that they do not have the flexibility offered by transceivers. Installing AOCs is time-consuming as the cable must be routed with the transceiver attached. Correctly installing AOCs with breakouts is especially challenging. The failure rate for AOCs is double that of equivalent transceivers. When an AOC fails, a new AOC must be routed through the network. This takes away from the compute time. Finally, when it is time to upgrade the network links, the AOCs must be removed and replaced with new AOCs. With transceivers, the fibre cabling is part of the infrastructure and may remain in place for several generations of data rates.
AI/ML has arrived, and it’s only going to become a more important and integrated part of the way people, enterprises and devices interact with one another. Interfacing with AI can literally happen in the palm of your hand; however, this is dependent on the large-scale data centre infrastructure that drives it. Enterprises that train AI quickly and efficiently will have a competitive advantage in our fast-changing, super-connected world. Investing today in the advanced fibre infrastructure that drives AI training and operation will deliver incredible results tomorrow.
Whatever dark data means to you or your business, it is an 'elephant in the room' for...
While the industry has built resilience and scalability into data centre infrastructure, there is...
Addressing cost and flexibility challenges means embracing a modern, future-facing digital...