July 8, 2024

Adapting Data Center Interconnect for the AI Data Deluge

ai

Artificial intelligence (AI) is undoubtedly reshaping the world, from education and healthcare to finance and foreign policy. As enterprises and data center providers continue to adopt purpose-built data center infrastructure platforms (like the NVIDIA DGX SuperPOD) that enable the use of AI at scale, the pressures on data center optical networking engineers are rising. As a result, optical transceiver technology is once again in the spotlight like never before. Here’s how.

The Deluge of AI Data Creates Challenges for DCI and Intra-Data Center Architectures

Across numerous verticals, enterprises are increasingly adopting data-intensive AI workloads with some of the largest generative AI applications relying on trillions of parameters. From large language modeling and AI training to high performance computing (HPC) simulations, these workloads generate vast amounts of data that need to be processed, analyzed, and routed both within and between today’s data centers. As a result, data center network engineers are under immense pressure to optimize both their data center interconnect (DCI) architectures and intra-data center networks to accommodate the soaring demands of AI-driven applications today while planning for the future.

In a recent article from Data Center Frontier, Sameh Boujelbene, VP at Dell’Oro Group, said that the rapid growth in the number of parameters handled by generative AI applications is increasing tenfold annually. As enterprise, hyperscale data centers and even colocation providers begin deploying thousands or even “hundreds of thousands of accelerated nodes,” per Boujelbene, data center networks will be hard-pressed to keep up with the ongoing deluge of data. As a case in point, Peter Jones, Chairman of the Ethernet Alliance, notes that hyperscale data center network boundaries are “crossing over the Terabits per second threshold.”  Recent Omdia research notes that AI use will significantly increase adoption of optical technologies in data centers of all sizes with lower data rates like 100G and 200G declining as higher rates like 400G, 800G and 1.6T increase.

What does this mean for DCI and the network fabrics within today’s data centers? The answer lies in the adoption of higher data rates like 800G and even 1.2T and 1.6T. From an intra-data center connectivity perspective, it also means increased focus on transceiver design to enhance scalability, reduce latency, and improve energy efficiency, thereby meeting the high-speed, low-latency, data-intensive demands of AI applications.

The Modern Face of DCI and Intra-Data Center Connectivity: 800G and New, Energy-Efficient Interfaces

With 400G transceiver variants already well-established in the market, data center operators are already beginning to look at deploying 800G technologies to plan for the vast volumes of data generative AI and other HPC applications will yield. The OFC, in its 2024 Post-Show Report, notes that 800G variant installations will accelerate in 2024 and beyond “in order to support AI backend networks as well as the general Ethernet network that supports all data center workloads.”

When it comes to interoperable, coherent 800G standards, the OIF’s work on 800ZR and 800LR stands out. The OIF 800ZR standard is a specification for 800G Ethernet over a single wavelength using coherent optics. Intended for DCI applications, it can be used to connect data centers over distances up to 80km. The 800LR standard concentrates on 800G Ethernet transmission overs a single wavelength for distances up to 10km. That makes it suitable for campus or intra-DC applications, which, as Dell’Oro notes, will be an important frontier in data center network evolution as AI’s growing bandwidth requirements “drive the need for optical 800G transceivers inside data centers.”

From an intra-data center perspective, Linear Pluggable Optics (LPO) and Co-Packaged Optics (CPO) are both serious contenders in revolutionizing the typical power and latency equation for data centers. For its part, linear drive technology, which LPO links leverage, eliminates the need for complex digital signal processing (DSP) on the optics by relying on the SERDES DSP in the switch chip for digital formatting. The DSP on the switch ASIC drives an optical engine on the pluggable optic that includes only linear amplifiers. As a result, LPO power consumption is much lower than that of conventional pluggable variants, making it an attractive choice as data center operators employ more transceivers within their networks in response to the demands of AI.

CPO technology can also offer significant benefits from a power consumption and latency perspective. Unlike the LPO design, CPO optics integrate optical engines directly with switch ASICs – all within the same package. In this way, CPOs enable short, low-loss communication between the chip and optical engine, allowing network operators to reduce the number of DSPs they rely on, thereby lowering power consumption. After all, the DSP can drive up overall system power by as much as 25-30%. CPOs also enable low latency communication because of the removal of long copper traces between the ASIC and the optics as well as enabling the use of fewer DSPs.   

That said, both LPOs and CPOs are still in the early stages of development and demonstration. For example, concerns around switch chip support, system design and interoperability still abound regarding LPOs. However, an LPO MSA has been established, which provides a certain degree of confidence that such issues will be ironed out in the not-too-distant future. For both LPO and CPO, the concept is in finding new solutions that make optics more efficient, by lowering power consumption, latency and cost.

As AI applications generate increasingly diverse and complex data streams, data center operators need optical solutions that can accommodate and efficiently transport these data payloads across their networks. In our whitepaper, titled Network Operator’s Guide: The Latest Advancements in 400G and 800G, we discuss the opportunities and challenges surrounding adoption of 800G in modern networks, including DCI. The innovations around each technology add a host of systems integration considerations to the already complex demands of adapting a data center to leverage purpose-built AI computational infrastructure. That said, it’s becoming necessary for data center network engineers to turn 800G transceivers in their quest to future-proof their infrastructure and ensure scalability in the face of escalating data demands.

Adapting to AI in the Data Center Requires Deep Seated Expertise and Partnerships

Whether you’re encountering challenges with 400G adoption or planning to deploy 800G optics, there’s a lot to consider. Transceiver power requirements, form factor popularity, multi-vendor interoperability, and network orchestration are just some of the issues forward-thinking data center operators encounter as they seek to adapt their networks for the demands of AI workloads. That’s where we can help. Our team of expert engineers have deep-seated expertise in all aspects of systems integration, network architectures and the optical transceivers that go in them. With robust testing and our consultative approach, we have a proven track record of turning our customer’s visions into reality. Contact us today with your questions!