AI Networking - 5 Things to Know

GPU is the brain of AI computer

Simply put, the graphics processing unit (GPU) acts as the brain of the AI computer.

As you may know, the central processing unit (CPU) is the brain of the computer. The advantage of a GPU is that it is a specialized CPU that can perform complex calculations. The fastest way to do this is to have groups of GPUs solve a problem. However, training an AI model can still take weeks or even months. Once it is built, it is placed in a front-end computing system and users can ask the AI model questions, a process called inference.

An AI computer containing multiple GPUs

The best architecture for solving AI problems is to use a cluster of GPUs in a rack, connected to a switch on the top of the rack. Multiple GPU racks can be connected in a hierarchy of networking. As the problem becomes more complex, the GPU requirements increase, and some projects may need to deploy clusters of thousands of GPUs.

Each AI cluster is a small network

When building an AI cluster, it is necessary to set up a small computer network to connect and allow GPUs to work together and share data efficiently.

The figure above illustrates an AI Cluster where the circles at the bottom represent the workflows running on GPUs. The GPUs connect to the top of rack (ToR) switches. The ToR switches also connect to the network backbone switches shown above the diagram, demonstrating the clear network hierarchy required when multiple GPUs are involved.

Networks are a bottleneck in AI deployment
Last fall, at the Open Computer Project (OCP) Global Summit, where delegates worked together to build the next generation of AI infrastructure, delegate Loi Nguyen of Marvell Technology made a key point: “networking is the new bottleneck.”

Technically, high packet latency or packet loss due to network congestion can cause packets to be re-sent, significantly increasing job completion time (JCT). As a result, millions or tens of millions of dollars worth of GPUs from enterprises are wasted due to inefficient AI systems, costing the enterprise both revenue and time to market.

Measurement is a key condition for successful operation of AI networks

To effectively operate an AI cluster, GPUs need to be able to utilize their full capacity to shorten the training time and put the learning model into use to maximize the return on investment. Therefore, it is necessary to test and evaluate the performance of the AI cluster (Figure 2). However, this task is not easy, because in terms of system architecture, there are many settings and relationships between GPUs and network structures that need to complement each other to solve the problem.

AI Data Center Testing Platform and How It Tests AI Data Center Clusters

This creates many challenges in measuring AI networks:

- Difficulty in reproducing entire production networks in the lab due to limitations in cost, equipment, shortage of skilled network AI engineers, space, power, and temperature.

- Measurement on the production system reduces the available processing capacity of the production system itself.

- Difficulty in accurately reproducing problems due to differences in scale and scope of problems.

- The complexity of how GPUs are collectively connected.

To address these challenges, enterprises can test a subset of the recommended setups in a lab environment to benchmark key metrics such as job completion time (JCT), the bandwidth the AI team can achieve, and compare it to switching platform utilization and cache utilization. This benchmarking helps find the right balance between GPU/processing workload and network design/setup. Once satisfied with the results, the computer architects and network engineers can take these setups into production and measure new results.

Corporate research labs, academic institutions, and universities are working to analyze every aspect of building and operating effective AI networks to address the challenges of working on large networks, especially as best practices continue to evolve. This collaborative, repeatable approach is the only way for companies to perform repeatable measurements and rapidly test “what-if” scenarios that are the foundation for optimizing networks for AI.

(Source: Keysight Technologies)

Source: https://vietnamnet.vn/ket-noi-mang-ai-5-dieu-can-biet-2321288.html