Internet-Draft Operations and Management Area Working G October 2024
Liu & Xu Expires 17 April 2025 [Page]
Workgroup:
Operations and Management Area Working Group
Internet-Draft:
draft-liu-opsawg-cco-cm-requirement-02
Published:
Intended Status:
Informational
Expires:
Authors:
C. Liu
China Mobile
S. Xu
China Mobile

Requirements from Control and Management Viewpoint for Collective Communication Optimization

Abstract

Collective communication optimization is crucial to improve the performance of distributed applications, due to that communication has become bottleneck to degrade applications with the growth of scale of distributed systems. The industry and academy has worked on proposing solutions to upgrade collective communication operations. However, there has been a problem of lacking for unified guidelines.

This draft provide requirements on collective communication optimization from the control and management viewpoint.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 17 April 2025.

Table of Contents

1. Introduction

In recent years, with the development and evolution of various applications and business, especially the rapid growth of AI applications, distributed computing performance has become more and more important and has gradually become a key factor which limits the progress of these applications. Due to the large amount of collective communication involved in distributed computing, collective communication performance is crucial. However, there exists many problems to be solved for collective communication. On one hand, many collective communication operations implemented by message-level communication libraries like MPI and NCCL mainly utilize the unicast point-to-point communication mechanism, leading to the redundancy of network traffic, the underutilization of network resources and waste of network capabilities. On the other hand, since the underlying network protocols and collective communication are not co-designed, there is a semantic gap between inter-process message transportation and packet forwarding. Therefore, there is huge space for the optimization of collective communication. At present, the industry and academy are also actively promoting the development, implementation and deployment of collective communication optimization.

The research group Computing in the Network, COIN for short, also focus on this topic. The goal is mainly to investigate how network data plane programmability can improve Internet architecture, with a too broad focus scope including network functions offloading, machine learning acceleration, in-network caching and in-network control, etc. In addition to the solution of collective operation offloading COIN talk about, multicast substituting for single point unicast, scheduling tasks and planning transportation paths by topology awareness and bridging semantic gap between inter-process message transportation and packet forwarding can also play a significant role on optimizing collective communication.

This draft provide some necessary requirements from the network control and management viewpoint, combined with the optimization solutions of collective communication offloading, multicast mechanisms, topology awareness and semantic gap bridge between inter-process message transportation and packet forwarding, to guideline the standardization work of collective communication optimization.

2. Requirements

2.1. Memory Management

Scarce memory resources provided by network devices for collective communication MUST be scheduled and controlled, e.g. assigning a scheduling priority to collective communication offloading tasks. Compared to the amount of collective communication message in the applications such as AI, HPC, etc., it is severely mismatched and extremely scarce for memory resource provided by network devices for collective communication, such as network programmable switches.

Use Case[ESA]. The memory of programmable switch is scarce for the amount of gradient transmitted in distributed training. There is some existing work to solve this problem like pool-based streaming and dynamic sharing, which are not enough yet. A use case of fully utilizing the memory of programmable switch is that the control and management module of switch assigns a priority to the aggregation task, to dynamically and preemptively schedule the aggregation tasks in the data plane, thus making more full use of memory in the form of switch aggregators.

+----------+  +----------+           +----------+
|          |  |          |           |          |
| worker 1 |  | worker 2 |           | worker n |
|          |  |          |  ... ...  |          |
+----+-----+  +----+-----+           +-----+----+
     |             |                       |
     |             |                       |
     +------+------+-----------------------+
            |GB/TB-level gradients
+-----------+-----------+           +-----------+
|           |           |           |           |
|  +--------+--------+  |           |           |
|  |     Switch      |  |           |  Control  |
|  |   Aggregators   |  |  Manage   |     &     |
|  +-----------------<--+-----------+ Management|
|  |Memory for others|  | Schedule  |           |
|  +-----------------+  |           |           |
| Switch Memory = 10MB  |           |           |
+-----------------------+           +-----------+
Figure 1: The Mismatched between Device Memory and Communication Volume

2.2. Topology Management

Topology awareness and mapping work are REQUIRED to be done to put some of the end-host computing on the network nodes for collective communication optimization. In many collective operation tasks, the logical relationship between nodes is usually described in the form of graph, and then mapping to the physical network. Therefore, collective communication offloading requires awareness of the network topology and making efficient mappings.

Use Case. In the parameter server architecture commonly used in distributed training, the parameter server can be reasonably mapped to spine switches in the fat tree physical network with being aware of network topology. Under this mapping mechanism, the traffic path is more simplified and the traffic volume of the whole network is greatly compressed. Compared to the traditional collective communication mode, the optimized end-to-network or end-to-network-to-end one with topology awareness and mapping makes the physical topology and the logical topology closer, more friendly and unified.

                 Logical Topology
                +----------------+
                |Parameter Server|
                +--------+-------+
                         |
          +----------+---+------------+
          |          |                |
     +----+---+ +----+---+        +---+----+
     |Worker 1| |Worker 2| ...... |Worker n|
     +--------+ +--------+        +--------+
                         |Mapping
                         |
              +----------+---------+
              |Management & Control|
              |                    |
              | Topology Awareness |
              | Paths Planning     |
              +----------+---------+
                         |
                         |Mapping
                         v
                 Physical Topology
             +-----+         +-----+
             |Spine|         |Spine|
             +--+--+         +--+--+
                |               |
     +----------+-+------------++-----------+
     |            |            |            |
  +--+--+      +--+--+      +--+--+      +--+--+
  |Leaf |      |Leaf |      |Leaf |      |Leaf |
  +--+--+      +--+--+      +--+--+      +--+--+
     |            |            |            |
  +--+--+      +--+--+      +--+--+      +--+--+
  |     |      |     |      |     |      |     |
+-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+
|GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|
+---+ +---+  +---+ +---+  +---+ +---+  +---+ +---+
Figure 2: Topology Management and Topology Mapping

2.3. Interfaces Management

Some collective communication interfaces MUST be defined and managed for application developers to shield tedious network engineering details, such as flow control, packet organization, chip-specific programming language, etc. If not, applications developers will need too much arcane knowledge and expertise, which is beyond their willingness and prevent from the evolution of the emerging applications.

Use case. The industry and academy have actually proposed some abstractions of collective communication operations, such as collective communication libraries MPI, NCCL, NetRPC[NetRPC], etc. In the control plane, these interfaces need to be configured and instantiated to complete the part of collective communication functionality.

2.4. Data Management

The semantic gap between application data unit and network data unit is REQUIRED to be bridged. This semantic mismatch poses a huge obstacle to the support of upper layer applications in the underlying network.

Use Case. In the distributed training of LLMs, AllReduce, a common kind of collective communication operation, involves the transportation of large amount of message-level gradients or token data, which is orders of magnitude of packets in network in size. Because packet in network is designed originally not considering large data transmission, there is natural mismatch property between network and collective communication.

2.5. Computional Resources Management

The fusing of communication and computation operators is now a popular and consensus-based way to optimize LLMs training or HPC. This practice can make the scheduling of communication and computing more efficient. A universal and direct way is to consider computing when designing communication operators in the communication libraries. In this process, unified management of computational resources, e.g., AI-cores of GPU or CPU-cores, is REQUIRED, to optimize the scalability of computing cluster.

Use Case. Training large language models (LLMs) efficiently in a distributed setting is indeed a challenging task, primarily due to communication bottlenecks that arise when multiple nodes need to synchronize their data. Overlapping communication with computation, or said kernel fusion, is a key strategy to improve training efficiency. This approach aims to reduce idle times by initiating communication operators while the CPU or GPU is still performing computations, e.g., fusing Allgather and Batch-Matmul into a single larger kernel to reduce kernel launch overhead and better overlap communication with computation.

3. Security Considerations

TBD.

4. IANA Considerations

TBD.

5. References

5.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.

5.2. Informative References

[ESA]
Wang, H., "Efficient Data-Plane Memory Scheduling for In-Network Aggregation", .
[NetRPC]
Zhao, B., "NetRPC: Enabling In-Network Computation in Remote Procedure Calls", .

Authors' Addresses

Chang Liu
China Mobile
Beijing
100053
China
Shiping Xu
China Mobile
Beijing
100053
China