Network Working Group Q. Xiong Internet-Draft ZTE Corporation Intended status: Informational K. Yao Expires: 15 April 2025 China Mobile C. Huang China Telecom Z. Han China Unicom J. Zhao CAICT 12 October 2024 Use Cases, Requirements and Problems for High Performance Wide Area Network draft-xiong-hpwan-uc-req-problem-00 Abstract High Performance Wide Area Network (HP-WAN) is designed for many applications such as scientific research, education, and other data- intensive applications which demand massive data transmission, and it needs to ensure data integrity and provide stable and efficient transmission services. This document describes the use cases and requirements, and analyses the problems in HP-WANs. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 15 April 2025. Xiong, et al. Expires 15 April 2025 [Page 1] Internet-Draft Use Cases, Requirements and Problems for October 2024 Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1. High Performance Computing (HPC) . . . . . . . . . . . . 5 3.2. AI Training . . . . . . . . . . . . . . . . . . . . . . . 6 3.3. Backup and Disaster Recovery . . . . . . . . . . . . . . 6 3.4. Multimedia Content Production . . . . . . . . . . . . . . 7 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1. Service Requirements . . . . . . . . . . . . . . . . . . 7 4.2. Performance Requirements . . . . . . . . . . . . . . . . 8 5. Problem Statements . . . . . . . . . . . . . . . . . . . . . 9 5.1. Challenging with Long-distance Delay and Slow Feedback . 9 5.2. Challenging with Low Bandwidth Utilization of Elephant Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.3. Challenging with Large Burst Incurs Unmanageable Congestions . . . . . . . . . . . . . . . . . . . . . . . 10 5.4. Challenging with Bottleneck Links Causing Packet Loss . . 11 6. Security Considerations . . . . . . . . . . . . . . . . . . . 12 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 12 9.1. Normative References . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 Xiong, et al. Expires 15 April 2025 [Page 2] Internet-Draft Use Cases, Requirements and Problems for October 2024 1. Introduction Data is fundamental for many scientific research, including biology, astronomy, and artificial intelligence(AI), etc. Within these areas, there are many applications that generate huge volume of data by using advanced instruments and high-end computing devices. For data sharing and data backup, these applications usually require massive data transmission over long distance, for example, sharing data between research institutes over thousands of kilometers. These applications include High Performance Computing (HPC) for scientific research, cloud storage and backup of industrial internet data, distributed training, and so on. It needs to ensure data integrity and provide stable and efficient transmission services in Wide Area Networks (WANs). These WANs need to connect research institutions, universities, and data centers across large geographical areas. Traditional data migration solutions include manual transportation of hard copy, which not only incurs more labor cost, but also lacks safety, and high-speed dedicated connectivity (e.g. Direct optical connection), which is expensive. Moreover, the applications may demand a periodic and temporary migration, require task-based data transmission with low real-time requirements, and the transmission frequency is variable, all of which will lead to low network utilization and cost-effectiveness. The massive data may be transmitted over non-dedicated WANs and the network requirements demand high performance such as the high- throughput data transmission which depends on the transport layer protocols such as Transfer Control Protocol (TCP), Quick UDP Internet Connections (QUIC), Remote Direct Memory Access (RDMA) and so on. But the performance of TCP will be impacted by the packet loss retransmission techniques. And for RDMA, there are three main implementation methods such as InfiniBand (IB), which is a high- performance dedicated network technology, but requires specific InfiniBand hardware support, Internet Wide Area RDMA Protocol (iWARP), which is based on the TCP/IP protocol, but the transmission performance may be affected by the congestion control and flow control of TCP, and RDMA over Converged Ethernet (RoCE), which allows the execution of RDMA over Ethernet, but it has applicability issues over WANs. Moreover, the long-distance connection and massive data transmission between two or more sites have become a key factor affecting the performance. For instance, the long-distance networks may have more uncertainties, such as routing changes, network congestion, packet loss and link quality fluctuations, all of which may have a negative impact on the performance. The services are massive and concurrent with multiple types and different traffic models such as the elephant Xiong, et al. Expires 15 April 2025 [Page 3] Internet-Draft Use Cases, Requirements and Problems for October 2024 flows with short interval time, high speed and large data scale, which may occupy a large amount of network resources and affect the performance. High Performance Wide Area Network (HP-WAN) is designed specifically to meet the high-speed, low-latency, and high-capacity needs of massive data set applications, which puts forward higher performance requirements such as ultra-high goodput, high bandwidth utilization, ultra-low packet loss ratio, and resilience to ensure effective high- throughput transmission. This document describes the use cases and requirements, and analyses the problems in HP-WANs. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. Terminology The terminology is defined as following. High Performance Wide Area Networks (HP-WANs): indicate the networks designed specifically to meet the high-speed, low-latency, and high- capacity needs of scientific research, education, and data-intensive applications. The primary goal of HP-WAN is to achieve massive data transmission, which puts forward higher performance requirements such as ultra-high goodput, high bandwidth utilization, ultra-low packet loss ratio, and resilience to ensure effective high-throughput transmission. It also makes use of the following abbreviations and definitions in this document: DC: Data Center DCI: Data Centers Interconnection HPC: High Performance Computing WAN: Wide Area Networks MAN: Metropolitan Area Networks Xiong, et al. Expires 15 April 2025 [Page 4] Internet-Draft Use Cases, Requirements and Problems for October 2024 PFC: Priority Flow Control ECN: Explicit Congestion Notification ECMP: Equal-Cost Multipath RTT: Round-Trip Time TCP: Transfer Control Protocol RDMA: Remote Direct Memory Access Round-Trip Time QUIC: Quick UDP Internet Connections 3. Use Cases Several use cases are documented for scenarios requiring high- performance data transmission over WANs. 3.1. High Performance Computing (HPC) High Performance Computing (HPC) uses computing clusters to perform complex scientific computing and data analysis tasks. HPC is a critical component to solve some complex problems in various fields such as scientific research, engineering, finance, and data analysis. For example, the research data of large science and engineering projects in cooperation with many research institutions requires long-term archiving of about 50~300PB of data every year. The PSII protein process generates 30 to 120 high-resolution images per second during experiments. This results in 60~100 GB of data every five minutes, requiring data transmission from one laboratory to another for analysis. Another example is Five-hundred-meter Aperture Spherical radio Telescope (FAST), astronomical data calculation with over 200 observations for each project, a single project generating observation data of TB~PB, and an annual production data of about 15PB per year. HPC requires high bandwidth and high-speed network to facilitate the rapid data exchange between processing units. It also requires high- capacity and high-throughput storage solutions to handle the vast amounts of data generated by simulations and computations. It is necessary to support large-scale parallel processing, high-speed data transmission, and low latency communication to achieve effective collaboration between computing nodes. Xiong, et al. Expires 15 April 2025 [Page 5] Internet-Draft Use Cases, Requirements and Problems for October 2024 3.2. AI Training With the increasing demand for computing power in AI large-scale model training, the scale of a single data center is limited due to factors such as power supply. The AI training clusters expands from single data center to multiple DCs. Collaborative training across multiple DCs typically refers to the process of distributed machine learning training across multiple data centers, which can improve computational efficiency, accelerate model training speed, and utilize more data resources. Moreover, in some scenarios, it needs to separate the data storage and compute resources for AI training to achieve better resource management, data privacy, scalability, and performance optimization. There is a major classification of machine learning known as batch learning (offline learning) and online learning. Batch learning is that type of learning where the model undergoes a training process from the entire batch of data. It involves feeding of batch data, which includes inputting all available data at once into the learning algorithm. It requires the whole dataset to be transfered to the multiple DCs before employment for training. Online learning is completed in stages, where the learned model is updated with a new model as new data arrives. It involves feeding data incrementally, through the flow of one instance or one small batch at a time. It also requires to transfer the data to each data center and then synchronously updates model parameters. For example, the training process of deep learning and the training data has reached 3.05TB. Uploading a large model training templates requires uploading TB/PB level data to the data center. Each training session has fewer data flows with larger bandwidth. And 20% of the current network's services accounts for 80% of the traffic which resulting in elephant flows. Compared with traditional DCI scenarios, parameters exchange significantly increases the amount of data transmission across DCs, typically from tens to hundreds of TB. It should provide sufficient bandwidth, low latency, and high reliability for data centers communications. 3.3. Backup and Disaster Recovery As the development of the cloud computing industry, cloud data centers are bearing a large amount of various enterprise IT services. The storage, transmission, and protection of the massive growth data bring new challenges. For instance, disaster recovery of core application data is required to ensure the enterprise data security and the service continuity. In the scenario of disaster recovery of the operator's traffic data, Xiong, et al. Expires 15 April 2025 [Page 6] Internet-Draft Use Cases, Requirements and Problems for October 2024 the daily data backup volume of a single IT cloud resource pool is at the TB level. The primary and backup data centers are normally built in different locations with long data transmission distances. However, they do not have strict requirements for data transmission time. By utilizing the tidal effect of the network, the idle bandwidth at night can be utilized for the transmission, so as to improve the data transmission efficiency and reduce the data transmission cost. 3.4. Multimedia Content Production Multimedia Content Production refers to the process of creating and editing content that combines different media forms such as text, audio, images, animations, and video. This field is characterized by the use of digital technology to produce engaging and dynamic content for various platforms, including film, television, the internet, and mobile devices. It requires processing a large amount of data, including raw video materials, special effects, and rendering results. For example, for film and video production, the raw material data of a large-scale variety show or film and television program is at the PB level, with a single transmission of data in the range of 10TB to 100TB. And with the development of new media such as 4K/8K, 5G, AI, VR/AR and short video, large amount of audio and video data needs to be transmitted between data centers or different storage sites across long distance. For AR/VR videos, the terminal outputs 1080P image quality requires 40M per user. It demands data transmission with the traffic characteristics such as massive data scale and large burst. 4. Requirements 4.1. Service Requirements The characteristics of above use cases may include massive data transmission with large burst, multiple concurrent services co- existed with dynamic flows through long-distance links between sites or DCs. This document outlines the service requirements from users as following shown. * Massive data transmission, e.g. the data volume of an elephant flow is 10Gbps~1Tbps. * Task-based data transmission, and the frequency is variable, e.g.a periodic and temporary migration. * Long-distance transmission, between one or more sites or DCs, e.g.more than 1000km. Xiong, et al. Expires 15 April 2025 [Page 7] Internet-Draft Use Cases, Requirements and Problems for October 2024 * Instant transmission, it needs to be transmitted immediately or at a specific time. * Timely transmission, it has a completion time but without real- time transmission requirements, e.g. seconds~milliseconds. * Low cost * Data security and integrity * Compatibility and complementation with dedicated networks such as Research and Education Network. For example, it is required to provide switching with a fine-grained mapping between private networks and WANs to achieve optimal operating and consumption costs. 4.2. Performance Requirements This document outlines the requirements for effective high-throughput data transmission in HP-WAN with the performance indicators such as ultra-low packet loss ratio, ultra-high bandwidth utilization, and low latency as following shown. * Ultra-low Packet Loss Ratio: according to the performance indicators of throughput, the packet loss negatively correlates with throughput. The lower the packet loss rate, the higher the throughput. It is required to achieve ultra-low packet loss ratio no more than 0.001% for high-throughput data transmission in HP- WANs. * Ultra-high Bandwidth Utilization: refers to the efficient use of available network capacity to maximize data transfer rates and minimize latency. It is required to improve the bandwidth utilization to achieve high-throughput data transmission for multiple concurrent services in HP-WANs. It is required to achieve bandwidth utilization rate exceeding 90% to ensure that network resources are fully utilized. * Low Latency: RTT is another performance indicators of throughput which negatively correlated with throughput. The lower the RTT, the higher the throughput. RTT consists of three types of delays. the propagation link delay, processing delay of the end system and the queuing delay. It is required to ensure low queuing latency (e.g. no more than 10ms) to achieve high-throughput data transmission in HP-WANs. Xiong, et al. Expires 15 April 2025 [Page 8] Internet-Draft Use Cases, Requirements and Problems for October 2024 5. Problem Statements Challenges of effective high-performance transmission in HP-WANs come from massive concurrent services and long-distance delays and packet loss. The existing network technologies have various problems and cannot meet the requirements. This document outlines the problems for HP-WANs. 5.1. Challenging with Long-distance Delay and Slow Feedback The long-distance transmission of thousands of kilometers brings extremely long link transmission delays and large RTT. It will delay the network state feedback, resulting in the inability to adjust the transmission rate in a timely manner. It will be challenging for congestion control in WANs for controlling the total amount of data entering the network to maintain the traffic at an acceptable level. For example, as per [RFC3168], Explicit Congestion Notification (ECN) defines an end-to-end congestion notification mechanism based on IP and transport layers. When the congestion occurred, the device will mark packets and transmits congestion information to the server and the server sends packets to the client to notify the source to adjust the transmission rate to achieve congestion control. The long- distance will delay the notification and slow the feedback, which result in the untimely adjustment and buffer overflow, causing a decrease in network performance. It is required to achieve rapid feedback for the source to adjust the rate. Especially for incast congestion based on multi-source targeting, the network needs a rapid feedback based on offered load to provide timely feedback nearing the source. Moreover, the slow feedback has an impact for some congestion control algorithms. For example, Bottleneck Bandwidth and Round-trip propagation time (BBR) is a congestion-based congestion control algorithm for TCP, which actively measures bottleneck bandwidth (BtlBw) and round-trip propagation time (RTprop) based on the model to calculate the bandwidth delay product (BDP) and then to adjust the transmission rate to maximize throughput and minimize latency. But BBR relies on real-time measurement of the parameters which may vary greatly, feedback slowly, thereby affecting the control precision of BBR in long-distance networks. Moreover, the Data Center Quantized Congestion Notification (DCQCN) and High Precision Congestion Control (HPCC++) would not tolerate the long feedback loop. The stability and adaptability of congestion control algorithms may be challenging in HP-WAN scenarios. Xiong, et al. Expires 15 April 2025 [Page 9] Internet-Draft Use Cases, Requirements and Problems for October 2024 5.2. Challenging with Low Bandwidth Utilization of Elephant Flows In HP-WAN applications, a large amount of data will be transmitted for example, the data volume of a single flow may be from 10G to 1TB. It needs to transfer the elephant flows which lasts for a long time with short interval time, high speed and large data scale in the network. It will be challenging for the elephant flows to occupy a large amount of network resources, resulting in low bandwidth utilization due to the uneven resource allocation and instantaneous congestion. The existing congestion control mechanisms focus on rate adjustment, which can control the sending rate of data flows at the source of data transmission, thereby avoiding or reducing network congestion. For example, the congestion control algorithm in the TCP protocol will reduce the sending rate when packet loss is detected using ECN mechanism. However, due to ECN feedback of congestion, frequent rate adjustment results in significant changes in throughput, which affects bandwidth utilization and transmission efficiency. Therefore, it is required for the network to actively avoid congestion and reduce the probability of ECN occurrence. For example, uneven network load will lead to a decrease in network throughput and low link utilization. It is required to achieve load balance which refers to a method for the allocation of load (traffic) to multiple links for forwarding traffic. For example, it will be challenging for HASH conflict and poor network balancing with massive elephant flows when flow-based ECMP distributes the elephant flows into the same link, resulting in congestion and packet loss. Moreover, goodput bottleneck with task-based transmission time and duration brings traffic scheduling challenging. The applications may have multiple concurrent services co-existed with existing dynamic flows. Considering the multiple services with various types and different traffic requirements, the traffic is required to be scheduled to multiple paths and fine-grained network resources to achieve high utilization and QoS guarantee. 5.3. Challenging with Large Burst Incurs Unmanageable Congestions The massive flows data transferring with large burst may cause instantaneous congestion, packet loss, and queuing delay within network devices in WANs. There will be more aggregations at the edge of WANs and it may be accumulated as the flows traverse, join, and separate over hops. It will be challenging for unmanageable congestions control for the bursty traffic. In HP-WANs, in order to ensure the effective high-throughput, it is required to improve the coordination of the end system and network to achieve congestion Xiong, et al. Expires 15 April 2025 [Page 10] Internet-Draft Use Cases, Requirements and Problems for October 2024 control based on collaborative rate negotiation. Initial rate negotiation is an important part of network communication, which determines the starting rate of data transmission. If the initial rate is set too low, it may lead to insufficient bandwidth utilization and fail to fully unleash the potential of the network. If set too high, it may cause network congestion, resulting in packet loss and increased transmission delay. For example, in order to balance bandwidth utilization and avoid congestion, the TCP protocol adopts various congestion control algorithms, including mechanisms such as Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery. These mechanisms work together to dynamically adjust the data transmission rate to adapt to changes in network conditions. For HP-WANs, the initial rate negotiation needs to comprehensively consider factors such as network bandwidth, latency, packet loss rate, and balance bandwidth utilization and congestion avoidance in complex and dynamic network environments. 5.4. Challenging with Bottleneck Links Causing Packet Loss It needs to achieve ultra-low packet loss rate (e.g. no more than 0.001%) to achieve high-throughput data transmission in HP-WAN scenarios. It will be challenging that the long-distance networks may have more uncertainties, such as multiple hops, routing changes, network congestion and link quality fluctuations, all of which may have a negative impact on the packet loss rate. The packet loss ratio will increase causing by bottleneck links bandwidth, due to the elephant flows occupying a large amount of network resources. It needs an active packet loss avoidance mechanism which aims to prevent congestion from occurring and only sends data when the network has sufficient capacity. It operates actively reserving and allocating network bandwidth through a scheduler to match the bottleneck link bandwidth as much as possible, thus fully utilizing bandwidth and preventing packet loss. Moreover, through effective flow control, congestion in the network can be reduced, thereby reducing packet loss caused by buffer overflow. Flow control refers to a method for ensuring the data is transmitted efficiently and reliably and controlling the rate of data transmission to prevent the fast sender from overwhelming the slow receiver and prevent packet loss in congested situations. It is required to provide fine-grained and high-precision flow control to reduce the impact between different traffic flows due to the long- distance link and transmission delay in WANs. Xiong, et al. Expires 15 April 2025 [Page 11] Internet-Draft Use Cases, Requirements and Problems for October 2024 The packet loss also has a significant impact on some transport protocols. For example, the design of RDMA is aimed at high performance and low latency, which makes RDMA have strict requirements for the network, that is, the network would be better to provide ultra-low packet loss, otherwise the performance degradation will be significant, which poses greater challenges to the underlying network hardware and also limits the network size of RDMA. RDMA relies on a goBackN retransmission mechanism and the throughput dramatically decreases with packet loss rates greater than 0.1%, and a 2% packet loss rate effectively reduces throughput to zero. And for TCP and QUIC, Congestion-based Upon Bandwidth-Information (CUBIC) is a traditional congestion algorithm, as per [RFC9438], and it uses a more aggressive window increase function which is suitable for high-speed and long-distance network. When packet loss occurs, CUBIC will reduce the congestion window based on its multiplicative window decrease factor, that will slow the convergence speed. So it has a requirement for low network packet loss. As per [RFC9438], section 5.2, it is required a packet loss rate of 2.9e-8 to achieve the throughput of 10 Gbps rate. The throughput will dramatically decrease when the packet loss ratio is over a threshold value. 6. Security Considerations This document covers a number of representative applications and network scenarios that are expected to make use of HP-WAN technologies. Each of the potential use cases does not raise any security concerns or issues, but may have security considerations from both the use-specific perspective and the technology-specific perspective. 7. IANA Considerations This document makes no requests for IANA action. 8. Acknowledgements The authors would like to acknowledge Zheng Zhang, Yao Liu and Guangping Huang for their thorough review and very helpful comments. 9. References 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . Xiong, et al. Expires 15 April 2025 [Page 12] Internet-Draft Use Cases, Requirements and Problems for October 2024 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. Khasnabish, "Mechanisms for Optimizing Link Aggregation Group (LAG) and Equal-Cost Multipath (ECMP) Component Link Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, January 2015, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . [RFC8664] Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., and J. Hardwick, "Path Computation Element Communication Protocol (PCEP) Extensions for Segment Routing", RFC 8664, DOI 10.17487/RFC8664, December 2019, . [RFC9232] Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Network Telemetry Framework", RFC 9232, DOI 10.17487/RFC9232, May 2022, . [RFC9438] Xu, L., Ha, S., Rhee, I., Goel, V., and L. Eggert, Ed., "CUBIC for Fast and Long-Distance Networks", RFC 9438, DOI 10.17487/RFC9438, August 2023, . Authors' Addresses Quan Xiong ZTE Corporation China Email: xiong.quan@zte.com.cn Kehan Yao China Mobile China Email: yaokehan@chinamobile.com Cancan Huang China Telecom China Xiong, et al. Expires 15 April 2025 [Page 13] Internet-Draft Use Cases, Requirements and Problems for October 2024 Email: huangcanc@chinatelecom.cn Zhengxin Han China Unicom China Email: hanzx21@chinaunicom.cn Junfeng Zhao CAICT Beijing China Email: zhaojunfeng@caict.ac.cn Xiong, et al. Expires 15 April 2025 [Page 14]