Decoupled architecture data center technology roadmap Part I .

Mondo Technology Updated on 2024-02-01

The traditional data center architecture mainly uses the server as the deployment unit to realize computing processing and storage reading and writing, and realizes the connection and access between servers through the network, in which the server connects computing and storage resources such as CPU, memory, GPU, and hard disk through the bus. The 7th Future Network Development Conference released the "Computing Network Operating System, Optoelectronic Convergence Service, Customized Wide Area Network", and "Serverless Data Center with Network IO as the Center".**Links:

Future Networks***2023) Collection1, Future Networks***2023): Computing Network Operating System***2, Future Networks***2023): Serverless Data Centers with Network IO as the Center***3, Future Networks***2023): Customized WAN for Optoelectronic Convergence Services***

ACDU China Tour: Database Technology Disclosure and Application Practice Collection" Data-driven Enterprise Digital Business ReportInnovative technology promotes the in-depth development of computing power networkThe design and implementation of enterprise desktop virtualization system: the current situation and future of enterprise terminal operation and maintenance, hyperconverged infrastructure, and server virtualization core technologyProgress in Gigabit Optical Access Network TechnologyHigh-throughput data network architecture and key technologies***SaaS Industry Research Report for Chinese Enterprises (2022)Financial-grade Distributed Database***Basic Network Technology: Development of Gigabit Optical Broadband Technology**(2023)Building Edge Intelligence to Promote New Development of Computing Power Networks (2023)2023-2025 Global DICT Technology Trend Research and Judgment Vehicle Broadcasting Vehicle Wireless Transmission Technology Report (2023)Industry In-depth Research Report"Controlling the Hybrid Multi-cloud Environment***Full-stack Cloud Technology Exploration and Practice Collection》 2023 U.S. SaaS Market Research Report Limited by the local access and limited capacity of the server's internal bus, various resources can only exist in a tightly coupled relationship, so traditional data centers often purchase various types of servers to meet the computing and storage needs of different applications.

In general, server-based data centers have limitations such as insufficient hardware scalability, low resource utilization, insufficient elasticity of resource use, and low fault tolerance granularity, which cannot effectively meet the diverse needs of emerging applications such as serverless computing and distributed training.

The main form of the resource decoupled data center architecture is to build heterogeneous storage and computing resource pools such as CPU, GPU, FPGA, RAM, SSD, HDD, etc., and connect each hardware resource pool through the network to realize the interconnection between resource pools.

The above-mentioned resource decoupling data center architecture breaks the physical boundaries between traditional servers, and at the same time, because the network has global access and highly scalable capabilities, it can get rid of the constraints of the traditional data center architecture with servers as the deployment unit.

With the diversification of application requirements for storage and computing resources, the high-speed network technology, and the high-energy hardware control, data centers based on resource decoupling architecture have ushered in their development opportunities.

1) Diversification of application requirements.

In terms of resource performance, applications in different fields involve different types of data operations, which are suitable for processing with specific computing chips, such as multiplication and addition of matrices or vectors involved in the field of artificial intelligence, which have high specificity and excessive computing power consumption, and are not suitable for computing with general-purpose CPUs.

The diversified requirements of applications in terms of storage and computing resource requirements and performance promote the evolution of data centers to resource decoupling architectures.

2) High-speed network connection.

The decoupling of storage and computing units such as CPU, GPU, RAM, SSD, etc., makes the communication between resources that were previously coupled in the same server now have to pass through the network, which greatly increases the data interaction delay between them. Therefore, network technology determines the performance of upper-layer applications and the ability to pool hardware resources.

3) High-energy hardware control.

In order to solve the problem of low utilization of local server resources, pooling of single-class resources is one of the current mainstream development directions, which has attracted many vendors to join, and its key technology lies in the efficient management and use of remote resources.

With the rapid development of network and hardware technology, the resource decoupling architecture has become one of the main development directions of the future data center due to its high resource utilization and good hardware scalability.

1. CPU-centric.

Under the "CPU-centric" technology route, various computing and data processing tasks are performed by the CPU, while other components provide support and services for the CPU, which is also the operating system design basis for the current resource coupling server.

CPU-centric + computing offloading is one of the current technical routes for building a resource decoupled data center, in which memory and CPU are still tightly coupled to reduce the modification of the traditional "CPU-centric" operating system. On this technical route, there are currently only some potential proposals, such as Fungible DPU, Intel IPU, Alibaba Cloud CIPU, CXL, etc., because there are still a large number of complete servers in the deployment scenarios it is currently targeting.

1.1 fungible dpu

The Fungible F1 hardware architecture is mainly composed of three functional parts: data cluster, control cluster, and network unit. There are a total of 8 data clusters, each with six cores and four threads, which are used to run data planes to accelerate data-related operations such as moving, finding, analyzing, security, and more. The control cluster is a four-core, two-threaded, Linux control plane that is mainly responsible for security authentication and acceleration of different encryption algorithms, such as RSA and Elliptic Curve.

The network unit supports a total of 800G bandwidth, supports TCP UDP, RDMA over TCP, and offload of TrueFabric endpoints, supports programming of packet paths using the P4 language, and supports IEEE1588 Precision Time Protocol (PTP).

TrueFabric is a new standard for interconnecting large-scale data center networks proposed by Fungible through a new Fabric control protocol based on standard UDP IP Ethernet. The Fungible F1 DPU natively supports TrueFabric, so the F1 DPU can be used in large-scale TrueFabric data center networks, and different types of servers can use the Fungible DPU as a network access point.

TrueFabric can scale from a cluster of servers deployed in small with a 100GE interface to a large-scale deployment of hundreds of thousands of servers using a 200GE-400GE interface, and can be scaled incrementally without shutting down the network for true always-on operation. All deployments use the same interconnect topology, with small to medium deployments using single-tier Spine switches and large deployments using the Spine and Leaf tiers.

The diagram above is an abstract view of a data center deployment based on TrueFabric and F1 DPUs, with multiple instances of four server types: CPU servers, AI data analytics servers, SSD servers, and HDD servers. Each server instance contains a fungible DPU that is connected to the network at a fixed bandwidth (e.g. 100GE). At the same time, with large-scale deployments, there is a dedicated 100GE link between each DPU.

1.2 intel ipu

In highly virtualized data centers, a significant amount of server resources are consumed to handle tasks outside of user applications, such as hypervisors, container engines, networking and storage functions, security, and large amounts of network traffic. To this end, Intel has launched an Infrastructure Processing Unit (IPU), and the following diagram leverages an IPU-based architecture that allows Cloud Service Providers (CSPs) to offload infrastructure-related tasks from the CPU to the IPU, freeing up server CPU cycles to process tasks to increase data center revenue.

By offloading infrastructure-related tasks to the IPU, CSPs can lease all of their server CPUs to customers. Currently, Intel offers two IPU architectures, including FPGA-based IPUs and dedicated ASIC-based IPUs.

Currently, there are two FPGA-based IPOs, which are Oak Springs Canyon and Arrow Creek. Oak Springs Canyon is based on Intel's Agilex FPGA and Xeon-D CPU implementations, both of which work together to offload 2x 100G workloads and optimize a rich software ecosystem around x86.

Oak Springs Canyon leveraged the Intel OpenFPGA stack, a scalable source-accessible software and hardware infrastructure stack that meets the deployment needs of a 100G CSP. Oak Springs Canyon also features a hardened cryptographic block that allows for line-rate performance to secure all infrastructure traffic, storage, and networking.

Arrow Creek is an accelerated development platform based on Agilex FPGAs and E810 100G Ethernet controllers. It is built on the foundation of the Intel N3000 Pack, which is currently deployed in a number of communication service providers around the world. Arrow Creek provides flexible, accelerated workloads such as Juniper Contrail, OVS, and SRV6.

Mount Evans is Intel's first ASIC-based IPU that can link up to four Xeon processors via PCIe and offload the compute load to the IPU for processing. Mount Evans has a packet processing engine that supports a number of existing use cases such as vswitch offload, firewalls, and virtual routingEmulate NVMe devices by extending the Optane NVMe controller;Deploy advanced encryption and compression acceleration with Quick Assist technologyIt supports programming in software environments such as DPDK and SPDK and self-developed P4 programming language to configure pipelines.

1.3 Alibaba Cloud CIPU

Cloud Infrastructure Processing Units (CIPUs) are a cloud processor proposed by Alibaba Group, which is specially designed to connect the hardware in the server and the virtualization resources on the cloud. CIPU quickly cloudifies computing, storage, and network resources in the data center and performs hardware acceleration, and connects to the Feitian cloud operating system upward.

In terms of computing, CIPU supports collaborative computing, which can distribute computing tasks to multiple nodes for processing to achieve higher computing efficiency and reliability. In terms of storage, CIPU provides "Feitian Distributed Storage" technology, which can distribute data and store it on multiple nodes to improve data reliability and scalability. When it comes to virtualization, CIPU can run multiple virtual machines on the same physical server for higher resource utilization, while supporting containerized management to quickly deploy, manage, and scale a variety of applications. In terms of programming, Alibaba's CIPU architecture provides a complete set of AI frameworks, including TensorFlow, PyTorch, etc., to support various AI application scenarios.

1.4 cxl

Launched in 2020 by companies such as Intel, Dell, and HP, CXL (Compute Express Link) is an open PCIE-based interconnect technology standard that enables high-speed and efficient interconnection between CPUs and GPUs, FPGAs, or other accelerators to meet the requirements of high-performance heterogeneous computing, while maintaining consistency between CPU memory space and connected device memory.

CXL defines CXLio、cxl.cache and cxlMEM three protocols. cxl.The IO protocol is an improved version of PCIe 50 protocol for initialization, linking, device discovery and enumeration, and register access, while providing a non-consistent loading storage interface for IO devices.

cxl.The cache protocol defines the interaction between the host and the device, allowing the attached CXL device to cache host memory efficiently and with low latency using request and response methods. cxl.The MEM protocol enables the CPU to use external devices as main memory, enabling larger memory capacities. Through the combination of these three protocols, different types of devices can be connected, including network cards such as PGAS NIC (Type-1), accelerators such as GPUs with memory, FPGAs (Type-2), and memory expansion devices (Type-3) in high-performance computing.

Currently, the CXL standard has evolved to CXL30。Compared to the traditional tree structure of PCIe and previous generations of CXL, CXL30 adds support for Layer 2 switches and implements non-tree network architectures such as leaf spine, as shown in Figure 2-9. The CXL network can support 4096 nodes and communicate with each other through a port-based routing mechanism. Here, the node can be a host CPU, CXL accelerator, PCIe device, or GFAM (Global Fabric Attached Memory) device. A GFAM device is similar to a traditional CXL Type-3 device, except that it can be accessed by multiple nodes (up to 4095) in a flexible manner using port-based routing. Therefore, CXL 30 not only can realize the pooling and decoupling of computing resources and storage resources in one cabinet, but also can establish a larger resource pool between multiple cabinets.

**Links:

The Evolution of Multi-Cloud Cache in Zhihu from the Perspective of Database HistoryThe Evolution of Enterprise Operation and Maintenance Platform in the Cloud Environment"2023: OLAP Engine Architecture Summit Collection (Part I)", "2023: OLAP Engine Architecture Summit Collection (Part II)", "2023 Cloud Service Industry Trends and Hot Spots", "2023 Database Technology Architecture Collection (4)", "2023 Database Technology Architecture Collection (3)", " 2023 Database Technology Architecture Collection (2)Return to Data Origins: Technical Interpretation of Enterprise Databases"2023 Database Technology Architecture Collection (1)"2023 Domestic Database Practice Collection"Research Framework: The Rise of Huawei Computing (2023)Basic Knowledge of Storage SystemsResearch and Development Status and TrendsTrends of Storage TechnologyDistributed Storage Trends and Their Impact on Cloud Storage Planning"RAID Technology and Application Collections" "Data Center Technology Collections"1、Data Center Hyperconverged Ethernet Technology***

2. Requirements for the sustainable development of data centers

3. Data center green design ***2023).

4. High-security technology system of new data center

China Cyber Security Market Industry Report Collection Year China Cyber Security Market Panorama

2. Analysis Report on China's Cyber Security Industry (2021).

3. China's cybersecurity industry ***2022).

4. China's cybersecurity industry ***2022).

Technical Collection of Operating System"1, UnionTech Server Operating System Enterprise Edition v20PDF 2, won the bid for Kylin desktop operating system software (arm64 version).pdfResearch report on the ecological development of domestic desktop operating systems. pdf 4, operating system depth: the tide of domestic production is rising, and the sea is leaping. pdf5, vehicle control operating system architecture research report. pdf 6, domestic operating system: the layout is comprehensively deepened, and the industry pattern is expanded. pdf7, the current situation and development of domestic operating system security. pdf8. Leading domestic operating system, seize the opportunity of independent and controllable. pdf

9. Who is the master of the domestic operating system?pdf

10. Computer Special Report: The Business Model of HarmonyOS Ecosystem. pdf

11. Loongson Paiyihui sylixos operating system. pdf

2023 Cloud Computing Technology and Report Collection" 1, Cloud Computing *** 20237) 2. 11 types of top threats to cloud computing 3. Cloud computing security level protection 20 compliance capability***4, cloud computing key industry application report 5, cloud computing platform performance evaluation model method research 6, cloud computing standards and performance evaluation 7, China cloud computing innovation vitality report.

Cloud Computing Full-Stack Cloud Technology *** Collection" All the information in this number is uploaded to Knowledge Planet and addedFull-stack cloud technical knowledgeAll information on the planet. ‧‧end ‧‧Disclaimer:This issue focuses on related technology sharingThe views expressed in the content do not necessarily represent the position of this number, The traceable content is indicated**, if there are copyright and other issues in the published article, please leave a message to delete, thank you.

Related Pages