Authors: Song Zehui (Xiaohongshu), Zhang Zuowei (Alibaba Cloud).
Editor's note:
Koordinator is an open-source project that was incubated based on Alibaba's years of practical experience in container scheduling and mixing, and is the industry's first production-available, open-source mixed-part system for large-scale scenarios, committed to improving application service quality and optimizing resource efficiency. Since it was officially open source in April 2022, it has attracted contributions and discussions from many outstanding engineers in the industry.
Xiaohongshu is an active member of the Koordinator community and has been deeply involved in the evolution of a number of important features since the early days of the project. This article is based on the transcript of Koordinator sharing at the 2023 Apsara Conference, where Koordinator community members Song Zehui (Xiaohongshu) and Zhang Zuowei (Alibaba Cloud) introduced the technical practice of Xiaohongshu and the recent planning of Koordinator.
With the rapid development of Xiaohongshu's business, the demand for computing resources for various types of offline services is also growing rapidly. At the same time, the average daily utilization rate of some ** clusters is maintained at a low level, and the main reasons for this phenomenon are as follows:
*Service resource usage shows a stable tidal phenomenon with the usage habits of end users, and the CPU utilization at night is extremely low, resulting in a low average CPU utilization of the clusterThe business retains a large number of exclusive resource pools, and the fragmentation of resource pools generates a large number of resource fragments, which reduces CPU utilizationFor the sake of stability, the business will overhoard resources and further reduce CPU utilization. Based on the above background, in order to help businesses reduce resource usage costs and improve cluster CPU utilization, the Xiaohongshu container team has implemented large-scale mixed department technology to greatly improve cluster resource efficiency and reduce business resource costs since 2022.
The evolution of Xiaohongshu's hybrid technology is divided into the following four stages:
Phase 1: Reuse of idle resources
In the early days, cluster resource management was extensive, and there were a large number of business-exclusive resource pools in the cluster, and there were a large number of inefficient nodes with low allocation rate due to resource fragmentation and other factors, and a large amount of resource waste was caused by inefficient nodes scattered in various clusters. On the other hand, some near-line offline transcoding scenarios based on K8S have a large amount of computing resources required throughout the day. Based on the above background, the container platform collects idle resources in the cluster through technical means and allocates them to transcoding business scenarios.
We use virtual-kubelet to connect metadata clusters and physical clusters, aggregate idle resources, and allocate them to nearline offline computing services in transcoding scenarios in the metadata cluster. In terms of policy, the secondary scheduler is responsible for inspecting all nodes in the cluster, identifying inefficient nodes and marking them, and virtual-kubelet obtains the available resources of the inefficient nodes in the physical cluster as idle resources in the cluster for secondary allocation to offline transcoding, and at the same time, the secondary scheduler needs to ensure that once the ** service has resource needs, it will immediately expel the offline pod and return the resources.
Phase 2: Time-sharing reuse of the whole machine
Based on the above background, the platform uses elasticity capability (HPA) to scale down the ** business in the early morning off-peak period, vacate the whole machine, and run offline pods such as transcoding and training during the period, so as to achieve the effect of utilization "valley filling".
For this reason, in terms of strategy, we have realized offline early exit, and through the scheduler preemption mechanism, we have ensured that the best service can be fully pulled up in time before the peak period of the business.
Stage 3: Normal Mixed Division
In order to reduce the resource fragmentation rate and reduce the cost of business resource holding, the platform continues to promote large-scale business pooling, and moves the business from the exclusive pool to the public mixed pool managed by the platform. On the other hand, in the complex mixed-part scenario after pooling, it is difficult to continue to implement the scheduling strategy of time-sharing mixed-part offline for the whole machine, so the platform needs to build more fine-grained resource management and scheduling capabilities to achieve the goal of improving the average utilization, including the following points:
Scheduling side:The amount of available resources that can be reallocated offline is obtained through the dynamic oversubscription technology, and the offline resource view is abstracted to make the K8S scheduler perceive, and the scheduler schedules the offline load to the corresponding node, so as to achieve the "valley filling" effect of offline node utilizationThrough load scheduling, the service is avoided from being scheduled to high-load machines as much as possible, so that the load of nodes in the cluster is more balancedThrough secondary scheduling, the high-utilization services on the load hotspot machine are evicted, so that the cluster load is in a dynamic balancing state. Stand-alone side:It supports the quality of service (QoS) assurance policy and provides differentiated runtime resource assurance capabilities based on the QoS level of the serviceIt supports interference detection, offline eviction and other capabilities, and when offline interferes with sensitive services, it will be expelled offline as soon as possible. Through the above technical means, the stability of the service mixing can be effectively guaranteed, so as to normalize the mixing of offline workloads on the nodes and maximize the utilization rate and valley filling effect.
The design of Xiaohongshu's container resource scheduling architecture is shown in the following figure
After Xiaohongshu's various business scenarios are submitted through various publishing platforms and task platforms, they are delivered to the unified scheduling system in the form of pods through the upper-layer load orchestration capabilities. Based on different scheduling requirements, the unified scheduling system provides strong guaranteed resource delivery capabilities, differentiated QoS assurance capabilities, and minimum resource requirements and extreme elasticity capabilities for offline services.
Scheduling side:
Offline scheduling: coscheduling;Secondary scheduling: hotspot eviction, defragmentation;Load scheduling: based on CPU level;Resource View: Simulating scheduling. Stand-alone side:
Suppression strategy: BVT suppression, memory eviction;QoS assurance: core binding, hyper-threading interference suppression, etcBatch Resource Reporting: Batch can be used to calculate and report resourcesMetrics collected (from kernel):p si, sched info, etc.;Interference detection: Interference detection based on CPI, PSI, and business indicators. Offline scheduling resource view
The basic principle of offline service resource scheduling is based on the dynamic oversubscription of the service load awareness capability, and the specific implementation is to reallocate the idle resources of the node to the offline business
The offline available resources are the idle resources on the node (including the sum of unallocated resources and allocated unused resources), and the remaining resources after deducting the security reserved resources, and the offline available resources are calculated by the following formula:
Offline available resources = Full-machine resources Reserved resources - **Actual service usage.
The calculated amount of available offline resources is distributed according to time as shown in the figure (green part of the figure).
In the actual landing process, in order to avoid large fluctuations in the offline available resources with the fluctuation of the use of ** service resources, thereby affecting the quality of offline resources and the stability of offline service operation, the actual usage data of ** service in the above formula is further processed through resource portraits, the data noise is removed, and a relatively stable amount of offline available resources (green part in the figure) is finally calculated, as shown in the figure
Mixed QoS assurance policy
QoS rating.
According to the requirements of quality of service (QoS), we divide Xiaohongshu's business types into three QoS levels, as shown in the following table
QoS assurance.
According to the QoS requirements of the service, the node side will do pod-granular hierarchical resource assurance to implement differentiated QoS assurance policies for each resource dimension, and the specific assurance parameters are as follows:
At the CPU core scheduling level, three types of core binding are set up, and a set of refined CPU core orchestration strategies are designed
The three types of nucleus binding are:
Exclusive (not recommended) Features: Bind cpuset scheduling domain, CCD aware, Numa binding, exclusive and exclusive Scenario: Extremely sensitive search and promotion of large-scale latency sensitive servicesShare Features: Bind Cpuset scheduling domain, CCD aware, Numa (optional) binding, Share Exlusive exclusivity, can be shared with None type of businessScenario: J**A microservice that tolerates partial interference, application gateway, web Service reclaimed features: no CPUSET binding, may share cores with non-exlusive core-bound services, the allocation of cores is completely handed over to the kernel, and CPU resources are not 100% satisfiedScenario: Batch offline services, and some computing services with no latency requirements are evicted offline.
In extreme scenarios, such as high memory usage of the whole machine, the risk of triggering OOM, or the CPU of offline services cannot be met for a long time, the single-node side supports multi-dimensional comprehensive calculation and sorting according to the priority configuration, resource usage, and running duration of the offline service.
Offline business scenarios
As a content community with hundreds of millions of users, Xiaohongshu has a variety of offline business scenarios, including a large number of offline scenarios such as class transcoding scenarios, search and push, CV NLP algorithm inference training, algorithm feature production, data warehouse query, etc., specifically, including the following business types:
Near-offline transcoding scenario (containerized)Flink streaming Batch computing (containerized)Spark batch computing (uncontainerized, on yarn)CV NLP algorithm sweepback scenario (containerized)Training scenario (containerized)By providing offline unified scheduling capabilities based on K8s, these offline services and services are mixed and deployed in a unified computing resource pool. The service provides differentiated resource quality assurance, provides massive low-level computing power for offline services, and improves resource efficiency.
K8s and yarn mixed part scheme.
There are a large number of algorithmic Spark tasks in Xiaohongshu's internal commercialization and community search services, which cannot be processed in a timely manner due to the shortage of offline cluster resources, and the resource utilization rate of ** clusters is low during off-peak hoursOn the other hand, in order to reduce the cost of business migration, we chose to cooperate with the Kooridinator community to use the Yarn On K8s hybrid solution to quickly implement the hybrid part of Spark offline scenarios, as shown in the figure
Containerized and offline workloads are published to the cluster through K8S links, and Spark jobs are scheduled to specific nodes through Yarn ResourceManager, and are pulled up by the NodeManager component on the node. NodeManager is deployed in the K8s cluster in the form of containers, and in addition, the following components are involved:
koord-yarn-operator on the scheduling side: supports two-way synchronization between k8s and yarn scheduler resource viewsNode-side copilot: nodemanager operation, providing a yarn task management and control interfaceneptune-agent koordlet: offline resource reporting, node offline pod task management, conflict resolution, eviction, suppression policy;The core capabilities to support K8S and YARN mixing have been developed in the community and will be available at Koordinator 1 in late November4 versions.
In order to share and allocate the total available offline resources on the cluster nodes, the koord-yarn-operator component needs to be used to synchronize and coordinate the resources between the two schedulers in both directions, and realize two synchronization links
1.The k8s->yarn scheduler is responsible for synchronizing the total number of offline resources from the yarn perspective, where the total amount of yarn offline resources is calculated as follows:
Total YARN offline resources = Total offline resources - Nodes allocated on the K8S side.
2.The yarn->k8s scheduler is responsible for synchronizing the amount of allocated resources of yarn, and the total amount of k8s offline resources is calculated as follows:
Total K8S offline resources = Total offline resources - Yarn nodes have been allocated.
Based on the offline resource view of their respective nodes, the two schedulers make scheduling decisions separately to schedule k8s offline pods and yarn tasks to the nodes
When the amount of offline service resources allocated by the node exceeds the available offline resources of the node for a long time, and the offline usage rate continues to be high, there is a possibility that the offline service will not be able to obtain resources and be starved to death, the stand-alone side will be comprehensively calculated and evicted according to the priority, resource occupation, running time and other factors of the offline service.
Alibaba Cloud EMR productization support.
At the same time, the Alibaba Cloud EMR team provides development support for the hybrid function at the product level, and supports the ability of K8S cluster to auto scale NodeManager Pods on the basis of compatibility with EMR's original logging, monitoring, and O&M logic.
Up to now, Xiaohongshu's hybrid capabilities cover hundreds of thousands of machines, millions of cores of computing power, and support resource scheduling for tens of thousands of offline scenario services. Through the continuous promotion of large-scale container mixing department, Xiaohongshu has achieved significant benefits in terms of resource cost efficiency, including the following two aspects:
CPU utilizationOn the premise of ensuring the quality of service quality, the average daily CPU utilization rate of mixed part clusters is increased to45% or more, the daily average CPU utilization of some clusters can be stably increased toThrough offline mixing and other technical means, the CPU utilization of the cluster is improvedThe CPU utilization of some storage clusters can be improved20% or more. Resource costsOn the premise of ensuring the stability of offline business, it provides various offline scenarios for XiaohongshuMillions of nuclear hoursof low-cost computing power. The CPU allocation rate of mixed clusters has increased to more than 125%, which is significantly lower than that of exclusive resource pools.
In April 2022, Koordinator was officially open-sourced, and in June of the same year, Xiaohongshu launched an offline mixed department project and began to participate in the design and submission of Koordinator solutions. In August 2022, Xiaohongshu and the community jointly built the runtime-proxy component and implemented it in the internal scenario. In April 2023, Xiaohongshu launched the community-led Yarn and K8S hybrid project, and in August 2023, the scheme was implemented on a large scale in Xiaohongshu.
Up to now, with the help of Koorindiator, Xiaohongshu's hybrid department has covered tens of thousands of nodes and provided hundreds of thousands of offline resourcesThe utilization rate of the overall mixed-part cluster has increased to more than 45%. Achieved good landing results.
In the process of Xiaohongshu's nearly a year of technology exploration, we have accumulated rich experience in improving resource efficiency, and have achieved good improvement results, as the company's business scale gradually grows, the scenarios become more and more complex, and we will face many new technical challenges. In the next stage, we aim to build a unified resource scheduling capability for hybrid cloud architecture, and the specific work will focus on the following three aspects:
Mixed workload scheduling capabilities support:Task-based workload scheduling capabilities, including big data and AI, can meet the resource scheduling functions and performance requirements of all business scenarios of XiaohongshuResource efficiency has been further improvedFor the hybrid cloud architecture, we will promote larger-scale resource pooling, promote quota-based resource delivery, and further improve cluster resource utilization and significantly reduce resource costs through more aggressive technical means such as elasticity, mixed department, and oversellingHigher service quality assurance capabilities:In the context of more aggressive CPU utilization targets, various mixed part interference problems that may be encountered in deep-water areas are solved by building QoS sensing scheduling capabilities, interference detection capabilities, and relying on technical means such as safe containers. Over the next few releases, Koordinator will focus on the following areas:
Scheduler Performance Optimization:Equivalent scheduling is supported, and the same pods are requested by merging to avoid double counting of scheduling processes such as filter and score. network qos:The quality of container service in the network dimension ensures high-priority bandwidth, and the request limit model is designed to ensure the minimum bandwidth requirements. Big Data Workload:Gang can be used to schedule atomic preemption and preempt pods as a whole by groupQoS policy adaptation for Hadoop Yarn tasks. Resource Interference Detection:Based on the underlying metrics, it perceives the competition of container resources, identifies abnormal pods, eliminates interference, and feeds back the scheduling link. You can use DingTalk to search for group IDs: 33383887 to join the DingTalk group in the Koordinator community.
Click on the link below to see a detailed introduction to Koordinator and how to use it!