The architecture of the integrated data platform for financial information innovation and lakehouse

Mondo Technology Updated on 2024-03-06

Reading guideThis article will share the practical experience of Shuxin Network in the architecture of a lakehouse integrated data platform in the field of financial information innovation.

This sharing is mainly divided into the following five parts:1.Evolution of data platform architecture.

2.The challenge of financial information innovation to the data platform.

3.DataCyber, a cloud data intelligence platform

4.The landing path of the financial information innovation data platform.

5.Practical cases of financial information innovation data platform.

Sharing guest: Yuan Panfeng, Zhejiang Digital New Network***CTO

Edited by Ma Xinhong.

Content proofreading by Li Yao.

Produced by the community datafun

Evolution of data platform architecture.

The development of big data infrastructure has gone through four main phases, each with landmark technological advancements to address new application needs.

Phase 1: Data Warehousing. At this stage, the data platform is mainly used to support analysis processing (OLAP) and business intelligence (BI) report analysis. Technical representatives include Oracle's shared storage architecture and Teradata's massively parallel processing architecture.

Phase 2: Data Platform. With the rise of big data, data platforms have begun to feature large-scale data storage and computing, mainly serving stream batch computing scenarios. The representative technology of this phase is Hadoop, which evolved from the early single MapReduce compute engine to support the multivariate compute engine 2Phase 0 to be able to handle more complex data analysis needs.

The third stage: data center. In terms of technology, the data middle platform continues the related technologies of the data platform, such as Hadoop, and integrates data organization and change management to form a more complete data service capability.

Phase 4: Cloud Data Platform. Current cloud data platforms are based on cloud-native architectures and offer innovative products such as cloud data warehouses. Representative products include Snowflake and Databricks, which support auto scaling and pay-as-you-go billing for multi-tenant resources on the cloud. Technically, advanced data architectures such as stream-batch integration, lakehouse integration, and storage-compute separation have emerged at this stage.

The first important trend in the current development of data platforms is the combination of cloud native and big data. This combination forms a new type of data platform architecture, which leverages cloud-native resource scheduling and storage unified load capacity to give full play to cloud-native advantages in resource utilization, elastic scheduling and computing, and standardized deployment and upgrades. This architecture not only improves the efficiency of data processing, but also enhances the flexibility and scalability of the data platform, providing enterprises with more efficient and reliable data services.

Although the traditional lakehouse architecture can handle multiple data types, due to data redundant storage and data transportation that relies on ETL tasks, there are problems of poor data timeliness and consistency, and at the same time increase the complexity of development and O&M.

The lakehouse architecture solves these problems by combining the advantages of data lakes and data warehouses to create an integrated and open data processing platform. This architecture allows the underlying data to be stored and managed in a unified manner, and enables efficient scheduling and management of data between lakehouses. In addition, it can provide unified query and analysis capabilities to the business layer, improve the timeliness and consistency of data, reduce development and operation and maintenance costs, and provide strong support for enterprise data analysis and decision-making.

The third important trend is the separation of storage and computing. For a long time, Hadoop has been providing big data capabilities with an integrated storage and computing architecture, but with the rapid growth of internal network bandwidth, the expansion of data scale, and the development of data lake technology, big data infrastructure is evolving towards a storage and computing separation architecture.

The core of storage-compute separation is to separate the resource scheduling yarn of Hadoop and HDFS of storage clusters to achieve the decoupling of storage and resource management. Although this approach increases the O&M burden and sacrifices some local read performance, practical experience shows that these losses are manageable, especially for customers who are concerned about cost and privatization scenarios.

After entering the cloud-native era, the architecture of storage and computing separation has become more diversified. The underlying storage can be HDFS, S3 object storage, etc., while the resource scheduling framework fully embraces Kubernetes for resource scheduling and management. This architecture provides greater flexibility and scalability, helping to optimize resource usage and reduce costs while maintaining the high performance and reliability of the big data platform.

The fourth trend is hybrid cloud and data cloud. With the increasing popularity of enterprise data centers, it is becoming more common for both public and private clouds to exist. Enterprises need a platform to manage data centers on both types of clouds in a unified manner and ensure the secure flow of data across clouds.

In order to ensure the secure flow of data within and between enterprises, technologies such as data sandboxing and privacy computing need to be adopted. These technologies can help solve the problem of the secure flow of data between different enterprises. For a data platform, a solution that supports the secure flow of data between multiple tenants is essential.

The challenge of financial information innovation to the data platform. Next, in the second part, we will focus on the field of financial information innovation and analyze the challenges faced by data platforms.

"Xinchuang" is the abbreviation of China's information technology application innovation, and its goal is to promote the localization of the core technology of the IT industry chain and achieve security, autonomy and controllability. The adaptation of big data components is an important part of the information innovation strategy. In the financial industry, the promotion and implementation of information innovation is accelerating, and the adaptation of big data components is an important challenge at present.

The necessity of big data information innovation is mainly reflected in two aspects: one is the high licensing cost of overseas big data platform products such as CDH, and the other is that these products cannot fully support the software and hardware in China's information innovation environment. Therefore, the adaptation of big data components has become an important task.

The adaptation process of information innovation is far more complex than a simple open source technology migration. First of all, it is necessary to adapt domestic CPU chips, operating systems, databases, and cloud platforms one by one. After completing this phase, you need to resolve version conflicts, dependency package conflicts, and component combinations between different big data components. This requires a professional big data team to go through a series of processes such as compilation, assembly, packaging, deployment, and testing based on open source technology, and finally realize the deployment of the production environment that can be delivered to customers.

The second challenge is the stability, performance, and security of the big data component. To ensure the high availability and stability of big data components in an innovative environment, perform the following steps:

Fully adaptable to mainstream big data computing, storage, and analytics components to ensure they can run smoothly in a cloud-native environment.

The performance of the adapted components is optimized to solve the problem of performance difference between Xinchuang and non-Xinchuang environments.

Optimized performance for cloud-native environments and storage-compute separation architectures to meet the needs of different business scenarios.

Perform large-scale performance testing and optimization to ensure components perform in real-world deployments.

At the same time, the security of big data platforms cannot be ignored. Platform security needs to be ensured from multiple dimensions, such as user management, tenant management, permission management, and audit center. This includes adapting security components such as Kerberos and OpenLDAP to the information innovation environment, as well as multi-tenant systems, permission systems, and audit systems. Through comprehensive security measures, the security of the big data platform in the information innovation environment is ensured.

The third challenge is the migration and hybrid deployment of big data clusters. This process is gradual and involves the parallel operation and transition of the old and new clusters. Therefore, it is necessary to develop tools to support cluster data migration in heterogeneous environments and maximize the reuse of server resources in existing old clusters.

In order to effectively reuse the original resources, the Xinchuang big data platform needs to have a variety of hybrid deployment capabilities. This includes supporting hybrid deployments between different CPU architectures, hardware specifications, and operating systems. These requirements pose higher challenges to the innovation of big data.

DataCyber, a cloud data intelligence platformThe third part details the architecture design and related practices of DataCyber, a cloud data intelligence platform independently developed by Datanew Network in the context of financial information innovation.

Design goals

Before we get into the DataCyber technology architecture, let's first clarify the design goals of the entire system. The primary design goal of the platform is to create a technologically independent and controllable big data platform in the domestic information innovation environment. At the same time, we do not pursue the development of big data engines from scratch, but hope to participate in the construction of new big data engine technologies through the open source community ecology to ensure the openness and compatibility of the platform.

In terms of technology selection, we adopt a cloud-native lakehouse integrated architecture. This architecture combines the technical advantages of cloud-native and lakehouse to achieve a next-generation cloud data intelligence platform. We also hope that the platform can realize the integration of data + AI, that is, the connection between the data platform and the AI platform, abstracting and extracting the common technical components of the two platforms, and opening up the account tenant system.

In addition, the entire platform is designed based on a multi-tenant system. It is necessary to ensure isolation and security within tenants, and to support open data sharing across tenants. This is a key capability of the data platform, especially in the financial scenario, where customers need to analyze and mine the value of data through data circulation between different enterprise entities.

Architectural design

The diagram above illustrates the overall architecture of DataCyber, an open cloud data intelligence platform designed to support heterogeneous hardware environments, including traditional x86 servers as well as emerging CPU architectures such as ARM and MIPS. The underlying layer of the platform can adapt to scenarios such as private cloud and hybrid cloud of different cloud platform vendors.

DataCyber's product matrix is divided into several tiers from bottom to top:

CyberEngine: provides the lakehouse engine base and big data cluster management and O&M capabilities, and provides the foundation support for the data platform and AI platform.

Cyberdata: provides developers and users with one-stop product capabilities for data platforms.

CyberAI: A one-stop product capability that provides AI platforms to developers and users.

CyberMarket: Responsible for the secure circulation of data models and algorithm applications across tenants to maximize the value of data.

The architecture of each component of DataCyber will be described in detail, showing how to use these components to achieve intelligent data processing and analysis, and how to facilitate the sharing and circulation of data between different tenants while ensuring security.

(1)cyberengine

First of all, let's introduce CyberEngine, which is an advanced big data management platform that is designed to support both cloud-native environments and traditional data architectures. The platform is divided into four levels from bottom to top: resource scheduling, data storage, data engine, and management platform.

At the resource scheduling layer, CyberEngine provides unified resource management and supports cloud-native K8S scheduling and traditional YARN scheduling to help customers smoothly transition to a cloud-native architecture. The data storage layer supports both traditional HDFS storage, object storage, and new data lake formats, and provides core services such as metadata services, data ingestion, data lake acceleration, and management. The data engine layer includes a streaming engine, a batch engine, and an interactive analysis engine, which is built based on open source technology to form a high-performance and high-stability big data engine distribution to meet the needs of different scenarios. In addition, it includes a Unified Data Integration Engine, a Unified Task Scheduling Engine, a Unified Metadata Services Engine, and a Unified SQL Engine to support the CyberData and CyberAI platforms. The management platform is composed of EngineManager products, which provides one-stop big data cluster planning, deployment, and operation and maintenance management, with the goal of becoming an intelligent and efficient big data technology infrastructure management platform to serve the big data management and operation and maintenance personnel within the enterprise.

CyberEngine features include: fully cloud-native, multi-tenant and multi-cluster management, as well as comprehensive publishing, configuration, management, operation, and auditing capabilities; It supports mainstream big data component versions, including computing and storage components, data lake engines, and analytical engines, and is better than open source components in terms of stability and performance. Enables large-scale deployment and management.

(2)cyberdata

Cyberdata is a one-stop big data intelligent R&D and governance platform, which is designed to be modular and pluggable, and can be split into different sub-products to meet different needs. These sub-products include data modeling, data integration, data development and operations, data asset governance, data security, data services, and more. The platform supports a variety of lakehouse architectures, including offline data warehouse, real-time data warehouse, stream-batch integrated data warehouse, and lakehouse integrated architecture, to meet the needs of internal data warehouse engineers, data analysis engineers, and data management personnel.

Cyberdata is a cloud-native technology architecture that can be deployed in a multi-cloud environment, supporting large-scale enterprise applications across multiple environments, regions, and clusters. In the direction of information innovation, Cyberdata not only supports a variety of information innovation software and hardware environments, but also can access localized databases and data sources, and collect business data to the platform for processing and analysis.

CyberData's core capabilities include unified metadata management, data integration capabilities, and workflow scheduling for data development tasks, all of which are powered by the four core service engines of CyberEngine products.

(3)cyberai

CyberAI is a one-stop machine learning platform designed to serve algorithm engineers, data scientists, and data analysts. The platform first emphasizes the management and access capabilities of the underlying infrastructure, which is based on the access capabilities of the unified service base engine provided by CyberEngine. In this way, CyberAI is able to effectively manage resources and data access.

In addition, the CyberAI platform works seamlessly with the CyberData data platform. This integration capability is reflected in the tenant account system and the access and management of data sources, achieving true integration capabilities.

In terms of productization capabilities, the CyberAI platform not only supports traditional interactive and visual modeling capabilities, but also provides algorithm sandbox and federated learning support for inter-enterprise data circulation scenarios to achieve cross-tenant data circulation and algorithm training. In addition, the platform also integrates the relevant capabilities of large model tools, and users can complete the construction process of large model applications in one stop on this platform.

(4) Product output form

DataCyber, a cloud data intelligence platform, demonstrates the high open-source openness of its products. DataCyber's three main products, CyberEngine, Cyberdata, and CyberAI, can all be output separately and maintain maximum compatibility with other open-source components and big data analytics databases. These products support diverse scenarios including CDH CDP localization substitution, big data foundation construction, data development and governance, and machine learning framework integration through plug-in and standardization.

CyberEngine: As a big data base, it can be plugged into mainstream big data components, which is suitable for building a big data platform from scratch, such as bank CDH replacement.

Cyberdata: Uses standardized plug-ins to access the big data base to quickly support data development and governance in open-source, commercial, and cloud-native clusters, and is suitable for building various enterprise-level data platforms.

CyberAI: As a one-stop machine learning platform, it integrates frameworks such as TensorFlow and PyTorch in a plug-in manner, and is suitable for scenarios such as private cloud enterprise machine learning platforms.

Combined outputs between products also offer more possibilities, such as:

CyberEngine + Cyberdata: Build a combination of big data foundation + one-stop DataOps platform, such as a cloud-native lakehouse platform. CyberEngine + CyberAI: The ability to build a combination of big data AI foundation + one-stop MLOPS platform. Cyberdata + CyberAI: Build an integrated digital intelligence platform that can benchmark with Databricks and Snowflake in terms of product capabilities. Output lakehouse platform on the cloud, CyberEngine + Cyberdata + CyberAI: Provide a one-stop combination of big data AI base, DataOps platform, and MLOPS platform to output a cloud-native elastic lakehouse platform to meet customers' diverse application scenarios.

Core technology components

Let's take a look at some of DataCyber's core technology components.

The first is the Cyberlakehouse, which is a lakehouse that combines the information innovation environment and cloud native technology. It has carried out full-stack adaptation in the information and innovation environment, from basic hardware to operating system to industrial ecology, to ensure the localization and adaptation of big data components. On this basis, the standardized release process and source code adaptation of big data components based on containerization are realized, and cloud-native deployment is supported. Further up is the lakehouse architecture, which requires storage-compute separation, open storage, flexible expansion of computing components, and unified and standardized metadata services.

Based on these requirements, the Cyberlakehouse architecture is divided into three layers:

Storage Layer: Provides unified storage and HDFS format support, as well as acceleration of the data access layer. Access layer acceleration includes cache acceleration such as Alluxio and services such as Celeborn, improving access stability and performance under the storage-compute separation architecture.

Computing layer: Based on cloud-native resource scheduling (K8S), it supports batch processing (Hive Spark), stream processing (Flink), and interaction analysis (Starrocks Presto Impala) engines.

Management layer: Provides a one-stop operation platform for cluster planning, management, implementation, deployment, and operation and maintenance monitoring.

Together, these technical components form DataCyber's lakehouse platform, which provides a one-stop operating platform for big data operation and maintenance managers, supporting multiple computing components and storage formats, ensuring high performance and flexibility.

The figure above illustrates a large amount of secondary development and adaptation of open source big data components in the R&D process of DataCyber's Cyberlakehouse. At present, it has been fully adapted to the batch processing engines Hive and Spark, the stream computing engine Flink, and the interactive analysis engines Impala and Presto. Key work outcomes include:

Hive: Resolved the adaptation issue of Hive to higher versions of Hadoop, K8S, data lake components, and data caching systems.

Spark: Adapts Spark to the previous version of Hadoop3, as well as the deployment capabilities of Spark on Hive and Spark on K8S, and supports elastic scaling.

Flink: supports the elastic deployment of Flink on K8s, and carries out the secondary development of resource elastic management for Flink's session clusters to achieve auto scaling of jobs.

Data lake integration: Data lake engines such as Hudi and Paimon are integrated.

Analysis engine: Customized secondary development and adaptation of Impala and Kudu solve the deployment problem of Impala in the cloud-native environment and support the smooth upgrade of old CDH users.

Security: Solves the integration issues of Kerberos, Ranger, and OpenLDAP, providing a solution to the security needs of financial-grade customers.

As a result of these adaptations and development efforts, the latest version of CyberEngine 23.0 already has a component coverage that surpasses CDH, and is better than the old version of CDH in component version, and realizes the output of productization.

Cybermeta is the core technical component of the big data platform, which realizes the unified management of the lakehouse metadata across the platform, the active discovery of metadata from external data sources, and the interconnection of metadata between multiple computing engines. In addition, it supports unified data rights management and automated optimization acceleration of data lakes across multivariate computing scenarios.

To meet the metadata requirements of multivariate heterogeneous big data computing engines, the Unified Metadata Service Engine supports two modes:

Integration with Hive Metastore: Provide metadata management and services for different analytical compute engines through Hive's standardized metadata capabilities.

Custom Catalog Scaling Mechanism Based on Spark and Flink: The extension supports a wider range of data source metadata management capabilities, enabling Spark and Flink engines to access relational databases and cross-source data access to and from lakehouse data sources.

Cyberscheduler, another core technical component of the big data platform, is responsible for the workflow scheduling of data warehouse tasks, ensuring the efficient execution of tasks and the stability of data flows. The architecture of CyberScheduler is divided into three layers:

Web Services and Scheduler API layer: Provides users with an interface for scheduling tasks.

Coordinator cluster: A distributed scheduling system that is responsible for generating job instances and scheduling them according to workflow dependencies, and provides API service-based interfaces. It emphasizes the stability of the service, high concurrency, and low latency.

Worker clusters: Perform different types of jobs, including tasks that are executed locally and submitted remotely to platforms such as Hadoop and K8S. It focuses on task scalability and resource isolation.

CyberScheduler covers a variety of job types and supports features such as periodic scheduling, flexible dependencies, data backfilling, and breakpoint rerun. It can adapt to lakehouse scenarios of different task scales, support task scheduling from 100,000 to more than 10 million, and adapt to different customer needs through a unified architecture and different deployment modes to achieve stable scheduling and O&M.

In addition, CyberScheduler supports intelligent scheduling and monitoring, optimizes task scheduling resources based on historical data, and provides intelligent prompts for task output time and alarms, thereby improving scheduling efficiency and task success rate.

CyberIntegration, the unified data integration engine, is an all-in-one data synchronization platform. The platform supports three main data synchronization engines: DataX, Spark, and Flink. These engines can handle a variety of data synchronization requirements, including batch synchronization, streaming synchronization, full synchronization, incremental synchronization, and full database synchronization.

Cyberintegration's system architecture allows it to dynamically determine the required resources and synchronization capabilities based on the scale of the data source, and supports horizontal scaling. In addition, considering that the platform needs to support public, private, and hybrid cloud architectures, it also needs to address the technical challenges of cross-network segment data transfer during data integration. This flexibility and powerful data processing capabilities make CyberIntegration an effective solution for a wide range of data integration needs.

CyberMarket, the data distribution hub, focuses on solving cross-tenant data circulation problems. CyberMarket supports a variety of open data sharing methods, including data APIs, data applications, and data sandboxes and algorithmic sandboxes, the latter of which allows data to be available and invisible between different tenants, especially suitable for industry scenarios such as finance.

Data sandboxing ensures data security through physically isolated storage and isolation between multiple tenants. At the computing level, SQL sandboxes and algorithmic sandboxes provide a secure data analysis and mining environment. After data enters the computing sandbox, it can only be used securely in the sandbox, and the calculation results can only be exported after the approval of the data owner. After the sandbox is used, there will be relevant ** and security mechanisms.

Collaboration between CyberData and CyberAI is also key, for example, models trained in the CyberAI platform are published to the CyberData platform for workflow orchestration and scheduling of data development and model training tasks. This capability enables the entire platform to provide comprehensive support in data flow scenarios.

The landing path of the financial information innovation data platform.

The fourth part divides the typical landing path of the financial information innovation data platform into six stages:

Construction of a unified management platform: First, a unified management platform is built to unify the user experience and ensure smooth management and migration in the process of information and innovation switching.

Business scenario selection and pilot planning: According to the actual situation of the customer, select the appropriate business scenario for systematic pilot, and plan the information innovation cluster.

Resource planning for the lakehouse: Design and plan the lakehouse cluster, including computing, storage, network and other resources to meet business needs.

Data migration and verification: After the deployment of the new Xinchuang cluster is completed, the data of the new and old clusters is migrated, and the data is compared and verified.

Stress testing and optimization: Stress test and optimize the information innovation cluster according to the amount of data and business requirements.

Stepwise switchover and verification: After ensuring that the new cluster meets the performance and stability requirements, the new and old clusters run in parallel to complete the cluster switchover.

This path ensures the efficient implementation and smooth transition of the financial information innovation data platform to meet the business needs of customers.

In the implementation of the financial information innovation data platform, the functional technical architecture of the unified management platform of the big data cluster in the above figure is the key. Among them, the unified management platform for big data clusters includes two clusters, the old and the new, which are different in terms of resource scheduling and component usage. The middle layer is responsible for data migration, while the unified management layer implements driver docking and management of different types of underlying big data clusters through different driver packages. The top layer is the application layer, which will not be discussed in detail in this article. This architecture can ensure the efficient promotion of the financial information innovation data platform, and realize the steady replacement of the financial information innovation data platform without affecting the stability of the customer's business.

Practical cases of financial information innovation data platform. In the field of financial information innovation, the new network has achieved remarkable results in the implementation of the cloud data intelligence platform. The fifth part of this paper will show the practical results of the new network through two case studies.

The first case involved a joint-stock bank that was using Cloudera's CDH product and was facing high subscription costs and non-compliance with information innovation requirements. The cloud-native big data management platform CyberEngine provided by Shuxin Network has successfully helped banks upgrade their multiple data clusters to the lakehouse architecture of Information Innovation. This not only improves the autonomy and controllability of the platform, but also optimizes component version upgrades and computing resource efficiency, bringing comprehensive platform improvements and value enhancements to customers.

The second case is the construction and operation of a provincial financial comprehensive service platform. The cloud data intelligence platform of Shuxin Network uses big data and cloud computing technology to realize the unified fusion and analysis of data from multiple leading departments and banks, and solves the problem of data use in the financial field. The new network provides cyberdata, a data platform, and cyberAI, an intelligent platform, to help build a financial theme library and a financial data warehouse, covering the whole process from data development, governance to AI task development. In addition, in order to ensure the legal and compliant sharing of data, Shuxin Network also provides data sandbox and algorithm sandbox capabilities. Finally, at the business level, the new network outputs data applications suitable for the financial field, providing a comprehensive data intelligence solution for the financial integrated service platform.

That's all for this sharing, thank you.

Share the guests

Former Peak

Zhejiang Digital New Network***

cto

Master of Computer Science from Beijing University of Aeronautics and Astronautics, more than 10 years of R&D experience in big data and privacy computing industry, and 15 invention patents at home and abroad.

Former senior technical expert of Alibaba Group's big data platform, core leader of the founding team of Alibaba Royal Dining Room, Alibaba Cloud Digital Plus, and DataWorks, 0-1 completed the development and commercialization of Data-Trust products, Alibaba Cloud's privacy-preserving computing platform.

Related Pages