Author |Chu Xingjuan.
In the whole of China, the number of senior ETL engineers at P7 and above is very limited, perhaps four or five hundred in total. Zhou Weilin, who has been deeply involved in the field of big data for more than 20 years, said.
Zhou Weilin once built Ant's data platform system from scratch and is one of the main founders of Ant Group's data technology. Is it possible to simulate the capabilities of these engineers through technological innovation, so that every industry can "afford" many of these senior talents?With a new thinking on the development of data engineering, Zhou Weilin and the core members of Ant's original data platform department in early 2021 established Aloudata, which focuses on eliminating technical bottlenecks in data management.
At the heart of their philosophy is to replace the traditional engineer-driven ETL operation model with an intelligent solution, known as "NOETL", to drive data processing and management through the ETL agent.
The concept of ETL (extract, transform, load) was first developed by William H., the "father of data warehousing".inmon presented. In the early days of data warehouse development, enterprises need to consolidate data scattered in different systems and different formats into a unified data warehouse to support business analysis and decision-making. In order to solve the problem of data consolidation and cleaning, ETL tools came into being and became an important tool for processing large amounts of data.
ETL has benefited from the growth of data** and the increased need for data-driven decision-making, as data continues to grow in size and complexity, making traditional ETL technologies difficult to cope. Specifically, traditional ETLs face the following real-world challenges:
Technical challenges. With the increase of links, the handling and processing work is also increasing, and the data is updated every day, so each task must go through the steps of scheduling, R&D, testing, and online release. The requirements for query performance are different, and when the requirements become more and more dynamic, there is an urgent need for performance optimization, and some requirements cannot wait for system optimization, which will lead to disordered development, which will affect the management and governance of the entire link. Costs are out of control. The difficulty of balancing demand satisfaction and cost is a challenge that data platform department leaders have been facing. The flexibility of requirements means that when they are first proposed, they often cannot be met immediately, and the data that may be used can be calculated in advance before being made available to users, which means higher costs and lower marginal benefits. In reality, this leads to a decline in the satisfaction of many users' needs, and the frustration of the business side being constantly held accountable by the IT department about goals, benefits, and ROI. ETL engineers have limited capabilities. The amount of tasks that individuals can manage and the complexity of the system they can handle are limited, while the data warehouse system is a comprehensive analysis system, and the data inside the system only increases rather than decreases, making the complexity of management higher and higher. In addition, due to the development of high-frequency changes, manual maintenance of catalogs has become extremely difficult, which leads to data management failures. Zhou Weilin believes that the existing ETL engineering system is not sustainable, and that a new way of thinking, new architecture, and new technologies must be adopted to meet this challenge.
In fact, new data integration and processing methods are constantly emerging, such as ELT (extract, load, transform), streaming, real-time data integration, etc., so how does Aloudata's "Noetl" solve the above problems?
Rather than saying "ETL is an enterprise IT activity," "NOETL" is an enterprise business capability that seeks to find an approach that is no longer driven by traditional ETL engineers, leading to sustainable and large-scale growth in data productivity.
According to Zhou Weilin, the "noetl" mode has four characteristics: it goes to the pipe, and there is no need to care about the location of the data;O&M-free, no need to worry about task O&M;Self-optimizing, no need to worry about query performance;Active metadata, from passive to active, realizes the "automatic driving" of data management.
In order to achieve the above four characteristics, the key is to build three engine capabilities: data semantic engine, data virtualization engine and active metadata engine, which provide a new data interaction interface, data integration solution and data management mode respectively.
In the new data interface, it's not just about reports. The underlying business needs not only reports, but more granular data sets and well-defined metrics. In practice, users do not need to know the specific location of **, but they must clearly understand the caliber of the indicator and the values behind these calibers. Overall, business people need to do two things: be clear about the definition of metrics, and make sure that they are what they need.
For the new data integration method, a new way of logical data integration and automatic reconstruction of ETL links is used to connect and integrate global data without physical centralization, and at the same time, adaptive acceleration technology is used to achieve more efficient data preparation and link orchestration. It can be compared with the ** model: the merchant first releases the product, the consumer completes the transaction by adding a shopping cart and placing an order, and the merchant sends logistics after placing the order, and the logistics is then distributed. This is very different from the traditional "first-first-sell" model.
A new model of data management, based on active metadata-driven management. Aloudata has implemented a set of active metadata systems that can perceive global information in real time, and provides comprehensive and accurate metadata and high-confidence intelligent suggestions in all aspects of data discovery, production, consumption and management through the world's most refined lineage analysis capabilities and data semantic mining technology, so that complex data links can be seen, managed, and managed.
Aloudata's NOETL architecture provides a new interactive interface through the data semantic engine, realizes logical data integration and automatic construction of ETL links through the data virtualization engine, and realizes the assisted driving (Copilot) of data governance through the active metadata engine.
According to Aloudata, with this new architecture, the actual business needs lead times can be reduced by more than 50% from weeks or months to days or hours.
Under today's data architecture, Aloudata has launched three main products: Aloudata Air, a logical data platform, AlouDataBig as an active metadata platform that enables operator-level parsing, and Aloudata Can, an automated metrics platform.
Based on the data fabric architecture, Aloudata Air integrates multi-source heterogeneous data through virtualization, without the need to physically transport data, similar to the model of providing a centralized shopping platform. In addition, its automatic materialized link orchestration and intelligent query push-down technology achieve adaptive query acceleration and significantly improve processing efficiency.
As an active metadata platform, Aloudata Big has operator-level lineage parsing capabilities, which can accurately understand online SQL ** logic, and achieve truly real-time, accurate data understanding and more efficient product applications. In addition, the platform can translate ** into natural language, making it easier for users to understand, providing value for model governance, link assurance, comprehensive security compliance inspection, etc.
AlouData CAN subverts the traditional metric management model, after the user defines the metric on the platform, the system can automatically develop the metric, and realize the definition of production, consistency and automated offline resources**. This automated production process dramatically simplifies ETL workloads and reduces IT engagement.
Among Aloudata's existing customers, China Merchants Bank is similar to Ant, and both have 10 trillion data management needs. Previously, China Merchants Bank used a data warehouse solution to physically aggregate and process data, thereby providing data preparation for analysis scenarios. Faced with the high cost of multiple physical transportation and ETL engineering in different scenarios, it will also lead to problems such as repeated derivatives, data security, poor data timeliness, data flexibility and low use efficiency, China Merchants Bank introduced a complete set of services from Aloudata.
The solution first integrates massive data in engines such as ClickHouse, MySQL, and Postgres through virtualization to build a unified logical data asset layer, so that the BIX platform can provide users with more flexible self-service data preparation and self-service data retrieval and data service methods.
In terms of adaptive materialization acceleration, based on the user's query history and data orchestration logic, SQL patterns are extracted, and valuable templates with high reusability are calculated by extracting factors such as operator template reference relationship statistics, computing and storage costs, access times, and compression ratios, and generalization and relational proignction of the templates to implement the physical orchestration of data precomputing links and ensure query performance under billions of data volumes per day.
According to the data, the data preparation period for the new solution has been reduced from two weeks to one day, and the overall storage and computing cost has been reduced by more than 50% compared to the previous one.
Our mission isn't just to solve the talent shortage, but to more radically change the way data is produced so that it's always ready to go. Zhou Weilin said.
This article***