This article mainly introduces the implementation and practice of the construction of the data warehouse in the Ctrip travel team, and will be carried out from the dimensions of business pain points, business goals, project architecture, and project construction.
1. Business pain points
With the increasing demand for real-time data, more and more business pain points are exposed by offline data warehouses, such as:
Real-time demand for chimney development mode.
Poor reusability of intermediate data.
Separation from data development.
Data production has a long service cycle.
Real-time table tasks are cluttered and unmanageable.
Real-time lineage, basic information, monitoring, etc. are missing.
Real-time data: no tools for quality monitoring.
Real-time tasks: O&M thresholds are high and quality systems are weak.
This kind of typical problem will bring a great test to our human efficiency, quality, management, etc., and urgently needs a systematic platform to solve it.
2. Business objectives
Focusing on the known business pain points, relying on the company's existing computing resources, storage resources, offline data warehouse standards and specifications, etc., our goal is to build a system at the levels of human efficiency, quality, and management. As shown below:
1.Human efficiency level
Realize the standardization of data development solutions, such as standardized data processing, compatibility, computing power integration, etc.
Minute-level data deployment, realizing visual operations such as data interface registration, release, and debugging at the BI student level.
2.Quality level
Data content DQC, such as whether the content is correct, incomplete, timely, whether it is consistent with **, etc.
Data task early warning, such as whether there is delay, whether there is backpressure, how about throughput, whether system resources are enough, etc.
3.Management level
Visualized management platform, such as end-to-end lineage, data table tasks, quality coverage and other basic information.
Standardize the whole process of integrated data warehouse, such as data modeling specifications, data quality specifications, data governance specifications, and storage selection specifications.
3. Project structure
The project architecture is as follows, the system mainly includes: raw data, data development, data service, data quality, data management and other modules, providing real-time data processing in seconds and data service deployment in minutes, for real-time data development students and back-end data service development.
The data of different data ** is first standardized by the standardized ETL component, and the data is preprocessed by the traffic ** tool, and the stream batch fusion tool and the business data processing module are used for hierarchical domain construction, and the produced data is directly deployed by the data service module for data API deployment, and finally used by business applications, and the entire link will have a corresponding quality and operation and maintenance guarantee system.
Fourth, the project construction
1.Data development
This module mainly includes data preprocessing tools and data development solution selection.
1) Traffic ** tool
Due to the large number of entrances and large traffic, the main problems are as follows:
There may be multiple ways to parse data in the same dimension**;
The buried data used accounts for about 20% of the total, and the full consumption of resources is seriously wasted, and each downstream will repeat the operation;
After the new buried point is added, the data processing needs to be developed and intervened (in extreme cases, all users are involved);
As shown in the figure below, the traffic tool has dynamic access to multiple data sources, simple data processing, and writes effective data to the downstream after standardization, which can solve the above problems.
2) Evolution of business data processing solutions
Option 1 - Simple fusion of data from **
Background
At the beginning, the business requirements were relatively simple, such as calculating the real-time order volume of the user's history, aggregating the user's historical purchase of attraction information, etc. Such simple requirements can be abstracted into simple aggregation of offline data and real-time data, such as numeric addition, subtraction, multiplication and division, character append, deduplication and summarization.
Solution
As shown in the figure below, the data provider provides standardized T+1 and real-time data access; Data processing: T+1 and real-time data fusion; consistency check; dynamic rule engine processing, etc.; Data storage: Horizontal expansion of aggregated data; tag mapping, etc.
Scenario 2 - SQL is supported
Background
Although option 1 has the following advantages:
Layering is simple and time-sensitive.
The rule configuration is fast and can take on a large number of complex UDFs
rules engine, etc.
Compatible with the entire J**A ecosystem.
But there are also significant disadvantages:
BI SQL developers are basically unable to intervene and rely on development.
In many SQL scenarios, the development cost of using J**A is high and the stability is poor.
There is no effective data tiering.
Process data is basically unavailable, and if you want to save process data, you need to repeat the calculation, wasting computing resources.
Solution
As shown in the following figure, Kafka carries the data layering function, Flink SQL compute engine, and OLAP carries data storage and hierarchical query, completing the typical hierarchical construction of a data warehouse system.
However, since Kafka and OLAP storage engines are two entities, there may be data inconsistencies, such as Kafka is normal and the database is abnormal, which will cause the intermediate tiered data to be abnormal, but the final result is normal. In order to solve the above problems, as shown in the following figure, the binlog mode used by traditional databases is used, and Kafka data strongly depends on the data change of DB, so that the final result strongly depends on the intermediate hierarchical results, and the data inconsistency caused by component BIG cannot be avoided, but most scenarios are basically available.
Scenario 3
Background
However, option 2 has the following advantages:
SQL. Natural stratification queries.
But there are also significant disadvantages:
Data inconsistencies.
There is no problem with binlog when inserting, but it is not easy to update and delete, and a lot of deduplication operations are required when updating, and the SQL is very unfriendly.
For long-term data aggregation, some operators, such as max and min, have large flink states and are prone to instability.
Also consider the data overwriting problem caused by the out-of-order Kafka data.
Solution
As shown in the following figure, the computing power of the storage engine is borrowed, and the binary log of Kafka is used as the trigger logic for data computing, and the Flink udf is used to query the DB for direct connections.
Advantages:
SQL. Natural stratification queries.
The data is consistent. The flink state is small.
It can support long-term persistent data aggregation.
You don't need to worry about problems caused by out-of-order binlog and updates.
Disadvantages:
Concurrency can't be carried up, and it strongly depends on the performance of the OLAP engine, so we will limit the rate of the window or expand the DB horizontally when the data source is in the data source.
The combination of sink and drawdown flow is interrupted, for example: group by, which is actually a brainless upsert, and the aggregation of UDF cannot replace the native aggregation of Flink;
Each solution has its own scenarios, and you need to select the solution based on different business scenarios and latency requirements. At present, 86% of our scenarios can be undertaken using solution 3, and due to Flink 116. With the blessing of various types of integrated characteristics, it can basically cover all scenes in the later stage.
2.Data Services
This module provides data synchronization, data storage, data query, data service and other capabilities, and can achieve minute-level data service deployment capabilities in simple scenarios, saving 90% of development man-hours. It implements strong dependence on DQC of offline data, DQC exception on the engineering side, resource isolation, current limiting and circuit breaker at the client-> interface level, and full-link lineage (client-server-side table hive lineage) management, etc., and provides the ability to deploy various performance requirements interfaces and provide O&M guarantee capabilities on demand.
The schema is as follows:
3.Data quality
This module is mainly divided into data content quality and data task quality.
1) Data content
Correctness, timeliness, stability
This part is further divided into data operation changes, data content consistency, data reading consistency, data correctness and timeliness, etc. As shown in the following figure, if the data is abnormal, the data can be entered into the company's HickWall alarm center and the alarm can be generated according to the early warning rules. Data content: There will be scheduled tasks, execute user-defined SQL statements, and write data to the alarm platform, which can achieve second-level and minute-level early warning.
Read consistency
As shown in the following figure, if there is a joint query across tables during data reading, if there is a problem with one of the tables, in most cases, the wrong data will not be displayed, and only the correct data in the history will be displayed, and all the data will be displayed after the table is restored.
For example, if the data in Table 2 is abnormal and there is no data in the last 2 hours, when the data is exposed to users, the business needs to display the data 2 hours ago, and the abnormal data gives a front-end abnormal reminder.
Consistency
About offline and real-time data consistency. As shown in the figure below, we use a simple method to directly synchronize real-time data to Hudi, and use Hudi to compare offline and real-time data to enter the alarm platform.
2) Data tasks
Upstream tasks
Relying on the company's custom early warning burying points, alarm middle platform, computing platform and other tools, key indicators such as whether the upstream message queue is delayed and whether the volume is abnormal can be monitored and warned.
Current task
You can monitor and warn key indicators such as throughput, latency, counterpressure, and resources of data processing tasks to prevent long-term abnormalities in data processing tasks.
4.Data management
This module can connect data processing, quality and other modules in series, and provide a visual management platform, such as: basic information of table lineage, DQC configuration, task status, monitoring, etc.
The following figure shows the lineage relationship between upstream and downstream data production tasks in each data table.
The following figure shows the details of the quality information of the data table.
The following figure summarizes the basic information of various UDF tables.
5. Prospects
At present, the system has basically been able to undertake most of the team's data development needs, and we will continue to explore in terms of reliability, stability, and ease of use in the later stage, such as improving the entire data governance system, building automatic data recovery tools, troubleshooting operation and maintenance intelligent components, and exploring the integration of service analysis.
Author丨chengrui**丨***Ctrip Technology (ID: CtripTech) DBAPLUS community welcomes contributions from technical personnel, and the submission email is editor@dbapluscn