How we can help banks build an open big data sourcing layer

Background of the case

The client is a state-controlled city commercial bank with an independent legal person in Hebei Province, with institutional outlets throughout the city and counties, and its 100 billion assets are among the best among urban commercial banks in Hebei Province.

With the advancement of "digital banking", the intrinsic value mining of data has become the driving force for financial business innovation, so most banks, including large state-owned banks, joint-stock banks and leading city commercial banks, are researching, planning and building big data systems. The bank also launched a big data platform in 2021, which has greatly improved its data processing and data analysis capabilities.

With the launch of the customer's new generation of core and credit systems, the demand for data processing and analysis relying on the big data platform is also expanding, and the daily increment of the platform source data exceeds 50G, which increases the storage pressure on the source data of the big data platform day by day, and the storage space of the existing big data cluster occupies nearly 80%. Since the storage and load of the existing big data platform were already overwhelmed, the bank proposed a project to build a new big data source layer.

Requirements analysis

The bank plans to rely on the architecture of the existing big data platform to improve the application and management capabilities of the full data at the source layer, build a complete and systematic set of data demand identification and analysis and service processes for the source of the source, and provide on-demand data to avoid the risk of system service runs.

The new module is mainly designed to meet the bank's needs for archiving, storage, analysis, and query of historical data. By building a distributed computing and storage system, the ability to analyze and mine the data of the full amount (including structured, semi-structured, and unstructured) source history can be improved, and the cost of data storage and computing can be effectively reduced.

Schematic design

Based on the requirements and current situation, the solution design is divided into data source, data platform, data application, and front-end portal at the level of architecture logic. The data source is synchronized from the main business systems such as approval, collection, and core systems to the big data platform. The big data platform integrates data and supplies data to data warehouses, relational data marts and data mining marketplaces, some special applications are directly supported by the big data platform, and the rest of the data mining tools, cubes, management cockpits and reporting platforms are supported by data warehouses and marketplaces. The front-end portal provides support for data mining, multi-dimensional analysis, report query, and ad hoc query for users inside and outside the bank, as well as some specific analytics applications.

Based on the existing architecture system of the big data platform, a complete and systematic big data source layer is built in the bypass. The source layer undertakes the full historical data of the big data platform (data with a storage time of more than 5 years), as well as all the data of the offline system in the future, which not only provides support for data warehouses and bazaars, but also provides hot and cold detailed data query and application based on big data storage, such as voucher detailed data query with a data volume of more than 10 million.

In view of the fact that the storage and complexity of the existing big data platform have been overwhelmed, the newly built source layer adopts storage and computing separation and open data storage, which not only allows the source layer to be expanded on demand, but also provides support for massive applications in the Hadoop ecosystem, without affecting the original platform.

Test selection

Starting from the current situation and needs, the customer's choice of even-numbered products and solutions has been thoroughly investigated and verified, which can be seen in the successful passage of the bank's POC process in terms of functions, performance, compatibility and other aspects. 1. Functional test

Passed 17 functional tests, including: data type, view management, indexing ability, temporary table creation, DML, mainstream function matching, multiple storage types, random distribution of data, transactions, multi-language UDF, cloud native features, etc.

2. Performance test

In the TPC-H test, the maximum read/write transmission of 100G data under gigabit bandwidth was realized, and the ad hoc query of massive data and the addition, deletion, and modification of more than 600 million data were realized.

3. Compatibility test

It is compatible with a variety of interfaces and tools, including JDBC, ODBC interfaces, third-party operation tools, BI tools, scheduling tools, etc.

4. Scenario test

In the same scenario as the existing big data platform and data warehouse, it runs batches, showing the extreme throughput and fast response of massive data.

Based on this test performance, the products of even number technology have been unanimously recognized by the leaders of the bank, and have become the best choice for the bank to build a big data source layer.

Under the premise of compatibility with the bank's existing technical architecture, the new source layer is built with OushuDB as the core, and the data access mode supports ETL and message queues at the same time, providing scalable and configurable data access forms, and supporting multiple data sources (including but not limited to text, relational databases, etc.).

Support dynamic expansion of resources and servicesData interfaces based on user needs can be quickly built to achieve high-speed and fast deliveryHorizontal linear scaling, cluster deployment, and system monitoring are supported.

Provide external service interface, support docking with the bank, provide data query interface for upstream and downstream systems, and support single query, batch query, asynchronous query and other forms, such as supporting concurrent flow query of counter system, concurrent report query of reporting system, etc.

Project implementation

The implementation of the project in the bank is divided into several parts: migration, access, application, scheduling, and analysis.

1. Migrate historical data of the offline system

Migrate the data of the system that has been taken offline from the HIVE database of the big data platform to be checked and archived.

2. **Business source system data access

For data storage and query, the storage strategy of data files is designed, and the data of the business system is stored in the database and layered to ensure the integrity of the data.

Design a data lifecycle management solution for the status tables and flow tables of the business system based on the data usage scenarios.

Formulate a data cleanup policy in the database based on the data lifecycle management plan, and the data files are permanently stored.

In the first phase, the data source access of 7 systems, including the core system, credit management system, Internet credit system, channel integration platform, wealth management and sales system, customer information management system (ECIF), and general ledger system, was completed, and 2 new systems (corporate online banking and new electronic ticket system) were built, with a total of thousands of tablesIn the second phase, more than 50 business systems were accessed, with a total of more than 900 tables.

3. Historical data query platform support

Realize the docking with the existing historical data query platform in the bank, and query the general ledger accounting voucher query and the detailed account page query through the existing historical data query platform.

4. Unified scheduling platform docking

Through batch running, monitoring and statistics, the docking with the existing scheduling platform in the bank is completed to achieve unified scheduling and standardized management.

Task running batch

Add and modify batch running tasks on the scheduling platform, and set the trigger mode of batch running tasks.

Task monitoring

Monitor the batch running tasks, record the start time, end time, total number of tasks, number of successes, and number of failures.

Task statistics

Statistics are collected on the historical running time, such as the average running time.

5. Data processing log analysis

Through data loading duration analysis, batch running duration analysis, and data query behavior analysis, the entire migration, access, and scheduling process is accurately completed.

In addition, we conducted an investigation on the bank's internal anti-money laundering, post-event supervision, reconciliation, fund transfer pricing, accounting and bookkeeping, third-party reconciliation system, overseas transaction data reporting and other important business systems, and confirmed that there was no impact on important business systems during the entire project implementation process.

Construction results

The source layer based on oushudb fundamentally alleviates the storage and service pressure that the bank's big data platform may face, and the new source data standardizes the data storage and application flow, supplementing and improving the bank's data architecture system. From the strategic level, the implementation of the project has formed and consolidated the following important capabilities:1. Realize the ability to apply open source data

Based on open storage, the new source layer can natively support a large number of applications in the Hadoop ecosystem, and does not rely on or affect the original platform. Help the bank build cutting-edge data applications such as data mining, real-time analysis, and machine Xi based on full historical source data.

2. Build advanced lakehouse integration basic capabilities

After migrating the historical source data of the big data platform, the source data from key business systems is managed in parallel with the big data platform, forming a data platform base that fully supports existing data and historical data. It began to undertake data applications from data warehouses and big data platforms, such as report applications, which shows that the source layer built based on oushudb is fully capable of compatible with the storage and application of lakes and warehouses, and further realizes the integrated architecture of data lakes and warehouses in the future.

3. Form the integrated innovation ability of the domestic information innovation platform

On the basis of the existing domestic big data platform, the bank further introduced domestic information and innovation products in the fields of relational databases (such as oushudb), data analysis software, data application software, etc., to achieve the sharing of technological achievements and complementary advantages, and eliminate the uncertainty of the bank's data base.

4. Lay out the innovation ability of data asset elements in advance

Through the use of innovative data software, the bank's source data resources are comprehensively inventoried and sorted out, and the contents of standard data, basic data, integrated data, derivative data, and data product data are included in the scope of data asset management, so as to form an enterprise-level unified data asset catalog and promote the process of data asset elementalization.

How we can help banks build an open big data sourcing layer

Related Pages

How can piezometers help us monitor and manage reservoirs?

How to help your baby exercise and raise his head?

Explore how to build a new development pattern

What problems can strategy help us solve?

The United States seeks help from China, should we be Mr. Dongguo?