System storage architecture upgrade sharing

Mondo Technology Updated on 2024-01-31

System business functionsData processing and integration are carried out within the system, and the initialization (writing) of result data and query data result services are provided to external systems.

System network architecture

The impact of the deployment architecture on the on-line of the cut-in - The distributed cache can be expanded separately without any impact on the read business of other systems, which has nothing to do with the upgrade of storage and query functions, and through the isolation of the cache layer, the external system can remain unchanged during the system expansion, and only the internal management system is upgradedLegend of the overall implementation plan:

Full product channelization - cutting plan: (the total amount is 10 times the current one):

For now:

The current database has more commonly used tables5000w, some of the results express 6000W, which has reached the peak capacity of the MySQL database table, and cannot be supported for full cuts

Objectives:

Highest support900 million: According to the cutting plan, the system is about 6700 million, retaining the redundancy of 1 4, takebillion;Rounded up to 900 million, this value has a large amount of redundancy, which can meet the data support in the next five years.

Time Target: Early August Program Set, August 17 822 On-line and Verification, 824 cut volume plan begins.

Current deployment structure.

Data center distribution, mysql: 1 master and 4 slaves (computer room a 1 master, 3 slaves;Room B read-only from).

Data center distribution, doris: 32C, 63 nodes, 3 replicas.

The number of application containers (Docker) and the maximum number of DB connections.

Number of application containers: 62 (web group: 25, worker group: 31, mq group: 6).

The maximum number of DB connections is 100 (configured per container).

Whether the current service is read/write splitting and the read/write ratio.

No read/write splitting.

Can master-slave latency be tolerated in each business scenario?What is the tolerable delay.

At present, most of the modification operations of business personnel are synchronous operations, and the operation results are returned to the front-end after the modification is completed, and the delay cannot be tolerated from the perspective of business side operations + query results.

In the background task scenario, master-slave latency can be tolerated for intermediate data processing.

At the product level, when the bottleneck pressure occurs in the system, does it accept the current limit?Do you accept deferred data display?

The external service interface is not involved in the development of this time, and the service interface will not be affectedThe business page has a low number of visits and can accept a short period of delay.

Whether the team has experience in using ES.

Partially understood, not used in the project.

Use a generic inventory framework to comprehensively sort out the current state of the system.

Information such as the space in the table, business scenarios, etc. (partial).

System features: high concurrent write and complex read of a single table.

Conclusion: Internally distributed DB: Expand from single sharding to multi-sharding to solve massive data storage and simple queries.

ES: Newly introduced, implements complex queries (word segmentation queries) and global sorting.

redis: Reserved, needs to be expanded.

doris: reserved, capacity increased.

Complex query (reason: There is a multi-table association scenario in front-end business access (two tens of millions of table related queries), and as the table capacity increases, the performance of the associated query decreases, which can no longer meet the efficient requirements of the business).

Complex query decision factors:

Solution description: Use the DRC platform to configure the quasi-real-time data synchronization from distributed DB to ES (Note: DRC is a general data synchronization platform within the company, which can synchronize data between multiple data sources).

Advantages:: Simple and disordered ** developmentDisadvantages: There may be data inconsistencies in scenarios where services are checked immediately after they are written.

Solution description: Dual-write distributed DB and ES to ensure data consistency.

Advantages:: Ensure the consistency of data write-to-read scenariosDisadvantages:* * High development costs.

First, select A-quasi-real-time synchronization solution, > online verification to meet the service operation experience, >, and then select whether to implement the B-dual-write strong consistency solution.

Problem: In the scenario of joint query of two tables, you cannot directly use the DRC platform for synchronization, and you need to develop a corresponding synchronization module JAR package, embed the DRC task, or abandon the use of DRC and directly use ** synchronization, which has the problem of long development time.

The ES index occupies a large amount of space, the number of redundant records is large, and the query results need to be reloaded, making the query complex.

Difficulty: The flow table and the process detail node table involve joint query, and both tables have operations of adding, deleting, and modifying a single tableAs a result, the data model synchronized to ES is complex and difficult to synchronize.

Solution: Add redundant fields to the database table, and the redundant fields are dedicated to ES queries

Add the fields of people to be reviewed and those who have been reviewed in the flow table of DB, the values of the fields are separated by spaces, and the word segmentation function of ES is used, and ES can directly use the DRC tool to directly synchronize the data of this table, reducing the development time of synchronization.

Solution cost: Added Modify the new fields in the process table synchronously when modifying the process detailsDevelop a tool to refresh historical data.

1) Added a database sharding field to the business table.

Some business tables lack database sharding fields and cannot be directly sharded. Add SKU sharding fields to business tables and add SKU conditions to existing logical modifications to improve query efficiency

2) The redundant fields of ES-related queries have been added (brush data).

1) Complete the initialization of the distributed DB sharding library + ES

2) Configure DRC to synchronize the full + incremental data from the original single database to the distributed DB sharded database

3) Configure DRC to synchronize the full + incremental data from the distributed DB shard database to ES

4) Through the verification tool, regularly compare the data consistency between distributed DB monoliths, distributed DB shards and ES.

1) Added AOP slice, which uses DUCC configuration (ERP whitelist, full read, result comparison and other dimensions configuration) to gradually switch read requests to the new application cluster.

2) After the product and business side complete the verification, switch all read traffic to the new application cluster (Note: the new application cluster uses a database read-only account).

1) Inform the business party and upstream and downstream systems before going live, and inform the time period and estimated time of the launch to reduce business impact.

2) Add a static page to remind users that the system is unavailable during system upgrade, and switch the front-end domain name to the static page to avoid user operation.

3) Stop the original system grouping to ensure that the original single database no longer has write traffic, and coordinate with DBA to prohibit write to the original database (close the worker, suspend MQ consumption).

4) After waiting and ensuring that all the data in the original database is synchronized to the destination database, verify the data consistency of the old and new databases again by manual + automatic mode.

5) Switch the new system group to read/write account for deployment.

7) R&D and testing personnel use test products for functional verification of the new system grouping function, and hand it over to the business personnel for verification after there is no problem (switch the static operation and maintenance page).

8) Start the worker and connect to MQ

The system runs normally after going online, 823 Commodities have been carried forward to date 2600 million;At present, the system supports commodity field dimension data 31.6 billion;The maximum db table data is 28.4 billion;ES data 4356W;

Before and after: erp:xxx;This ERP account data is 29w 9s for the original query and 1s for the new query

A comprehensive and clear inventory of the current state of the system: reduce complexity and improve quality.

Clear rollout plan: guide personnel to divide labor reasonably, shorten the rollout time, and reduce the difficulty of rollout.

At present, the distributed DB distributed transaction support is relatively weak, and it is impossible to ensure the correctness of multiple records modified in a transaction when crossing databases.

When the product data under the name of the business person is million, the query time is still long, and the query performance will continue to be optimized.

Author: Jingdong Retail Wang Kai.

*:JD Cloud Developer Community **Please indicate**.

Related Pages