Justified centralized or distributed?Is there really a need for a distributed database?

Mondo Technology Updated on 2024-01-31

About the Author:

Wang Hui,Database expert of a large joint-stock bank. He has many years of experience in the development, operation and maintenance management of bank databases, is proficient in various types of databases, is mainly responsible for the research, testing, selection and promotion of domestic distributed databases, has in-depth research and practical experience in distributed databases, and has successively shared domestic distributed data related topics at several technical summits.

Share the synopsis

1. Analysis of the current situation of use.

Second, does it really need to be distributed?

3. When to use distributed.

Fourth, how to make good use of distribution.

5. In-depth analysis: whether distributed is a database solution or an application solution.

6. Summary. Whether the OLTP business system adopts centralized database or distributed database is a question that is often asked in the transformation of domestic databases, whether it is for the development and evolution of technical architecture or to provide necessary support for the long-term development of existing businesses. In the context of the prevalence of distribution, it seems that any architecture needs distributed empowerment. Is this really the case?The following will be a comprehensive analysis and elaboration.

1. Analysis of the current situation of use.

In 2022, there will be more than 200 domestic database vendors, and the traditional centralized databases are mainly Jincang and Dameng, as well as emerging databases such as PolarDB, and distributed databases such as GaussDB, KingWow, TDSQL, In fact, most of these databases have two deployment modes, centralized and distributed, that is, the money you buy for distributed databases can also be used for centralized deployment, which can meet your different business needs.

It is important to note here that some distributed database vendors use centralized deployments, and applications still need to be connected to compute nodes. The following data nodes can be connected through the compute node (CN), which may be due to the consideration of the unified architecture, and also because the compute node can perceive the automatic switchover and be transparent to the application when the database is switched between active and standby. However, this inadvertently adds a layer of parsing, which will cause a certain loss of performance. Some database vendors directly connect to the database through their own JDBC ODBC driver or VIP driver, so as to avoid similar problems.

From the perspective of technical architecture, the database used in the financial industry is still mainly centralized, and the distributed database has formed a strong supplement to medium and large financial institutions. According to the survey data of the "Financial Industry Database **Chain Security Development Report (2022)", centralized databases still account for 89% of the overall financial industry, of which 80% are banks, * and more than 90% are insurance, and centralized databases play an important role in the digitalization process of financial technology. The overall proportion of distributed databases in the financial industry has reached 7%, the banking industry has exceeded 17%, and the * industry and insurance industry are relatively low. In other words, it is fully satisfying to use a centralized database for most of our business.

Second, does it really need to be distributed?

Since there is only one primary data node, centralized databases have the advantages of simple architecture, convenient operation and maintenance, good compatibility, and high cost performance.

However, there are also problems such as the inability to break through the hardware limitations of a single machine, the inability to expand horizontally, and the bottlenecks of performance and capacity.

So when a centralized database can't meet our performance and capacity requirements, distributed provides us with a good technical means. When we are going to choose distributed to solve the problem of centralizationIt is recommended that you ask the following questions before considering:

Whether the problem can be solved by optimizing the centralized database itselfDo not make major architectural changes, such as optimizing parameters, SQL statements, and business logic. ‍‍

Is it possible to solve the problem by increasing the host resource allocationFor example, the problem is solved by increasing the CPU and memory size, or using a virtual machine instead of a physical machine.

Whether the problem can be solved by separating storage and computing?If the capacity of a single machine cannot meet the requirements, you can consider external storage or use a storage-computing separation architecture to solve the problem of limited disk capacity of a single machine.

Whether it can be solved by the application layer,For example, if the business architecture is changed, microservices or unitized architecture are adopted, that is, the capabilities of data splitting, distributed transactions, and horizontal scaling are implemented at the application layer, while the database is still centralized. This method has high requirements for developers and high business transformation costs, so it needs to be considered comprehensively.

Whether the advantages and disadvantages of distributed architectures are fully understoodWhether you have made preparations for the O&M and backup of distributed databases, and whether you have fully considered whether your business must be solved by distributed databases.

3. When to use distributed.

In the early days, there were tables with 2000w rows that needed to be split, and this was mainly for MySQL databases. When the number of rows in an OLTP table exceeds 2,000 W, the number of layers of B+Tree leaves is increased to 4 by using the formula, increasing the number of IO reads. However, with the upgrade of hardware or the implementation of caching technology, the impact of IO can be basically ignored.

Therefore, it is common to use TPS or QPS metrics to determine whether distributed transformation is required, such as when the single-point TPS bottleneck reaches 4000, or the QPS reaches 8W, or the data capacity reaches 2TB. In general, it is relatively reasonable to do horizontal expansion to solve performance or capacity bottlenecks, but there is no fixed formula here, and it is mainly necessary to make judgments based on your own business scenarios. It is also necessary to consider the needs of future business growth, such as whether to meet the growth needs of the business in 3-5 years, do a good job of peak**, plan in advance, and avoid secondary transformation. Referring to the above mentioned issues at the same time, whether it must be solved by a distributed database.

Experimental data 1 (finding the inflection point).

The hardware resources are Kunpeng virtual machine environment based on ARM architecture, with specific configurations of 16C64G, winning the bid for the Kirin V10 operating system and ordinary SSD disks.

The following figure shows the test results of a domestic distributed database, which is distributed into 4 shards in seconds.

There is basically no gap for single-point index-based queries, and for full-table scanning and double-table association (the associated table is unified to 200w rows and both are based on the sharding key as the association condition), there is a significant increase of about 5 times when the data volume is 500w.

For data volumes below 500W, you can test them by yourself based on your business. Of course, there may also be an inflection point at 300W or lower, and I hope you can give more test results here. The experimental data may have a certain deviation due to a variety of factors, please also correct, and I very much hope that you can put your test results in the comment area, and we can verify the performance inflection point of distributed and centralized together, so as to provide a more accurate data basis for the selection of reference.

Experimental data two

The following figure shows the results of a vendor's stress test based on the sysbench tool

It can be seen that when the resource utilization of a centralized database reaches 75% in a medium-sized configuration, the maximum TPS is 4595, the latency is 5 ms, and the concurrency is 400. This is a reference value, which is a basis for the basic TPS mentioned above to be split if it exceeds 5000. Of course, if your resources are large enough, this value can be even larger. However, the most accurate judgment requires us to verify our TPS value through real-world stress tests.

Fourth, how to make good use of distribution.

As the name suggests, it is distributed, with multiple people working and has the advantages of high availability, high scalability, high performance, and elastic scaling capabilities.

Due toNumber of data nodeswithDatabase componentsThe increase is bound to occurThe architecture is complex and the O&M is complexwithHigh costAt the same time, most distributed databases do not support special objects such as stored procedures and custom functions.

Distributed is a double-edged sword, and how we use it well and not get hurt is important.

1. Selection of shard key

The selection of the shard key is very important, and the values of the fields selected as the shard key should be discrete so that the data can be evenly distributed across each data node. When a single field fails to meet the discrete criteria, you can consider using multiple fields together as shard keys. In most cases, you can select the primary key of a table as the sharding key. For example, select the ID number as the distribution key in the Personnel Information table. Most distributed databases do not support or recommend modifying shard keys.

2. Selection of distribution mode

A common choice is the hash distributionRelatively speaking, the distribution is more uniform, and there are also partitions such as range and list, of course, we ultimately need to choose according to specific business scenarios. In addition, some commonly used configuration information tables or small tables associated with queries need to be defined as global tables to ensure that they can be obtained at one data node and avoid cross-node data interaction.

3. Standardize the writing of SQL statements

You should select a shard key as the query condition and use the shard key as the query condition for multi-table associations. If the sharding key is not used, cross-node data transmission will occur, and some distributed databases will summarize and associate all data aggregation and computing nodes, and when the data is large, the computing node resources will be filled instantly, resulting in the database being unable to provide services to the outside world.

4. Avoid cross-node data transmission

As mentioned above, using the query condition as the shard key is to avoid cross-node transmission to the greatest extent, because cross-node data transmission is based on the network, and there is a big gap between the network and the disk transmission and read and write performance, so the performance will be significantly reduced, and even the results will not be able to run.

5. Avoid distributed transactions

Most of the databases are based on the principle of 2pc, so we should avoid distributed transactions to the greatest extent, generally controlled within 10% of all transactions, too many distributed transactions will definitely bring us performance impact, but also bring challenges to the consistency of business data.

5. In-depth analysis: whether distributed is a database solution or an application solution.

The implementation of distributed data can be solved by database (distributed database) or by application, most developers, especially financial institutions such as traditional industries or city commercial banks, have less development capabilities than large banks, and the scale of personnel is limited, so they want the database to do more, such as the implementation of distributed transactions and data splitting, and try to be transparent to developers. Therefore, they will directly use distributed databases, taking the unitized architecture as an example as shown in the following figure:

The underlying distributed database is a unitized architecture.

However, some important business systems or teams with certain development capabilities will consider implementing them more at the application layer. They want to get more control, for example, if a distributed transaction is abnormal, if it is implemented at the database layer, then the corresponding developer is a black box, he can only expect the distributed transaction processing power of the database, and they cannot intervene. However, if it is implemented at the business layer, they can process the log information obtained by message queues, tcc, and saga and other data compensation mechanisms. Therefore, they will implement distributed at the application layer, and the database adopts a centralized way, and each database stores part of the business data, taking the unitized architecture as an example as shown in the following figure

The underlying centralized database is a unitized architecture.

The differences between centralized and distributed databases in terms of how they are distributed are summarized as follows:

The Hundred Families Help Plan adopts a centralized database, and the application layer has higher requirements for the application to achieve distributed characteristics, but it is less transformed at the database level, because the compatibility of centralized databases is better than that of distribution.

With distributed databases, applications do not need to implement distributed characteristics and are transparent to applications, but distributed databases are not compatible with special objects, such as stored procedures and functions, or even do not support them, which requires applications to adapt to the database.

6. Summary.

At a roundtable forum on database innovation, a fellow teacher said:A centralized database is like a sheep, docile and easy to manage, whileDistributed databases are a wild horseUninhibited is difficult to control, which reminds me of what Song Dongye sang in the song "Miss Dong", "I fell in love with a wild horse, but there is no grassland at home, which makes me feel hopeless." The wild horse of the distributed database can be tamed, and it will make you gallop across the savannah, otherwise you will suffer and struggle. In fact, most developers still want the database to be done more, the developer to transform less, and the database to be more transparent, simpler, or even smarter.

Finally, I would like to say that our domestic database has a long way to go, in fact, compared with the increase of new functions, customers are more concerned about the improvement of basic functions. If we can do a good job of the core storage engine of the database and the ecology, then we will not discuss this topic in depth about the OLTP database.

If there is any inaccuracy or unprofessionalism in the article, please correct it, thank you.

This article is the author's original contribution, not the first release of the community, if you need **, please contact ***Basic Technology Research.

The DBAPLUS community welcomes contributions from technical personnel at editor@dbapluscn

Related Pages