Keywords: data synchronization, data heterogeneity, and data migration.IntroductionIn this day and age, data is at the heart of business operations. As your business expands and your users increase, it's critical to ensure data consistency, timeliness, and reliability across different segments. This topic describes several common data synchronization scenarios, including database master-slave synchronization, data migration synchronization, and real-time data synchronization. By gaining an in-depth understanding of the features, advantages, and limitations of various solutions, we can better select and customize data synchronization strategies suitable for specific business scenarios, laying the foundation for building efficient, stable, and scalable systems.
The main contents are as follows:
Scheme.
1. Database master-slave replication.
Database master-slave replication is a common data synchronization scenario in which the primary database propagates its change operations to one or more slave databases.
Configuration steps for master-slave replication of MySQL databases:
Make sure that the master and slave versions are consistent: Make sure that the master and slave databases use the same MySQL version to avoid compatibility issues. Configure the primary database: Configure the configuration on the primary database and open the mysql configuration file (usually my.).cnf or myini), parameter settings:
Set the unique identifier of the primary server server-id = 1 Enable binary logging, log all changes on the primary database log bin = var log mysql mysql-binlog specifies the database to be replicatedbinlog do db = your database name
Create a replication user: Create a user for replication on the primary database, making sure that the user has the appropriate permissions:Replace the replication user and replication password with your own username and password Create user'replication_user'@'%' identified by 'replication_password';Grant Replication Sl**e on *to 'replication_user'@'%';Refresh rights changes apply flush privileges;
To obtain the binary log location of the primary database: Run the following command on the primary database to record the output file and position, which are used when configuring the slave database:show master status; mysql> show master status;+-file | position | binlog_do_db | binlog_ignore_db | executed_gtid_set |+mysql-bin.000001 | 6470 | your_name | 1 row in set (0.00 sec)
Configure Slave Database: Configure on the slave database, open the MySQL configuration file, and set parameters:server-id = 2
Save the configuration and reboot the slave database.
Connect from the database to the primary database: Run the following command on the slave database to replace the information about the master host, master user, master password, master log file, and master log pos for the primary database
Configure the connection from the database to the primary database change master to master host ='master_host',master_user = 'replication_user',master_password = 'replication_password', from show master status;file value in . master_log_file = 'master_log_file', from show master status;The position value obtained in master log pos = master log pos;
Start the replication process from the database:start sl**e;
Verify replication status: Make some data changes on the primary database and execute the following command on the slave databaseshow sl**e status\g
In short, the master-slave replication solution is suitable for scenarios where there are more reads and fewer writes, and read requests can be shared with the slave database, reducing the load on the primary database. Advantages:Improve read performance, offload the primary database, and provide disaster recovery and backup mechanisms. Limitations:There is a replication delay that can cause inconsistencies in the data from the databaseA single point of failure in the primary database can affect the entire system;Not suitable for write-intensive applications.
Solution 2: Data migration from ETL tools
ETL (extract, transform, load) tools are widely used for data migration, integration, and synchronization between different data storage systems, especially in large-scale data migration, data warehouse construction, data cleansing, and transformation. Common ETL tools are:
NameMain featuresApplicable scenariosApache Nifi provides an intuitive visual interface, supports real-time data flow, emphasizes ease of use and manageability, is suitable for building real-time data flows, easy-to-use interface, powerful management functionsTalend Open StudioPowerful graphical interface and rich connectors, supports multiple data sources and targets, complex conversion and cleaning functions are suitable for complex data transformation, multi-source and multi-target data synchronization, large-scale data migration apache Camel is based on the enterprise integration model and supports a variety of protocols and data formats suitable for building flexible data integration solutions, enterprise-level data integration and message routing Kettle (Pentaho) provides a graphical interface that supports powerful data manipulation and transformation capabilities, and integrates other components of the Pentaho platform for full data integration, business intelligence and data analysis.
Select Suggestions:
If you're looking for real-time data flow and ease of use, Apache NiFi is a good choice. For complex data transformations and large-scale migrations, Talend Open Studio offers a wealth of features and a wide range of connectors. If you already use other components of Apache Camel, or if you need a high degree of flexibility and customizability, you may want to consider using Apache Camel. For comprehensive data integration and business intelligence, Pentaho Data Integration can be a comprehensive solution. Specific usage depends on the specific needs of the business, the technology stack, and the skill level of the team.
Here we take Apache Nifi as an example to explain the process of data migration.
Official website address: *Address:
Installation and deployment are self-consulted by readers. This section describes the main execution process according to the ETL function.
Stand-alone architecture:
The role of the Web Server is to host NIFI's HTTP-based command and control API.
The flow controller is at the heart of the operation. It provides threads for the extension to run and manages when the extension receives a schedule of execution resources.
There are various types of NIFI extensions, which are described in other documentation. The key point here is that the extension runs and executes in the JVM.
FlowFile Repository The flowfile repository is where NIFI keeps track of the state of a given Flowfile that is currently active in the flow. The implementation of the repository is pluggable.
Content Repository A content repository is where the actual content bytes of a given flowfile are located.
Provenance Repository A Provenance repository is where all Provenance event data is stored.
Tool positioning and use process:
Here, we will take querying data from MySQL and writing data to MySQL as an example to demonstrate a simple process
For detailed steps, please refer to (article source):
If you are interested, you can dig deeper, here I just want to explain: for large data volume processing, including data extraction, data loading, incremental data synchronization, you can use these tools, ETL tools provide some visual components + configure specific link types. It can save a lot of labor costs and indirectly ensure data consistency. It is a good data processing tool. However, due to the introduction of new components, in the case of multiple data sources, it will inevitably bring complexity to the system.
Scheme.
3. Trigger incremental data synchronization.
For example, the trigger in the example is triggered when new data is inserted into the TB order table, and the new data is synchronized to the TB order HIS table (readers can adjust the trigger timing and logic as needed).
Status: TB order has a total of 3 records.
tb order his 0 records.
Trigger logic script:
Create trigger create trigger sync order to historyafter insert on tb orderfor each rowbegin insert into tb order his ( order id, customer id, order date, product id, quantity, total price, status, shipping address, payment method, coupon code, create time, update time ) values ( new.order_id, new.customer_id, new.order_date, new.product_id, new.quantity, new.total_price, new.status, new.shipping_address, new.payment_method, new.coupon_code, new.create_time, new.update_time );end;//delimiter ;
This trigger is triggered after an insert operation occurs in the TB Order table, and copies the newly inserted data to the TB Order His table. Note that I'm assuming that the structure of the tb order his table is the same as the tb order table.
Test the work of the trigger:
Insert into tb order values (4, 4,'2024-01-15 12:00:00', 104, 4, 150.25, 'To be paid', '567 elm st, county', 'credit card', 'discount_15', '2024-01-15 12:00:00', '2024-01-15 12:00:00');Query TB Order His to ensure that the data is synchronized successfullySelect * from TB Order His;
Check the result: Synchronization successful:
tb_order
tb_order_his
Advantages of Trigger Sync:
Real-time: Triggers can synchronize data in real time, and when a trigger event occurs, the synchronization operation is performed immediately to ensure that the data in the destination table is synchronized with the source table.
Simplified operations: Triggers automate synchronization operations at the database level, eliminating the need to write additional synchronization logic in the application, simplifying development and maintenance. Ensure data consistency between the source and destination tables.
Disadvantages of trigger synchronization:
Performance impact: The execution of triggers introduces additional performance overhead, especially when performing data operations at scale. Triggers that fire frequently can cause database performance to degrade.
Complexity: When trigger logic is complex or there are multiple triggers, it can be difficult to track and debug the behavior of triggers, especially when it comes to maintenance.
Concurrency control: In a high-concurrency environment, triggers can cause issues with concurrency control and need to be handled with care to ensure data consistency.
Scheme.
Fourth, manual script synchronization (unpretentious).
This is a common SQL script, which is commonly used for data cutover, error data modification, including configuration data, business fields, and manual adjustment of abnormal data during O&M. It's relatively simple, just for the sake of a complete explanation of the article structure. Let's take a simple example:
# insert into tb_target select * from tb_sourceinsert into tb_order_his ( order_id, customer_id, order_date, product_id, quantity, total_price, status, shipping_address, payment_method, coupon_code, create_time, update_time)select order_id, customer_id, order_date, product_id, quantity, total_price, status, shipping_address, payment_method, coupon_code, create_time, update_timefrom tb_order;
It's relatively simple, and there's nothing to sum up.
Scheme.
5. Real-time data synchronization scheme (using message queues).
In this scenario, MySQL data change events are captured and passed to downstream data sources through message queues. For example, if you want to synchronize data from MySQL to ClickHouse, you can use Debezium as the Change Data Capture (MySQL CDC) tool and Kafka as the message queue. Approximate steps:
Configure MySQL database connection information
MySQL connection configuration databasehostname=mysql-hostdatabase.port=3306database.user=mysql-userdatabase.password=mysql-password debezium configuration connectorclass=io.debezium.connector.mysql.mysqlconnectortasks.max=1database.server.id=1database.server.name=my-app-connectordatabase.whitelist=mydatabase
Start the Debezium Connector: Start the Debezium Connector from the command line or configuration file, for example:
debezium-connector-mysql my-connector.properties
Create kafka-topic: Debezium sends the change event to the kafka topic, making sure that the kafka topic has been created:
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic my-topic
jA pseudo example - Consume a Kafka topic and write data to ClickHouse:import org.apache.kafka.clients.consumer.consumer;import org.apache.kafka.clients.consumer.consumerconfig;import org.apache.kafka.clients.consumer.consumerrecords;import org.apache.kafka.clients.consumer.kafkaconsumer;import org.apache.kafka.common.serialization.stringdeserializer;import j**a.time.duration;import j**a.util.collections;import j**a.util.properties;public class clickhousedataconsumer }catch (exception e) private static void processkafkamessage(string message) private static void writetoclickhouse(string message) }
There are a few advantages and disadvantages to using Kafka to synchronize MySQL in real time:
Advantages: High real-time performance: Kafka is a high-throughput, low-latency message queuing system that provides near-real-time data synchronization, enabling applications to quickly obtain the latest data changes.
Message persistence: Kafka has the message persistence feature to ensure that even if consumers are offline for a period of time, they can still retrieve previously unprocessed messages and ensure that data is not lost.
Disadvantages: Consistency guarantee: Kafka guarantees the order of messages within a partition, but it is difficult to ensure the order of messages across the cluster. In some scenarios, additional measures may be required to ensure global consistency.
For small-scale applications, introducing Kafka may seem too cumbersome, and a lightweight solution may be more appropriate.
Summary
Synchronization Scheme DescriptionAdvantagesLimitationsDatabase master-slave replication uses the master-slave replication feature of the database itself to synchronize changes from the primary database to one or more slave databases. It is simple to implement and can provide relatively real-time data synchronization, which is suitable for scenarios where there are more reads and fewer writes. A stable network connection is required between the master and the slave, which is accompanied by the problem of master and slave latency. It is suitable for databases such as MySQL and PostgreSQL. ETL tool data migration uses professional ETL tools, such as Apache Nifi and Talend, to periodically extract data from the source database, perform data transformation, and then load it into the target database. It can perform complex data transformation and cleaning, and is suitable for synchronization between heterogeneous databases. You need to configure an appropriate scheduling policy to handle incremental synchronization and full synchronization. Database trigger-based synchronization triggers are set in the source database to trigger corresponding actions when data changes, such as recording changes to a synchronization table, and the target database periodically polls the synchronization table and processes the changes. It can achieve real-time synchronization and is suitable for small-scale data. Triggers need to be carefully designed to avoid disproportionately impacting the performance of the source database. Manual data scriptingManually write data scripts that insert data from one database to another. Simple and straightforward, suitable for small-scale data synchronization. Handling of abnormal situations such as online configuration and data cutover is considered to be an interference factor. The real-time data synchronization scenario (using a message queue) publishes change operations from the source database to the message queue, and consumers subscribe to messages and synchronize change operations to the target database. Real-time synchronization is achieved, and asynchronous processing has little impact on system performance. Consider the reliability of message queues and the idempotency of consumers.
Ending