About the Author:
Luo Ruixing,Data Architecture Expert of Yiguo Group, Leader of Tidb User Group (TUG) in Shanghai.
Application of TIDB
Yiguo Group's real-time data warehouse has existed for a long time, when the business volume was not so large, we only used a SQL Server to meet the demand, because the amount of data is not large, so the stored process can generally be completed in 1-2 minutes, and it can also ensure the consistency of real-time and T+1 data.
Looking back, the benefits of this are also obvious, a machine is easier to maintain, and it is still very suitable when the amount of data is not large, but once it encounters a big promotion or activity, some stored processes are very easy to pretend to be dead, and sometimes it may take 30-40 minutes to run the situation.
With the growth of the business, the offline part of Yiguo Group has been switched from SQL Server to Hadoop, and the real-time part also needs a system that can meet the future business growth, according to the comprehensive choice of business and technology, we finally selected the TiDB+TISPARK solution. Based on this scheme, there are several obvious advantages:
The cost of changing from the original stored procedure to SQL is very small compared with the cost of changing to **, which can greatly save the transformation cost
Because stored procedures were used in the previous system, most of the stored procedures were more responsible, and there were many operations such as update and delete, and the use of the tidb solution can still ensure the consistency of real-time and offline, reducing a lot of interpretation costs
Obviously, from SQL Server to TiDB, from a single machine to a distributed one, the performance has been improved, and it is almost rare for a script to take 30 minutes.
It should be mentioned that one of the most important reasons for us when we selected the model was because of the Tispark project, which was still a very early version of Tidb at that time, unlike the current 30 is a big improvement, thanks to the tispark project, which provides analysts with the possibility of complex analysis.
In addition, as mentioned before, our offline cluster is based on Hadoop, so that after Tispark, we can use Spark to unify the engine, and after Tispark supports write-back in the future, we can basically achieve a set of scripts for two clusters.
The architecture diagram of Yiguo Group's real-time data warehouse based on TIDB is as follows:
Tiflash and data middle-end
Although this architecture is very convenient, there are also some problems, the most obvious is that AP and TP interfere with each other, which is an unavoidable problem for HTAP systems in the early days. In 18 years, TiDB proposed the TiFlash project, and there is a lot of information about this project, so I won't introduce it too much here. The emergence of TiFlash physically isolates the requirements of AP and TP, fundamentally solves the problem of conflict between AP and TP, and makes TIDB one step closer to HTAP.
We started to conduct some performance and functional tests in 18 years, and initially found some scenarios with a large amount of data but relatively small scenes and relatively small traffic for testing, and the overall test effect was relatively satisfactory.
Tiflash is to solve AP TP conflicts at the physical level, 18 years, the concept of data middle platform is very hot, from another point of view, from the perspective of the middle platform, there are also some management tools to alleviate AP TP conflicts.
The following figure is a simple comparison between the Hadoop and Tidb ETL processes, as can be seen from the figure, the ETL of Hadoop is mostly based on tables, so the impact on resources is relatively small, and the impact scope is not large.
Most of the ETL process in TiDB is based on instances or DB, and it is very time-saving to synchronize MySQL to TiDB through DM or Syncer, but compared with Hadoop's ETL, if most of the data is not used or the data is bad and changes frequently, it will have a certain impact on resources.
Based on this, whether it is Hadoop or TIDB, all synchronizations should have a data catalog. The data cataloging project is a part of the data middle platform, which is led by the business middle platform or the DBA in the early stage, which preliminarily evaluates the availability of the data, and also maintains certain business attributes of the data. At the same time, it also cooperates with onedata and data access processes to further control the correspondence between indicators, tables, and tasks, so as to facilitate resource control.
Finally, TIDB is also an important export of OneService, OneService is the data department in Yiguo to provide a unified interface service, currently mainly provides a RESTful interface, in the interface system, we have done the management of business attributes and responsible persons for each system, and at the same time there is also the management of the interface version in the current version, the business side only needs to follow the steps on the page to configure to be able to generate a usable interface, in the follow-up plan, we are also ready to add the interface weight mechanism, to avoid the phenomenon of duplicate interfaces。
With the introduction of the concept of data middle platform, enterprises are paying more and more attention to the value of data, although data consumes assets in the traditional sense, but data is also a part of enterprise assets. Therefore, data needs to be managed more and more refinedly, from access to use, from use to full use, each step requires a lot of exploration.
future
The emergence of systems such as HTAP and NewSQL not only solves some problems such as database and table sharding in the business, but also slowly affects the field of big data.
As an HTAP system, there will be people in various roles to maintain, manage or use the system, and everyone's focus may be different.
For traditional DBAs, stability and performance are more importantIn addition to this, big data engineers will also pay attention to the efficiency of the task, the resource usage of each task;
The modeling engineer adjusts the model according to the analyst's usage and judges whether the model is good or bad
Analysts, on the other hand, want ease of use, convenience, and so on.
The focus of each role is different, so in addition to the usual performance monitoring, user-oriented monitoring will also receive more and more attention, not to mention security management, automatic management of resources, etc. It is believed that with the continuous development of the middle platform and the gradual progress of TIDB, all aspects of data will be improved and improved.
100 help plan