Focus on data and explore cutting edge applications of distributed databases and lakehouses

In the digital era, data is an important asset of an enterprise, and its management and utilization efficiency has a significant impact on the survival and development of an enterprise. According to the Database Development Research Report (2023), the global database market size in 2022 will be US$83.3 billion, and the size of China's database market will be 59$700 million (about 403..)600 million yuan), accounting for 72%。It is estimated that by 2027, the total size of China's database market will reach 1286800 million yuan, with a compound annual growth rate (CAGR) of 261%。

Recently, 51CTO held two technical live events, focusing on the cutting-edge trends and practical applications of databases, with the themes of "Distributed Database Applications and Challenges" and "Technical Practice of Lakehouse", respectively, and invited 51CTO Academy gold medal lecturers and enterprise technical experts in the direction of databases to share to help users better understand and master the cutting-edge technology trends and application methods in the database field.

A distributed database system is an efficient, scalable, and reliable database system that is suitable for processing large-scale data and complex business needs. With the continuous development of cloud computing and big data technology, distributed database systems will be more widely used and developed.

Togo, a certified lecturer of 51CTO Academy, and Chen Qianlong, senior architect of Transwarp Database, shared their skills with the themes of "Unraveling the Mystery of Distributed Databases" and "Transwarp's Distributed Analytical Database Practice" respectively.

First of all, Togo shared the characteristics of data in the era of big data, the business needs of the new era, and the reasons for using distributed databases, and talked about the data governance problems faced by enterprises in the era of big data and the general solutions to solve them.

Togo mentioned that in the era of big data, many old problems have become new or big problems. These include computing power challenges and changes in the characteristics of Internet applications. On the one hand, the surge in data volume leads to the continuous increase of storage and computing costs, and at the same time, managers need to pay attention to any directional changes brought by data volume to the overall computing, as well as the challenges of the underlying architecture caused by the iterative upgrade of data management. On the other hand, the characteristics of Internet applications are gradually shifting to the Internet of Things, for example, the business model is shifting from transactional business (OLTP) to analytical business (OLAP), and data is becoming more and more heterogeneous.

Togo believes that in the face of these challenges, data managers can try to incorporate the idea of distributed programming, from a stand-alone programming mindset to a cluster programming mindset, from a scale-up mindset to a scale-out mindset, and enabling entirely new system stacks. Therefore, Togo summarized three directions to solve the problem, namely: the selection and introduction of distributed databases, flexible and convenient resource scheduling, and mobile computing methods that are more economical than mobile data.

In his speech, Chen Qianlong quoted the nine directions and four goals of the Academy of Information and Communications Technology for the future development of database technology, and proposed convergence, that is, architecture integration, using a unified architecture to replace hybrid architecture and platform integration, and unifying data lakes, data warehouses, and data marts is the development trend of data platform architecture. Chen Qianlong said that distributed analytical databases can replace the hybrid architecture of Hadoop + MPP. It supports standard SQL syntax and provides advanced technical capabilities such as multi-model analysis, real-time data processing, storage-compute decoupling, mixed workloads, data federation, and hybrid deployment of heterogeneous servers.

Referring to the key technologies of distributed analytical databases, Chen Qianlong focused on the following points:

First, unify SQL entries to improve business concurrency by balancing the load. At the same time, different services, such as queries and batches, are distributed to different computing resources based on specific rules, so as to realize the diversion of services and reduce the mutual influence between services.

Second, the SQL compilation engine is unified to simplify SQL development and adaptation, lower the development threshold, and improve migration efficiency.

Third, the SQL computing engine is unified and the performance is improved through the vectorized computing engine.

Fourth, unified storage management, support for multi-modal data, efficient integration of multi-source data, and promote the further enhancement of multi-model capabilities.

Fifth, hybrid load balancing management correlates jobs and resource pools to control and realize the rational use of resources, so as to maximize the benefits of system resource utilization.

Sixth, the expansion of the cluster is not aware of the running business, and the performance of the expansion shows a linear increase after the expansion.

Seventh, block-level disaster recovery breaks through geographical restrictions and builds data security assurance.

Eighth, intelligent O&M, integrated cluster management, SQL development, SQL monitoring and other capabilities to achieve one-stop database O&M capabilities.

Chen Qianlong said that enterprises should adapt to local conditions and start from specific needs in the selection, application and optimization of databases. At the same time, in the face of the current situation of changeable technology development, changeable application characteristics, and urgent external demand, he suggested that database operation and maintenance personnel should continue to learn and learn extensively, always pay attention to the development of distributed databases, and maintain sufficient technical sensitivity to keep up with the development trend of technology.

A lakehouse is an innovative data storage and processing architecture that has powerful data processing and analysis capabilities, while ensuring the security and quality of data, and has gradually become the mainstream data storage method for enterprises. With the characteristics of low O&M, low cost, multi-format, multi-functional, high-value, high agility, security, and flexibility, the lakehouse solution has been applied on a large scale in the financial and Internet industries. Zhao Yuqiang, a gold medal certified lecturer of 51CTO Academy, and Gao Jingjun, CTO of Beijing Kejie Technology, respectively brought technical sharing with the theme of "Technical Practice of Lakehouse Integration" and "Lakehouse Integration to Build a New Cornerstone of Data Intelligence".

Firstly, starting from data warehouse and big data technology, Zhao Yuqiang introduced the data warehouse architecture based on big data technology: lambda architecture and kappa architecture, as well as big data computing engines: flink and spark, which further led to the topic of data lake technology and data warehouse implementation based on data lake.

Zhao Yuqiang believes that a data warehouse is essentially a database, and traditional relational databases, such as Oracle and MySQL, can also be implemented using a big data ecosystem system. There are two main types of data warehouse architectures based on big data technology: lambda and kappa. Among them, the lambda architecture is the main architecture used to build a data warehouse, which is divided into two parts: offline data warehouse and real-time data warehouse, using HDFS or HBase to store offline data, and using the message system Kafka to store real-time data. After encapsulating the file data, the abstraction of the extracted data is easy to integrate with the data lake to realize the reading function of offline data or real-time data. Compared with the lambda architecture, the kappa architecture can only read real-time data, and although it can read offline data as a special case of real-time data, its performance is poor.

After introducing the big data computing engines Flink and Spark, Zhao Yuqiang introduced the concept of data lakes and common technical frameworks. Put simply, a data lake stores both structured and unstructured data and is an organizational approach for large-scale, multi-tasking, and highly diverse data. However, data lakes do not provide data storage capabilities, and common data lake technical frameworks include Hudi, Iceberg, and Delta Lake. At the end of the sharing, Zhao Yuqiang also provided a data warehouse, flow-batch integrated architecture based on the data lake for your reference.

Then, Mr. Gao Jingjun, CTO of Beijing Kojie Technology, brought a technical sharing with the theme of "Lakehouse Integration to Build a New Cornerstone of Data Intelligence". Gao Jingjun shared from three aspects: the exploration and construction of the lakehouse integrated architecture, the technical practice of the lakehouse integration and the future development trend of the lakehouse integrated platform.

Gao Jingjun said that LakeHouse is a new open architecture, which fully combines the advantages of data lakes and data warehouses, and is built on the low-cost data storage architecture of data lakes, which inherits the data processing and management functions of data warehouses and can fully meet the needs of BI, DI and AI applications.

As for the core elements of building a lakehouse, Gao Jingjun believes that there are the following three points:

First, reliable on-lake data management: an open, high-performance format for organizing data.

Second, support for machine learning and data science: an open, standard set of APIs.

Third, advanced SQL performance: an extremely optimized execution engine.

However, with the gradual deepening of the practice of lakehouse, especially when the data volume of a single link reaches the minute level and the daily data reaches trillions, enterprises need to pay special attention to the performance of the lakehouse. For example: How do you balance streaming and batch access?It can not only achieve high performance and high efficiency, but also achieve low costHow can I optimize if I continue to accelerate when I am close to the limit in minutes?Gao Jingjun believes that in order to solve these problems, it is necessary to continuously optimize the technical architecture, improve the capabilities of the data lake computing engine, and continuously optimize the performance of the data lake through storage and computing separation, unified metadata service and query engine.

Gao Jingjun said that the enterprise data architecture has a development trend from a single architecture to multi-architecture integration, and data assets from physical unity to logical unity. Build a lakehouse integrated basic data base to ensure the foundation of the enterprise's multi-architecture convergence platform, so as to help enterprises build a new cornerstone of intelligent data.

With the continuous progress of information technology, database technology has become the core of enterprise intelligent construction, which not only stores the core data of enterprises, but also supports business operations and decision-making analysis of enterprises.

The future development of database technology will pay more attention to the efficiency and security of data processing. On the one hand, with the advent of the era of big data, enterprises need to process more and more data, and database technology needs to continuously improve data processing efficiency to meet the needs of enterprises. On the other hand, with the increasing problem of network security, the security of database technology is becoming more and more important. In the future, database technology will pay more attention to data security and privacy protection, and adopt more advanced data encryption and access control technologies to ensure data security and integrity.

For more details about the live broadcast, you can click on the [Database Live Zone], *Live Playback, **Guest PPT.

Focus on data and explore cutting edge applications of distributed databases and lakehouses

Related Pages

Explore model parallelism in distributed training Data parallelism and hybrid parallelism

AntDB database development trend and difficulties of domestic distributed databases

Distributed matrix system

Apache SINGA is an engine for distributed deep learning Xi

Explore the application of 4K30HDMI distributed nodes in video wall display