Thoughts and best practices of OLAP for open source big data

A roundup of open source OLAP

In recent years, many excellent products have emerged in the open source field, such as Starrocks, Doris, Lake Data, Lake Format, Spark, and early HBase and Presto. The wide variety of open-source tools brings convenience to users, but it also presents a difficult choice.

The diagram above provides a simple categorization of the various databases. For example, StarRocks, Doris, and CK used to be AP databases that integrated storage and computing. Presto, Trino, and Impala are classic Hadoop-based MPP engines. In addition, Kylin, HBase, and Druid have many applications in pretreatment. There is also a category of lake-format (lake storage) tools that have become popular in recent years, including Delta Lake, Hudi, Iceberg, and Apache Paimon, which just hatched a few months ago.

olap scenario thinking

There are many technology stacks involved in the OLAP scenario, how should you choose? To answer this question, we first think about it at the scene level. Typical business scenarios involved in OLAP include user-facing reports, business-oriented reports, user portraits, operation analysis, order analysis, and self-service analysis.

For advertisers, store managers, and ToB reporting services, these scenarios have a common feature, that is, they need to quickly retrieve users based on attributes such as user IDs, which have high requirements for query performance and a certain amount of concurrent requests. Of course, the concurrency here is different from the TOC scenario.

In view of these characteristics, an excellent OLAP engine should technically meet the following requirements: first, it should have a prefix index function, so that after the index is built, the query performance will be significantly improved; Secondly, vectorization engines are also an important trend, which was first proposed by CK and is now developing in this direction, and vectorization can indeed improve query speed to a large extent; In addition, equalization of data distribution and automatic reverse processing are key to help avoid issues such as data skew.

In business report scenarios, such as real-time large-screen display, real-time risk control, real-time monitoring, and auditing, the core requirement is the real-time nature of data, that is, after the business data is written, the data can be obtained as early as possible. The importance of real-time is that it affects the speed of subsequent policy responses. At the same time, we want the query performance to be good enough during the query process.

In addition, an important feature of these businesses is that they need to interface with commercial BI tools. This means that our SQL processing flows need to be highly diverse to meet changing partitioning needs. On top of that, we also need to refine the data model to meet diverse needs.

End-of-line operation analysis scenarios, such as broker performance calculation of enterprises such as Lianjia and the report of the head of the group for grocery shopping applications. A common feature of these businesses is that brokers are constantly changing, and the organizational structure is frequently adjusted, resulting in more frequent tail table changes.

In addition, these services have certain requirements for query performance and data visibility. The most important feature is the complexity of the computational logic, i.e., the variety of join conditions. Therefore, the OLAP engine needs to be able to support a flexible data model that is not limited to large and wide tables. In terms of support for new joins, some of the products currently on the market are still inadequate. In order to improve performance, people generally want to have certain capabilities in materialized views.

In the business scenario of user portraits, the main requirement is the processing of large and wide tables. The CK engine has been widely used in the field of user portraits. However, in some scenarios, you need to deal with a combination of different tags. In addition, the user portrait business has high requirements for accurate deduplication.

From the engine side, large and wide tables need to be supported to meet business requirements. However, when updating a large and wide table, you can't update two or three thousand columns of data at a time, so the ability to update is particularly important. In addition, multi-stream join support and JOIN query capability optimization are also key. On this basis, the engine is also required to support bitmap accurate query to meet the efficient processing requirements of user portrait business.

In order analysis scenarios, real-time data and complex query logic are the two core points. In fact, looking back at the previously mentioned scenarios, we will find that there is a certain degree of commonality between the order analysis scenario and other scenarios in terms of business characteristics and technical requirements.

The order analysis business has high requirements for real-time performance in order to quickly respond to business changes. At the same time, due to the richness and diversity of order data, the query logic is often complex. This means that we need a high-performance, easy-to-use solution that supports complex queries for order analysis scenarios.

When building an OLAP engine product, you need to focus on the following basic aspects:

First, we need to strengthen the ability to join multiple tables, including syntax support at the functional level and optimization at the performance level. Multi-table joins are the core part of OLAP queries and are essential for handling complex data scenarios.

Second, there are the necessary capabilities of modern engine solutions, such as cost-based optimization (CBO) and vectorized queries. These capabilities can make products competitive in the market and better solve problems in various business scenarios.

In addition, concurrency capacity is also an important metric. In high-concurrency scenarios, the OLAP engine needs to have stable performance and scalability. In terms of data writing, performance needs to be improved, and efficient data writing capabilities can help OLAP products better meet the requirements of business scenarios.

Other aspects include functional and architectural optimizations, such as development efficiency, UDF (user-defined functions) support, etc. Taking J**A UDF as an example, compared with C++ UDF, J**A is easier to use and is conducive to improving development efficiency.

Finally, consider the ease of operation and maintenance of the architecture. A good OLAP product should have a simple O&M method that is easy to manage and maintain on the platform side.

Open-source data lake streaming data warehouse solution

The following describes the architecture of open source data warehouses and data lakes on the Alibaba Cloud e-MapReduce (EMR) platform. First, let's talk about the overall architecture of EMR.

The lowest layer of EMR infrastructure is cloud resources, which mainly include Elastic Compute Service (ECS) and Alibaba Cloud Container Service (ACK). On this basis, we use schedulers to coordinate and control the data processing process. In addition, we offer Jindofs, a distributed file system compatible with Hadoop that makes it easy for users to store and manage data.

Next, we will further discuss the diverse applications of computing engines on Alibaba Cloud's EMR platform, including offline batch processing, real-time Flink, and OLAP-related engines.

At present, the typical data warehouse architecture is still dominated by offline batch processing. In this architecture, real-time data is collected through CDC technology and transmitted to a real-time processing engine such as Flink via a message queue such as Kafka. The processed data lands directly into the OLAP engine to support rapid data analysis.

The offline part mainly includes layering such as ODS DWD, and uses traditional Hive technology for data processing. However, real-time and offline data processing is relatively independent in this architecture, so data alignment becomes a common problem.

To solve this problem, near-real-time data lake architectures have emerged in recent years, such as delta, iceberg, hudi, etc. These new data storage formats are designed to improve the performance of data storage and processing while simplifying data alignment. The emerging Apache Paimon also provides effective support for solving data alignment problems.

The real-time data lake architecture is also a common data processing architecture on EMR platforms. In this architecture, real-time data is ingested from CDC schema or directly from Kafka and processed incrementally at various levels. Compared with the lambda architecture, the real-time data lake architecture is unified on the data link, thereby reducing the workload of data verification and other links.

In this architecture, a common OLAP query engine accesses the data lake directly, or serves the business unit as the end of the ADS layer. With a real-time data lake architecture, organizations can process and analyze data more efficiently, improving agility and accuracy in business decision-making.

Let's describe a typical data warehouse architecture. In this architecture, Kafka is used as a message queue and Flink is used for data processing at all levels. At the same time, synchronize the processed data to an analytical database like Starrocks to improve the performance of user analysis.

The benefits of real-time data analysis based on Starrocks include current applications and possible future evolutionary directions. In this architecture, we adopt a materialized view strategy, which first synchronizes the underlying data to StarRocks. Then, through the batch scheduling capability of offline materialized views, the data at all levels can be refreshed.

The main advantage of this architecture is that the entire data analysis process is done within the Starrocks engine, reducing the need to introduce complex engines and components. From a maintenance point of view, this architecture makes the platform more concise and easy to operate and manage.

Starrocks Introduced

Let's take a closer look at the architecture and core features of Starrocks.

The core strength of Starrocks is that it can effectively handle the various scenarios mentioned above. It has the following four key features:

High query performance: StarRocks stands out for its superior query performance, which can quickly return query results to meet users' needs for real-time data. Efficient data import: StarRocks excels in data import, with high throughput and low latency to ensure fast data import and synchronization. Good concurrency support: StarRocks has powerful concurrency processing capabilities, which can support multiple concurrent tasks at the same time, improving system performance and utilization. Rich data models: Starrocks provides a variety of data models to facilitate multi-dimensional data analysis. Users can select the appropriate data model for data processing and analysis according to their actual needs.

In the overall layered architecture on the business side, Starrocks plays a key role in the analytics layer. It enables an extremely fast and unified solution that covers a wide range of business scenarios mentioned earlier. With the high performance, high throughput, and low latency of Starrocks, users can quickly obtain data and achieve efficient data analysis. On this basis, Starrocks' rich data model supports a variety of data processing and analysis methods to further meet the needs of users in multi-dimensional data analysis.

With StarRocks as the core, the entire ecosystem is complete, including data import, query, etc.

Starrocks has a clear, simple architecture. Overall, it is divided into two roles: FE and BE.

FE is mainly responsible for query parsing and optimization, and generating physical execution plans. FE is designed for high availability to ensure automatic fault tolerance in the event of a failure. With internally implemented conformance protocol metadata synchronization, the system remains stable even in the event of FE downtime.

BE plays the role of the compute execution engine and storage engine before the storage and compute separation. BE typically employs a multi-copy strategy to ensure data security. When a BE goes down, the data system will be automatically migrated without affecting query performance. At the same time, the system has a self-healing function, which can automatically complete missing copies on other machines to ensure data integrity and consistency.

From a performance perspective, the full vectorization engine is an important feature of Starrocks. The reason why "comprehensive" is emphasized is that an efficient vectorization engine can only be achieved if there are no shortcomings in the entire processing chain. Many products on the market today claim to have vectorization capabilities, but there are not many engines that can truly achieve comprehensive vectorization.

The advantages of Starrocks comprehensive vectorization engine are manifested in the following aspects:

Avoid performance bottlenecks: The comprehensive vectorization engine can efficiently process data in both shuffle and join, avoiding a single link becoming a performance bottleneck. Higher query performance: By introducing vectorization technology, StarRocks has a significant advantage over traditional engines in the core computing process. For example, operations such as virtual function calls and CPU scheduling can be optimized efficiently. Optimize system resource utilization: The comprehensive vectorization engine can make better use of system resources and further improve overall performance.

The second major performance impact is Starrocks' adoption of a cost-driven optimization strategy (CBO). The CBO is mainly aimed at join scenarios, and dynamically adjusts the join order and optimizes the query plan by calculating the cost of each join operation. This diagram is an industry reference classic** that shows how a CBO engine works. CBOs are also introduced in the relevant courses of CMU.

Through CBO, Starrocks can adjust and rewrite the order of join operations, so as to support a variety of join types, making it have superior performance in complex business scenarios. This is also one of the core technologies that StarRocks can cope with a variety of multi-turn scenarios.

Starrocks mainly supports two modes in terms of join operations: shuffle join and colocation join. The combination of these two modalities enables efficient data processing and analysis.

Shaffle Join: The shuffle join mode, including Broadcast Join, is mainly used for the overall summary scenario. In this mode, Starrocks enables join operations between different tables by randomly distributing and reorganizing data.

Colocation Join: For some special business scenarios, StarRocks recommends using the Colocation Join method. In this mode, the data distribution of the two tables is completely consistent based on business requirements. In the query process, the delay caused by remote data transmission is avoided and the processing efficiency is improved.

The key performance optimization points of Starrocks on the query side were introduced earlier, and the features of the import side are introduced next. As you can see in the real-time analysis link diagram, Starrocks supports real-time import of component models.

The component model is optimized by design over traditional update models, such as Doris's earlier update model, to achieve a performance balance between writes and queries. In the traditional update model, the import is faster, but multiple small files may need to be merged when querying, resulting in heavy memory operations.

The core advantages of the component model are:

Introducing primary key indexes: When importing data, Starrocks first creates a primary key index so that it knows which history file the key is written in. Based on this information, the delete information can be updated to avoid invalid queries. Efficient implementation: Despite the introduction of primary key indexes, Starrocks guarantees that write performance will not be affected too much. This is because the implementation of primary key indexes is more efficient, and the overall speed is not much different from the traditional import method. Query performance optimizations: Thanks to the Deliver Vector information, Starrocks does not need to sort and merge. At the same time, the predicate can be pushed down to further improve query performance. Materialized view: starrocks from 2Since version 5, the support for materialized views is relatively complete. Materialized views can dramatically improve the performance of real-time analytics, especially for incremental data.

Starrocks is committed to bringing a better analytics experience to its users, especially when it comes to query performance. To achieve this, Starrocks focused on user analytics, hoping to engage users of products like Presto and Impala so that they could enjoy upper-layer query optimization capabilities on Starrocks without sacrificing performance.

Starrocks has achieved remarkable results in this regard. As you can see in the chart below, Starrocks has seen a 3-5 bit performance improvement over competitors such as Trino and Presto in most benchmarks and real-world customer cases. This achievement is due to Starrocks' continuous optimization of the query engine and underlying architecture, providing users with a more efficient and stable analysis solution.

The above is another performance report.

From 2Starting with version 3, Starrocks introduced the Pipeline engine, which aims to further improve CPU utilization. In concurrent scenarios, Starrocks can achieve better resource isolation based on the pipeline engine. This capability allows Starrocks to be as flexible as possible in resource allocation when handling large and small queries and ETL tasks. For example, when some ETL tasks are heavy, other query tasks may be greatly impacted if there is no resource isolation. Starrocks' resource isolation capability can effectively reduce this impact and ensure the stable operation of the system.

Resource isolation is one of the core capabilities of StarRocks, which has significant optimization effects for concurrent scenarios. By improving CPU utilization and improving the resource isolation mechanism, Starrocks can provide users with more efficient and stable analysis solutions to meet the needs of various complex scenarios.

The last core competency is data balancing. The balance between scattered data relies on the separation of storage and compute, which allows Starrocks to scale elastically.

When a new node is added, StarRocks is able to automatically distribute the data evenly across the new node, ensuring that the storage capacity of each node is balanced. When it comes to replicas, even if they are lost, StarRocks can recover them automatically. As long as at least one of the multiple replicas is available, Starrocks guarantees the integrity and reliability of the data.

Planning for the future

starrocks 3.Key points in the evolution of the X version include:

Storage and compute separation: This is Starrocks 3One of the core optimizations of the X release. lake house：starrocks 3.The X version will support hard-word union, making it easier to achieve multi-warehouse and multi-operation capabilities on the basis of separation of storage and computing. In addition, for ETL scenarios, Starrocks is also constantly optimizing and improving its own capabilities. Scenario optimization: Last year, StarRocks focused on the Big House scene and has achieved more mature capabilities. Currently, many customers are using this scenario. It is recommended that users who are concerned about this scenario try it. ETL capability optimization: StarRocks has been optimized for scenarios such as computing and placing, and supports incremental materialized views. While the materialized view is updated in real time, the import end is also unified. Simplifying the user experience: Starrocks is committed to simplifying the way it is imported and reducing the cost of user learning. For different scenarios, StarRocks provides corresponding import methods. For example, Snowflake is doing a very good job of this, and Starrocks will learn from its experience to optimize the user experience. Semi-structured data type support: For database scenarios, Starrocks 3Version X adds support for semi-structured data types to meet the needs of users in such scenarios. In short, Starrocks 3The X version includes optimizations and upgrades in several areas, including storage and compute separation, Lakehouse, ETL capabilities, user experience, and semi-structured data type support. These improvements will help users respond more efficiently to various business scenarios and improve the processing performance of big data analysis.

Thoughts and best practices of OLAP for open source big data

Related Pages

In today's era of big data, big data is a high-speed way to acquire customers!Loved by everyone

The development of big data

The main direction of the big data postgraduate examination

Data Cleansing The magic that makes big data more "pure".

Big data-driven community group buying, data-based operation and decision support analysis