Parameterized queries are a class of queries that have the same template and differ only in the predicate binding parameters, and they are widely used in modern database applications. They perform repetitive actions, which provides an opportunity to optimize their performance.
However, many current commercial databases handle parameterized queries by simply optimizing the first query instance (or user-specified instance) in the query, caching its optimal plan and reusing it for subsequent query instances. Although the optimization time of this method is minimized, the execution of the cache plan may be arbitrarily suboptimal due to the different optimal plans of different query instances, which is not applicable in practical application scenarios
Most traditional optimization methods require a number of assumptions about query optimizers, but these assumptions often don't fit into real-world scenarios. Fortunately, with the rise of machine learning, these problems can be effectively solved. In this issue, we will focus on two articles published in VLDB2022 and SIGMOD2023
:《leveraging query logs and machine learning for parametric query optimization》
:《kepler: robust learning for faster parametric query optimization》
** 1 Highlights
This article decouples parameterized query optimization into two problems:
1) populatecache: cache k plans for a query template;
2) getplan: For each query instance, select the best plan from the cached plan.
The algorithm architecture of this ** is shown in the figure below. There are two main modules: populatecache and getplan module.
populatecache uses the information in the query log to cache k plans for all query instances. The GetPlan Module first collects cost information between k plan and query instances by interacting with the optimizer, and uses this information to train a machine learning model. Deploy the trained model in the DBMS. When a query instance arrives, the optimal plan for the instance can be quickly determined.
The PolulateCache module is responsible for identifying a set of cache plans for a given parameterized query, and the search stage utilizes two optimizer APIs:
optimizer call: returns the plan selected by the optimizer for a query instance;
recost call: Returns the optimizer-estimated cost for a query instance and the corresponding plan;
The algorithm process is as follows:
plan-collection phase: call optimizer call to collect candidate plans for n query instances in the query log;
plan-recost phase: call recost call for each query instance and each candidate plan to form a plan-recost matrix.
k-set identification phase: The greedy algorithm is used to cache k plans using the plan-recost matrix to minimize sub-optimism.
The getplan module is responsible for selecting one of the cached k plans for execution for a given query instance. The GetPlan algorithm can consider two goals: to minimize the cost estimated by the optimizer or to minimize the cost that is actually executed out of k cache plans.
Consider Goal 1: Train a supervised ML model with the Plan-Recost Matrix, taking into account classification and regression.
Consider Goal 2: Train a model using reinforcement learning based on a multi-armed bandit.
** 2 Highlights
This paper proposes an end-to-end, learning-based parameterized query optimization method, which aims to reduce query optimization time and improve query execution performance.
Kepler also decouples the problem into two parts: plan generation and learning-based plan prediction. There are three main phases: Plan Generation Strategy, Training Query Execution Phase, and Robust Neural Network Model.
As shown in the preceding figure, the query instance in the query log is input to Kepler Trainer, who first generates a candidate plan, then collects the execution information related to the candidate plan, uses it as training data to train a machine learning model, and deploys the model in the DBMS after training. When a query instance arrives, use the kepler client to best plan and execute.
In this paper, we propose a candidate plan generation algorithm called Row Count Evolution (RCE), which generates candidate plans by perturbing the cardinality estimation of the optimizer.
The idea of the algorithm is that the misestimation of cardinality is the main cause of the suboptimality of the optimizer, and the candidate plan generation phase only needs to include the optimal plan of one instance, rather than picking a single optimal plan.
The RCE algorithm first generates the optimal plan for the query instance, then perturbates the join cardinality of its subplans within the exponential interval range, repeats it many times and iterates many times, and finally selects all the generated plans as candidate plans. Specific examples are as follows:
With the RCE algorithm, the resulting candidate plans may outperform those produced by the optimizer. Because the optimizer can have cardinality estimation errors, RCE produces the best plan for the correct cardinality by constantly perturbing the cardinality estimation.
Once the set of candidate plans is obtained, each plan is executed on the workload for each query instance, and the true execution time is collected for the training of the supervised optimal plan model. The above process is cumbersome, and this paper proposes some mechanisms to accelerate the collection of training data, such as parallel execution and adaptive timeout mechanism.
Train the neural network with the resulting real-world execution data to plan optimally for each query instance. The neural network used in it is the spectral normalization of Gaussian neural processes, which ensures the stability of the network and the convergence of training, and can provide uncertainty estimation for the world. When the uncertainty estimate is greater than a certain threshold, it is left to the optimizer to select the execution plan. Performance regression is avoided to a certain extent.
Summary
Both of the above decoupling the parameterized query into two parts, populatecache and getplan. The comparison between the two is shown in the table below.
Algorithms based on machine learning models perform well in terms of planning, but their training data collection process is expensive, and the model is not easy to generalize and update. Therefore, there is still some room for improvement in the existing parameterized query optimization methods.
This article is illustrated**:
1)kapil vaidya & anshuman dutt, 《leveraging query logs and machine learning for parametric query optimization》, 2022 vldb,2)lyric doshi & vincent zhuang, 《kepler: robust learning for faster parametric query optimization》, 2023 sigmod,