Algebraic optimization is one of the key aspects of query optimization, which involves the transformation and improvement of relational algebraic expressions. These expressions are typically generated by query analysis and inspection steps during query processing. By applying a series of algebraic rules and heuristics, the goal of algebraic optimization is to find a lowest-cost execution plan.
Here are some basic rules for relational algebraic expression transformations, which are commonly used tools in the algebraic optimization process:
Commutativity: Join and Cartesian product operations are full of ** reversals, i.e., changing the order of operations does not affect the result.
Cartesian product commutative law: $e 1 times e2 = e2 times e1 $$
Commutative law of natural connections: $e 1 bowtie e2 = e2 bowtie e1 $$
Commutative law of natural connection: $e 1 bowtief e2 = e2 bowtief e1$$
Associativity: The connection and Cartesian product also satisfy the law of union, which allows us to reorganize the way operations are combined without changing the result.
Cartesian product associativity: $e 1 times e2) times e3 = e1 times (e2 times e3) $
Natural connection bonding law: $e 1 bowtie e2) bowtie e3 = e1 bowtie (e2 bowtie e3) $
Conditional connection associative law: $e 1 bowtie1} e2) bowtie2} e3 = e1 bowtie1} (e2 bowtie2} e3) $
Select the law of cascading of selection: The selection actions can be performed in tandem and in any order. The merge condition (and connection condition) can be broken down into multiple selection actions.
The law of commutation of the order of operations is chosen: $sigma( sigma(e)) = sigma( sigma(e)).
The law of conjunctional conditional decomposition: $sigma(e) = sigma(e) cap sigma(e) $
The law of projection operation concatenation: If a projection's property set is a subset of another projection property set, the two projection operations can be combined into one.
pi( pi(e)) = pi(e) when (ai subseteq b j)$$
Select the commutation with other operations: In some cases, the selection operation can be exchanged with the Cartesian product, projection, and join operations, so that small data sets can be filtered first, reducing the amount of data for subsequent operations.
Select and Cartesian product: $sigma(e1 times e2) = ( sigma(e1)) times e2 $$ if all the attributes involved in (f) are all properties in ($e 1$).
Selection and projection: $sigma( pi(e)) = pi( sigma(e)) if (f) only relates to the property of (a).
Distributivity: The selection operation is distributive for union and natural join operations, while the projection operation is distributive for Cartesian product sum operations.
Choice and parallel: $sigma(e1 cup e2) = sigma(e1) cup sigma(e2) $
Choice and difference arithmetic: $sigma(e1 - e2) = sigma(e1) -sigma(e2) $
Choose to connect with nature: $sigma(e1 bowtie e2) = ( sigma(e1)) bowtie ( sigma(e 2)) if (f) involves only common attributes.
The application of these transformation rules allows us to transform complex query expressions into a series of more efficient operational steps. For example, you can significantly reduce query costs by postponing join operations on large tables and performing selection and projection operations on small tables first.
The goal of algebraic optimization
The ultimate goal of algebraic optimization is to reduce disk operations during queries, as disk access is often the most time-consuming part of database operations. In addition, the optimizer also tries to reduce the size of intermediate results, as well as reduce the complexity of CPU calculations.
In a real DBMS, algebraic optimization is usually automated. The optimizer selects the best execution plan based on database statistics, such as the size of the table, the presence or absence of indexes, and the distribution characteristics of the data. However, database administrators and experienced users can assist or guide the selection of optimizers through query rewriting or by using specific query hints.
Overall, algebraic optimization is an integral part of query optimization, ensuring that the database system can execute the user's query requests in the most efficient manner.
Choose the action precedence principle: The basis of this rule is that filtering out irrelevant data early can significantly reduce the amount of data processed by subsequent operations. By applying selection (filtering) criteria early in the query, you can avoid complex operations on unwanted data.
Projection operation priority principle: Similar to the selection operation, the projection (determining which columns are needed) operation should also be done as early as possible to reduce the width of the processing data, especially before the join operation.
Cartesian product merge rule: Cartesian products produce a large number of intermediate results, often unnecessary. Therefore, Cartesian products should be combined with other operations such as selection and projection as much as possible to reduce the generation of intermediate results.
Extract common expression rules: If a computation is repeated in multiple places and the result set of the computation is small, it should be evaluated once and the results cached for subsequent operations.
Suppose we want to retrieve all the electives from the database with the numbers for'02'The name of the student for the course. The SQL query statement might look like this:
select s.sname from s, sc where s.sno = sc.sno and sc.cno = '02';In the absence of optimization, the database may perform a Cartesian product before applying the selection condition, which is extremely inefficient. After applying the heuristics, we can refine the query execution plan as follows:
Select Action PriorityFirst, we apply the selection criteriasc.cno='02'
to filtersc
table, which reduces the dataset size for subsequent operations.
Projection operations take precedence: Secondly, we only chooses.sname
withs.sno
, as well as those who meet the conditionssc.sno
to further reduce the width of the data.
Merge Cartesian products: We perform the join operation immediately after the selection operation, which avoids the unnecessary generation of a large amount of intermediate data.
Extract common expressions: Ifsc.cno='02'
is a common query condition whose results can be cached for quick use by subsequent queries.
The optimized query execution plan is as follows:
select s.sname from s join sc on s.sno = sc.sno where sc.cno = '02';In this optimized query plan, we start with the
sc
The table applies a selection criterion, and then performs join operations only on this filtered result set. This strategy will greatly improve the efficiency of the query.
From these examples, we can see that proper query optimization can lead to a huge performance improvement in a database system. This process is usually automated, and users don't have to worry about the underlying execution details, but understanding these optimization principles can help you write more efficient queries.
Physical optimization is an important part of query optimization, which involves the selection of underlying data access methods and execution algorithms. The aim is to passChoose the most efficient access path and operating algorithmto reduce resource consumption and improve query speed during query execution. Physical optimization typically includes heuristic-rule-based optimization and cost-estimation-based optimization.
In physics optimization, heuristic rules provide a set of rules of thumb that can quickly guide the choice of access path when there are no detailed statistics.
Select the heuristic rule for the action
For queries with small relationships, full table scans tend to be the fastest, even if an index is present.
For primary key equivalent queries, a primary key index is typically used because the result is only one tuple at most.
For equivalent queries that are not primary attributes, index scans should be considered if there is an index and the number of tuples expected to be returned is small (less than 10% of the total); Otherwise, it may be more efficient to use a full-table scan.
For non-equal or range queries, index scans should likewise be prioritized if the result set is small and has an index.
For multi-criteria queries with AND joins, if there is a composite index that involves multiple criteria, the composite index takes precedence.
For OR joined queries, a full table scan is typically used.
Heuristic rules for connection operations
If both tables are already sorted by join attributes, you should use sort-merge joins.
If one of the tables has an index on the join property, an index join is usually preferred.
If both tables are large and don't have an indexing advantage, consider using a hash join.
If none of the above rules apply, consider using nested loop joins, especially if one table is much smaller than the other.
Let's say we have a simple query: fromemployees
All employee records with a department of "Sales" are retrieved in the table. Our database contains the following statistics:
employees
Tables sharedb
blocks.
Each chunk can containr
records.
The department column has a non-clustered index.
Cost-based optimization calculates the execution cost of each operation algorithm, which requires knowing the state of the database, including:
The total number of tuples (n) for the table.
The average length of the tuple of the table (l).
The number of data blocks occupied by the table (b).
The number of different values for each field (m).
Field selection rate (s).
The number of layers of the index (l).
The selection cardinality (s) of the index.
Based on the above information, the query optimizer can employ the following algorithm to estimate the cost of executing the query:
Estimation of the cost of the full table scanning algorithm
Cost ( cost = b ).
If the selection criterion is Master Code = Value, the average search cost (cost = frac$).
Estimation of the cost of the index scan algorithm
If the selection condition is "code = value", then the main index of the table is used, if it is b+ tree and the number of layers is l, then the cost (cost = l + 1).
Estimation of the cost of nested loop-join algorithms
For nested loop joins of two tables, if one of the tables is smaller, you can put the smaller table in the inner loop at the cost ( cost = b + frac times b }$
Sorting-Merge Join Algorithm Cost Estimation
If the joined tables are already ordered by the join attributes, the cost ( cost = b + b + frac times n times n}}$
Where: $b$) and ($b$) are the number of data blocks occupied by the two join tables, respectively.
k ) is the amount of data blocks in memory that can be used to store data.
f$ is the join selectivity, which represents the proportion of the number of tuples of the join result.
m $ is the block factor that holds the result of the connection, indicating the number of result tuples that can be stored in each block.
Assumptionsemployees
The table has 1000 data blocks, and the non-clustered index of the department column has 3 layers, and we want to use this index to quickly locate employees in the "sales" department.
Estimate the cost of a full table scan
cost ( cost = 1000$ ).
Estimate the cost of an index scan
Cost (cost = 3 + 1 = 4$) assumes that the index scan directly targets the desired record).
If the percentage of employees in the "Sales" department is less than 10%, index scanning would be a more efficient method, so we chose it as our query strategy. In more complex queries, there will be more factors to consider, such as join operations, parallel execution, and so on.