About the Author:
Jiang Jian,Founder of Thin Ice Technology, Oracle ACE, 11G OCM, has many years of experience in Oracle design, management and implementation, and is proficient in database optimization, Oracle CBO and parallelism. Eagle Eye monitors the core design and developer, and is a senior python web developer.
Background
When I visited the customer's site recently, the customer reported that the database had encountered a "bug" a few days ago, and I hope we can help troubleshoot it.
The customer feedback phenomenon is as follows: there is a very core system database, suddenly one day after logging in with xxx schema, the running SQL is running in parallel state, and the DOP is very high, and in the case of multiple fault positioning failures, you can only temporarily use the parallel max servers parameter to reduce the concurrency to 32 to alleviate the pressure of excessive concurrency.
Fault acknowledgment
Let's take a look at the situation of the database in the first few days, and the failure period is indeed very high.
Switching to a single SQL operation ** situation, the DOP is very high, and then it is throttled, which is consistent with the customer description.
I opened the monitoring report next to it, and the DOP was significantly downgraded.
The initial parallelism of 192 is eye-catching, and the logical CPU is 96, which is very similar to the default concurrency of the database when automatic parallelism.
Preliminary investigation
Under the leadership of the customer, it was confirmed that the failure period was basically consistent with the customer's description. Listening to the description alone, it seems that there is a trigger for adjusting concurrency at the session level or something related to the Resource Manager.
However, this is the customer's core production system, and the adjustment of the conjecture should be theoretically impossible. Through a quick investigation, it was confirmed that the conjecture was incorrect.
Secondary checking
After the initial investigation, I couldn't help but wonder if there was a problem with the customer's description. The top-level activities are filtered by user, and suddenly it is found that there is not parallelism at the user level, and many SQL running under xxx users do not have parallelism, in fact, there is only one SQL with a high degree of concurrency. This also shows that the initial investigation direction has been deviated.
This SQL text is not specified in parallel with hints, combined with the observed information, a little skeptical, and want to find the application to confirm whether there are any recent changes. Another DBA of Party A said categorically that it must not have been changed, not only has it not been changed, but he also suspects that it is an Oracle bug.
When he said that it might be a bug, I wanted to apply for permission to do a trace on the machine, but before I said it, I asked him why he suspected it was a bug. He said something that made me determined to write an article to record this case: "Because of the previous SQL execution exception, he would use SQL Profile to fix the SQL execution plan, but after this fixation, it was completely ineffective."
The culprit of the shock
There is actually a sqlprofile? What surprised me was why I ignored the SQLprofile SQL Baseline troubleshooter.
The main reason is that they have optimized their system before, and they know that it is very difficult for them to make changes to the core, and it is often necessary to demonstrate in detail the benefits of adding indexes, as well as the possible harms, first in the test, quasi-production environment to verify the reduction ratio of logical reads, and then demonstrate what SQL is affected, and describe the impact on Redo.
Take these materials and then go to the development team (well, foreign-funded enterprises), make an appointment for a meeting, get n more email confirmation, and then schedule production (only one day a week can be produced in the early morning), the next day after production, you need to observe in the factory to ensure the safety of the production warehouse.
Based on what I know about their process, I almost forgive myself for this sqlprofile oversight. After looking at the execution plan corresponding to SQL, there is a sqlprofile, and it starts with sys, which is generally out of addm sql tune.
Confirm this,The problem is basically located,Verify it immediately,Sure enough, this sys's SQLprofile makes SQL use high parallelism,And let the customer's DBA newly generated SQLprofile not take effect,Combined with audit and other information,Later, it was also confirmed that it was indeed another DBA of Party A that ran the ADDM report at night,And saw the benefits of the recommendations given by the report,Very tempting,So quickly adopted the suggestion,And took a few days off。
Review and reflect
This case is not strong in terms of skills, and there are many points worth reflecting on, but what I feel most deeply is that the supervision of the operation and maintenance behavior of operation and maintenance personnel in the operation and maintenance of the system database is seriously insufficient. However, it is this kind of high-authority and unsupervised O&M that frequently brings failures, and they are relatively difficult to troubleshoot, at least often faults that are difficult for Party A's DBA to locate.
There are many such cases in historical articles, such as secretly collecting system statistics, omitting to add disk commands, changing parameters without restarting, and failing after restarting. The way to make up for this kind of failure is to become a rule and join the inspection, so is there any good practice to anticipate the enemy's opportunity and prevent it from happening? Friends who have mature practices or are concerned about this kind of issue are welcome to communicate in the comment area.