How To Approach Different Oracle Database Performance Problems
Jump Start Your Oracle Database Tuning Effort
Every Oracle Database Administrator will tell you no two performance problems are the same. But a seasoned Oracle DBA recognizes there are similarities...patterns. Fast problem pattern recognition allows us to minimize diagnosis time, so we can focus on developing amazing solutions.
I tend to group Oracle performance problems into four patterns. Quickly exploring these four patterns is what this article is all about.
You Can Not Possibly List Every Problem And Solution
When I teach, some Oracle Database Administrators want me to outline every conceivable problem along with the solution. Not only is the thought of this exhausting, it's not possible. Even my Stori product uses pattern matching. One of the keys to becoming a fantastic performance analyst is the ability quickly look at a problem and then decided which diagnosis approach is the best. For example, if you don't know the problem SQL (assuming there is one) tracing is not likely to be your best approach.
The Four Oracle Database Performance Patterns
Here are the four performance patterns I tend to group problems into.
The SQL Is Known
Many times there is a well know SQL statement that is responsible for the poor performance. While I will always do a quick Oracle Time Based Analysis (see below) and verify the accused SQL, I will directly attack this problem by tuning with SQL specific diagnostic and tuning tools.
But... I will also ask a senior application user, if the users are using the application correctly. Sometimes new applications users try and use a new application like their old application. It's like trying to drive a car with moving your feet as you are riding a bicycle... not going to work and it's dangerous!
Business Process Specific
I find that when the business is seriously affected by application performance issues, that's when the "limited budget" is suddenly not so limited. When managers and their business's are affected they want action.
When I'm approached to help solve a problem, I always ask how the business is being affected. If I keep hearing about a specific business process or application module I know two things.
First, there are many SQL statements involved. Second, the problem is bounded by a business process or application. This is when I start the diagnostic process with an Oracle Time Based Analysis approach which, will result in multiple solutions to the same problem.
As I teach in my online seminar, How To Tune Oracle With An AWR Report, user feel performance through time. So, if our analysis is time based we can create a quantitative link between our analysis and their experience. If our analysis creates solutions that reduce time, then we can expect the user experience to improve. This combined with my "3 Circle" approach yields spot-on solutions very quickly.
While an Oracle Time Based Analysis is amazing, because Oracle does not instrument CPU consumption we can't answer the question, "What's Oracle doing with all that CPU?" If you want to drill into this topic check out my online seminar, Detailing Oracle CPU Consumption: The Missing Link.
It's Just Slow
How many times have I experienced this... It's Just Slow!
If what the user is attempting to explain is true, the performance issue is affecting a wide range of business processes. The problem is probably not a single issue (but could be) and clearly the key SQL is not know. Again, this is a perfect problem scenario to apply an Oracle Time Based Analysis.
The reason I say this is because an OTBA will look at the problem from multiple perspectives, categorize Oracle time and develop solutions to reduce those big categories of time. If you also do Unit Of Work Time Based Analysis, you can an even anticipate the impact of your solutions! Do an OraPub website search HERE or search my blog for UOWTBA.
Random Incident That Quickly Appears And Vanishes
This is the most difficult problem to fix. Mainly because the problem "randomly" appears and can't be duplicated. (Don't even bother calling Oracle Support to help in this situation.) Furthermore, it's too quick for an AWR report to show it's activity and you don't want to impact the production system by gathering tons of detailed performance statistics.
Even a solid Oracle Time Based Analysis will likely not help in this situation. Again, the problem is performance data collection and retention. The instrumented AWR or Statpack data does not provide enough detail. What we need step-by-step activity...like a timeline.
Because this type of problem scares both DBAs and business managers, you will likely need to answer questions like this:
What is that blip all about?
Did this impact users?
Has it happened before?
Will it happen again?
What should we do about it?
The only way I know how to truly diagnose a problem like this is to do a session-level time-line analysis. Thankfully, this is possible using the Oracle Active Session History data. Both v$active_session_history and dba_hist_active_sess_history are absolutely key in solving problems like this.
ASH samples Oracle Database session activity once each second (by default). This is very different than measuring how long something takes, which is the data an AWR report is based upon. Because sampling is non-continuous, a lot of detail can be collected, stored and analyzed.
A time-line type of analysis is so important, I enhanced my ASH tools in my OraPub System Monitor (OSM) toolkit to provide this type of analysis. If you want to check them out, download the OSM toolkit HERE, install it and read the osm/interactive/ash-readme.txt file.
As an example, using these tools you can construct an incident time-line like this:
HH:MM:SS.FFF User/Process Notes ------------ ------------- ----------------- 15:18:28.796 suspect (837) started the massive update (see SQL below) 15:28:00.389 user (57) application hung (row lock on TM_SHEET_LINE_EXPLOR) 15:28:30.486 user (74) application hung (row lock on TM_SHEET_LINE_EXPLOR) 15:29:30.??? - row locks becomes the top wait event (16 locked users) 15:29:50.749 user (83) application hung (row lock on TM_SHEET_LINE_EXPLOR) 15:30:20.871 user (837) suspect broke out of update (implied) 15:30:20.871 user (57) application returned 15:30:20.871 user (74) application returned 15:30:20.871 user (83) application returned 15:30:30.905 smon (721) first smon action since before 15:25:00 (os thread startup) 15:30:50.974 user (837) first wait for undo - suspect broke out of update 15:30:50.974 - 225 active session, now top event (wait for a undo record) 15:33:41.636 smon (721) last PQ event (PX Deq: Test for msg) 15:33:41.636 user (837) application returned to suspect. Undo completed 15:33:51.670 smon (721) last related event (DFS lock handle)
Without ASH seemingly random problems would be a virtually impossible nightmare scenario for an Oracle DBA.
It's true. You need the right tool for the job. And the same is true when diagnosing Oracle Database performance. What I've done above is group probably 90% of the problems we face as Oracle DBAs into four categories. And each of these categories needs a special kind of tool and/or diagnosis method.
Once we recognize the problem pattern and get the best tool/method involved to diagnosis the problem, then we will know the time spent developing amazing solutions is time well spent.
Enjoy your work!
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|Top Five Holiday Gifts For The Oracle DBA||Focus On Oracle Database Training...Rest Of 2011||A Duck In And Out Of The Oracle Water|