How To Use AWR Data To Create Awesome Wait Event Based Histograms
In this post I will walk you through how to quickly and selectively pull dba_hist_event_histogram data and transform it so it can be quickly pasted into an R script creating two pages of histograms.
Many times I find myself solving Oracle performance problems that have already happened. Perhaps the incident occurred a few hours ago, yesterday, last week or even last month. But regardless, I have a nice juicy problem to solve.
If I need to understand a specific Oracle wait event timing situation in more depth than the average wait time, I'm in luck. That is, if either Statspack or AWR data is being collected and retained. That's because both AWR and Statspack store v$event_histogram data. In AWR, the table name is dba_hist_event_histogram.
In my previous post I walked you though how to interactively gather wait event timing details from v$event_histogram using my free OSM tool swEhVdGet.sql and creating a proper histograms using my R script swEhCharts.r. In this post, it's all about getting historical data from the AWR table dba_hist_event_histogram.
How to easily pull targeted dba_hist_event_histogram data, create some basic statistics and histograms, and then understand the information is what this post is all about!
I You Are Just Joining Us...
Over the past weeks I have blogged and made the focus of my Firefighting Friday "How To" Webinars about installing and using the free statistical package R primarily for understanding wait times for a specific wait event. We did this by examining some basic statistics and histograms. If you just joined this thread, you may want to review what I've already covered.
Here is a partial list of resources.
- YouTube Video. How To Install R for Oracle DBAs
- Blog. How To Quickly Install The Free Statistics Package R
- Blog. How To Create Histograms Using The Free Statistics Package R
- Tool. OraPub System Monitor (OSM) toolkit, readme file for event histogram related tools, osm/interactive/readmeEventHist.txt
- Blog. How To Create Awesome Wait Event Based Histograms
- Firefighting Friday "How To" Webinars (recorded). Installing R, using [v$,dba_hist_]event_histogram data with OSM tools to create proper histograms in R
- OraPub Online Video Seminar. Using Skewed Performance Data To Your Advantage
Gathering Wait Event Times To Create A Histogram
In my OSM Toolkit there are a number of tools to gather/report Oracle wait event times. I am going to focus on pulling the data out of the AWR table dba_hist_event_histogram using the swEhAwrGet.sql script and then copy/paste its output into the R script swEhCharts.r resulting in a magnificent two-page histogram montage. It's pretty cool.
The script has only two main parts. The first part helps you quickly pick the best AWR snap IDs. (If you want to develop your own set of AWR script, this first part of the script is a great place to start.) The second part, pulls the appropriate data from dba_hist_event_histogram and outputs it into a format so you can simply copy and paste the output directly in the swEhCharts.r script.
How To Use swEhAwrGet.sql
Suppose on 8 August between 2pm and 4pm overall, the Oracle system looked great. However, a couple very upset users called in saying the application all of a sudden froze solid (like a laptop incased in ice) for over a minute and then mysteriously released control. After using my BloodHound toolkit to analyze the historical AWR data stored in the dba_hist_active_sess_history table, I discovered the situation was exactly as the users described. That is, overall performance was great, but a few users where essentially "hung" waiting on a row level lock. The Oracle wait event was, enq: TX - row lock contention.
The objective now was to get a better understanding of the long but infrequent wait times and be able to explain the situation to others.
Select Your Snapshot ID Range
Below is the entire dialog from selecting the best AWR snap IDs to the final output ready to be pasted into the R script.
Notice first, the script shows me what AWR data is available. This happens very quickly as I pull directly from the indexed dba_hist_snapshot table.
SQL> @swEhAwrGet DBID INSTANCE_NUMBER FIRST_SNAP_ID BEGIN_INTERVAL_TIME LAST_SNAP_ID BEGIN_INTERVAL_TIME ---------- --------------- ------------- -------------------- ------------ -------------------- 4217380439 1 2464 28-Dec-2014 16:00:07 3369 04-Feb-2015 09:00:43 6243270432 1 31089 12-Jul-2014 02:00:10 32532 11-Aug-2014 09:30:13 6243270432 2 31089 12-Jul-2014 02:00:10 32532 11-Aug-2014 09:30:13 6243270432 3 31089 12-Jul-2014 02:00:10 32532 11-Aug-2014 09:30:13
Rarely, do I know the exact snapshot IDs I want, so this script prompts me for the day and hour (not minute and seconds) when I'm interested. Then it provides the snapshot ID details for around this range. If find this very useful. And the script is very fast because, like before, it's pulling from dba_hist_snapshot.
ENTER the database ID (DBID) : 6243270432 ENTER the instance number : 3 ENTER the approximate starting date/hour (DD-Mon-YYYY HH24) : 08-Aug-2014 12 ENTER the approximate ending date/hour (DD-Mon-YYYY HH24) : 08-Aug-2014 15 SNAP_ID TO_CHAR(BEGIN_INTERV TO_CHAR(END_INTERVAL ---------- -------------------- -------------------- 32391 08-Aug-2014 11:00:11 08-Aug-2014 11:30:28 32392 08-Aug-2014 11:30:28 08-Aug-2014 12:00:05 32393 08-Aug-2014 12:00:05 08-Aug-2014 12:30:23 32394 08-Aug-2014 12:30:23 08-Aug-2014 13:00:41 32395 08-Aug-2014 13:00:41 08-Aug-2014 13:30:24 32396 08-Aug-2014 13:30:24 08-Aug-2014 14:00:02 32397 08-Aug-2014 14:00:02 08-Aug-2014 14:30:22 32398 08-Aug-2014 14:30:22 08-Aug-2014 15:00:36 32399 08-Aug-2014 15:00:36 08-Aug-2014 15:30:07 9 rows selected.
Generate The Raw Histogram Data & Review
In my situation, the users called in around 2pm. Looking at the above output, this means the starting snapshot ID of 32395 with an ending snapshot ID of 32397 should be fine. With the actual snapshot IDs and the specific wait event I'm interested in, I reply to the scripts' prompts like this:
ENTER the beginning snap_id : 32395 ENTER the ending snap_id : 32397 ENTER the wait event (default: db%seq%read) : enq%TX%row%lock% Copy and paste the below R code directly into the swEhCharts.r It must be the last entry in the #2 area of swEhCharts.r
In a split second the below output is created. Regardless of how much AWR data is retained, the output is created extremely fast.
Let's look at the output for a bit. The output relates to the event histogram bins found in both v$event_histogram and dba_hist_event_histogram, which are in milliseconds. For example, within the snapshot ID interval, there were 4888 TX row level lock occurrences that lasted between 0ms and 1ms. Notice there are 7 wait occurrences that lasted over 1 second, with two of them over 65 seconds!
So perhaps our users are not such shlemiels after all?
### START ### eName = "enq%TX%row%lock%" count0to1 = 4888 count1to2 = 2354 count2to4 = 2052 count4to8 = 764 count8to16 = 347 count16to32 = 149 count32to64 = 5 count64to128 = 2 count128to256 = 3 count256to512 = 3 count512to1024 = 2 count1024to2048 = 1 count2048to4096 = 0 count4096to8192 = 0 count8192to16384 = 0 count16384to32768 = 3 count32768to65536 = 1 count65536to131072 = 2 count131072to262144 = 0 PL/SQL procedure successfully completed. ### END : Do NOT COPY and PASTE the "PL/SQL procedure..." TEXT INTO R ###
Copy & Paste The Output Into The OSM R Script
Next, in a text editor open the OSM R script, swEhCharts.r and scroll down to below the last example. Then copy the above output starting from the "### START ###" to just above "### END" line, being careful to NOT include the "PL/SQL..." line. Now paste that into the R script.
Next, copy and paste the entire R script directly into R. Now press ENTER and you will see two pages of carefully thought out histograms displayed before your eyes!
When you paste the R script into R and press ENTER, R will produce some basic statistics (i.e., mean and median) plus two histogram pages, detailing the wait event activity. Below are the statistics followed by the histograms.
> summary(binValues) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 0.54 1.17 32.38 2.67 129800.00
Now you can easily create proper and beautiful Oracle wait event histograms based on dba_hist_event_histogram
What Are The Statistics And Histograms Telling Me?
Now that you have some statistics and some histograms, it's time to learn about the underlying data. While I'll provide some guidance, my online video seminar entitled, Using Skewed Performance Data To Your Advantage will teach you to be an expert!
Here are a few observations about the data:
- The average TX Row Level Lock time is around 32.5ms. In my experience, lock wait time data is super skewed and therefore, the average lock time gives me bupkes! In fact, average wait times can mislead me and others. Instead of focusing on the average wait time, I focus on the median lock time.
- The typical (median) row level lock time is around 1.2ms. Much better! This can be seen by looking at the bottom-right histogram on the first page in conjunction with the top-right histogram. The bottom-right histogram zooms in on the quicker wait times and the top-right shows the complete range of wait events. Therefore, 50% of the wait occurrences where less than about 1.2ms. Of course this also implies, 50% of the row level lock wait times are greater than 1.2ms!
- 75% of the row level lock times were less than around 2.7ms. That's pretty darn good! So obviously, there are a few super long row level lock times.
- Most of the row level lock times where between 0.54ms to 2.7ms.
- Looking at the actual bin counts, we can see there were a 6 row level lock times greater than 15 seconds, 3 greater than 30 seconds and 2 greater than one minute! Almost always, a few very long row level lock wait times are acceptable. While no one likes them and it can make people nervous. But if we can demonstrate the long wait times are a rare (which we have done), most IT Managers and Business Managers will rarely invest the time and expense to eliminate them.
- As previously mentioned, showing people a histogram or two will help them to put a few long wait times into perspective. In fact, if you only showed them the histograms and NOT the numeric statistics, they would probably dismiss proceeding further as a waste of time!
My hope is you can calmly investigate shockingly large wait event times. To support that effort, with AWR or Statspack data, you are able to generate some basic statistics and create histograms. What I have found is the time required to do this is minimal, but the rewards are significant!
To keep this post from getting to long, I did not mention how to use an actual AWR or Statspack report to create proper wait event histograms. How to do that will be the topic of my next post.
Enjoy your work!
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|Watch Oracle DB Session Activity With The Real-Time Session Sampler||Creating And Understanding A Response Time Chart Based On Oracle Database Time Data||Parsing Performance: Going Beyond Cursor Sharing Using Bind Variables|