How To Use An AWR Report To Create Correct Wait Event Based Histograms
In this post I will step you through how to quickly create a correct wait event time histogram directly from an AWR report.
The data will come from a standard AWR report, not by extracting data from the AWR dba_hist_event_histogram table like I did in my previous post.
You start with an AWR or Statspack report and end with powerful statistics summarizing the wait event timing situation, plus two pages of carefully crafted histograms.
And don't count on OEM's wait event histograms. As I'll explain below, they can easily be misleading. So misleading, you can end up in an "IO Fight" situation like the one shown in this video.
The process to create proper histograms from an AWR is straightforward and only takes a couple of minutes. Never again will you be hamstrung with only the wait time average... which is a vulnerable place to be in any Oracle Database performance tuning analysis.
So let's get started!
Get AWR report
However you run your AWR or Statspack reports, run one now and view it. The report shown on the below is what I'll be using in this article. It's a serious production system... with some serious problems!
Find Event Histogram Section
With the AWR report open, scroll down a bit and you'll see something like this:
Notice the sections starting with Wait Event Histogram? Those are the sections we are interested in. Click on the first link and you'll see something like this:
What is unique about this report is each wait event occurrence is placed into a bucket or bin based on its duration. For example, suppose we timed a single occurrence of the wait event ADR block read and found its wait time to be 4.6ms. It would be placed in the <8ms bin because its wait time is greater than 4ms and less than or equal to 8ms.
But it gets better! If Oracle placed the raw occurrence counts in this table, it would be very difficult to read. So Oracle shows the percentage, not the actual count. If you want to know the total count, you're in luck because the Total Waits is shown for each event.
To conserve storage space, performance reasons (I suspect) and to allow easier viewing of a potentially massive wait time duration range, Oracle stores and displays the wait times in bins that grow by powers of two. This seems strange at first, but it is a great way to show lots of information in a small space. Notice the bin sizes increase in powers of two: 2^0, 2^1, 2^2,...
But even with the power of two bin sizing strategy, wait events can sometimes last many seconds or minutes... Remember that row lock situation last year? Oracle's strategy is to create multiple HTML tables, with each table showing a set begin and end wait time range. And, only the wait events that have an occurrence in one of the displayed bins are shown. This is why the last table has only one wait event shown. Here is an example.
Oracle Enterprise Manager Can Easily Mislead
While storing and displaying the wait event times in ever increasing bin sizes has its advantages, there is a disturbing usability problem.
Look at the histogram built from wait event histogram data (v$event_histogram and dba_hist_event_histogram) created by Oracle Enterprise Manager (OEM). Does the taller bin in the middle concern you? Normally this would concern me.
But because I understand how OEM creates the histogram, I am NOT concerned.
The problem is this; 99.9% of us are trained to expect histogram bin sizes to be the same. For example, if the first bin is 5ms then the last bin should also be 5ms. If the bin sizes are not uniform, we can be visually fooled into thinking there is a problem when there is not a problem!
Because of this, I would never ever use or reference an OEM wait event histogram. Instead, I would create a proper histogram. This post demonstrates one quick way to do this.
Find Wait Event Of Interest
In this very real situation, the wait event of concern is log file sync. The average log file sync wait time is 13ms and is a massive 59% of the Oracle database time (DB time). Since the DBAs know that log file sync time highly correlates with commit times, they are concerned about the average wait time but also very concerned about the possibility of some very large wait times. To understand the large wait times, we need look at the event histogram data.
In the below graphic I copied and pasted the related log file sync histogram times directly from the AWR report.
From the above event histogram data and other information available in the AWR report, we can see that over the 60 minute report duration, there were nearly 704K log file sync waits. And while the average time was 13ms, there were a few occurrences up to 250ms, that is 1/4 of a second! But wait, it gets even worse!
Do you see the .0 values in the <1/2s and <1s bins? This means there was at least one occurrence. But since Oracle chose to severely limit the displayed precision (ARGH!), we can't determine how many occurrences. This is one reason why I prefer to pull the data directly from dba_hist_event_histogram. This way, I know the actual wait counts.
Now it's time to create the histograms and generate some statistics. Read on!
Get OSM Script swEhCharts.r
Download my free OSM toolkit HERE and open the file osm/interactive/swEhCharts.r. It is an R script. If you are not familiar with the free statistics pack R, I have recently blogged and also did a few webinars on this topic.
Now it's time to edit the script, using our AWR log file sync event timing details.
Edit Script Based On AWR Report
With both the AWR report's log file sync event histogram data available and the R script open, you are ready to edit the R script. In the R script, scroll down to the last example and copy it. Making sure to edit the last example (because that is what R will reference), change the values to match the values shown in the AWR report. Following my example, here are the results.
There is only one gotcha to watch out for. On the second AWR event histogram "page," the first timing column sums the results from the previous page! I actually made this mistake myself. Find the 93.3 value. That is percentage total from the previous page. In other words, 93.3% of the waits are from 0ms to 32ms. Do not enter 93.3.
Now comes the easiest part and the most satisfying!
Copy, Paste And Gain Understanding!
Finally, it's time to copy the entire R script into R. Start the R console, paste in the script and press ENTER. This is what I saw in my R console.
You should quickly see some numeric statistics and two pages of histograms. Let's look at the numerics first.
> summary(binValues) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.746 11.300 14.720 19.550 1020.000
First, notice the average is 14.72 but the AWR shows the average is 13ms. The difference is because the AWR report does not provide precise percentages and my R script creates artificial sample values based on the percentages and their related bin ranges. So it's not going to be perfect.
But I'm not going for perfection, I'm going for understanding and solving real production problems. And usually there is a difference.
Next, notice the median value is less than the mean, indicating the data is bunched to the left of the mean. This is very typical of Oracle wait event times. In fact, if this was not the case, I would double check my numbers! Also, 75% of wait times are below 19.5ms. That's probably not acceptable.
By the way, I am usually not concerned with the maximum value because it usually happens only once! The histograms should validate this.
Important: If your goal is for most of the log file sync times to be less than 20ms, you are meeting your SLA!
Now let's look at the histograms. There are two pages because I don't know ahead of time which times you'll likely be interested in. It could be when the wait time is less than 10ms or perhaps less than 500ms. To help ensure there is at least one helpful histogram, I create a bunch of them. Shown below are both pages.
The histograms visually tell a compelling story. First, they tell us to not freak out about the large log file sync values because there are very few of them. Unless your SLA is extremely tight, a few large values will not likely affect the user experience... they may experience a "blip" in performance, but that's about it.
However, it is very evident (see the blue and green histograms) while there is a large percentage of the wait times less than 2ms (which is fantastic) there is a relatively large wait time spread. That is, the wait times are not very consistent. For a batch intensive system this may be fine.
But in an OLTP centric system we need to be careful about these inconsistent times being multiplied by complex architectures. The result could easily result in large and inconsistent user-facing commit times. The combination of slow and inconsistent response times drive users crazy!
With this detailed numeric and visual information, we are in a much stronger position to have a rational discussion about the problems and potential solutions. Armed with only the average wait time, we are in a weakened position.
Without some fact-based information, it's much easier for someone in authority (perceived or real) to rule the discussion. But when you have actual fact-based pictures and numbers, it helps keep people honest and focused on solving the real problems.
Thanks for reading and be thankful for your work,
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|Operating System Focus: Log File Switch Checkpoint Incomplete||Simple Way to Calculate OS CPU Utilization From Oracle Database V$ Views||Top 7 Reasons Why Oracle Database Conferences Are A Waste Of Time|