Which Is Better; Time Model Or ASH Data?

Posted on 12-Aug-2015 by Craig Shallahamer, craig@orapub.com

Question: Which provides better Oracle Database performance tuning data; Oracle's time model or sampling model? In other words, which is better, the "db time" approach or the active session history (ASH) approach?

Aren't you just a little curious? It's a question I have wondered about for years, but never took the time to gather session level data and compare the results side by side.

What I did was first create a simple tool. The tool collects time based and sample based data for a given session, then displays the results in a way we can compare... looking for agreement. What you will notice, is the longer the experiment and the more active the session, the closer the two performance data strategies become.

In this post, I'll introduce and run the tool, present the results and then summarize them into a few key take-aways. Good fun!

Performance Data Collection Models

Since Oracle Database version 10g, Oracle has provide Oracle DBAs with two truly amazing data collection facilities. One approach is based on time and the other is based on sampling. I'll quickly introduce each.

Oracle Time-Based Model

If you are familiar with Oracle database time, you are already familiar with the Oracle time model. Since every Oracle session trying to do some work is either consuming CPU or waiting to consume CPU, there are just two main components to database time, CPU and Oracle non-idle wait time.

The CPU time details are collected directly from the operating system and presented to us in the time model statistics views, v$sess_time_model and v$sys_time_model. The wait time views have been around since Oracle 7. Oracle wait time is based on instrumentation. That is, Oracle developers insert additional code to collect how long it takes to do a non-CPU task, such as an IO read.

The benefits of the time model for people like you and I, is the ability to do quantitative performance analysis based on measurements. It allows us to answer questions in terms of time like, "How long did X take?"

Time is important because users feel performance in terms of time and throughput. So, if our performance analysis is quantitative and based on Oracle's time model, we can actually develop a quantitative link, at least in part, between a user experience and our performance analysis. That's powerful and that's why the Oracle time model rocks!

But along with all the quantitative detail is massive amount of performance data. In fact, there is so much performance data collected, Oracle doesn't store it all. That is why when an Oracle session disconnects we loose lots of detail about their session.

The session details are no longer in v$sess_time_model or v$session_event or v$session. So we can't answer the question, "How much CPU did Sam consume?" and we don't know how long Sam's single block reads took. Plus, the time model does not collect or store details about when Sam executed sql_id ABC123. It's just too much overhead.

If you have read my posts about the Oracle time model , you know that while it is truly amazing, it's not perfect.

These limitations are where Oracle's active session history come into play.

Oracle Active Session History

Oracle's active session history or ASH for short, is based on a sampling methodology. It's like this, suppose I wanted to know your temperature over the last 24 hours. One approach would be to ask you once every hour. I would have 24 samples and then could get a pretty good of idea of your temperature.

Oracle does the same thing. Essentially, Oracle samples from v$session once every second and stores the data in the shared pool cache and makes this available to us through the v$active_session_history view.

While the data is not complete because it's not based on time, the breadth and depth of the data is amazing. With ASH data, I can tell pretty-much when when row locking started, who held the first lock, what SQL statement they were running and when they released the lock.

If you have read my posts about ASH , you know that while it is truly amazing, it's not perfect.

I Created A Script

The script I created is named, tmcomp.sql and is now part of my OSM Toolkit. You can download the OSM Toolkit HERE .

I create a script to compare the time model and sample model approaches. A fair comparison method is the percentage of CPU and wait "time" over a time interval, such as 30 seconds.

For example, suppose session number 315 is active. I start my 30 second timer, get my initial time data and get the more recent ASH sample ID. When the 30 seconds are up, I get the final time data, calculate the deltas and the percentages. For the ASH data I take all the session related rows since the timing began and calculate the percentages.

The output could look something like this:


Sample      Time Based        ASH Based        
Time (s)    CPU %    Wait %   CPU %   Wait %   
----------  -------  -------  ------  ------  
      32.6      4.5     95.5    12.5    87.5  

Looking at the output above, you can see there is a clear difference. So the question becomes, are the differences enough to invalidate using ASH data? And, is there something I can easily do to increase the output agreement? I'll tackle that in the next section.

Seeing Data Agreement Before Your Eyes

As you might expect, the more ASH we have the more agreement there will be between the time model and the sample/ASH model. There is a subtle assumption there, that the time model is the truth, but don't get stuck there.

Report Columns

If you want to understand how I calculated the output, especially the details, please reference the actual script. I documented how I did the math. But I will say this, the only tricky part is the "cpu os Qt" column. This is the "left over" DB time that is common and unfortunately called CPU wait time. It's calculated as the DB Time minus, the DB CPU time plus the non-idle wait time.

If the CPU subsystem is really busy resulting in processes queuing for CPU resources, that can be a cause for the "cpu os Qt" time. Also remember that Oracle's time model is not perfect. I have lots of blog postings about this.

Report Output

Take a look at the output below, paying close attention the sample interval and the CPU and Wait percentage agreement.


SQL> @tmcomp
Important: Make sure you have run osmsetvar.sql

ENTER inst_id           (default: 1) : 
ENTER sid               (default: 0) : 172

Sample      Time Based        ASH Based       |  Time Based Details (s)                ASH Samples
Time (s)    CPU %    Wait %   CPU %   Wait %  |  db time  db cpu   cpu os Qt  ni wait  cpu #    wait #
----------  -------  -------  ------  ------  |  -------  -------  ---------  -------  -------  -------
_     47.8      4.6     95.4    10.9    89.1  |     45.1      3.8       -1.7     43.0        5       41
Enter / to run again!
SQL> /
_      3.4     15.7     84.3    25.0    75.0  |      4.0      0.2        0.5      3.4        1        3
SQL> /
_      3.7      7.5     92.5     0.0   100.0  |      4.0      0.2        0.1      3.7        0        4
SQL> /
_     13.6      7.9     92.1     0.0   100.0  |     14.0      0.7        0.4     12.9        0       13
SQL> /
_     14.8      2.7     97.3     6.7    93.3  |     14.0      1.0       -0.6     13.6        1       14
SQL> /
_     26.6     13.7     86.3    22.2    77.8  |     27.4      2.7        1.0     23.7        6       21
SQL> /
_     32.6      4.5     95.5    12.5    87.5  |     32.0      2.0       -0.6     30.6        4       28
SQL> /
_    116.7      7.4     92.6     9.6    90.4  |    114.4      8.7       -0.2    106.0       11      103
SQL> /
_    133.6      8.3     91.7     7.8    92.2  |    129.3     10.7        0.0    118.6       10      119


SQL> @tmcomp
ENTER inst_id           (default: 1) : 
ENTER sid               (default: 0) : 684

Sample      Time Based        ASH Based       |  Time Based Details (s)                ASH Samples
Time (s)    CPU %    Wait %   CPU %   Wait %  |  db time  db cpu   cpu os Qt  ni wait  cpu #    wait #
----------  -------  -------  ------  ------  |  -------  -------  ---------  -------  -------  -------
_      9.9     41.4     58.6    58.3    41.7  |     10.0      2.9        1.2      5.9        7        5
Enter / to run again!
SQL> /
_     61.4     83.8     16.2   100.0     0.0  |      4.7      1.3        2.7      0.8        4        0
SQL> /
_    184.0     90.3      9.7    90.8     9.2  |     75.1     35.3       32.5      7.3       69        7


SQL> @tmcomp
ENTER inst_id           (default: 1) : 3
ENTER sid               (default: 0) : 2120

Sample      Time Based        ASH Based       |  Time Based Details (s)                ASH Samples
Time (s)    CPU %    Wait %   CPU %   Wait %  |  db time  db cpu   cpu os Qt  ni wait  cpu #    wait #
----------  -------  -------  ------  ------  |  -------  -------  ---------  -------  -------  -------
_     48.6     77.5     22.5    32.4    67.6  |  -4521.5  -2028.4    -1474.0  -1019.1       99      207
Enter / to run again!
SQL> /
_      3.8    -12.4    112.4    20.0    80.0  |      4.1      0.4       -0.9      4.6        1        4
SQL> /
_      5.1     17.5     82.5    33.3    66.7  |      7.0      0.6        0.6      5.8        2        4
SQL> /
_     12.0     -0.8    100.8    15.4    84.6  |     12.1      1.0       -1.1     12.2        2       11
SQL> /
_     18.5     16.1     83.9    20.0    80.0  |     20.1      2.4        0.8     16.8        4       16
SQL> /
_     35.0     19.9     80.1    33.3    66.7  |     36.1      5.4        1.8     29.0       12       24
SQL> /
_    205.8     58.9     41.1    55.6    44.4  |    206.9     76.5       45.3     85.0      115       92
SQL> /
_     63.3     84.6     15.4    87.5    12.5  |     66.3     34.1       22.0     10.2       56        8
SQL> /
_     43.2     84.4     15.6    84.1    15.9  |     44.2     22.6       14.7      6.9       37        7
SQL> /
_     13.7     51.7     48.3    35.7    64.3  |     14.1      4.8        2.5      6.8        5        9
SQL> /
_      3.7     64.8     35.2    50.0    50.0  |      6.0      2.0        1.9      2.1        2        2
SQL> /

That should be enough output to draw some general conclusions. Don't freak out about the negative numbers... even the time model data is not updated in a uniform real-time fashion. Here's what I noticed solely on the data above.

So, Which Is Better?

Even my simple experiment showed both methods reached an acceptable agreement with only a handful of samples. So, it really it depends on your objectives and what data is available.

For me, it boils down to, is a sample percentage breakdown good enough and do I need to know when a performance "incident" occurred.

For example, must I know that over a 30 minute interval 200 seconds of CPU were consumed and there was 800 seconds of non-idle wait time? Perhaps this is good enough: Over the 30 minute interval, 20% of the ASH samples found the session to be consuming CPU and the other 80% it were found to be waiting. In most cases, percentages are good enough.

But not in all cases. There are times when I must have instrumented numbers. Sometimes I need wait time details, such as the average and the statistical distribution. ASH samples won't provide this... ASH is not designed to provide this.

And finally, if I need to know when a performance incident occurred, I go to ASH data. ASH data along with my BloodHound Toolkit , I can create a second-by-second time-line down to the individual session. I can even create a graphical representation of the situation just before and after the incident. I can't do that with the time model data.

All the best to you in your work.

Craig.

If you have any questions or comments, feel free to email me directly at craig at orapub.com.

Oracle Database IO Read Wait Occurrence Mismatch - Part 2 The Situation: Detailing Oracle Process CPU Consumption It's All About CPU But There Is NO CPU Bottleneck! What?