How To Use The Oracle ASH time_waited column

Posted on 09-Sep-2015 by Craig Shallahamer, craig@orapub.com

Both Oracle time model and sampling model (ASH) are powerful Oracle performance tuning infrastructures. But combining sampling and time is dangerous. So, having the time_waited column in the ASH views presents us with both power and danger.

My previous post demonstrated the ASH time_waited column's unit of time and what all the zeros are about.

This post is about how to use the ASH time_waited values. But, it's not as simple as it seems. It's all centered around Oracle wait times being (hopefully and usually) much less than the ASH sample rate. So, how does an Oracle DBA appropriately and aggressively use the ASH time_waited column? That's what this post is all about.

Note: There is a lot of script detail in this post. If you just want "the answers" then scroll down to the final "Super Summary" section of this post.

Questions That Need Answering

A couple of important questions come to mind:

If these questions are important to you, then read on!

The Experiment: Setup Overview

Note: You can download all the experimental details (scripts, collection, reporting, spreadsheet, data) HERE.

To confidently answer the above questions, I needed to create an experiment. But it was really tricky to create because I must create and control sub-second wait times.

In my previous post, I had one session lock a table indefinitely and had another session attempt to acquire the lock but with a specific timeout. That allowed me to control the lock acquisition wait time.

However, the timeout parameter must be an integer. And, what I need for this is experiment is a fractional (real) number and the real number sometimes needs to be less than 1 second. So, another solution was needed.

The solution I came up with was to create a situation where the blocker/holder session locked and then released a table in a very specific and controlled pattern. Then I'll have another session (the "waiter") attempting to acquire the lock.

But here's the cool part. Only for the "waiter" session, I timed the lock acquisition time (the wait time), so I know the true wait times... many of them. Then I also looked at waiter session's ASH time_waited column. Now I can compare the true wait time and the ASH sampled wait times!

This means, I have two different sets of data related to the exact same situation.

The first set is the actual lock acquisition times. I simply got the timestamp before the lock acquisition attempt and again after the session acquired the lock... calculate the delta...

The second set of data was recorded for me by ASH. I get the ASH data from the time_waited values. It is important to understand that only when a session are trying to acquire the lock and ASH samples the session would I see the session waiting on the lock.

Naturally, we would expect some difference between ASH and the actual locking situation. How much of a difference is where it really gets interesting!

The Experiment: The Lock Holder/Blocker Session

Below is the "blocker" code that I simply copy and pasted into SQL*Plus. I explain the code below the listing.

connect sys as sysdba
grant execute on dbms_lock to craig;
connect craig/craig

drop table bogusT;
create table bogusT as select * from dual;

drop table op_lock_times_blocker;
create table op_lock_times_blocker (lock_time_s number);
select distinct sid from v$mystat;

declare
  sample_value_var  number;
begin
   dbms_random.seed(13);
   loop
     ---- The below random statement are used to set the random number ranges
     sample_value_var := dbms_random.value(0.15,0.45);
     --sample_value_var := dbms_random.value(0.01,0.05);
     --sample_value_var := dbms_random.value(2.0,5.0);

     ---- The below insert/commit data is ONLY used to ensure the locks times are correct
     ---- They are not used in the statistical analysis
     insert into op_lock_times_blocker values (sample_value_var);
     commit write nowait;

     ---- Lock, wait, and release for a specific time; sample_value_var
     lock table bogusT in exclusive mode;
     ---- This is how I controlled the blocking sessions lock time
     dbms_lock.sleep(sample_value_var);
     ---- now release the lock, using the rollback
     rollback;

    end loop;
end;
/

In the above code, the blocker should be holding the lock between 0.15 and 0.45 seconds. The waiter will then be blocked from 0 to a maximum of 0.45 seconds. But it's more interesting than this!

If you look closely at the dbms_random entries, you can see that I control the random value, which will be the actual blocking session's lock time. Statistically speaking, the random real numbers will be pulled from a pool of uniformly distributed numbers between and including the given low and high values. So, if the range is between 0.15 and 0.45, the average should be 0.30. I actually checked this! Thankfully, this was always correct!

The key to the session blocker code is the combination of locking the table, sleeping for a specific period of time and then releasing the lock.

So, the blocker should be holding the lock between 0.15 and 0.45 seconds, with an average lock time of 0.30 seconds. The waiter will then be blocked from 0 to a maximum of 0.45 seconds. Next I'll describe how I collected the Oracle waiting session details.

The Experiment: The Lock Acquisition/Waiting Session

The code for the session attempting to acquire the table lock is a bit more complicated. The complication lies in that I only want to use the ASH samples during the lock acquisition times. Below is the code that I copy and pasted directly into SQL*Plus.

To summarize the below code, I collect 200 lock acquisition samples. Every attempt/sample is timed.

To help ensure the locking situation is what I think it is, I also take a before and after snapshot of the session's v$session_wait lock related data. Thankfully, Oracle's wait interface agreed with the waiting/holding session's lock times and the number of locks (i.e., the count).

I also saved the current ASH sample_id just before collection begins. This allows me to report on the session's ASH data ONLY during the lock attempts.

Slowly, go through the code list below. It's full of comments, so I'm hoping it will be clear.

connect craig/craig

set tab off
set serveroutput on
drop table op_lock_times_waiter_truth;
create table op_lock_times_waiter_truth (lock_time_s number);
variable sample_id_start number;
variable sid number;

begin
  select distinct sid into :sid from v$mystat;
end;
/
print :sid

declare
  max_samples_var number := 90; -- sometimes I took 200 samples
  max_sleep_var   number := 3.14159265359;
  cntr_var        number;
  timestamp_start_var  timestamp;
  wait_time_ms_var     number;
  total_waits_B_var    number;
  time_waited_s_B_var  number;
  total_waits_E_var    number;
  time_waited_s_E_var  number;
begin

  select max(sample_id)
  into   :sample_id_start
  from   v$active_session_history
  where  session_id = :sid;

  begin

    -- get BEGIN total wait time, BEGIN total waits
    select total_waits, time_waited/100
    into   total_waits_B_var, time_waited_s_B_var
    from   v$session_event
    where  sid = :sid
      and  event like 'enq: TM%';
    exception when others then
      total_waits_B_var   := 0;
      time_waited_s_B_var := 0;
  end;

  for cntr_var in 1..max_samples_var
  loop
    -- get pre-lock time, to ensure i'm looking the right data
    select current_timestamp 
    into   timestamp_start_var 
    from   dual;

    -- attempt to lock the table, which may already be locked
    -- if I acquire the lock, processing continues to next line of code
    -- the 999 second timeout should never happen in this controlled experiment
    lock table bogusT in exclusive mode wait 999;

    -- get acquisition clock time and calculate lock time to the millisecond
    select extract( day from diff )*24*60*60*1000 +
           extract( hour from diff )*60*60*1000 +
           extract( minute from diff )*60*1000 +
           extract( second from diff )*1000
    into   wait_time_ms_var
    from   (select current_timestamp - timestamp_start_var diff
            from   dual
           );

    -- release the lock by rolling back
    rollback;
      
    -- record the actual lock time, if any
    insert into op_lock_times_waiter_truth values (wait_time_ms_var/1000);

    -- commit the lock time... there is a rollback in this script, so 
    -- i need to commit the experimental results
    commit;

    -- to reduce the likelihood of a cyclical pattern, randomly sleep
    dbms_lock.sleep(dbms_random.value(0,max_sleep_var));

  end loop;

  -- get END total wait time, END total waits
  select total_waits, time_waited/100
  into   total_waits_E_var, time_waited_s_E_var
  from   v$session_event
  where  sid = :sid
    and  event like 'enq: TM%';


  -- get ending wait time and wait totals... calc avg wait time
  dbms_output.put_line('-------------------------------------------------------');
  dbms_output.put_line('Waiting session actual wait times based on v$session_event');
  dbms_output.put_line('.');
  dbms_output.put_line('Avg TM wait time (s) : '||
    round((time_waited_s_E_var-time_waited_s_B_var)/(total_waits_E_var-total_waits_B_var),3)||
    '  Total TM waits : '||(total_waits_E_var-total_waits_B_var) );
  dbms_output.put_line('-------------------------------------------------------');

end;
/

So... what happened? Read on!

The Experiment: The Results

The experimental design worked flawlessly. I was able to get all the wait time related samples I needed. Plus, I used three different blocking session sleep times. This gave me wait times less than and greater than ASH's default one second sample period. ...exactly what I wanted.

Below is a matrix summarizing the results.

The Measured Lock Times are the waiting session's actual lock times while trying to acquire the table lock. There are always 200 samples because a) each sample is a lock acquisition attempt and b) because that is how many samples I wanted to gather, using a plsql FOR loop.

The ASH-based experimental samples times are taken from the time_waited column. Notice the number of ASH samples does not equal the number of FOR loop samples. I would NOT expect them to be the same because they are two independent activities. I'll comment more on this below.

There is a LOT to learn from this experiment. If you just want a summary, jump down to the next section. But, here are a few observations I want to highlight.

But be careful, making decisions based on a maximum value is problematic. All it takes is some bizarre unlikely scenario to cause a one-off massive wait time. Just because a value is correct, doesn't mean you should use in your decision making. If would be better to do some math and take, for example, the 98%-tile value. That way you can "scrape" off the extremely unlikely and worthless maximum values. I'll detail how to do that in my next post.

Wow... That was a lot to digest, so I have summarized it below.

Result (Super) Summary

Based on my experimental data,

All the best to you in your work,

Craig.

If you have any questions or comments, feel free to email me directly at craig at orapub.com.

Oracle Database Locks And Latches...What A Difference! What Is Oracle Database Elapsed Time And Wall Time With A Parallelism Twist Speed Sometimes Means Changing The Rules