Inferring SQL Run Time Using ASH And PQ Program Filtering

Posted on 11-July-2019 by Craig Shallahamer / OraPub / craig@orapub.com

Parallel query can be awesome for performance because it can significantly reduce a SQL's run time.

With ASH data, it is very easy to infer SQL statement database time. However, it is not simple to infer SQL statement run time. The challenge occurs when parallel query is involved.

In this article, I'll take you on my journey to infer SQL statement run times using ASH data with an Oracle Parallel Query (PQ) program based filtration strategy.

As you might expect, what began as a simple exercise morphed into an entire project.

Enjoy!

Database Time

Oracle database time is roughly CPU consumption plus non idle wait time. This can be applied from the SQL statement level up to the entire instance and beyond.

One way to infer a SQL statement's database time is to count all the active sessions related to a specific execution. For example, if we have 12 ASH samples for a specific SQL execution and the sample frequency is one every second, then we can infer the database time is around 12 seconds.

Here's how to get a list of the inferred database times for a specific SQL_ID. Notice I am using a dba_hist_active_sess_history extract, op_dba_hist_actv_sess_hist.

select count(*), to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
from OP_DBA_HIST_ACTV_SESS_HIST
where  sql_id = '&mysqlid'
group by to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
order by 1

I ran the above SQL for three different SQL_IDs. You can see the results below. Keep in mind, I am referencing an extract of dba_hist_active_sess_history, so the sample frequency is once every 10 seconds.

What is the inferred SQL database time for each of these executions?

SQL> l
  1  select count(*), to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
  2  from OP_DBA_HIST_ACTV_SESS_HIST
  3  where  sql_id = '&mysqlid'
  4  group by to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
  5* order by 1
SQL> /
Enter value for mysqlid: 67k9d0nf4186q
old   3: where  sql_id = '&mysqlid'
new   3: where  sql_id = '67k9d0nf4186q'

  COUNT(*) TO_CHAR(SQL_EXEC_ST SQL_EXEC_ID
---------- ------------------- -----------
        15 2019-02-06 20:52:28    16777219

SQL> /
Enter value for mysqlid: 73abvkc03r860
old   3: where  sql_id = '&mysqlid'
new   3: where  sql_id = '73abvkc03r860'

  COUNT(*) TO_CHAR(SQL_EXEC_ST SQL_EXEC_ID
---------- ------------------- -----------
         1 2019-01-14 22:11:53    16777216
         1 2019-01-07 22:13:58    16777217
         1 2019-01-05 06:07:35    16777218
        14 2019-01-09 22:30:03    16777216
        14 2019-01-21 22:13:49    16777216

SQL> /
Enter value for mysqlid: 8tjmvqptxpv4v
old   3: where  sql_id = '&mysqlid'
new   3: where  sql_id = '8tjmvqptxpv4v'

  COUNT(*) TO_CHAR(SQL_EXEC_ST SQL_EXEC_ID
---------- ------------------- -----------
        15 2019-02-06 20:02:16    16777296

To infer the SQL execution's database time, simply multiple the sample count by the sample frequency.

For example, referencing SQL_ID 67k9d0nf4186q, there is one execution shown with a sample count of 15. Therefore, the inferred database time is around 150 seconds (15 samples X 10s). For SQL_ID 73abvkc03r860, the inferred database time is around either 10 seconds (1 sample X 10s) or 140 seconds (14 samples X 10s).

As you can see, inferring the database time is very simple. But what about inferring the SQL run time? Read on...

SQL Statement Run Time

While inferring database time using ASH samples is very straightforward, inferring run time is not.

It's because database time includes ALL the cpu and wait time for all the sessions involved, which includes any Oracle parallel query (PQ) session database time. When PQ is involved, database time and run times diverge.

Here's an example I made up.

DB Time   PQs   Run Time
100        0    100
100        2     50
100       10     10

In reality, there is some PQ overhead, so when PQ is involved the database time will be more than 100. This means the run time is not a simply the DB Time divided by the number of PQ sessions, but usually close enough for our purposes.

So, one strategy to determine a statement's run time is to exclude all PQ sessions from the ASH sample count. But this turned out to be much more complicated then I expected.

Complicated? Yes... Read on.

Filtering Parallel Query Sessions By Program

One would think it would be stupid simple to filter out PQ sessions. But it is not simple because Oracle is not consistent in how it populates various ASH columns.

When solving an "unconstrained PQ" incident where Oracle created thousands of PQ sessions and blew out the Oracle process limit, I noticed all the PQ processes had a similar pattern in ASH program column.

Here is an example of just some of the programs.

select distinct(program)
from   dba_hist_active_sess_history
/

PROGRAM
-----------------------------------
...
oracle@03.acme.com (P100)
oracle@03.acme.com (P520)
oracle@03.acme.com (PD47)
oracle@03.acme.com (LMS0)
oracle@03.acme.com (P211)
oracle@03.acme.com (P660)
oracle@03.acme.com (P261)
JDBC Thin Client
pmdtm.exe
oracle@03.acme.com (P099)
oracle@03.acme.com (P286)
oracle@03.acme.com (P767)
oracle@03.acme.com (PA78)
oracle@03.acme.com (PB25)
WinAutomation.ServiceAgent.exe
oracle@03.acme.com (P425)
...

It turns out, the above "P" programs are PQ slaves! I told you there were a lot of PQ slaves.

My strategy is to filter out all the PQ slaves samples, leaving only the main foreground process samples, with which we will infer the SQL run time. Below is SQL I used.

select count(*), sql_id, to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
from OP_DBA_HIST_ACTV_SESS_HIST
where (program is null or program not like '%(P%)%' ) and sql_id='&mysql_id'
group by sql_id, to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
order by 1

Here is an example for our three SQL_IDs.

SQL> l
  1  select count(*), sql_id, to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
  2  from OP_DBA_HIST_ACTV_SESS_HIST
  3  where (program is null or program not like '%(P%)%' ) and sql_id='&mysql_id'
  4  group by sql_id, to_char(SQL_EXEC_START,'YYYY-MM-DD HH24:MI:SS'), sql_exec_id
  5* order by 1
SQL> /
Enter value for mysql_id: 67k9d0nf4186q
old   3: where (program is null or program not like '%(P%)%' ) and sql_id='&mysql_id'
new   3: where (program is null or program not like '%(P%)%' ) and sql_id='67k9d0nf4186q'

  COUNT(*) SQL_ID        TO_CHAR(SQL_EXEC_ST SQL_EXEC_ID
---------- ------------- ------------------- -----------
         1 67k9d0nf4186q 2019-02-06 20:52:28    16777219

SQL> /
Enter value for mysql_id: 73abvkc03r860
old   3: where (program is null or program not like '%(P%)%' ) and sql_id='&mysql_id'
new   3: where (program is null or program not like '%(P%)%' ) and sql_id='73abvkc03r860'

  COUNT(*) SQL_ID        TO_CHAR(SQL_EXEC_ST SQL_EXEC_ID
---------- ------------- ------------------- -----------
         1 73abvkc03r860 2019-01-07 22:13:58    16777217
         1 73abvkc03r860 2019-01-05 06:07:35    16777218
         1 73abvkc03r860 2019-01-14 22:11:53    16777216

SQL> /
Enter value for mysql_id: 8tjmvqptxpv4v
old   3: where (program is null or program not like '%(P%)%' ) and sql_id='&mysql_id'
new   3: where (program is null or program not like '%(P%)%' ) and sql_id='8tjmvqptxpv4v'

  COUNT(*) SQL_ID        TO_CHAR(SQL_EXEC_ST SQL_EXEC_ID
---------- ------------- ------------------- -----------
         1 8tjmvqptxpv4v 2019-02-06 20:02:16    16777296

Based on the above results, all 3 SQL_IDs and all of their executions result in an inferred run time of around 10 seconds (1 sample X 10s).

So it looks like we have a solid solution, right? Wrong! Read on...

Where Did That SQL Execution Go?

If the program filtering solution looks good... take a closer look at the above results.

SQL_ID 73abvkc03r860 looks strange! When running the database time SQL, we showed five executions but above when we filtered out the PQ sessions we only show three executions!

What this means is, in two of our cases, the foreground process was not sampled, because it was not "active." But we know the SQL was running because "active" PQ sessions for this specific SQL_ID's execution were sampled.

So this PQ filtration strategy has some problems. It is very possible to miss executions when PQ is involved.

Program Column May Not Be Populated

There is yet another problem! And it's a big stink'n problem.

Not all Oracle versions populate the program column with PQ slave details. You can easily check this on your system by running the below SQL.

select distinct program
from   dba_hist_active_sess_history
where  program like '%(P%)%'

Don't be surprised if you receive "no rows returned"! But also, don't be surprised if rows are returned! Yep... a big problem.

This means we need another strategy. Read on...

What Now?

As I demonstrate in this article, inferring database time using ASH samples is very straightforward. However, inferring run time is not simple.

The challenge occurs when Oracle parallel query is involved.

My strategy was to filter out the PQ active sessions using the program column. While this may work, it will not always work.

I am looking for a solution that will likely always work and is simple. My solution failed in this quest.

Did I give up? Don't bet on that!

My second strategy, is the topic of my next post.

All the best in your Oracle tuning and machine learning work,

Craig.

If you have any questions or comments, feel free to email me directly at craig at orapub.com.

What is the real top Oracle Database SQL? Do Oracle Database LGWRs (10g, 11g, 12c) Always Sleep For The Full Three Seconds? A Real Story In A Real Life: Everything Started With A Simple Post