Watch Oracle Database Elapsed Time And Wall Time With Parallel Query
In my recent postings I wrote that when using the Oracle Database parallel query a SQL statement's wall time should be equal to its elapsed time divided by the number of parallel query slaves plus some overhead.
That may seem correct, but is it really true? To check I ran an experiment and posted the results here. The results are both obvious and illuminating.
If you don't want to read but just sit on the couch, have a beer and watch TV you're in luck! I took a clip from my Tuning Oracle Using An AWR Report online video seminar put it on youtube. Here's the clip.
The Math, For Review Purposes
In my previous recent postings I detailed the key time parameters; DB Time, DB CPU, non-idle wait time, elapsed time, parallelism and effective parallelism. To save you some clicking, the key parameters and their relationships are shown below.
DB Time = DB CPU + NIWT
Elapsed Time = Sum of DB Time
Wall Time = ( Elapsed Time / Parallelism ) + Parallelism Overhead
Wall Time = Elapsed Time / Effective Parallelism
Test Results: When Oracle Parallel Query was NOT involved.
If you want to see my notes, snippets, etc. they can be found in this text file HERE.
Here is the non-parallel SQL statement.
select /*+ FULL(big2) NOPARALLEL (big2) */ count(*) into i_var from big2 where rownum < 9000000
When the SQL statement was running, I was monitoring the session using my Realtime Session Sampler OSM tool, rss.sql. Since I knew the server process session ID and wanted to sample every second and wanted to see everything just for this session, this is the rss.sql syntax:
SQL>@rss.sql 16 16 827 827 % % 1
For details on any OSM tool syntax, run the OSM menu script, osmi.sql. You can download my OSM Toolkit HERE
The rss.sql tool output is written to a text file, which I was doing a "tail -f" on. Here is a very small snippet of the output. The columns are sample number, sample time, session SID, session serial#, Oracle username, CPU or WAIT, SQL_ID, OraPub wait category, wait event, [p1,p2,p3].
We can see the session is consuming CPU and waiting. When waiting, the wait event is "direct path read", which is asynchronous (we hope) block read requests to the IO subsystem that will NOT be buffered in the Oracle buffer cache.
Now for the timing results, which are shown in the below table. I took five samples. It's VERY important to know that the wait time (WAIT_TIME_S), DB CPU (DB_CPU_S), and DB Time (DB_TIME_S) values are related to ONLY server process SID 16. In blazing contrast, the wall time (WALL_S), elapsed time (EL_VSQL_S), and SQL statement CPU consumption (CPU_VSQL_S) is related the entire SQL_ID statement execution.
Here are the "no parallel" experimental results.
SQL> select * from op_results; SAMPLE_NO WALL_S EL_VSQL_S CPU_VSQL_S WAIT_TIME_S DB_CPU_S DB_TIME_S ---------- ---------- ---------- ---------- ----------- ---------- ---------- 1 35.480252 35.470015 9.764407 24.97 9.428506 34.152294 2 35.670021 35.659748 9.778554 25.15 9.774984 35.541861 3 35.749926 35.739473 9.774375 25.12 9.31266 34.126285 4 35.868076 35.857752 9.772321 25.32 9.345398 34.273479 5 36.193062 36.18378 9.712962 25.46 9.548465 35.499693
Let's check the math. For simplicity and clarity, please allow me to round and use only sample 5.
DB_TIME_S = DB_CPU_S + WAIT_TIME_S 35.5 = 9.5 + 25.5 = 35.0
The DB Time is pretty close (35.5 vs 35.0). Close enough to demonstrate the time statistic relationships.
Elapsed Time (EL_VSQL_S) = DB_TIME_S 35.5 = 34.2
The Elapsed Time is off by around 4% (35.5 vs 34.2), but still closely to demonstrate the time statistic relationships.
Wall Time (WALL_S) = Elapsed Time (EL_VSQL_S) / Effective Parallelism 35.5 = 35.5 / 1
Nice! The Wall Time results matched perfectly. (35.5 vs 35.5)
To summarize in a non parallel query (i.e., single server process) situation, the time math results are what we expected! (and hoped for)
Test Results: When Oracle Parallel Query WAS involved.
The only difference in the "non parallel" SQL statement above and the SQL statement below is the parallel hint. Below is the "parallel" SQL statement.
select /*+ FULL(big2) PARALLEL(big2,3) */ count(*) into i_var from big2 where rownum<9000000>
When the "parallel" SQL statement was running, because Oracle parallel query was involved resulting in multiple related Oracle sessions, when monitoring using my rss.sql tool, I need to open the session ID (and serial#) to include all sessions. I still sampled every second. Here is the rss.sql syntax:
SQL>@rss.sql 0 9999 0 9999 % % 1
The tool output is written to a text file, which I was doing a "tail -f" on. Here is a very small snippet of the output. I manually inserted the blank lines to make it easier to see the different sample periods.
There is only one SQL statement being run on this idle test system. And because there is no DML involved, we don't see much background process activity. If you look closely above, sessions 168 (see third column) must be a log write process because the wait event is "log file parallel write". I checked and session 6 is a background process as well.
It's no surprise to typically see only four session involved. One session is the parallel query coordinator and the three parallel query slaves! Interestingly, the main server process session that I executed the query from is session number 16. It never showed up in any of my samples! I suspect it was "waiting" on an idle wait event and I'm only showing processes consuming CPU or waiting on a non-idle wait event. Very cool.
Now for the timing results. I took five samples. Again, it's VERY important to know that the wait time (WAIT_TIME_S), DB CPU (DB_CPU_S), and DB Time (DB_TIME_S) values are related to ONLY calling server process, which in this case is session 16. In blazing contrast, the wall time (WALL_S), elapsed time (EL_VSQL_S), and SQL statement CPU consumption (CPU_VSQL_S) is related the entire SQL statement execution.
Here are the "parallel" experimental results.
SQL> select * from op_results; SAMPLE_NO WALL_S EL_VSQL_S CPU_VSQL_S WAIT_TIME_S DB_CPU_S DB_TIME_S ---------- ---------- ---------- ---------- ----------- ---------- ---------- 1 46.305951 132.174453 19.53818 .01 4.069579 4.664083 2 46.982111 132.797536 19.371063 .02 3.809439 4.959602 3 47.79761 134.338069 19.739735 .02 4.170921 4.555491 4 45.97324 131.809249 19.397557 .01 3.790226 4.159572 5 46.053922 131.765983 19.754143 .01 4.062703 4.461175
Let's check the math. So simplicity and clarity, please allow me to round and use sample 5.
DB_TIME_S = DB_CPU_S + WAIT_TIME_S 4.5 = 4.1 + 0
The DB Time shown above is kind of close... 10% off. (4.5 vs 4.1) But there is for sure timing error in my collection sript. I take the position, this is close enough to demonstrate the time statistic relationships. Now look below.
Elapsed Time (EL_VSQL_S) = DB_TIME_S 131.7 != 4.5
Woah! What happened here? (131.7 vs 4.5) Actually, everything is OK (so far aways) because the DB Time is related to the session (Session ID 16), whereas the elapsed time is ALL the DB Time for ALL the processes involved in the SQL statement. Since parallel query is involved, resulting in four additional sessions (1 coordinator, 3 slaves) we would expect the elapsed time to be greater than the DB Time. Now let's look at the wall time.
Wall Time (WALL_S) = ( Elapsed Time (EL_VSQL_S) / Parallelism ) + overhead 46.1 = ( 131.8 / 3 ) + 2.2
Nice! Clearly the effective parallelism is greater than 3 because there is some overhead (2.2). But the numbers makes sense because:
1. The wall time is less than the elapsed time because parallel query is involved.
2. The wall time is close to the elapsed time divided by the parallelism. And we can even see the parallelism overhead.
So it looks like our time math is correct!
Reality And The AWR Or Statspack Report
This is really important. In the SQL Statement section of any AWR or Statspack Report, you will see the total elapsed time over the snapshot interval and perhaps the average SQL ID elapsed time per execution. So what is the wall time? What are users experiencing? The short answer is, we do NOT have enough information.
To know the wall time, we need to know the parallelism situation. If you are NOT using parallel query, than based on the time math demonstrated above, the elapsed time per execution will be close to what the user experiencing (unless there is an issue outside of Oracle). However, if parallelism is involved, you can expect the wall time (i.e, user's experience) to be much less than the elapsed time per execution shown in the AWR or Statspack report.
Another way of looking at this is: If a user is reporting a query is taking 10 seconds, but the average elapsed time is showing as as 60 seconds, parallel query is probably involved. Also, as I mentioned above, never forget the average value is not always the typical value. (More? Check out my video seminar entitled, Using Skewed Data To Your Advantage.)
Thanks for reading!
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|IOUG 2016 Is Different: Here's How To Get Your Abstract Accepted||Oracle Database Buffer Cache Visualization And Tool||Why Tuning Oracle Database Works And Modeling It|