How To See Unseen Activity Using ASH And SQL*Net Message From Client

Posted on 15-Feb-2018 by Craig Shallahamer / OraPub / craig@orapub.com

Seeing into the unseen Oracle world is a skill performance focused Oracle DBAs need to master. When creatively used, the Oracle wait event, SQL*Net message from client gives us this perspective. This article shows you a very elegant diagnostic technique that will expand your existing ASH analysis skills.

While creatively using this wait event enhances your tuning prowess, it can also clutter and destroy your analysis. So, it is important to know what the event SQL*Net message from client means, how Oracle Corporation deals with it, how to know when to creatively use it and when to not use it.

Answering these questions and exploring this unseen world, is what this article is all about.

Frustrated Colleague

Yesterday I received an email from a super frustrated colleague/member/friend/"Sam." He was dealing with a 120 step COBOL batch process that was taking way too long. No big deal right?

To him it was a very big deal because he could not find a trace of the process in his AWR reports or his ASH analysis. We emailed back and forth a few times. He can now prove the root problem resides in the COBOL program not with the Oracle database or the operating system. Very cool!

How can he prove this? Read on.

Key Words

In Sam's initial email a few words immediately caught my attention; 120 step, batch, COBOL, too long, not showing in AWR, not showing in ASH.

Putting the key words togethers creates a plausible story. Let's start with COBOL.

Key Word: COBOL

COBOL was built for sequential row-by-row processing, not for set processing. This means there is a really good chance most of the data processing and number crunching is being done in the COBOL program, not in Oracle.

The COBOL program does not run as a Oracle foreground process. The COBOL program is a client process that probably does not reside on the Oracle database server and must therefore communicate with an Oracle foreground process via SQLNet.

Key Word: Client

This is fundamental Oracle stuff... but really good stuff! Looking below, sqlplus is the client process and oracleprod50 is the foreground process.

ps -eaf | grep oracle
...
oracle   50248 50247  0 10:24 pts/1    00:00:00 sqlplus               
oracle   50249 50248 80 10:24 ?        00:01:06 oracleprod50 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
...

Oracle's time model and ASH sampling model look at the world from a foreground process perspective, NOT a client process perspective. If the foreground process is idle yet the client process is crunching away solving how to beat lightspeed, we see no database time associated with the foreground process.

If a foreground process is not consuming CPU, it must be waiting. If it is waiting then it must have an associated wait event. The wait event could be an idle or non-idle event. But there will be for sure 100% be an associated wait event.

This means this mystery likely is associated with a non-idle wait event.

Key Word: Active

ASH is about analyzing "active" sessions, which are usually Oracle foreground sessions. Below is a snippet from my ASH "timeline" report contained within my ASH ScratchPad Toolkit.

The session IDs (SID) below are associated with "active" Oracle foreground sessions. They are deemed by Oracle Corporation to be "active" because they are either "ON CPU" or "WAITING" on a non-idle wait event.

SQL> /
 SAMPLE_ID DT              SID SESSION_STATE EVENT                     SQL_ID          MODULE
---------- ------------- ----- ------------- ------------------------- --------------- ---------------
  35917124 12 10:30:59      86 ON CPU                                  5zc9g0hy3gvxg   SQL*Plus
  35917124 12 10:30:59      88 ON CPU                                  9csay8aj87gyh   SQL*Plus
  35917124 12 10:30:59     247 ON CPU                                  dt28kdqab28xx   SQL*Plus
  35917124 12 10:30:59     330 ON CPU                                  asbs9fmbj6b6b   SQL*Plus
  35917124 12 10:30:59     405 WAITING       latch: shared pool                        SQL*Plus
  35917124 12 10:30:59     406 ON CPU                                  2ck2uhxkdtg0h   SQL*Plus
  35917125 12 10:31:00      86 ON CPU                                  az5pch9r972w3   SQL*Plus
  35917125 12 10:31:00      88 ON CPU                                  489xrz08h1s4x   SQL*Plus
  35917125 12 10:31:00     166 ON CPU                                  9yym9vyaz2mrm   SQL*Plus
  35917125 12 10:31:00     247 ON CPU                                  65cdyf380a8h0   SQL*Plus
  35917125 12 10:31:00     330 ON CPU                                  74mrpbk2mr2nn   SQL*Plus
  35917125 12 10:31:00     405 ON CPU                                  gpr2s8kdrf3nx   SQL*Plus
  35917126 12 10:31:01      86 WAITING       library cache: mutex X    4tm49gtqds7cp   SQL*Plus
  35917126 12 10:31:01      88 WAITING       library cache: mutex X    byjax9kg0q01g   SQL*Plus
  35917126 12 10:31:01     166 ON CPU                                  9gyv3mvj2tgv1   SQL*Plus
  35917126 12 10:31:01     247 ON CPU                                  66ywd7rdrt1xg   SQL*Plus
  35917126 12 10:31:01     330 ON CPU                                  fqymsj0z6351j   SQL*Plus

But what if you are interested in session 325? While we may see session 325 in v$session, if it is not "active" ASH will not sample it and therefore we will not see it. Yeah, that's a big bummer. But thankfully that's not the end of the story. Read on...

Key Word: AWR

An AWR report is about understanding performance from an instance perspective. I like to call this the, "ocean of activity." Analyzing performance based on an AWR report is about understanding the "ocean currents" and the "big fish" swimming in the ocean.... but I digress.

Like with ASH analysis, if the foreground process is not consuming CPU or waiting on a non-idle wait event, we won't see its activity in an AWR report. Even if the process is being impacted by other sessions or programs, we may not see the result.

Back to my friend's situation.

If the COBOL program is doing all the number crunching and data processing and not getting the foreground process involved that much, we will see very little foreground process activity... even though the process takes "forever" and is really irritating people!

What we need is eyes to see into the unseen. Read on.

How To See Into The Unseen

There is a little known ASH parameter that forces ASH to collect both active and non-active sessions. The parameter is _ash_sample_all. The default is FALSE.

Why FALSE? Because ASH is built to be crazy lightweight, like an airplane's black box, with no latches or locks and virtually undetectable from a resource consumption perspective. And, it should always be collecting data unless the instance is not running... or gulp... crashes.

One of the reasons ASH is so lightweight is because it does not sample all Oracle sessions... only the active sessions are sampled. And the reason AWR does not record session level details is because it's a management nightmare... but that's a topic for another time.

Run the below query on one of your production boxes and you'll see something like this:

SQL> select status, count(*) from v$session group by status;

STATUS     COUNT(*)
-------- ----------
ACTIVE          120
INACTIVE       1545

2 rows selected.

Based on the above results, ASH should be showing 120 rows. If ASH collected both active and non-active sessions, each sample period would contain not 120 rows but 1665 rows... each and every second! This is one reason why ASH does not collect inactive rows... the data management could become a nightmare.

Let me show you something very cool that you can do on one of your active but NON-PRODUCTION servers.

$ cat x.sql

set echo on feedback on

select count(*)
from   v$active_session_history 
where  sample_id = (select max(sample_id) from v$active_session_history);

alter system set "_ash_sample_all"=true;

exec dbms_lock.sleep(5);

select count(*)
from   v$active_session_history 
where  sample_id = (select max(sample_id) from v$active_session_history);

alter system set "_ash_sample_all"=false;

exec dbms_lock.sleep(5);

select count(*)
from   v$active_session_history 
where  sample_id = (select max(sample_id) from v$active_session_history);

Here's the output on my sandbox system.

SQL> @x
SQL> 
SQL> set echo on feedback on
SQL> 
SQL> select count(*)
  2  from   v$active_session_history
  3  where  sample_id = (select max(sample_id) from v$active_session_history);

  COUNT(*)
----------
	 5

1 row selected.

SQL> 
SQL> alter system set "_ash_sample_all"=true;

System altered.

SQL> 
SQL> exec dbms_lock.sleep(5);

PL/SQL procedure successfully completed.

SQL> 
SQL> select count(*)
  2  from   v$active_session_history
  3  where  sample_id = (select max(sample_id) from v$active_session_history);

  COUNT(*)
----------
	48

1 row selected.

SQL> 
SQL> alter system set "_ash_sample_all"=false;

System altered.

SQL> 
SQL> exec dbms_lock.sleep(5);

PL/SQL procedure successfully completed.

SQL> 
SQL> select count(*)
  2  from   v$active_session_history
  3  where  sample_id = (select max(sample_id) from v$active_session_history);

  COUNT(*)
----------
	 6

1 row selected.

As we expected, by enabling ASH to sample both active and inactive sessions, we see many more ASH samples being collected.

I suspect the sixth session is my session running the above script!

Now lets use this knowledge to our advantage.

Deeper Into The Unseen

ASH collects a lot of details about sessions, such as the client identifier, module or action. Let's use this to our advantage!

In another SQLPLUS session I simply connected, entered the below statement to set the client identifier to "orapub" and then did nothing!

EXEC DBMS_SESSION.set_identifier('orapub');

Now let's peer into the unseen world of the above "inactive" Oracle session that is doing "nothing." To do this, in another SQLPLUS session I do this:

SQL> l
  1  select event, count(*)
  2  from   v$active_session_history
  3  where  client_id = 'orapub'
  4   and   sample_time >= current_timestamp - interval '1' minute
  5* group by event
SQL> save t1 replace
Wrote file t1.sql
SQL> get t1
  1  select event, count(*)
  2  from   v$active_session_history
  3  where  client_id = 'orapub'
  4   and   sample_time >= current_timestamp - interval '1' minute
  5* group by event
SQL> /

no rows selected

No rows are returned because the tagged "orapub" session is not active. Remember, the "orapub" session is connected but at the SQLPLUS command line prompt. Now comes the secret sauce...

SQL> alter system set "_ash_sample_all"=true;

System altered.

SQL> get t1
  1  select event, count(*)
  2  from   v$active_session_history
  3  where  client_id = 'orapub'
  4   and   sample_time >= current_timestamp - interval '1' minute
  5* group by event
SQL> /

EVENT		                                COUNT(*)
------------------------------------------- ----------
SQL*Net message from client                         10

1 row selected.

SQL> /

EVENT		                                COUNT(*)
------------------------------------------- ----------
SQL*Net message from client                         11

1 row selected.

SQL> /

EVENT		                                COUNT(*)
------------------------------------------- ----------
SQL*Net message from client                         13

1 row selected.

What we see above is the wait time for our idle and in-active Oracle session. This happens a lot in active Oracle systems, but we don't bother to look. This is why most DBAs are not trained to use this event and other idle events to their advantage.

You may be wondering why most DBAs don't learn about this? Well, there are a couple of reasons. First, its because usually this idle time is user think time, lunch break time or whatever. So, it's not useful for most performance analysis situations. Second, most DBAs analyze performance using OEM, AWR and don't write scripts where they can flexibly adjust to unusual situations. Yep... not a great career move.

The bottom line in the above is, by enabling all sessions to be sampled by ASH, I can now see the unseeable activity of my non-active "orapub" session using standard ASH scripts.

I should probably flip ASH back to normal to not sample all Oracle sessions. But watch what happens.

SQL> alter system set "_ash_sample_all"=false;

System altered.

SQL> get t1
  1  select event, count(*)
  2  from   v$active_session_history
  3  where  client_id = 'orapub'
  4   and   sample_time >= current_timestamp - interval '1' minute
  5* group by event
SQL> /

EVENT		                                  COUNT(*)
------------------------------------------- ----------
SQL*Net message from client                         19

1 row selected.

SQL> /

EVENT		                                  COUNT(*)
------------------------------------------- ----------
SQL*Net message from client                         19

1 row selected.

SQL> /

EVENT		                                  COUNT(*)
------------------------------------------- ----------
SQL*Net message from client                         19

1 row selected.

This is important. Why the 19 SQL*Net message from client count? It's because ASH collected 19 samples for the inactive "orapub" session. Just because we instructed ASH to no longer sample all sessions, those 19 sample rows are not deleted! That is fantastic because we now do our analysis with NO concern that ASH is sampling all Oracle sessions.

In the above, you can see how by enabling all sessions to be sampled by ASH, I can now see the unseeable activity of my non-active "orapub" session!

Sifting Through The ASHes

Once we know we can see non-active Oracle sessions, we can continue with our standard ASH analysis as usual. The below SQL is from my ASH ScratchPad, which you can download HERE.

Below I am "time lining" all active and non-active Oracle sessions with a client_id of "orapub". Looking closely and you can see session 325 is "stuck" on wait event, SQL*Net message from client. In other words, session 325 is idle.

def datasource=v$active_session_history
def timingdetails="sample_time >= current_timestamp - interval '5' minute" 
def ashWhere="client_id='orapub'"
def dbahistdetails=' and 1=1'

set linesize 300
select sample_id, to_char(sample_time,'DD HH24:MI:SS') DT,
       session_id sid, session_state,
       &ashCols
from   &datasource
where  &timingdetails
       &dbahistdetails
  and  &ashWhere
order by 1,3
/

SQL>   2    3    4    5    6    7    8    9  old   3:        &ashCols
new   3:        event,sql_id,module
old   4: from   &datasource
new   4: from   v$active_session_history
old   5: where  &timingdetails
new   5: where  sample_time >= current_timestamp - interval '5' minute
old   6:        &dbahistdetails
new   6:         and 1=1
old   7:   and  &ashWhere
new   7:   and  client_id='orapub'

 SAMPLE_ID DT            SID SESSION_STATE EVENT                     SQL_ID          MODULE
---------- ----------- ----- ------------- ------------------------- --------------- ---------------
  35921630 12 11:46:06   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921631 12 11:46:07   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921632 12 11:46:08   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921633 12 11:46:09   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921634 12 11:46:10   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921635 12 11:46:11   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921636 12 11:46:12   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921637 12 11:46:13   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921638 12 11:46:14   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921639 12 11:46:15   325 WAITING       SQL*Net message from clie                 SQL*Plus
  35921640 12 11:46:16   325 WAITING       SQL*Net message from clie                 SQL*Plus
...

We know that session 325 is associated with the super slow batch process. But how can the Oracle foreground process be idle yet the batch process not completing?! The answer is, because the batch process work is being done by the COBOL Oracle client process, not the Oracle foreground process!

If the above paragraph does not make sense, please stop and review it.

Just for fun, let's summarize or profile all "orapub" sessions using my ASH ScratchPad "profile" script.

SQL> def ashcolkey=client_id
SQL> def ashcolval='orapub'


SQL> l
  1  select totAS,
  2         oncpu/(oncpu+waiting) cpuPct,
  3         waiting/(oncpu+waiting) waitingPct,
  4         (userIO+sysIO)/waiting ioPct,
  5         (waiting-userIO-sysIO)/waiting otherPct
  6  from
  7  (
  8    select
  9      count(*) totAS,
 10      sum(decode(session_state,'ON CPU',1,0)) oncpu,
 11      sum(decode(session_state,'WAITING',1,0)) waiting,
 12      sum(decode(session_state,'WAITING',decode(wait_class,'User I/O',1,0))) userIO,
 13      sum(decode(session_state,'WAITING',decode(wait_class,'System I/O',1,0))) sysIO
 14    from   &datasource
 15    where  &timingdetails
 16           &dbahistdetails
 17      and  &ashcolkey like '&ashcolval%'
 18* )
SQL> /

     TOTAS     CPUPCT WAITINGPCT      IOPCT   OTHERPCT
---------- ---------- ---------- ---------- ----------
       300          0          1          0          1

1 row selected.

SQL> select count(*), event
from   &datasource
where  &timingdetails
       &dbahistdetails
  and  session_state = 'WAITING'
  and  &ashcolkey like '&ashcolval%'
group by event
order by 1 desc
/  2    3    4    5    6    7    8    9  

  COUNT(*) EVENT
---------- -------------------------
       300 SQL*Net message from clie

1 row selected.

The top part of the above output is a summary of all the filtered sessions. That is, client_id "orapub" sessions. So, over the last five minutes ASH has collected 300 samples of "orapub" sessions. Of those 300 samples, 0 percent are "ON CPU" and 100% are "WAITING." And of those samples waiting, 0 percent are waiting on IO events and 100 percent of the samples are waiting on some non-IO (i.e., "other") event.

The lower part of the above output lists the event details. Think of this as a "Top 5 Events" report. As you can see, all the 300 ASH "WAITING" samples show the sessions were waiting on the event, SQL*Net message from client.

Sampling ALL Sessions Is Unusual

Normally, we would never ever see something like this... because the SQL*Net event is an idle event and therefore ASH would not sample the session. But we wanted to see into the unseen. To do this, we altered a hidden ASH parameter to sample both active and non-active sessions.

Be warned, that your ASH activity will increase. How much of an increase will require a little math. But as you can see in this situation, it was worth the effort.

Back To The Users

By enabling ASH to sample all foreground and background sessions, we saw the above profile and timeline scripts clearly showing the "orapub" foreground sessions are doing no work and waiting patiently for, perhaps a 30 year old COBOL process to do its work.

Meanwhile the users are upset. This means, the focus now needs to turn to the COBOL process and also looking at the OS to ensure there is not a raging OS bottleneck effecting the COBOL process.

So, when that application programer looks at you and says, "Why should we focus on the COBOL process and not Oracle?", you can say "Because the Oracle process is waiting for COBOL to give it something to do!"

And now you can prove it!

All the best in your Oracle tuning work,

Craig.

If you have any questions or comments, feel free to email me directly at craig at orapub.com.

Top Five Holiday Gifts For The Oracle DBA Oracle Database IO Read Wait Occurrence Mismath - Part 1 What I Do In My Spare Time...