Application/SQL Focus: Log File Switch Checkpoint Incomplete

Posted on 24-Jun-2016 by Craig Shallahamer, craig@orapub.com

This post is about how to find the most important SQL statements in a generally poor performing Oracle Database system.

The situation has outraged users with the top Oracle wait event being log file switch checkpoint incomplete.

In my previous three posts, here, here, and here we base-lined performance, completed an Oracle time-based analysis including implementing a solution and analyzing the operating system looking for a bottleneck (there was none).

Our analysis is almost complete.

In this article I will focus on the application system, which in this case is the most important SQL statements. The result is usually a list of SQL statements that are either consuming resources that are in short supply or are not effectively using resources that are in abundance!

How I analyzed the application system (think: SQL) and how you can do the same on your system is the focus of this article.

Why Look At the Application System?

Many people in IT will tell you to start and to focus your analysis on the application and its SQL first. The reasoning is, because in their experience, that is where you are likely make the most dramatic improvement.

There is a lot of truth to that. Especially if your goal is good performance. But not if you want great performance! If you want to optimize your system then you need to, well... optimize the "system." And the "system" is more than the application and its SQL.

A lot of Oracle DBAs performance situations are pretty straightforward. For example, if a user has a performance problem, identify the SQL, trace it and tune it. Done. But many times, it's not that simple. Being an Oracle performance expert, I rarely find myself in these types of situations.

In my experience, those that focus on application and its SQL first get really frustrated if they can not find or tune "the" SQL. What I see happen next is they will typically blame the database or the operating system. Even worse is when I hear someone say the system is running efficiently and stop the analysis, even though the business is suffering.

I have learned that you CAN ALWAYS POINT a finger at Oracle, the operating system and even the application. You can (almost) always get a bigger box, tune Oracle a little more and either tune your SQL again or reduce the application workload.

When performance generally bad, pervasive and predictable, while I always talk with the user community first, I never start with the application or SQL analysis first. I use my Oracle analysis and my operating system analysis to guide and direct to the most important SQL.

For example, if the OS is out of CPU resources and there is massive cache buffer chain latch contention, I will find the top CPU consuming and buffer get SQL. And, if a very specific, identified and painful functional issue is involved, then I find that SQL as well.

So, how I do this? Read on...

AWR/Statspack Report Make It Easy

Every Oracle AWR and Statspack report will have a "SQL Statistics" report heading like this.

If you know what you are looking for, then simply click or scroll down. Really, it's usually that simple. (But of course, not always.) For example, if the IO subsystem is not responding as fast you would like and you want to reduce the physical IO read load the application is placing on the system, then find the top physical IO read SQL. Below is an example.

In most cases, this works great. But not always.

Oracle stores a lot more information about SQL statements than is displayed in an AWR or Statspack report. Just do a describe on the AWR view dba_hist_sqlstat. By mining this data you can find information about modules, programs, executions, number of sorts, child cursors, IO related waits, direct writes, physical writes and their relationships with each other. It is amazing what is available to you.

If the type of SQL you are looking for is not shown in the AWR or Statspack report, you will need to mine the data yourself. Everyone DBA needs some basic AWR reporting SQL.

For example, let's say there is lots of sorting occurring. So much so, Oracle must create temporary segments, read and write them from/to disk, which is putting pressure on the IO subsystem. If you notice in the AWR report SQL categories image above, there is no "SQL order by Sorts" section. However, in dba_hist_sqlstats there is the column, SORTS_DELTA. There ya go!

Let The Operating System Lead You

Seriously. If the operating system is out of CPU, then you had better find the top CPU consuming SQL. If the OS IO subsystem is struggling to respond quick enough to read requests, then you had better find the top physical read SQL.

It comes down to this: Whatever the OS is struggling to provide, reduce that requirement. Ask less of the resource and the resource in short supply's utilization will decrease and then its responsiveness will improve (decrease).

It's actually pretty straightforward.

In my last post we focused on the operating system. In our case study we discovered there was plenty of CPU, memory and network capacity available. However, while the IO subsystem was responding OK, it was on the edge of poor responsiveness. When I ran the U/Linux command iostat, it was obviously all about writes, not reads and not a mix... all writes.

This means, I'm looking for write focused SQL statements, which means possibly DML and possibly big sorting SQL statements. But before I take action, I need to look at what SQL my Oracle time-based analysis leading towards. Read on...

Let The Oracle System Lead You

Two posts back I focused on analyzing Oracle by performing a quick Oracle time-based analysis. Referring back to that analysis, over the analysis snapshot interval, 46% of the elapsed time (think: db time) Oracle processes where consuming CPU, 54% of the time Oracle processes where waiting on a non-idle wait event. About 57% of the non-idle wait time (the "big" chunk) was processes waiting on a log file switch (checkpoint incomplete) wait event.

What does this tell us about the SQL that most needs our attention? For sure the high CPU consuming SQL, DML SQL and especially when commits occur. Let's take these one at a time.

High Consuming CPU SQL

Users are complaining about DML activities and their complaints are not functionally specific. The issues for them is pervasive, that is, system wide. Plus there is plenty of CPU resources available on the database server. So, at this point in my analysis, I would not focus on the top CPU SQL.

Big DML SQL

I have never seen an AWR and Statspack report categorize SQL by the amount of block changes, direct writes or their related physical writes IO activity. However, this information is available in dba_hist_sqlstats. So, you need to mine that data yourself.

However, in our particular case study reconfiguring the redo logs made such a dramatic impact on block change throughput, I would expect performance to be fantastic. If not, then I had better find that big DML statements.

Commit SQL

An Oracle commit does not have a SQL_ID... it's blank. Strangely, nearly one year ago I blogged about blank, empty or null SQL_IDs. But I will say this, if commits are taking a long time or there rapid/repeating commits all over the application, then log file sync wait time will be one of the top wait events. How do I know that? I did a research project and blogged about it here.

Consolidating The Analysis

It's finally time to consolidate my 3-Circle Analysis: Oracle, Operating System and Application/SQL System. The result should be a nice list of solutions that not only make technical sense but "work" in the IT environment and culture.

The Oracle time-based analysis put the spotlight on the misconfigured online redo logs. Then the operating system analysis showed the IO subsystem activity was very write intensive and responding just good enough, but it was probably on the edge of OK performance. Both the Oracle and operating system analysis led me to look for DML SQL.

Notice that all three circles make sense as a system and support each other. There is no contradiction or mystery. This is exactly what I want to see.

The order of my solutions will be this. I will begin by reconfiguring the online redo logs as there was clearly a mis-configuration, it is easy to do and requires no downtime.

If that does not return performance to an acceptable level, I will focus on the application DML SQL, but with a twist.

Reducing the blocks a SQL statement changes is unlikely and problematic. In this situation, I'm hoping the key DML SQL statement is executed repeatedly and frequently. Perhaps, I can talk the users into running it less often. You would be surprised how many times this turned out to be "the solution." Another alternative, though likely very painful, is to redesign the "problem" part of the application.

Since an AWR or Statspack report never seems to categorize SQL by block changes, I'll need to mine dba_hist_sqlstats. However, there are two other highly accurate alternatives to finding "the" SQL.

If I have access to the AWR ASH data (dba_hist_active_sess_history), I can easily show the SQL most highly associated with the log file switch wait event. Personally, I would use my free BloodHound toolkit, which is all about ASH performance analysis.

Also, if the situation is happening in real-time, besides referencing v$active_session_history, I can query v$session directly looking for SQL that always seems to be waiting on the log file switch wait event. I have a number of tools in my free OSM toolkit that can help.

Oh heck, let's just run this SQL statement to find the answer!

SQL> select sid, sql_id
  2  from   v$session
  3  where  state = 'WAITING'
  4  and  event = 'log file switch (checkpoint incomplete)';

SID  SQL_ID
---- --------------
 531 ajs17jprc8xc0

1 row selected

Got it! I found the SQL. This SQL statement is the center of my application focused solution that both my Oracle and my operating system analysis led me to.

If both the redo log reconfiguration and dealing with the DML SQL does not return performance to the prior level, then I'll work with the IO team to increase IO subsystem write performance.

So, we have completed our analysis resulting a clear understanding of the situation and a nice list of solutions that are technical solid. Plus, I outlined the order of the solution implemented in a way that is appropriate for the situation.

Thanks for reading and enjoy your work!

Craig.

If you have any questions or comments, feel free to email me directly at craig at orapub.com.

How To Create Histograms Using The Free Statistics Package R Part 2. Does Table Column Order Affect Performance? Which Is Better; Time Model Or ASH Data?