Oracle Database Performance Analysis: Total Time vs Bits of Time
Our Experience Tells Us So...
We know from hard-fought Oracle Database tuning experience that the more work we push through a system the more processing time involved. That's not rocket science... but to mathematically represent the relationship between work and time is rocket science.
Decades ago Operations Research queuing theory researchers quantitatively established the relationship between work and time. They put into math what we all have experienced. In our production Oracle Database Administrator world this means that as the system activity increases, then so does Oracle process CPU consumption and eventually Oracle process non-idle wait time.
Find The Big Bar
The classic way to graphically represent all the database related time is by a vertical bar. (Currently in vogue is relating the bar to the number of average active sessions.) We have all seen Oracle performance monitors (some much better than others) representing the database related time over a period of time (e.g., 10 minutes or one hour) as big vertical bar. Many of us have been told when there is a performance issue, look for the "big bar." Why? Because "the database is busier." And usually this works. But there is a hidden assumption that is not accounted for in this line of thinking which can through off our analysis.
Just because the bar gets bigger doesn't mean there is a performance problem. It can simply mean the system is processing more work; which requires more CPU resources and perhaps there is more wait time involved. So it's more than just the "big bar." It's about both the processed work and the associated time.
Little Bits of Time
For each little piece of work processed (think: buffer get) some CPU is consumed and perhaps there is some associated non-idle wait time. An operations research person would call this response time, that is the time it takes to process a single piece of work (think: buffer get). Upon closer inspection, which I won't go into here, the CPU consumed can be represented by what is called service time and the wait time can be represented by what is called queue time. Therefore, response time is queue time plus service time.
Rt = Qt + St = Qt (time/work) + St (time/work) = Qt (wait time/work) + St (cpu time/work)
Summing the Littles Bits of Time
The big bar, that is the total time, is the sum of all the little bits of time or some interval, such as 10 minutes or perhaps an hour. This can be represented a number of ways, such as:
Total Time = Work Processed X Time to process one piece of work
Total Time = Work Processed X Response Time
Total Time = Work X Time/Work = Time to process all the work
When you think about it, the big bar is made up of little bits of time all stacked on top of each other. Each time a piece of work is processed the bar will get a little taller. Because each time a piece of work is processed a little CPU is consumed and there may be some associated wait time.
Suppose when each piece of work is processed 10 ms of CPU is consumed along with 5 ms of non-idle wait time. It is logical to assume/predict/anticipate that processing 20 pieces of work will take 300 ms. ( 300 ms = 20 work pieces * ( 10 ms/ work piece + 5 ms / work piece ) I demonstrated this in my post entitled Anticipated SQL Elapsed Times.
The workload intensity is a rate. It is the amount of work processed over a specific interval of time. For example, if 10500 pieces of work are processed over a 10 minute interval, the average workload intensity is 17.5 pieces of work per second. ( 10500 pieces of work / 10 minute interval) In the land of queuing this is called the arrival rate. (Technically the arrival rate is what actually arrives into the system, but in an operationally stable system what arrives equals what departs.) By the way, the reason you see the arrival rate represented by the letter L is because when convenient it is always shown as the Greek letter lambda.
What's In The Big Bar
Suppose little pieces of work arrive into the system at a rate of 50 per ms, that is 50 work/ms. It follows that after 1000 ms, 50000 pieces of work will have entered the system. Also, assume each piece of work takes 15 ms to process (it's response time). Therefore the accumulated service time and non-idle wait time will be 750000 ms. ( 750000 ms = 1000 ms X 50 work/ms X 15 ms/work) So the "big bar" will be 750000 ms "tall."
Relating Work with Time
Decades ago, the queuing theory guys mathematically described the relationship between work and time. When there is a single queue (think: CPU run queue, not run queues) sending requests to many "servers" (think: CPU cores) then response time is represented like this:
Rt = St / ( 1 - (( St * L )/M)^M )
St is service time
L is the workload intensity, called the arrival rate.
M is the number of "servers" processing, that is, servicing the work.
Let's related the time to process a single piece of work (Rt) to the total time (big bar). Suppose over a 10 minute snapshot interval 10500 pieces of work (e.g., logical IOs) where processed and it took 15 ms to process each piece of work. Therefore, the total time (big bar height) would be:
Total Time = Work Processed X Rt = 10500 lio X 15 ms/lio = 157500 ms = 157.500 sec
So the height of the "big bar" for this 10 minute interval would be 158 seconds "tall."
Seeing is Believing
Math is great, but pictures are better! If what I have written is correct and the math correct, I should be able to show the Total Time changing as the Work Processed changes. And wouldn't it be great if there was an animated tool so I can visually see and experience this relationship? Well... that's what I did. I created a relatively simple tool to visually show the relationship between work and time.
You can download this free tool from the OraPub.com web site. Just do an OraPub search for "time tool". Below is a picture of the tool as it was initially released (it has seen be updated).
Looking at the figure above, the chart on the left is a classic response time curve. The plotted point is the time it takes to process a single piece of work at the associated arrival rate. The curve represents the response time path the point will take as the arrival rate changes. We know in our gut that as the arrival rate increases (point moves from left to right) eventually the system will not be able to respond as quickly... and this is when the point enters into the "elbow" of the curve.
Technically what is occurring is while the service time is constant, the queue time begins to increase. Think of it like this: If it takes 3 minutes to cook a burger (i.e., service time), it will still take 3 minutes to cook a burger regardless of how many people are waiting (i.e., queuing) to get a burger.
Looking at the figure above, the chart on the right is the total time chart, for a single time interval. Which could be, for example 10 ms or 1 hour. (However in the tool, I set the time interval to one unit of arrival rate time.) What is important to understand is that as the arrival rate increases more CPU is consumed (the green bar gets bigger)... even though the CPU required to service a single piece of work remains constant! (more below) Eventually the system can't process all the arrivals immediately and the pieces of work begin to wait, that is queue. This is when we start seeing the red appear in the bar. And is also when the response time begins to increase.
Looking at the "Big Bar"
What I think is really interesting is while the service time (think: time to cook a burger) does not change regardless of the workload, the total time service (think: time to cook lots of burgers) increases as the workload increases. This is why the green bar continues to increase as the arrival rate increases.
The three figures below show the situation at three different arrival rates; 3.25, 3.50, and 3.75. If you look closely as the arrival rate increases the green bar does in fact get bigger, just not that much.
Looking at and comparing the three above figures, notice once the point enters the elbow of the response time curve the big bar's red components begins to rapidly increase! While the green continues to increase, is does not increase nearly as much as the red bar! This is a graphical way to communicate to others why you do not want your systems operating in the elbow of the curve; relatively slow and highly variable response times means potentially very slow and inconsistent SQL elapsed times and batch job durations.
Creating a number of pictures is great, but it's awesome to watch the change before your eyes. If you download this tool (URL at top of this section) you can change the snapshot interval, the arrival rate, and watch the situation change before your eyes. I created a short video using the 1b version of the tool; while making some comments and then uploaded it to YouTube. I also posted the video into this blog below.
Pretty cool, eh? I love seeing what I have experienced and know in my gut is backed up with mathematics. That gives me so much satisfaction!
There are a number of take-aways that we can apply to our work and to reinforce what we have personally experienced.
- There is a relationship between work and time.
- As the workload increases, so does the total time to process the work.
- Operations Research queuing theory mathematically describes the relationship between work and time.
- Response time is the time to process a single piece of work.
- The response time curve is a graphical way to represent the the relationship between workload intensity (arrival rate) and the time to process a single piece of work (response time).
- Total time is the time to process many pieces of work, over a specified interval.
- The "big bar" chart is a great way to visually see the total time and it's components.
- When in the "elbow" of the response time curve, small changes in the arrival rate result in big changes in the response time; erratic response time.
There are many, many more take-aways but that's good enough for now!
Thanks for reading!
Start my FREE 18 lesson Machine Learning For Oracle Professionals E-Course here.
Craig Shallahamer is a long time Oracle DBA who specializes in predictive analytics, machine learning and Oracle performance tuning. Craig is a performance researcher and blogger, consultant, author of two books, an enthusiastic conference speaker a passionate teacher and an Oracle ACE Director. More about Craig Shallahamer...
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|2011 IOUG Presentation: Unit Of Work Time Based Analysis||Altering Oracle Database Insert Commit Batch Size (Log File Sync) - Part 3||The Situation: Detailing Oracle Process CPU Consumption|