Does Table Column Order Affect UPDATE Performance?
This is being significantly updated... stay tuned.
Both you and I have heard this before, "When creating an Oracle table, place the most commonly accessed column at the beginning of the table CREATE statement because performance will be better."
My previous post addressed SELECT statements. It was pretty obvious, that starting with the second column performance began to immediately degrade. By the 100th column, SELECT statement elapsed time increased three times!
But what about UPDATE statements? The answer to that question is the focus of this article. In my next article, I will compare and contrast the SELECT and UPDATE results and also make some inferences regarding why the performance changes so dramatically and how to apply this in your DBA work.
How I Designed The Experiment
With an experiment like this, it is easy to get overwhelmed with all the possible experimental variations. I decided on one variation with the objective of learning if there is a statistically significant performance change. That's it. Not how much of a difference or not at what column number does performance significantly change. And I'm not trying to determine if the column type makes a difference or if the column length make a difference. I just want to know two things about my experiment:
- Does column order make a difference?
- How can I apply the results to my work?
That is it. Otherwise, I would never get the experiment completed and posted.
However, I have made everything available to you. This way, you can alter the experiment and do some research yourself.
This is what I chose for this experiment focused on UPDATE statements.
- 5 million row table
- 180 columns
- Alternating column types; number and varchar2
- Numeric columns contain a uniformly distributed random number with a length from 1 to 20 digits.
- Varchar2 columns contain uniformly distributed random printable character strings with lengths from 1 to 20 characters.
- Gathered 35 samples for each column, but did not use the first four to reduce the impact of any potential buffering.
- The posted results contain the final 31 samples averaged for each column accessed.
Below is the UPDATE PL/SQL that I dynamically generated and executed 34 times, for each of the 180 columns. The key is I wanted to ensure only the specific column (1 to 180) was specifically named in the SQL statement. The column number is the variable column_loop_v. For details, download the Analysis Pack from the link below.
the_sql_v := 'update colorder set c'|| column_loop_v ||' = '|| equals_v ||' where rownum between '|| row_low_v ||' and ' || row_high_v ;
I created an "analysis pack" you can download HERE containing the:
- Experiment script, so you can see it, run it, and modify and run it yourself.
- Plotting SQLPLUS script, which pulls the experimental results and formats them into an R script, which you can then simply copy and paste into R to create the chart you see below.
- Plot you see below.
- Table export (COLORDER_RESULTS_UPD) of the experimental results, so you can crunch the numbers yourself.
- Histograms I present below and the ones I do not discuss.
- Statistical analysis step-by-step, using R, that I summarize below.
Column Order Does Make A Difference
The picture says it all... it makes a difference. And you'll forgive me if I say, "A big difference." In my specific experiment, accessing the 100th column takes about 2.8 times (i.e., 6050/2128=2.84) longer compared to the first column.
Clearly and immediately, performance degrades as we UPDATE a column other than the first column. But to satisfy any naysayers, below you can find the statistical significance test between column one and column two, plus other sweet details.
Statistical Analysis Detail
For those of you who care about the underlying statistical analysis, here are the results in all their glory.
If you click HERE you will see a text file containing exactly what I typed into R (the free statistic package) and the R results. In summary, the column one elapsed data is normally distributed (p-value = 0.08394) but the column two data is not normally distributed (p-value = 0.02829). Therefore to test if the data sets are stistcically different I do not use a t-test but instead use a Wilcoxon rank sum test. The Wilcoxon resulting p-value is 0.00000 which means the data from c1 and c2 likely came from two different data sets and any differences are likely NOT due to random selection (that means, we did not just happen to pick convenient values).
If that did not make sense, the below histograms may help. Shown below are both the column one (red) and column two (blue) elapsed time data sets shown as "smooth histograms." There is quite a bit of overlap. Much more than I expected. Visually they do appear to be two very different data sets, but not as dramatic as I like to see!
Not impressed? Me either! So, I did the same analysis but comparing column one and five. Shown below are both the column one (red) and column five (blue) elapsed time data sets shown as "smooth histograms." There is no overlap at all. Even visually this appears to be two very different data sets.
What does this mean? It means the SQL UPDATE statement elapsed times when selecting from column one is statistically faster than from column two. And, the column two elapsed times are closer to column one elapsed times than ANY of the other columns, which are visually much different in the plot in the previous section above.
Looking at the above to graphics, it may seem a waste of time to go through the statistical analysis. Over the years working with lots of data sets, it is always good to have a visual representation, a numeric representation and a statistical representation. If all three line up, then we're good!
Should I Reorganize My Tables?
Like I wrote in my previous article, Please NO! Do NOT schedule table reorganizations! Before you seriously consider doing anything that extreme, run some simple performance tests.
For example, try running some tests using your real data and using real production SQL UPDATE statements. And of course, any expected benefit should directly improve the user and/or business experience.
It Seems Pretty Obvious To Me
For me, the chart is pretty convincing. Cleary table column order makes a difference in UPDATE statements. But the key is, is the effort involved worth the true benefit to your organization. That is something I can't answer. But if you have a chance to influence the order of table columns, then for sure remember this research project!
But best of all, now when someone starts talking about column ordering, you will have something definitive to say along with where anyone can go to check out the results for themselves. For me, that's valuable.
In my next article, I will compare the SELECT and UPDATE experimental results. But more important I think, I will infer WHY there is a performance decrease and bring this research project to a practical close.
Thanks for reading and all the best in your Oracle tuning work!
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|Does Table Column Order Affect SELECT Performance?||I Have Lots Of Oracle Database Server Power But Performance Is Slow/Bad||Inferring SQL Run Time Using ASH And PQ Program Filtering|