Does Table Column Order Affect SELECT Performance?
Both you and I have heard this before, "When creating an Oracle table, place the most commonly accessed column at the beginning of the table CREATE statement because performance will be better."
So, I asked three simple questions:
- How much better? The response was, "A lot better." Gee. Thanks a lot.
- Why is it better? The response was, "Probably because of the way Oracle accesses block data." Probably? Which prompted my next question.
- How do you probably know? The response was, "I don't know. It just makes sense." OK. I'm done talking with you.
It's time for a research project. And, this research is one I've wanted to do for many years, but never took the time... until today!
So, does the order in which you create an Oracle database table's column impact SELECT performance? The short answer is YES, but the details are much more interesting and encourage us to ask more questions.
How I Designed The Experiment
With an experiment like this, it is easy to get overwhelmed with all the possible experimental variations. I decided on one variation with the objective of learning if there is a statistically significant performance change. That's it. Not how much of a difference or not at what column number does performance significantly change. And I'm not trying to determine if the column type makes a difference or if the column length make a difference. I just want to know two things about my experiment:
- Does column order make a difference?
- How can I apply the results to my work?
That is it. Otherwise, I would never get the experiment completed and posted.
However, I have made everything available to you. This way, you can alter the experiment and do some research yourself.
This is what I chose for this experiment focused on SELECT statements.
- 5 million row table
- 180 columns
- Alternating column types; number and varchar2
- Numeric columns contain a uniformly distributed random number with a length from 1 to 20 digits.
- Varchar2 columns contain uniformly distributed random printable character strings with lengths from 1 to 20 characters.
- Gathered 35 samples for each column, but did not use the first four to reduce the impact of any potential buffering.
- The posted results contain the final 31 samples averaged for each column accessed.
I created an "analysis pack" you can download HERE containing the:
- Experiment script, so you can see it, run it, and modify and run it yourself.
- Plotting SQLPLUS script, which pulls the experimental results and formats them into an R script, which you can then simply copy and paste into R to create the chart you see below.
- Plot you see below.
- Table export (COLORDER_RESULTS_SEL) of the experimental results, so you can crunch the numbers yourself.
- Histogram I present below and the ones I do not discuss.
- Statistical analysis step-by-step, using R, that I summarize below.
Column Order Does Make A Difference
The picture says it all... it makes a difference. And you'll forgive me if I say, "A big difference." In my specific experiment, accessing the 100th column takes about three times as long as the first column.
Clearly and immediately, performance degrades as we SELECT a column other than the first column. But to satisfy any naysayers, below you can find the statistical significance test between column one and column two.
Statistical Analysis Detail
For those of you who care about the underlying statistical analysis, here are the results in all their glory.
If you click HERE you will see a text file containing exactly what I typed into R (the free statistic package) and the R results. In summary, the column one elapsed data is normally distributed (p-value = 0.07352) but the column two data is not normally distributed (p-value = 0.0006583). Therefore to test if the data sets are stistcically different I do not use a t-test but instead use a Wilcoxon rank sum test. The Wilcoxon resulting p-value is 0.00000 which means the data from c1 and c2 likely came from two different data sets and any differences are likely NOT due to random selection (that means, we did not just happen to pick convenient values).
If that did not make sense, the below histogram will! Shown below are both the column one (red) and column two (blue) elapsed time data sets shown as "smooth histograms." There is no overlap at all. Even visually this appears to be two very different data sets.
What does this mean? It means the SQL SELECT statement elapsed times when selecting from column one is statistically faster than from column two. And, the column two elapsed times are closer to column one elapsed times than ANY of the other columns, which are visually much different in the plot in the previous section above.
Looking at the above to graphics, it may seem a waste of time to go through the statistical analysis. Over the years working with lots of data sets, it is always good to have a visual representation, a numeric representation and a statistical representation. If all three line up, then we're good!
Should I Reorganize My Tables?
Please NO! Do NOT schedule table reorganizations! Before you seriously consider doing anything that extreme, run some simple performance tests.
For example, try running some tests using your real data and using real production SQL SELECT statements. And of course, any expected benefit should directly improve the user and/or business experience.
It Seems Pretty Obvious To Me
For me, the chart is pretty convincing. Cleary table column order makes a difference in SELECT statements. But the key is, is the effort involved worth the true benefit to your organization. That is something I can't answer. But if you have a chance to influence the order of table columns, then for sure remember this research project!
But best of all, now when someone starts talking about column ordering, you will have something definitive to say along with you someone can go to so they can do the experiment themselves. For me, that's valuable.
What About UPDATE Statements?
That's a good question. I'm kind of curious about this also. So, in my next article, I'll post the results and then compare the SELECT and UPDATE focused experiments. But until then, start re-organizing those tables! (Just joking.)
Thanks for reading and all the best in your Oracle tuning work!
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|Important Statistical Distributions For Oracle Database Tuning...Really||Important Statistical Distributions For Oracle Database Tuning...Really||Does An Oracle Commit Always Have A Blank, Empty Or Null SQL_ID?|