An Optimization Challenge: Training VS Testing

Posted on 13-Apr-2020 by Craig Shallahamer / OraPub / craig@orapub.com

In my previous article Why Our ML Models Must Be Tested With Unseen Data, I presented a simple model accuracy score. Yes, it is simple and useful. However, it provides limited insight and can actually mislead us into thinking our model is powerful when in fact, it's not.

Also, in my previous article, I demonstrated one way to gain more insight into our model's predictive power is to "test" it with data it has never seen before.

I showed a kind of manual way to do this. Having a TRAIN and TEST dataset is so foundational to supervised machine learning, there is a library function that makes it easy to split our full dataset into TRAIN and TEST subsets.

But these functions also provide two very special and important advantages! The first is the ability to set the TRAIN to TEST split percentage. The second is called "stratification." In this article the focus is on the split percentage.

Sound interesting? Read on...

A Better Way To Split, Train And Test

Here's the process summarized. First, split the full dataset into a TRAIN and TEST dataset. Second, train and evaluate the model using the TRAIN data. Third, evaluate the model using the TEST dataset.

The trick is the model has never seen the TEST data before. So, it's a much better test.

In the code below ( download example1.py), notice I have a command line argument for the TEST dataset size percentage. That variable is then used in the Python function, train_test_split, parameter test_size.

$ cat example1.py
# author: Craig Shallahamer, craig@orapub.com
# blog  : 10-April-2020

import sys
test_split_pct = float(sys.argv[1])
print("Test to Train dataset split % :", test_split_pct)

import numpy  as np
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# define dataset: 100 features(X) and 1 label (y)
X, y = make_classification(n_samples=1000, n_features=100, n_redundant=0, n_clusters_per_class=1, weights=[0.90], flip_y=0.01, random_state=2)

# summarize data, class distribution
print("Unsplit dataset shapes        :", X.shape, y.shape)
print("Unsplit dataset label counts  :", Counter(y))

# split the full X/y dataset into train and test datasets
# based on test_split_pct variable
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_split_pct, random_state=123)

print("Split Train shape             :", X_train.shape, y_train.shape)
print("            label counts      :", Counter(y_train))
print("Split Test  shape             :", X_test.shape , y_test.shape)
print("            label counts      :", Counter(y_test))

# define model and initialize it
from sklearn import tree
Mymodel = tree.DecisionTreeClassifier(random_state=22)

# Using TRAIN dataset, fit model and make predictions 
Mymodel.fit(X_train, y_train)
Ypred = Mymodel.predict(X_train)

# calculate TRAIN dataset accuracy, based on model trained using TRAIN dataset
print("Model accuracy (training)     :", accuracy_score(y_train, Ypred))

# Make predictions, using TEST dataset and the model trained using TRAIN dataset
Y_test_pred = Mymodel.predict(X_test)

# calculate TEST dataset accuracy, based on model trained using TRAIN dataset
print("Model accuracy (testing)      :", accuracy_score(y_test, Y_test_pred))

I ran the above Python script with a TEST split percentage of 0.20. This means 20% of our full dataset samples will be split into the TEST dataset and the remaining 80% into the TRAIN dataset.

Below are the accuracy results. The model trained perfectly, achieving an accuracy of 100% or simply 1.0. To check if the model has simply memorized the TRAIN data, we asked it to make predictions with data it has never seen before (TEST data) and score the results. The TEST accuracy score is 92%. Does that seem pretty good?

$ python example1.py 0.20

Test to Train dataset split % : 0.2
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (800, 100) (800,)
            label counts      : Counter({0: 719, 1: 81})
Split Test  shape             : (200, 100) (200,)
            label counts      : Counter({0: 177, 1: 23})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.92

While a 92% TEST accuracy may seem good, remember the accuracy score we are using is extremely simple and can misslead us. And, because 90% of our data is labeled "good performance" (0's), simply guessing "zero" will likely result in a 90% accuracy! So, what seems like a pretty good accuracy score, may not be anything special after all.

Perhaps we can achieve a TEST accuracy above 92% with a different data split percentage?

Tension: Increasing Training Data

There is always a tension between the TRAIN/TEST percentage split. During training, we want to expose our model to as much data as possible. However, we also want a responsible test.

In my Machine Learning For Oracle Professionals LVC, we repeatedly see that this tension is very real in Oracle performance datasets.

As I introduced in my Why "Rare Events" Can Be A Problem With Machine Learning article, Oracle performance problems should be a "rare event." And, the more rare the better!

For example, suppose we have a 5000 snap AWR dataset, we split 80% of the data into the TRAIN dataset, the remaining 20% into the TEST dataset, and 2% of our data is labeled as "poor performance." This means our TRAIN dataset will have around 4000 snaps with around 80 "poor performance" snaps. (5000*0.80*0.02)

But check this out!

Our TEST data set will have 1000 snaps but only 20 "poor performance" snaps. You should be uncomfortable with only 20 samples of anything!

With this in mind, we are motivated to increase the TEST dataset size. So how about 25% or 30% or 50%?

Here comes the tension!

But if we increase the TEST dataset size, the TRAIN dataset size decreases. Can a smaller TRAIN dataset impact the model's ability to make good predictions?

Well... let's run some tests and check it out!

Tension: Finding The Sweet Spot Split

Many Data Scientists will start with a TRAIN/TEST split of 80/20 or 70/30. But if the accuracy scores (along with other accuracy statistics) are not satisfactory, they will likely experiment with different split percentages.

In my experiment below, I am setting my "rare event" percentage to 10%... which is really not all that rare for our type of work. But I want to make the point that even with a dataset containing 10% "bad performance" 1's and 90% "good performance" 0's the results can be shocking.

Below is the the Python code. Notice I created the OS shell variable for our TEST/TRAIN (not TRAIN/TEST but TEST/TRAIN) split percentage called, testpct. I set the initial value to 0.01 which means only 1% of our initial dataset will become the TEST dataset and the remaining 99% will become our TRAIN dataset.

for testpct in 0.01 0.05 0.10 0.15 0.20 0.30 0.40 0.50 0.60 0.70 0.80
do
  echo "Testing percentage of $testpct ......................."
  python example1.py $testpct
done

Below are the results.

$ for testpct in 0.01 0.05 0.10 0.15 0.20 0.30 0.40 0.50 0.60 0.70 0.80
> do
>   echo "Testing percentage of $testpct ......................."
>   python example1.py $testpct
> done
Testing percentage of 0.01 .......................
Test to Train dataset split % : 0.01
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (990, 100) (990,)
            label counts      : Counter({0: 886, 1: 104})
Split Test  shape             : (10, 100) (10,)
            label counts      : Counter({0: 10})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 1.0
Testing percentage of 0.05 .......................
Test to Train dataset split % : 0.05
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (950, 100) (950,)
            label counts      : Counter({0: 851, 1: 99})
Split Test  shape             : (50, 100) (50,)
            label counts      : Counter({0: 45, 1: 5})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.96
Testing percentage of 0.10 .......................
Test to Train dataset split % : 0.1
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (900, 100) (900,)
            label counts      : Counter({0: 808, 1: 92})
Split Test  shape             : (100, 100) (100,)
            label counts      : Counter({0: 88, 1: 12})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.92
Testing percentage of 0.15 .......................
Test to Train dataset split % : 0.15
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (850, 100) (850,)
            label counts      : Counter({0: 764, 1: 86})
Split Test  shape             : (150, 100) (150,)
            label counts      : Counter({0: 132, 1: 18})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.9266666666666666
Testing percentage of 0.20 .......................
Test to Train dataset split % : 0.2
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (800, 100) (800,)
            label counts      : Counter({0: 719, 1: 81})
Split Test  shape             : (200, 100) (200,)
            label counts      : Counter({0: 177, 1: 23})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.92
Testing percentage of 0.30 .......................
Test to Train dataset split % : 0.3
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (700, 100) (700,)
            label counts      : Counter({0: 628, 1: 72})
Split Test  shape             : (300, 100) (300,)
            label counts      : Counter({0: 268, 1: 32})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.9233333333333333
Testing percentage of 0.40 .......................
Test to Train dataset split % : 0.4
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (600, 100) (600,)
            label counts      : Counter({0: 538, 1: 62})
Split Test  shape             : (400, 100) (400,)
            label counts      : Counter({0: 358, 1: 42})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.925
Testing percentage of 0.50 .......................
Test to Train dataset split % : 0.5
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (500, 100) (500,)
            label counts      : Counter({0: 446, 1: 54})
Split Test  shape             : (500, 100) (500,)
            label counts      : Counter({0: 450, 1: 50})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.914
Testing percentage of 0.60 .......................
Test to Train dataset split % : 0.6
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (400, 100) (400,)
            label counts      : Counter({0: 358, 1: 42})
Split Test  shape             : (600, 100) (600,)
            label counts      : Counter({0: 538, 1: 62})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.9033333333333333
Testing percentage of 0.70 .......................
Test to Train dataset split % : 0.7
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (300, 100) (300,)
            label counts      : Counter({0: 273, 1: 27})
Split Test  shape             : (700, 100) (700,)
            label counts      : Counter({0: 623, 1: 77})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.8814285714285715
Testing percentage of 0.80 .......................
Test to Train dataset split % : 0.8
Unsplit dataset shapes        : (1000, 100) (1000,)
Unsplit dataset label counts  : Counter({0: 896, 1: 104})
Split Train shape             : (200, 100) (200,)
            label counts      : Counter({0: 184, 1: 16})
Split Test  shape             : (800, 100) (800,)
            label counts      : Counter({0: 712, 1: 88})
Model accuracy (training)     : 1.0
Model accuracy (testing)      : 0.8625

The 1% TEST split resulted in a 100% accuracy score! Good luck with your deployment! If you look at the TEST dataset shape, there is only 10 rows in the dataset! So, we got lucky!

But did we really get lucky?

Remember that 90% of rows are 0's and 10% are 1's. So if the model always predicted a 0 the accuracy results would be 90%. So what seems like a good score may not be so good after all!

What I Like To See

What I look for is at least 30 "rare event" samples. And, I would like more... a lot more if possible.

If you look closely above, at the 20% split, the TEST dataset contains 200 rows, 172 are "good performance" 0's but only 23 are rare "bad performance" event 1's samples. While accuracy using the TEST dataset is 0.92, there are not quite 30 rare event samples in the TEST dataset.

At the 30% split, there are 32 rare event samples with an accuracy score of 92%. So, using my general "what I look for" method, the 20% split doesn't quite make it but the 30% does.

Our Sweet Spot

Notice the accuracy peaks at around 20% to 40%. It is common for the accuracy to degrade as the TEST dataset size increases.

With our rare event dataset, as we increase the TEST data percentage, the accuracy doesn't fall of much... but it is falling.

Again, we are led to belive our model is amazing, when our model can't beat the simple, "It's a zero!" strategy.

What Is The Best Train/Test Dataset Split Percentage?

So what percentage should we use? I hope you can see, this is not a simple question. Based on the demonstrations above, we can see there are many factors that play a part in determining the optimal split percentage.

But there's even more to consider.

It centers around the fact that with the same dataset and the same data splits, "Algorithm A" can show a high TEST accuracy score (e.g., 95%) but actually be less powerful than "Algorithm B" with a lower accuracy score (e.g., 85%).

This can occur because a simple accuracy score can mask or hide important details. Going deeper into accuracy is a topic for another post.

The Key Points In This Article

Below are the key points I want to get across in this article:

As I mentioned at the beginning of this article, the data split function provides two very important functions. The first, is setting the TEST to TRAIN split percentage. The second, is called "stratification." In my next article we will explore this incredibly important and interesting topic!

All the best in your machine learning work,

Craig.

Start my FREE 18 lesson Machine Learning For Oracle Professionals E-Course here.


Craig Shallahamer is a long time Oracle DBA who specializes in predictive analytics, machine learning and Oracle performance tuning. Craig is a performance researcher and blogger, consultant, author of two books, an enthusiastic conference speaker a passionate teacher and an Oracle ACE Director. More about Craig Shallahamer...


If you have any questions or comments, feel free to email me directly at craig at orapub.com.

How To Create LOTS Of Oracle Database Library Cache Child Cursors Inferring SQL Run Time Using ASH PX_FLAGS Filtering And Max/Min SAMPLE_TIME Creating Realistic User Think Times For Research Projects