How To Proportionally And Randomly Split Your Full Dataset

Posted on 21-Jul-2020 by Craig Shallahamer / OraPub / craig@orapub.com

If you have been following my recent posts, you'll know that both the Top Down Split and the Random Shuffle Split strategies produce untrustworthy accuracy score results.

If you're like me you're asking, "Gee thanks Craig! What do I do now?"

What to do, is what this article is all about!

The focus of this article is on proportionally and randomly splitting our FULL dataset into a TRAIN and TEST dataset. A way cooler way to say this is, we want to stratify our split!

The Big Problem: Performance Is Usually Good

In my previous post, I showed the below picture of a randomly split FULL dataset into the TRAIN (green) and the TEST (yellow) datasets. Look at the "Perf Situation" column values closely.

The rows have been randomly sorted/shuffled. You can see this by looking at the Snap ID column. Look at the Perf Situation column. Do you notice anything strange?

The problem is all of the TEST data (yellow) have their performance situations labeled as "good." This means when our model accuracy is scored, it is really only a score of how well the model predicts "good" performance! There will be no model accuracy score for when performance is "bad."

Notice that in the FULL dataset, there are only two "bad" performance samples. This should not surprise us because performance is rarely "bad."

As Oracle professionals, the above dataset is our reality.

When working with Oracle activity and performance datasets, we experience what is called a "rare event" situation. This results in two significant problems.

First, our model is not tested/scored with many "bad" performance samples. This means we have very limited knowledge of how well our model will detect/predict a "bad" performance situation.

Second, our model will be better trained and more developed in relation to "good" performance situations. Just as with our life experiences, if we rarely experience failure, when we do, our lack of failure experience can result in a very traumatic situation.

This rare event situation presents us with a big challenge. However, the good news is, there are creative ways to deal with this situation.

In this post I focus on a rare event remedy called, stratification. Usually stratification is not enough to solve our problem, but it is a good start.

The Big Problem: A Demonstration

In my previous post the FULL dataset was split using the below code.

# Trained model is tested with data it has never seen before... good
# Train/Test split was performed randomly... good
# No attempt to split based on label proportions... bad
# So, the results could be better or worse than a simple top-down split. Therefore, weak test.

print("Split FULL Dataset Into TRAIN And TEST Datasets Using A Random Shuffle")
print()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=None, random_state=6752)

ScoreAndStats(X, y, X_train, y_train, X_test, y_test)

Below is some analysis of the split results. I show everything I did in the previous post plus some additional data regarding the "Perf Situation" label.

Split FULL Dataset Into TRAIN And TEST Datasets Using A Random Shuffle

Shapes X(r,c) y(r,c)

Full   (1259, 3) (1259,)
  Train  (1007, 3) (1007,)
  Test   (252, 3) (252,)

Labels

Full  dataset
green      772   61.3
red         63    5.0
yellow     424   33.7

Train dataset
green      611   60.7
red         46    4.6
yellow     350   34.8

Test  dataset
green      161   63.9
red         17    6.7
yellow      74   29.4

Notice the dataset label percentages. For example, the percentage of "red" samples in the FULL Dataset is 5.0 but in the TEST Dataset it is 6.7.

Note: Green, yellow and red was set based on statistical analysis related to the rate of actual support system tickets created during the AWR snapshot interval. This is the kind of "next level" stuff I teach in my Machine Learning For Oracle Professionals LVC .

While this may be interesting, the super important part is that in the FULL dataset 5% of the rows are labeled "red" but in the TRAIN dataset the percentage is 4.6% and in the TEST dataset is 6.7%.

Because the TRAIN and TEST dataset rows where simply randomly picked from the full dataset, the "red" TRAIN and TEST row percentage could be MUCH higher or MUCH lower. This could bring us directly back to a situation where we are testing our model with no "red" data and training our model with all of the "red" data... or, the reverse!

It may not be that extreme. But the point is, there is no attempt to ensure the proportion of green, yellow and red labels in the TRAIN and TEST datasets match the proportion from the FULL dataset!

At this point, most of you reading this will be thinking, "So what! What's the big deal?" Read on...

Non-Stratified (Non-Proportional) Split: The Teacher Tricked Me!

For a moment, think back to your university days. Did you ever encounter a situation where the professor scheduled an exam covering the first three chapters of some book? Of course you did!

In fact, the professor was kind enough to give you some study questions to help you learn the material. Suppose the professor provided twenty questions for each chapter, resulting in a total of 60 questions.

When you arrive in class to take the exam, you notice that Chapter 1 has only five questions, Chapter 2 has only five questions, but Chapter 3 has twenty questions!

If you would have known Chapter 3 was more important then Chapter 1 and Chapter 2, you would have studied (i.e., trained yourself) differently and more appropriately.

What you experienced was a NON STRATIFIED examination!

If the exam was stratified, the proportions of the study (train) and exam (test) questions would be the same. Since 30% of the study questions focused on Chapter 3, you would have expected 30% of the exam questions to be focused on Chapter 3. But instead 67% (20/30) of the questions were related to Chapter 3.

In the same way, we would not want to train our model with light workload AWR snaps and test it high workload AWR snaps. If our FULL dataset contains 40% of AAS less than 20, then we want both the TRAIN and TEST datasets to contain 40% of their samples with AAS less than 20.

We Need More Data!

In rare event situations we need more data than you might think. For example, 500 samples with only 1% rare event samples, means there are only 5 rare event (think: red) samples to both train and test our model. So, though the data is stratified, we still have the problem of limited data.

Stratification: Super Cool Algorithm!

If you think about this for a minute, the stratification algorithm is pretty cool. It has to both split the FULL dataset at a specified percentage plus the TRAIN and TEST label percentages must match the FULL dataset label percentages.

Thankfully, most machine learning platforms contain the ability to "stratify" the split datasets. But, with rare event situations, we still need an unusually large FULL dataset.

Good News: Stratified (Proportional) Split: Python

It's time to get back to the Python train_test_split function. This core line of code is exactly the same as in my previous post but with one exception.

Notice the stratify paremeter is set to y. First, the y does NOT represent YES! It instructs the split function to proportionally split the X dataset based on the proportions of the label y data. While our label data array is traditionally named y it could be named, for example, myLabelData.

This is the most important paragraph in this article:

If y contains 33.7% yellow label samples, then both the split TRAIN and TEST subsets should also contain 33.7% yellow label samples. So, not only is the X dataset split with 30% of the samples going to the TEST dataset and 70% going to the TRAIN dataset, but the resulting rows should also be split proportionally based on the label data, y.

Before I present the code, you can download the entire Jupyter Notebook containing the Python code HERE. If you need to setup your machine learning sandbox environment, click HERE

Below is the slightly modified code.

# Trained model is tested with data it has never seen before... good
# Train/Test split was performed randomly... good
# Label proportions are maintained... good
# So, the split is awesome!

print("Split FULL Dataset Into TRAIN And TEST Datasets Using A Stratified Split")
print()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=6752)

ScoreAndStats(X, y, X_train, y_train, X_test, y_test)

Here is the output from the slightly modified code:

Split FULL Dataset Into TRAIN And TEST Datasets Using A Stratified Split

Shapes X(r,c) y(r,c)

Full   (1259, 3) (1259,)
  Train  (1007, 3) (1007,)
  Test   (252, 3) (252,)

Labels

Full  dataset
green      772   61.3
red         63    5.0
yellow     424   33.7

Train dataset
green      618   61.4
red         50    5.0
yellow     339   33.7

Test  dataset
green      154   61.1
red         13    5.2
yellow      85   33.7

Score:  0.7857142857142857

Pretty amazing, eh?

You may have noticed the TEST dataset has 5.2% red's, not 5.0%. Well... the algorithm tried its best, but with a small number of red samples, 5.2% is the best it could do.

What About The Accuracy Score?

You may be wondering about the accuracy score of 0.786. For sure, that's nothing to celebrate about. However, our objective is to create a valid accuracy score. What we have done with splitting the data and in a stratified way is to create a valid accuracy score.

Once we are confident we have a valid score, then we can focus on improving it.

Are We Done With The Accuracy Score? No.

While we do have valid accuracy score, because our dataset is so small and we have a rare event situation (e.g., only 5% reds), the accuracy strength is not as strong as we would like.

So, the question now is, how can we better utilize our limited dataset to produce both a valid accuracy score AND one that is as strong as an accuracy score from a much bigger dataset?

That's is the topic of my next article!

All the best in your machine learning work,

Craig.

Start my FREE 18 lesson Machine Learning For Oracle Professionals E-Course here.


Craig Shallahamer is a long time Oracle DBA who specializes in predictive analytics, machine learning and Oracle performance tuning. Craig is a performance researcher and blogger, consultant, author of two books, an enthusiastic conference speaker a passionate teacher and an Oracle ACE Director. More about Craig Shallahamer...


If you have any questions or comments, feel free to email me directly at craig at orapub.com.

How Oracle Database Mutexes Sleep What Machine Learning Means For The Oracle DBA How To Approach Different Oracle Database Performance Problems