Denormalizing Normalized Data For Machine Learning

Thirty years ago I was passionately teaching Oracle DBAs about normalization and their impending disasters if they implemented denormalized tables.

However, now I'm teaching Oracle Professionals to denormalize Oracle data that has been thoughtfully normalized!

Why would I do something so sinister!?

The answer to this question and how to do the denormalization for machine learning purposes is what this post is all about.

Please, Everything In One Row!

In machine learning, we want all data related to a sample (think: AWR snap_id data) to reside in a single row.

So instead of id, name and value, we will have id, name1_value, name2_value, etc. Below is example of a normalized table definition.

create table perf_stats as (
   snap_id number, stat_name varchar2(100), stat_value number
)

If we did a simple SELECT statement, the result would be something like this.

snap_id  stat_name     stat_value
   1001  uc_psec             2500
   1001  aas                34.25
   1001  trx_psec            9.45
   1002  uc_psec             1200
   1002  aas                14.50
   1002  trx_psec            6.50

But machine learning algorithms want the data to look something like this:

snap_id  uc_psec    aas  trx_psec
   1001     2500  34.25      9.45
   1002     1200  14.50      6.50

That's great, but I'm still very uncomfortable...

Where Does This End!

You may be asking yourself, where does this end? I mean, how many columns could we possibly have?

The answer is not so much a technical limit, but more of a usefulness limit. The more table columns, that is the more "features" does not imply the ML model will perform better. This is one of the myths I try to destroy in my Machine Learning For Oracle Professionals LVC.

But for sure there could easily be tens or even hundreds of features. Part of our job is to minimize the number of features, but that's a topic for another post.

Perfect Denormalization Example

It turns out the AWR dba_hist_sysmetric_summary table contains metrics, that when used properly, can help create a powerful predictive model. That's the good news.

The bad news is, dba_hist_sysmetric_history is somewhat fully normalized. Check this out!

SQL> desc dba_hist_sysmetric_summary

 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 SNAP_ID                                   NOT NULL NUMBER
 DBID                                      NOT NULL NUMBER
 INSTANCE_NUMBER                           NOT NULL NUMBER
 BEGIN_TIME                                NOT NULL DATE
 END_TIME                                  NOT NULL DATE
 INTSIZE                                   NOT NULL NUMBER
 GROUP_ID                                  NOT NULL NUMBER
 METRIC_ID                                 NOT NULL NUMBER
 METRIC_NAME                               NOT NULL VARCHAR2(64)
 METRIC_UNIT                               NOT NULL VARCHAR2(64)
 NUM_INTERVAL                              NOT NULL NUMBER
 MINVAL                                    NOT NULL NUMBER
 MAXVAL                                    NOT NULL NUMBER
 AVERAGE                                   NOT NULL NUMBER
 STANDARD_DEVIATION                        NOT NULL NUMBER
 SUM_SQUARES                                        NUMBER
 CON_DBID                                           NUMBER
 CON_ID                                             NUMBER

SQL> select snap_id, metric_name, average
  2  from   dba_hist_sysmetric_history
  3  order by 1,2 
  4  /

   SNAP_ID METRIC_NAME                                 AVERAGE
---------- ---------------------------------------- ----------
     22721 Background Checkpoints Per Sec            .14362801
     22721 Branch Node Splits Per Sec               .223080287
     22721 Branch Node Splits Per Txn               3.67087764
     22721 Buffer Cache Hit Ratio                   99.5259521
     22721 CPU Usage Per Sec                        76.3042613
     22721 CPU Usage Per Txn                        1248.99557
     22721 CR Blocks Created Per Sec                .025542427
     22721 CR Blocks Created Per Txn                .254059829
     22721 CR Undo Records Applied Per Sec          .043018182
...

The dba_hist_sysmetric_summary table is a perfect example of a "normalized" table, that for machine learning purposes, we need to denormalize.

We can denormalize with SQL, but you will eventually need to know how to do this in Python. I actually think it's a lot easier using Python.

How To Denormalize In Python

While there are a bunch of Python code lines shown below, there are three key steps.

Load the data
Remove all but the required features/columns
Pivot/denormalize the data

If the below Python code is confusing, checkout my FREE Machine Learning For Oracle Professionals E-Course. If you need a Python machine learning sandbox check out my blog post, How To Setup Your Machine Learning Sandbox Environment.

Below is some Python code demonstrating how to denormalize normalized data.

import pandas as pd  # dataframe processing

# set the dataset
url = "http://filebank.orapub.com/DataSets/jj_dba_hist_sysmetric_summary.csv"

# Load the features/columns list
features = pd.read_csv(url, nrows=0).columns.tolist()
features = [x.strip(' ') for x in features] # remove leading/trailing spaces
print(features) 

# Load the data
sysmetricDF = pd.read_csv(url, names=features, skiprows=1) 
print(sysmetricDF.shape)
sysmetricDF.columns = map(str.lower, sysmetricDF.columns)
sysmetricDF['metric_name'] = [x.replace(" ", "") for x in sysmetricDF['metric_name']]

# set my interested features/columns
features = ['snap_id','metric_name','average']
sysmetricDF = sysmetricDF[features]
print(sysmetricDF.shape)
print(sysmetricDF.head())

# use the Python pivot function to denormalize
sysmetricDFpiv = sysmetricDF.pivot_table(index='snap_id', values='average', columns=['metric_name'])

# check the results
sysmetricDFpiv.shape
print(sysmetricDFpiv.head())
print(sysmetricDFpiv[['AverageActiveSessions', 'CPUUsagePerSec','DBBlockChangesPerTxn']].head())

As you can see it's very simple to denormalize normalized data in Python. You will find that nearly all of your machine learning projects using Oracle data will require using the pivot command.

All the best in your machine learning work,

Craig.

Denormalizing Normalized Data For Machine Learning

Please, Everything In One Row!

Where Does This End!

Perfect Denormalization Example

How To Denormalize In Python

What do you want to learn next?

End the Frustration in Tuning Oracle Databases

Become an OraPub Member

Please, Everything In One Row!

Where Does This End!

Perfect Denormalization Example

How To Denormalize In Python

About the Author

Related Posts

Do PLSQL Procedures And Anonymous Blocks Have A SQL_ID?

What Machine Learning Means For The Oracle DBA

What do you want to learn next?

End the Frustration in Tuning Oracle Databases

Become an OraPub Member