Denormalizing Normalized Data For Machine Learning
Thirty years ago I was passionately teaching Oracle DBAs about normalization and their impending disasters if they implemented denormalized tables.
However, now I'm teaching Oracle Professionals to denormalize Oracle data that has been thoughtfully normalized!
Why would I do something so sinister!?
The answer to this question and how to do the denormalization for machine learning purposes is what this post is all about.
Please, Everything In One Row!
In machine learning, we want all data related to a sample (think: AWR snap_id data) to reside in a single row.
So instead of id, name and value, we will have id, name1_value, name2_value, etc. Below is example of a normalized table definition.
create table perf_stats as ( snap_id number, stat_name varchar2(100), stat_value number )
If we did a simple SELECT statement, the result would be something like this.
snap_id stat_name stat_value 1001 uc_psec 2500 1001 aas 34.25 1001 trx_psec 9.45 1002 uc_psec 1200 1002 aas 14.50 1002 trx_psec 6.50
But machine learning algorithms want the data to look something like this:
snap_id uc_psec aas trx_psec 1001 2500 34.25 9.45 1002 1200 14.50 6.50
That's great, but I'm still very uncomfortable...
Where Does This End!
You may be asking yourself, where does this end? I mean, how many columns could we possibly have?
The answer is not so much a technical limit, but more of a usefulness limit. The more table columns, that is the more "features" does not imply the ML model will perform better. This is one of the myths I try to destroy in my Machine Learning For Oracle Professionals LVC.
But for sure there could easily be tens or even hundreds of features. Part of our job is to minimize the number of features, but that's a topic for another post.
Perfect Denormalization Example
It turns out the AWR dba_hist_sysmetric_summary table contains metrics, that when used properly, can help create a powerful predictive model. That's the good news.
The bad news is, dba_hist_sysmetric_history is somewhat fully normalized. Check this out!
SQL> desc dba_hist_sysmetric_summary Name Null? Type ----------------------------------------- -------- ---------------------------- SNAP_ID NOT NULL NUMBER DBID NOT NULL NUMBER INSTANCE_NUMBER NOT NULL NUMBER BEGIN_TIME NOT NULL DATE END_TIME NOT NULL DATE INTSIZE NOT NULL NUMBER GROUP_ID NOT NULL NUMBER METRIC_ID NOT NULL NUMBER METRIC_NAME NOT NULL VARCHAR2(64) METRIC_UNIT NOT NULL VARCHAR2(64) NUM_INTERVAL NOT NULL NUMBER MINVAL NOT NULL NUMBER MAXVAL NOT NULL NUMBER AVERAGE NOT NULL NUMBER STANDARD_DEVIATION NOT NULL NUMBER SUM_SQUARES NUMBER CON_DBID NUMBER CON_ID NUMBER SQL> select snap_id, metric_name, average 2 from dba_hist_sysmetric_history 3 order by 1,2 4 / SNAP_ID METRIC_NAME AVERAGE ---------- ---------------------------------------- ---------- 22721 Background Checkpoints Per Sec .14362801 22721 Branch Node Splits Per Sec .223080287 22721 Branch Node Splits Per Txn 3.67087764 22721 Buffer Cache Hit Ratio 99.5259521 22721 CPU Usage Per Sec 76.3042613 22721 CPU Usage Per Txn 1248.99557 22721 CR Blocks Created Per Sec .025542427 22721 CR Blocks Created Per Txn .254059829 22721 CR Undo Records Applied Per Sec .043018182 ...
The dba_hist_sysmetric_summary table is a perfect example of a "normalized" table, that for machine learning purposes, we need to denormalize.
We can denormalize with SQL, but you will eventually need to know how to do this in Python. I actually think it's a lot easier using Python.
How To Denormalize In Python
While there are a bunch of Python code lines shown below, there are three key steps.
- Load the data
- Remove all but the required features/columns
- Pivot/denormalize the data
If the below Python code is confusing, checkout my FREE Machine Learning For Oracle Professionals E-Course. If you need a Python machine learning sandbox check out my blog post, How To Setup Your Machine Learning Sandbox Environment.
Below is some Python code demonstrating how to denormalize normalized data.
import pandas as pd # dataframe processing # set the dataset url = "http://filebank.orapub.com/DataSets/jj_dba_hist_sysmetric_summary.csv" # Load the features/columns list features = pd.read_csv(url, nrows=0).columns.tolist() features = [x.strip(' ') for x in features] # remove leading/trailing spaces print(features) # Load the data sysmetricDF = pd.read_csv(url, names=features, skiprows=1) print(sysmetricDF.shape) sysmetricDF.columns = map(str.lower, sysmetricDF.columns) sysmetricDF['metric_name'] = [x.replace(" ", "") for x in sysmetricDF['metric_name']] # set my interested features/columns features = ['snap_id','metric_name','average'] sysmetricDF = sysmetricDF[features] print(sysmetricDF.shape) print(sysmetricDF.head()) # use the Python pivot function to denormalize sysmetricDFpiv = sysmetricDF.pivot_table(index='snap_id', values='average', columns=['metric_name']) # check the results sysmetricDFpiv.shape print(sysmetricDFpiv.head()) print(sysmetricDFpiv[['AverageActiveSessions', 'CPUUsagePerSec','DBBlockChangesPerTxn']].head())
As you can see it's very simple to denormalize normalized data in Python. You will find that nearly all of your machine learning projects using Oracle data will require using the pivot command.
All the best in your machine learning work,
If you are ready for machine learning cross-training, check out my Machine Learning For Oracle Professions LVC. I also offer this training on-site in a workshop format. Email me for details.
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
|Comparing Oracle Database SQL Execution Times From Different Systems||Do PLSQL Procedures And Anonymous Blocks Have A SQL_ID?||A Real Story In A Real Life: Everything Started With A Simple Post|