Denormalizing Normalized Data For Machine Learning

Thirty years ago I was passionately teaching Oracle DBAs about normalization and their impending disasters if they implemented denormalized tables.
However, now I'm teaching Oracle Professionals to denormalize Oracle data that has been thoughtfully normalized!
Why would I do something so sinister!?
The answer to this question and how to do the denormalization for machine learning purposes is what this post is all about.
Please, Everything In One Row!

In machine learning, we want all data related to a sample (think: AWR snap_id data) to reside in a single row.
So instead of id, name and value, we will have id, name1_value, name2_value, etc. Below is example of a normalized table definition.
create table perf_stats as (
snap_id number, stat_name varchar2(100), stat_value number
)
If we did a simple SELECT statement, the result would be something like this.
snap_id stat_name stat_value
1001 uc_psec 2500
1001 aas 34.25
1001 trx_psec 9.45
1002 uc_psec 1200
1002 aas 14.50
1002 trx_psec 6.50
But machine learning algorithms want the data to look something like this:
snap_id uc_psec aas trx_psec
1001 2500 34.25 9.45
1002 1200 14.50 6.50
That's great, but I'm still very uncomfortable...
Where Does This End!
You may be asking yourself, where does this end? I mean, how many columns could we possibly have?
The answer is not so much a technical limit, but more of a usefulness limit. The more table columns, that is the more "features" does not imply the ML model will perform better. This is one of the myths I try to destroy in my Machine Learning For Oracle Professionals LVC.
But for sure there could easily be tens or even hundreds of features. Part of our job is to minimize the number of features, but that's a topic for another post.
Perfect Denormalization Example
It turns out the AWR dba_hist_sysmetric_summary table contains metrics, that when used properly, can help create a powerful predictive model. That's the good news.
The bad news is, dba_hist_sysmetric_history is somewhat fully normalized. Check this out!
SQL> desc dba_hist_sysmetric_summary
Name Null? Type
----------------------------------------- -------- ----------------------------
SNAP_ID NOT NULL NUMBER
DBID NOT NULL NUMBER
INSTANCE_NUMBER NOT NULL NUMBER
BEGIN_TIME NOT NULL DATE
END_TIME NOT NULL DATE
INTSIZE NOT NULL NUMBER
GROUP_ID NOT NULL NUMBER
METRIC_ID NOT NULL NUMBER
METRIC_NAME NOT NULL VARCHAR2(64)
METRIC_UNIT NOT NULL VARCHAR2(64)
NUM_INTERVAL NOT NULL NUMBER
MINVAL NOT NULL NUMBER
MAXVAL NOT NULL NUMBER
AVERAGE NOT NULL NUMBER
STANDARD_DEVIATION NOT NULL NUMBER
SUM_SQUARES NUMBER
CON_DBID NUMBER
CON_ID NUMBER
SQL> select snap_id, metric_name, average
2 from dba_hist_sysmetric_history
3 order by 1,2
4 /
SNAP_ID METRIC_NAME AVERAGE
---------- ---------------------------------------- ----------
22721 Background Checkpoints Per Sec .14362801
22721 Branch Node Splits Per Sec .223080287
22721 Branch Node Splits Per Txn 3.67087764
22721 Buffer Cache Hit Ratio 99.5259521
22721 CPU Usage Per Sec 76.3042613
22721 CPU Usage Per Txn 1248.99557
22721 CR Blocks Created Per Sec .025542427
22721 CR Blocks Created Per Txn .254059829
22721 CR Undo Records Applied Per Sec .043018182
...
The dba_hist_sysmetric_summary table is a perfect example of a "normalized" table, that for machine learning purposes, we need to denormalize.
We can denormalize with SQL, but you will eventually need to know how to do this in Python. I actually think it's a lot easier using Python.
How To Denormalize In Python

While there are a bunch of Python code lines shown below, there are three key steps.
- Load the data
- Remove all but the required features/columns
- Pivot/denormalize the data
If the below Python code is confusing, checkout my FREE Machine Learning For Oracle Professionals E-Course. If you need a Python machine learning sandbox check out my blog post, How To Setup Your Machine Learning Sandbox Environment.
Below is some Python code demonstrating how to denormalize normalized data.
import pandas as pd # dataframe processing
# set the dataset
url = "http://filebank.orapub.com/DataSets/jj_dba_hist_sysmetric_summary.csv"
# Load the features/columns list
features = pd.read_csv(url, nrows=0).columns.tolist()
features = [x.strip(' ') for x in features] # remove leading/trailing spaces
print(features)
# Load the data
sysmetricDF = pd.read_csv(url, names=features, skiprows=1)
print(sysmetricDF.shape)
sysmetricDF.columns = map(str.lower, sysmetricDF.columns)
sysmetricDF['metric_name'] = [x.replace(" ", "") for x in sysmetricDF['metric_name']]
# set my interested features/columns
features = ['snap_id','metric_name','average']
sysmetricDF = sysmetricDF[features]
print(sysmetricDF.shape)
print(sysmetricDF.head())
# use the Python pivot function to denormalize
sysmetricDFpiv = sysmetricDF.pivot_table(index='snap_id', values='average', columns=['metric_name'])
# check the results
sysmetricDFpiv.shape
print(sysmetricDFpiv.head())
print(sysmetricDFpiv[['AverageActiveSessions', 'CPUUsagePerSec','DBBlockChangesPerTxn']].head())
As you can see it's very simple to denormalize normalized data in Python. You will find that nearly all of your machine learning projects using Oracle data will require using the pivot command.
All the best in your machine learning work,
Craig.
Start my FREE 18 lesson Machine Learning For Oracle Professionals E-Course here.

Craig Shallahamer is a long time Oracle DBA who specializes in predictive analytics, machine learning and Oracle performance tuning. Craig is a performance researcher and blogger, consultant, author of two books, an enthusiastic conference speaker a passionate teacher and an Oracle ACE Director. More about Craig Shallahamer...
If you have any questions or comments, feel free to email me directly at craig at orapub.com.
![]() |
![]() |
![]() |
|||
Fixed In Oracle Database 12c? Session CPU Consumption Statistics | Hmm... Oracle Database Cloud Vendors Want Efficient Systems | IOUG 2016 Is Different: Here's How To Get Your Abstract Accepted |