* On p.191 of the book "The Data Warehouse Lifecycle Toolkit" it says
"All data warehouse keys must be meaningless surrogate keys.
You must not use the original production keys."
There is a data warehouse schema, sh, that comes with ORACLE 9i. The schema contains the followings:
Fact tables:
SALES, COSTS
Dimension tables
CUSTOMERS: 50000 rows
PRODUCTS: 10000 rows
TIMES: 365*3 + 366 = 1461 rows
CHANNELS: 5 rows
PROMOTIONS: 501 rows
COUNTRIES: 19 rows
SALES and the 6 dimension tables form a snow flake schema, with CUSTOMERS referencing COUNTRIES.
- There is NO primary key on the fact tables.
- Fact tables are partitioned on the TIME_ID.
- CUSTOMERS, PRODUCTS, PROMOTIONS have surrogate primary key.
- TIMES has natural key.
- COUNTRIES, CHANNELS have intelligent key.
Let's take a closer look...
(One must be aware of the impact data partitioning has on this issue)
* Surrogate key for the dimension table
There is little doubt that all dimension tables must have surrogate primary keys.
Refer to Kimball's book for the reasons.
* Surrogate key for the Time dimension table
Should surrogate key be used for the TIMES dimension? The ORACLE sample schema does not use
surrogate key, but uses the natural key instead. Kimball's book is against this.
Reasons for NOT using surrogate date key:
The only reason seems to be that, by denormalizing date key into the fact table, you can
directly query the fact table with date constraint without joining to the TIMES dimension.
All other situations when other attributes in the TIMES dimension are needed, you still need
to do a join.
Reasons for using surrogate date key:
Keeping historical dimension records (or multi-version dimension records) still applies here.
Each record in the TIMES dimension has calendar attirbutes which can change. Different companies
may have different fiscal calendar system. If companies merge, 1 natural date key can have
several different interpretations. So, TIMES dimension is really no different from any other
dimensions.
So, TIMES dimension with "date" (not time) key is really no different than any other dimensions.
Surrogate key should be created. Why was surrogate time key not created in this ORACLE sample
DW schema? I believe the reason is the partitioning factor discussed below.
It is critically important to note that "date" is different from "time". "Time of day" should be
split from the "date" key into its own fact attribute. "AM", "PM", exact hour of the day should be
treated as "facts" and stored as attributes in the fact table.
* Surrogate key for the COUNTRY dimension table
The COUNTRY dimension table has a sort of 'intelligent key', meaning that you can actually infer
country name from the country code. If there is an international standard for country codes, this
might be OK. The chance of changing an international standard on country code is fairly remote.
Hard to imagine there will be more than 200-300 countries in the world. So JP probably will always
be referring to Japan. If companies use proprietory country codes and the companies merge, data
transformation during ETL can be implemented to conform the dimension attributes.
* Surrogate key, primary key for the fact table?
Should there be a surrogate, primary key in the fact table?
No. No versioning issue here. The attributes are "facts" and by nature will never change.
Should there be a primary key in the fact table?
From performance point of view, no. You will not be querying against the fact table with the primary
key in the where clause, that's for sure. "select ... where PK=1" just does not make sense. You will
not have "natural" primary key in the fact table. So you will NOT have queries like
"select sum(fact) ... where cutomers=... and products=...".
From constraint enforcement point of view, PK has some merits. You would create PK or UK on the
combination of all the FKs. This prevents the same fact record from being inserted twice into the
fact table. But if the data loading process is already doing the cehcking and you have confidence
in the loading process, PK or UK do not nee to be created, since these constraints and their
associated indexes have negative impact on the data loading performance.
* The data partitioning factor
Fact table
----------
The fact table should be partitioned. What should be used as the partitioning key?
If the fact table contains only facts and meaningless surrogate foreign keys, which key should
be used as the partitioning key? In the ORACLE sample schema, TIME_ID is used as the partitioning
key. This may be why ORACLE did not use surrogate key for the ITMES dimension. It is difficult
to partition on the surrogate key in a meaningful way, since the surrogate key itself does not
have any meaning. What to do then?
The most important benefits of partitioing are:
Partition pruning during query.
Partition independence during partition maintainence (data loading).
Types of partitioing:
Range partitioning
Hash Partitioing
List partitioing
Range/hash composite partitioing
Range/list composite partitioing
How does partitoning affect these 2 types of queries?
Queries involve just the fact table.
Queries require joining fact table with dimension tables.
To take advantage of star transformation, all FK columns in the fact table must have bitmap indexes.
All dimension table must have primary keys.
Consider the above aspects of partitioning, let's take a closer look:
Assuming that there is no primary key in the fact table, which is a reasonable assumption
1. If all the FKs are surrogate keys, how do you partition the fact table? List partitioing is a
rather special type of partitioing and will not be considered here. Hash partitioning has some
serious drawbacks, for example you will not be able to merge, split partitions. And you can not
exchange a hash partition with a table. Most of the time, you will be dealing with range partitioning.
Same argument can be applied to range/hash and range/list partiitoing. If HASH partitioning is to be
used, this surrogate key issue really does not exist. You have no control over the ORACLE hashing
routine, so it does not really matter if partitioning column constains surrogate key or natural key.
For the most oftenly used range partitioning, what happens if you partition on one of the surrogate
FKs? First of all, it can not be partition in a meaningful way. And, as long as this is true, you
will have problems creating a temp table and try to exchange it with a partition in the table.
Second, you will not be able to run single table query on the fact table - all meaning less FKs here.
When you join to the diemnsion table, partition pruning may not even happen (example to come...).
2. If one of the natural keys is used in the fact table and it is used as the partitioning key,
all the problems listed above disappear. The partitioning is meaningful, partition DDL can be
carried out and partition pruning during query is more likely to happen.
Conclusions:
It is necessary to exame this surrogate key and partitioning issue on a case by case basis.
For the most common scenario, when the fact table is range partitioned on the date key, there is
trade-off between using surrogate and natural date key. The pros of using surrogate key is the
ability to handle the possility of multi-version date key in the future. The cons come from the
data partitioning factor. Unless HASH partitioning, a not very desiable partitioning method, is used,
partitioning on surrogate key is problemetic at the very least.
Dimension table
----------------
How should the dimension table be partitioned, if it gets too big? Take the CUSTOMERS table for
example. For a big company, this dimension can be huge.
The CUSTOMERS dimension table in the sample DW schema is not partitioned.
This is a snow flake schema.
Partitioning it on the country code seems to be reasonable, when partitioning is needed.
The argument against partitioning on the surrogate key in the fact table applies here also.
Partitioning on the 'intelligent key' of country code makes more sense.
The PK index will be a global index or local non-prefixed index. For definition of these indexes
refer to "Properties of an index".
Refer to "Partitioned PK index" on how to create local prefixed index.
* What happen to all the indexes?
From the above discussions, we can draw conclusions on the indexes in a data warehouse environment.
1. For star tranformation to happen, bitmap indexes must be created on the FKs in the fact table.
If the fact table is partitioned, most of these bitmap indexes will be local non-prefixed.
In the ORACLE sample DW schema, the bitmap index on the time_id would have been the only index
that is local prefixed.
2. The PK on the dimension table would have been either a non-partitioned global index (a bad choice)
or a local non-prefixed index - if the dimension table is partitioned.