* the Data Warehouse Lifecycle Toolkit By: Ralph Kimball, Laura Reeves, Margy Ross, Warren Thornthwaite This is a good book on Data Warehousing and is frequently mentioned in the data warehousing community. The following are some notes and thoughts as I was reading the book... ----------------- Always ask: 1. Does this DW give you the data you need -> correct fact and dimension tables, meta data 2. Can it give you what you want quickly? -> dimensional modeling, star schema, bit map index (1,2 is at least 50% of your DW work) 3. Can the data be loaded easily? -> ETL process 4. Data storage -> would be nice if we can save tons of data without sacificing too much on other areas. ----------------- p.163 Every data mart is a family of similar tables sharing conformed dimensions. Data warehouse is a collection of of separated implemented data marts bound together with a powerful architecture based on conformed facts and dimensions. p.173,p.185, Fig.5.6, Fig5.8 When to snow flake, when to create a separate dimension? Snow flake when dimension attribute has low cardinality data: 1. These data has different grain, is managed and loaded at different times than rest. 2. Many attributes, so we indeed save a lot of data. 3. Basically it is a separate entity, and we often want to browse its data. Create a separate dimension when data in a monster dimension is changing rapidly, and you want to avoid continuously adding records to the dimension table (why? because these dimension tables are big enough as is. You don't want them to grow further) So, you band some of the attibute data up front and try to capture all possible data for these attributes. p.200 A single data mart is a cooredinated set of fact tables. p.205, Fig 5.11 Looks like a snow flake schema. Maybe it is. By snow falking you separate low cardinality data into a sun-dimension. Here you create a sub-dimension in order to get rid of all the NULLs, since many attributes of the dimension are not shared by different lines of business. So for each line of business you create 1 sub-dimension and 1 sub-fact. Note that both custom fact join key and custom dimension join key will have 1-1 relationship with the fact and dimension table. Think about what 1-1 means. Imagine that both dimension and fact tables are "partitioned" by the join key. "Partitioning prunning" will eliminate the unnecessary partitioins during queries. When running query for checking, one would join the (custom d)-(core d)-(core fact)-(custom d). (Are the keys in the core tables primary key? Are the keys in the custom tables foreign keys? Probably. The point here is to join the core tables to the right custom tables - since there will be many customer table. Many-to-many dimensions: This does not mean m-to-m is a problem. Actually m-to-m dimensions is the solution we come up with to solve the problem of the granularity of fact record higher than that of the dimension record and wieghting factor is needed to lower the granularity of the fact table. p.219 What are the attributes in the diagnosis dimensions? Does the fact table contain all the diagnosis records of a patient during one visit to the hospital? What's wrong with Fig6.1? A fact record is really made up of more than 1 diagnosis, so there is no way to references 1 single Diag. dimension. You can not attribute the hospital visit cost of $300 to a single diagnosis. What if we add a weighting-factor attribute to the fact table in stead of creating the bridge table? This will reduce 1 join condition from the query. So you go from select d.diag-name, sum(f.measure*b.weight) from fact f, bridge b, diag d where f.bridge-grp-key=b.bridge-grp-key and b.diag-key=d.diag-key group by d.diag-name to select d.diag-name, sum(f.measure*f.weight) from fact f, diag d where f.diag-key=d.diag-key group by d.diag-name and this approach should make the data loading easier. All additional attributes in the bridge table can be added to the fact table. Looks to be a case of vertical partitioning as presented in the book. Not normalization, since you are not removing any duplicates. P.220 Is there a separate diagnosis dimension in addition to the bridge dimension? Otherwise you can not have a diagnosis-group-key FK in the fact table. What would be the aatributes in the dimension then? Or is the FK simply a logical concept that says all the diagnosis-group-key in the fact table must be in the bridge table, even though diagnosis-group-key is not unique in the bridge table? p.221, fig 6.3 The PK in the bridge table has 2 columns. Remember there is also a "account" dimension that is not shown. The account_key FK in the fact table references this "account" dimension, not the bridge table. The customer_key in the bridge table is a FK that references the customer dimension. One account fact record can map to many records in the bridge, since the account can have many customers. 1 acccount record in the bridge can map to many fact records - from several months for example. If 2 tables are not tied with PK, FK (as is the cas ehere), does it have any meaning to describe their relationship as 1-1 or 1-m? p.222 Don't worry. If you know your tables and if you know what exactly you want out of your queries, you will not get the "wrong" results. p.224 Several keys in the fact table maps to 1 key in the dimension. The opposite of p.219. You can create 7 views and join the fact with 7 views, or you can create 7 sub fact tables, and join the main fact, 7 sub facts with the time dimension. Obviously Creating views is the better choice. p.229, Fig6.7 Huge bridge table. 1 record for every possible relation in the hierachy tree. (13)(0 level)+(12)(under top node)+(4+6)(under 2nd level nodes) +(2+4)+(2)=43 (as mentioned on p.231) Top node does not have parent_customer_key Q: sales to all subsidiaries of ID=100, group by industry_group select sum(f.sales), c.industry_group from fact_table f, bridge b, customer c where f.customer_key=b.subsidiary_customer_key and b.parent_customer_key=c.customer_key and number_of_levels >0 and f.customer_ID=100 group by c.industry_group; Drop bridge, add parent_key to the customer table: select sum(f.sales), v.industry_group from fact_table, (select industry_group, customer_key from customer start with customer_id=100 connect by prior customer_key=parent_key) v where f.customer_key=v.customer_key group by v.industry_group; Denormalize parent_key into the fact table? select sum(v.sales), c.industry_group from (select customer_key, sales from customer start with customer_key= (select customer_key from customer where customer_id=100) connect by prior customer_key=parent_key ) v , customer c where c.customer_key=v.customer_key group by c.industry_group; No point in denormalizing. How will the front end tool handle this SQL? The Object VIew appraoch, see HQExample.sql -------------------------------------------- create or replace view customer_view (, hier_key) as select c.*, p.* from customer c, TABLE(CAST(ChildrenOf(c.customer_key,3) as hkeys)) p; select sum(f.sales), c.industry_group from fact_tabel f, customer_ov c where f.customer_key=c.hier_key ; and c.customer_ID=100; -- note, map f.customer_key to c.hier_key -- for each customer_key there will be everal hier_key (or child nodes) -- in the object view -- If performance is poor, stored all linked nodes as a nested table column -- and then create an object view over it as with the virtual nested table. -- What about level? How do you parameterized the object view? -- Solved using pipelined function -> see HQExample.sql 1 step down, Sum for all immediate subsidiaries -------------------------------------- select sum(f.fact) from fact_table f, bridge_table b, customer c where f.customer_key=b.subsidiary_customer_key and b.parent_customer_key=c.customer_key and c.customer_name= and b.levels=1 1 step up, sum for the ONE parent ------------------------ select sum(f.fact) from fact_table f, bridge_table b, customer c where f.customer_key=b.parent_customer_key and b.customer_customer_key=c.customer_key and c.customer_name= and b.levels=1 Go up the tree, sum for all parents ------------------ select sum(f.fact) from fact_table f, bridge_table b, customer c where f.customer_key=b.parent_customer_key and b.customer_customer_key=c.customer_key and c.customer_name= and b.levels>=1 Note the differences between Fig6.7 and 6.8. Link key to key. It is a logical representation. p.235 The emp transaction table looks like a fact table, but its lacks measurable facts. Its has all textual values. This would qualify as a rapidly changing large dimension. How do you come up with a smart design like this? Start with your query requirements and ask what makes sense, what is possible. p.241 Allocating is similar to what's discussed on p.219. It is all a matter of granularity. p.245 Does GMT_date_key reference a view built on top of the time dimension - same one referenced by date_key? p.248 Note that these factors are more like facts than dimension attributes. A lot has to do with how the front end tool present the schema to the end users. If I want to create several views for different groups of users, is it really necessary to DENROMALIZE these factors from dimension into the fact table? The views could join the two tables and therefore the end users see only the UOM converted facts. Of course the users could query the fact table alone - by selecting from the drop down list provided by the tool. In this case, end users may not be aware of the UOM issue - if the conversion factors are not in the fact table. How does the tool build the query from the list items selected by the users? How does it know to multiply or divide the factor? What if we deny user access the the base fact table and just let them access the views created by developers who know all the UOM and factors issues? They may still need to join to the product table to see the product info.. The factors in the table may cause confusion in this case, we can assume the users will not be aware that the fact table they are viewing has already been UOM factored. p.259 From SQL point of view, a special behavior dimension is like a temporary table created to reduce the complexity of a query by pre-storing some of the itermediate results. You then join this temp table with the fact table to get the facts. Of course, any slightest change in your behavior description would render the table useless and you will have to create another temp table. Might be easier to just create a script with dynamic parameters and let the tool run it upon request. p.549 Each (x,y) coordinate in Fig.14.1, 14.2 represent a fact table. The y value represents number of rows in that fact table. This has nothing to do with the nice bar chars you get in a OLAP tool. Don't get confused. p.561 Aggregate navigation system is called "query rewrite" in ORACLE. Materialized views will be your aggregates. p.633 Surrogate key lookup: basically to preserve referencial integrity. 1. Don't delete records in dimension tables without also deleting child records in the fact table. 2. For each new fact records, look up dimension table for the right surrogate key value. There is decision to be made whether to maintain a "current" lookup table for performance - since dimension tables can have many verison of the same records. Or add a "current" flag to the record. How is this handled by the ETL tool? p.635 What happens when a fact table is reloaded? You can not get your surrogate key value from the "current" record in the dimension table. You will have to find out the surrogate key value at the time when the historical record is generated. Is effective date column in the dimension table the only way to go? Seems easier with reloading a dimension table. Why reloading? What could be the reasons? Must we keep the old surrogate keys if we ever need to reload a dimension table? If you also reload fact table, then basically you refresh the whole DW and nobody is referencing the old keys anymore, and there is no reason of keeping them. ------------------------------ ORACLE OLAP Whether or not you should learn this depends on what front end tool you will use. For developers building a tool that works with ORACLE OLAP, they have to learn how to use the OLAP package. DBA probabaly needs to learn how to set up the cube, dimensions, etc.. If commercial tools will be used, it is best to ask them directly how the back end needs to be set up. At the moment, seems no major vendor is using ORACLE OLAP. Unless you are using EXPRESS, there is probably no reason to learn ORACLE OLAP. But playing with ORACLE OLAP will give you a sense of how dimension, cube, attributes work.