-
Marketplace
-
Channel Resources
Articles from this Site
Teradata Launches Data Warehouse Packages
Navy Exchange Service Command Selects Netezza
Netezza Enters Location Intelligence Market
Data Warehousing Meets Data Archiving in Information Lifecycle Management
Data Warehousing Appliances Fly into a Storm of Controversy
White Papers
Data Warehouses: What are they and how will they benefit your organization?
Data Warehousing Buyer's Guide
Books
One brand, one Web site! DM Review is now the home of all the content you're used to at BIReview.com and much more. If you are registered at BIReview.com, you're already registered at DM Review. If not, take just a moment to sign up for all the free services we have for you at the new DMReview.com.
A Generalized Approach to the Creation of Bridge Records
As applied to data warehousing, a bridge table relates a fact record to one or more dimension records. If a given fact only ever relates to one dimension record, you dont need a bridge table. Where a fact has a one-to-many relationship with a dimension, or many-to-many, you may choose to handle that with a bridge table.
The bridge table approach described in this article identifies and numbers unique sets of dimensions and then reuses those sets, treating the set identifier as itself a sort of dimension key, which is put directly in the fact table. Its elegant and is particularly applicable to situations where you dont need a weighting factor but, rather, all the analysis is impact analysis (i.e., the full amount of a metric is counted against all the applicable dimensions). However, the extract, transform and load (ETL) is a bit complex.
An alternative approach that is better in most situations is to use a traditional associative table, just like you normally would in a transactional system. Assuming you have a fact table with a primary key FK and a dimension table with key DK, you simply create records with the pairs of keys as shown in Figure 1.

If the fact tables primary key contains multiple columns, put all the columns in there. You can add a weighting column to this, as in Figure 1, to avoid double-counting (unless you always want to double-count); when joining the fact through the bridge to the dimension, duplexing your fact records, you generally multiply the metrics by the weighting factor.
The admittedly minor issues with the associative table approach are:
- It requires that the primary key of the fact be put in the associative table. Unless you put a surrogate key on your fact tables, like you do your dimension tables, a fact table primary key may be a large collection of columns, perhaps even the collection of all foreign keys. If your fact table primary key is large, then your associative table will be large. Again, this particular issue goes away with surrogate keys on your fact table.
- The associative table will contain at least one record for each fact. Specifically, it will hold a number of records equal to the number of fact records times the average number of dimensions per fact. If you have 10,000 facts each related to both red and orange, then youll have 20,000 associative table entries. If space is a problem, this might also be a problem. The approach described in this article reuses sets of dimensions and will result in much smaller bridge tables (at the cost of more complex ETL).
- The traditional associative table does not easily allow you to predefine acceptable combinations of dimensions. In a situation when you want to predefine such combinations, the model described in this article may be better.
Another way to handle a one-to-many relationship between a fact and a dimensional attribute is to make a delimited string of all the applicable dimensional attributes (assuming they are, or can be converted to, strings) in one fact table column. For instance, if one of your dimensional attributes is color, then you could make a Colors attribute as some longish VARCHAR() data type directly in your fact table, and populate it something like Figure 2:

Notice the need to delimit the beginning and end of the entire string with whatever delimiter character you like. The queries you write against this sort of structure are easy enough:
select *
from fact
where colors like %/seafoam green/% or colors like %/red/%
to get the facts of either color, or
select *
from fact
where colors = /seafoam green/
to get all those where the only color is seafoam green.
But easy queries do not always a happy warehouse make. With a concatenated string approach:
-
You cant bake data quality into the warehouse database itself. The domain of acceptable values (red, seafoam green, violet, etc.) has to be managed inside ETL, perhaps by checking incoming color values against a Colors lookup table inside the staging area, but certainly not as a foreign key constraint to a real Colors dimension nor as a check constraint on the column (unless you wanted to write a constraint of all the different permutations of the dimension values, which is only practical if youve got just a few values).
-
Dealing with situations where your delimiter of choice shows up inside a dimensional value can be a drag (e.g., sort of red/orange as an incoming color attribute.)
-
You cant easily add attributes about the attributes. You cant, for instance, say whether a color is warm, cool or cold, track color names in different languages or divide them into standard and oddball bins.
-
You cant flag that a particular color is the primary color for a fact or assign weighting factors.
-
Compared with a true bridge table approach, it wastes database space. Seafoam green in UNICODE repeated a million times will burn a little volt 200MB.
-
How you actually group a metric by a given color, or all the various colors, isnt obvious. How does one do a count of the number of records by color? Do I count the records with /red/green/ and /red/ as two for red and one for green? Or one for /red/green/ and one for /red/?
In short: I wouldnt do this. I wouldn't make a concatenated string of dimension keys, either, for many of the same reasons.
Let's move on to our set-based approach. We have a many-to-many relationship between a fact and a dimension. Our dimension table, for as an example colors, would look something like Figure 3.

A set-based bridge table would look, then, like Figure 4. Well add a NUMBER_IN_SET column to track the number of different rows contained in a given set.

Color set 1 is for those facts which are red and white. Color set 2 is white only, and set 3 is for red, yellow and white. To relate a fact to a color set, you simply put COLOR_SET_K in the fact table as a foreign key to the bridge. The primary key of the bridge table is the foreign key to the dimension table plus the set key.
Note: Depending on your modeling tool and/or database management system (DBMS), you may not be able to model the oddball foreign key relationship from the fact to the bridge. Some systems and tools will insist that you make the set key alone (COLOR_SET_K in the above example) the primary key to be able to make a proper foreign key relationship. If so, you just need to work around it. If youre typically keen on declaring referential integrity (RI) in your database and the nasty thing wont let you, you can always implement RI with triggers. If you have a religious problem with triggers, youll need to trust to your ETL code for data integrity.
In our bridge table, well create a unique group or set key for every distinct combination of colors. You could do this on the fly, only creating those sets which appear in your data, or you could prepopulate your bridge table with all the permutations of your dimensions. The latter approach is nice all your bridge table sets are there as you load your fact table - but if youve got lots of dimensional values (for example, medical diagnoses attached to a patient visit fact record), the set of permutations can get out of hand for a given maximum number of dimensions. Plus, it's likely that many of the sets will never be used; its uncommon to see nail through the head and strep throat diagnosed at the same time. On the other hand, if you want to enforce that only certain dimension combinations are acceptable, then prepopulating your bridge table with those sets is a good idea. This article explains how to create bridge entries on the fly for only those combinations actually appearing in the incoming data.
Finding Unique Sets
Lets create our bridge table entries. Assume that you are already to the point in your ETL of having identified for any given natural (source system) key the set of dimension records that apply, and you have the surrogate keys to the dimension table in hand. In Figure 5, NK = natural key and DK = dimension key. Natural key can be a source system or natural key, or a surrogate key on your fact table. In either case, it represents the handle to the fact which has a relationship to multiple dimensions. What you have, so far is shown in Figure 5.

In the formation of bridge records, we want to identify and number the unique groups of dimensions. Reordering Figure 5 by both columns makes it a little easier to see that we actually have only five distinct sets of DK values, as shown in Figure 6.

NKs 2 and 98 share the DK set of [83]. NKs 4 and 53 share the set [12, 23]. NKs 17 and 102 use the set [3, 24, 192]. NK 11 does not share its set of DK values with any other NK. NK 25 shares the first two of its set (3 and 24) with others (NKs 17 and 102), but the third in its set is different, which means that 25 has its own unique set [3, 24, 988].
The question is how you get from the initial set of pairs of natural keys and dimension keys to the identification of unique sets of dimensions, the creation of bridge keys for those sets and then population of the bridge table. Brute force approaches involving cursors or the declaration of a maximum number of dimensions per natural key and reverse pivot operations are inflexible, crude and slow, and are not considered here. We'll use set logic. The approach accommodates any sort of natural key, does not use smart or concatenated keys, is not concerned with the number of dimensions that may ultimately appear for any given natural key, creates new sets as needed on the fly and does not duplicate sets. Again, environments with small numbers of dimensions, or where the number of dimensions attached to a given natural key remains reliably small, may choose to prepopulate the bridge table with all the permutations of the dimension keys or may choose to take the easier traditional associative table approach.
Again, weve identified the dimension keys for our source records. Well name the various work or staging tables we create with little mnemonic acronyms to try to make things easier to follow.
Step one is to sort your records by natural key and then dimension key, as shown in Figure 7.

Now we need to identify the unique sets. How can we tell that the combo of 3, 24 and 192 for NK 17 is the same thing as the set attached to NK 102?
First, order these rows in ascending order by NK and then DK and add a row numbering or identity column. Depending on your ETL tool and DBMS, youll have a number of choices for how to do this. (In Business Objects Data Integrator, for instance, you can use a Key Generation transform. In Microsoft SQL Server, you could add an identity column, either to a table variable in memory, a temp table or a real working table.) However you like, add a new column, which we'll call K1, to uniquely identify each row in order, as shown in Figure 8.

Next, make a table of the minimum K1 value and the count of records for each NK. Again, depending on your ETL tool (even if your ETL tool is just plain SQL), there will be various ways to do this. In SQL:
select
nk
,min(k1) as min_k1
,count(*) as ct
from
s2d
group by
nk
That will generate the table shown in Figure 9. This table might be a subquery in SQL code, an in-memory table in your ETL tool or a persistent table or file - logically, it doesnt matter.

What wed like is for each set of dimension keys to have similar attributes, so that we can do a SELECT DISTINCT to get the sets, apart from their appearance multiple times attached to various NK values.
Lets join S2D to AG1 and do some math to get a set of attributes for any given set of dimension keys to be the same:
select
s2d.NK
,s2d.dk
,s2d.k1
,((s2d.k1 + 1) ag1.min_k1) as SET_ORD
,ag1.min_k1
,ag1.ct
from
s2d inner join ag1
on s2d.nk = ag1.nk
You get the following table shown in Figure 10.

SET_ORD is the ordinal ranking of the dimension value within a given set. Itll be handy in a second. We can now pull out our unique sets of dimension keys by doing a self-join on the common attributes and making sure things match up (or dont match up!) in a certain way.
select
t1.min_k1
from
S2DAG1 t1 left outer join s2dag1 t2
on t1.ct = t2.ct
and t1.set_ord = t2.set_ord
and t1.dk = t2.dk
and t1.min_k1 < t2.min_k2
group by
t1.min_k1
having
sum(IsNull(t2.dk,1) > 0
We want the set markers (the MIN_K1 values) for all the sets where when we try to join on a higher numbered set, matching everything up dimension key, ordinal ranking and number in set - we fails to get a complete match. (If we do get a complete match, then by definition of complete match, its a duplicate set that we dont want.)
The above query generates the table of unique set markers (USM) shown in Figure 11.

Joining USM back to S2DAG1 results in a nice table of unique dimension sets:
select
usm.min_k1
s2dag1.dk
,s2dag1.set_ord
,s2dag1.ct
from
s2dag1 inner join usm
on s2dag1.min_k1 = usm.min_k1

We see that we have five sets of dimension keys. These are the unique sets we see coming in, in our source data.
New versus Existing Sets
Next we need to figure out which of these unique sets we already have in our bridge table and which we need to add. Those we already have well toss for purposes of this article.
To keep things simple, lets say our existing bridge table consists entirely of shown in Figure 13.

So far, then, we have these sets:
- [83]
- [3, 24, 192]
- [3, 24]
- [3, 24, 889]
Step one: do a left outer join from the incoming distinct set table, UDS, to the existing, BRIDGE:
New Left Joined to Existing (NLJE)
select
uds.min_k1
,uds.dk new_dk
,uds.ct
,IsNull(bridge.bk,-100) bk
from
uds left outer join bridge
on uds.dk = bridge.dk and uds.ct = bridge.ct
You set the value of the bridges bridge key (BRIDGE.BK) to some crazy value that you know, whatever your processes and standards, wont ever be a real group key value. You dont want these to be NULL at this point, but you want some sort of placeholder for no match. The results are shown in Figure 14.

Next, lets make a little table of the existing group keys, the keys of the incoming unique sets and the count of records by that level of aggregation:
Existing Count (EC)
select
bk
,min_k1
,count(*) ct
from
nlje
group by
bk

By joining the NLJE table back to EC, we can figure out the sets we already have versus the new ones we need to create:
Create versus Have (CVSH)
select distinct
nlje.min_k1
,nlje.new_dk
,nlje.ct
,ec.bk
from
nlje left outer join ec
on nlje.bk = ec.bk and nlje.ct = ec.ct and nlje.min_k1 = ec.min_k1

And here weve identified our new sets theyre simply the records with NULL in the BK column. (As a nice side benefit, weve also found the existing set keys for our incoming records, which we can use as part of the fact table load.)
The final piece of the puzzle is numbering our new bridge records, and then theyre ready to insert into the bridge table.
Numbering Our New Sets
We pull out the MIN_K1 values from above for the new sets:
New sets, MIN_K1 (NSMIN_K1)
select distinct
min_k1
from
cvsh
where
bk is null

We need to number these with new surrogate keys for our bridge table. Again, depending on tools, there are different ways, but the basic idea is to find the maximum existing bridge key value, add 1, and then use that as the seed value to number the new sets. To find the new seed value:
select
max(bk) + 1 as next_bk
from
bridge
In our example, the next BK value is 5. Using that as the seed, we add a BK (bridge key) column to table NSMIN_K1:

To create the records to be inserted into the bridge table, just join CVSH to NEW_BK, as follows:
insert bridge (dk, bk, ct)
select
cvsh.new_dk dk
,new_bk.bk
,cvsh.ct
from
cvsh inner join new_bk
on cvsh.min_k1 = new_bk.min_k1
where
cvsh.bk is null
Now you have your new bridge table entries. As mentioned, you've also identified the bridge set key to use for the fact table load (for the existing sets, the newly inserted sets are identified in the same way).
Jeff Prenevost is a business intelligence consultant.
For more information on related topics, visit the following channels:


