-
Marketplace
-
Channel Resources
Articles from this Site
Teradata Launches Data Warehouse Packages
Meijer Selects QuantiSense for Retail Business Intelligence
Navy Exchange Service Command Selects Netezza
Netezza Enters Location Intelligence Market
St.George Bank Upgrades Enterprise Data Warehousing with Teradata
White Papers
Databasing in the 90s: Data and What We're Doing with It!
Spend Data Warehouse on Steroids
Debugging PL/SQL with AgileInfoSoftware OraDebug
Tune Oracle SQL Performance with AgileInfoSoftware
Data Warehouses: What are they and how will they benefit your organization?
Books
One brand, one Web site! DM Review is now the home of all the content you're used to at BIReview.com and much more. If you are registered at BIReview.com, you're already registered at DM Review. If not, take just a moment to sign up for all the free services we have for you at the new DMReview.com.
Real Time Data Warehousing, Part 3
Trouble-Shooting in Real Time
The previous article addressed the question How is real time achieved? Beginning with the architecture, it provided a general design for a real-time application. This third and final article of the series addresses the questions What causes the biggest headaches? and What can do you do about it?
It can be frustrating following the implementation of an application that delivers real-time data when the source system refuses to cooperate. This article addresses some of the more common lapses in data quality that can shoot a real-time application in the foot.
Incomplete Within Itself
Problem: Data extracted from an operational source system may lack internal elements that are required to consider the data complete. For example, a record may lack:
- Persons name,
- Part description and/or
- Creation date.
The business meaning of that record may be compromised as a result.
Solution: This is a data quality problem, and it requires a data quality solution. Typically, if the meaning of a record is significantly compromised or rendered unusable, the occurrence of poor data quality is reported and the record rejected (hard reject). However, if some of the records business meaning can be salvaged by filling in the gaps with default values, then the record is allowed to pass to the data warehouse with default values inserted into the gaps. In this case, the occurrence of poor data quality is reported, including the data quality exception encountered and the action that was taken (soft reject).
Incomplete Beyond Itself
Problem: Data extracted from an operational source system may only have a business meaning in the context of other data records. If those other records do not arrive the business meaning can be compromised or removed completely. For example, a record may lack:
- A header record to provide the context for a detail line-item record,
- An original transaction that is updated by the data in a subsequent record and/or
- A dimensional data value that will join to the fact data in a record.
As a result, the business context of that record may be compromised.
Solution: This is also a data quality problem, and it requires a data quality solution. Again, if the context of a record is compromised or rendered unusable, the occurrence is reported and the record is rejected (hard reject). However, the context of the record can be salvaged by filling in the gaps with default values, the record is allowed to pass to the data warehouse with default values. The occurrence of poor data quality and soft reject are reported in this circumstance as well.
Repeat Data
Problem: Sometimes a source system will repeat its data. A receiving agent or application may request that data be played again. An interruption in an asynchronous environment may cause data to be repeated when connectivity is restored.
Solution: If a record should never be repeated, then each record must be keyed to allow a staging application to recognize it when repeated. This may not be as straightforward as it sounds. The repeat record may have a timestamp value, which is different from the original data record. In that case, the timestamp is not part of the key that uniquely identifies a data record. The key that uniquely identifies each data record is the key to catching repeated data. It is important to determine what the key is.
Repeat Data with Modifications
Problem: Sometimes a source system will realize it has made a mistake and try to resolve the error by restating a data record. This can be caused by internal completeness or external relational integrity issues that have been resolved. Now that the data is complete, the source system replaces the incomplete record with a complete record.
Solution: Each record must be keyed to allow a staging application to recognize it when repeated, even if it is repeated with modifications. The key which uniquely identifies each data record is the solution to catching modified data. If one of the values in the key of the repeated modified record has been modified, the subsequent record is a different record by key definition, which renders the extract, transform and load (ETL) application blind to the repeat. If, however, it was an attribute that was modified, then the ETL application must be able to recognize the repeated record as an update to a previous record.
Recycle Wheel
A popular design consideration is a recycle wheel. Data records with a data quality problem are held inactively in a recycle wheel until they can be resolved and corrected via the source system. The problem with a recycle wheel is that data quality problems can quickly become forgotten and ignored. Such inattention to data quality is caused by significant data volumes. For example, if 20 million data records occur in a day and 10 of them have a data quality problem, a subject matter expert from the source is not likely to be concerned with those 10 data records. When this situation is repeated daily for years, the recycle wheel will most likely exceed it storage allocation; meanwhile, its not delivering data or value to the data warehouse.
In this series, I have addressed questions that must be answered as the decision-makers in an organization or enterprise consider the possibility of real-time data warehousing:
- What is real time?
- When and where is real time valuable?
- How is real time achieved?
- How do you trouble shooting in real time?
This information is, of course, generic. No two real-time applications will implement these concepts and methods in exactly the same way. But, this series provides an approach and set of concepts that will enhance a real-time data warehouse.
Fon Silvers graduated from the University of South Florida with an MBA concentrating in Information Systems. He is currently an ETL analyst for a Fortune 500 retail corporation, developing ETL applications for a data warehouse program. In March 2008 Silvers published his first book, Building and Maintaining a Data Warehouse. He may be reached at fon.silvers@verizon.net.
For more information on related topics, visit the following channels:


