Earlier this week I was having yet-another-end-of-year-lunch1, this time with a bunch of friends who are all data warehouse developers. Now we could have spent the time enjoying the food and wine and talking about what we are all going to do over the holidays.
I say could because of course we didn't. Instead we talked about when a data warehouse is not a data warehouse. Go figure.
If you are still reading this, then you too must be a data warehouse developer so I will dispense with any attempt to include non-experts in the conversation. If this isn't you, just skip to the end of the article and then go to have lunch with friends who are happy to talk about the real world!
We all know that data warehousing is a term that is widely used to describe a database - any database - that somebody (dare I say a vendor?) wants you to think is somehow endowed with mystical powers that normal databases don't have.
To me, a data warehouse exists if the following three conditions are met:
- A data warehouse is a repository of an organisation's electronically stored data.
- It is designed to facilitate reporting and analysis.
- Is optimised for fast data retrieval (as opposed to update) across large data sets.
There is often confusion between a data warehouse and databases used in operational systems. Here is how I think of the differences.
Operational databases are optimised to preserve data integrity while also handling the insertion of transactions into normalised databases. Operational system designers generally follow the Codd rules2of database normalisation in order to ensure data integrity.
Fully normalized database designs (that is, those satisfying the five (of the 12 in total) Codd rules related to normalisation, result in transaction data being stored in many different tables. Relational databases are efficient at managing the relationships between these tables and, in order to improve performance, older data are usually regularly purged from operational systems.
Data warehouses are optimised for data retrieval - by placing all relevant information together in a single table - and analysis - by holding historical information that shows you how the data has changed over time.
So how does this work in practise? Even more importantly, how does this relate to my recent lunch?
Well, somebody mentioned that they had read about a new cloud-based data warehouse built by Taser (yes, that fun-loving bunch that zap criminals for a living). The data warehouse captures tamper-proof video recordings of police 'incidents' and stores them for use as evidence in subsequent criminal trials.
I argued that this isn't a data warehouse because it wasn't designed to facilitate analysis, reporting or even fast retrieval. Others argued that it supported analysis by providing the raw video footage of incidents in the same way that other warehouses provide data to business intelligence applications. Their logic was that the warehouse doesn't do the analysis but it is the analytic software that delivers the insight.
I still don't buy it and we (amicably) agreed to disagree.
On reflection, perhaps it comes down to our different worlds. I'm only interested in data warehouses because of the opportunities they give me to drive greater insight and better decisions in the business.
I think my data warehouse friends have the more modest aim of giving the business the data that makes insight possible - but it's up to the business to make smart use of this data. Isn't this just like a scientist inventing a cure and the drug company making it available to people? Sounds like a big enough challenge to me.
But hold on, doesn't that make me a drug pusher?
1 That's what happens to you when you stop doing actual business intelligence work and become a manager. It's a tough life.
2 Ted Codd was the IBM scientist who invented the relational model for database management. Interestingly, Codd also invented OLAP back in 1993 when he published an article in Computerworld. The article was then retracted by the publisher when they found out that Codd wrote it while working as a consultant for Arbor Software, the inventors of Essbase (which later merged with Hyperion and is now Oracle).