Following-up to my blog 'Just Tell Me What I'm Doing', this is the second in a series of posts that define the key concepts and terms that make up my analytic world. Everything I do is coloured by my experience actually doing analytics in commercial organisations.
So while I believe these posts will present practical definitions that will be actionable in the business world, I know that there are other worlds in academia and science where they are less relevant. At the very least, people in these areas will gain a better understanding of how business regards analytics. Data Warehouse puritans should perhaps stop reading, take a pill and lie down instead.
Bennett's Analytica
A Practitioner's Guide To Analytics
Data Warehousing
The overall process of providing Information to support business decision making is referred to as either data warehousing or business intelligence. The terms are used interchangeably and often as acronyms like DW/BI, DWH, EDW, ODW and BI.
EDW stands for enterprise data warehouse. This is a warehouse that stores data about activities in many areas of an organisation. An EDW or DWH is typically used to support the strategic business requirements of an organisation.
ODW stands for operational data warehouse. This is a warehouse whose primary purpose is to analyse and report on operational (day-to-day) activities.
Data warehousing refers to not only the data warehouse itself (see below) but also the technologies and activities to put data into it, extract data from it, analyse the data, and inform users through reports, dashboards, alerts and data extracts.
(image copyright unknown)
ETL
Is the acronym of Extract, Transform and Load data into and out of a data warehouse. ETL, also referred to as data integration is an important area of analytics and deserves it's own section in any Analytica. Stay tuned for a fuller discussion in another post.
Briefly, ETL is the activity of:
- Getting data into a data warehouse:
- Extract data from source system files. Source systems are usually the transactional systems of the organisation or are files that contain metadata required to assist in the analysis of the transactional data.
- Transform the data ready to place it into a warehouse. Often multiple sources of data are involved and similar data from different sources must be modified (i.e. transformed) so that they have the same format and/or meaning. When this is completed the data is said to be conformed.
- Load the conformed data into the tables of the data warehouse
- Getting data out of the data warehouse:
- Extract data from the tables of the data warehouse.
- Transform the data so that it is better organised to support the analytic purpose that the business requires to be performed.
- Load the transformed data into another database (called a data mart) or file ready for analytic or reporting use. It's called a data mart because it conjures-up comparisons with a supermarket - a place where you can easily pick what you want to consume.
Just to confuse you, ETL is now a popular term for loading and extracting data from any data store or application - especially in IT middleware buses where data is moved between applications.
Data Warehouse
A data warehouse on the other hand is generally defined as a database where the data is stored in such a way that it facilitates reporting and analysis. Other databases in business tend to be places to store data for transactional use. This difference in purpose has a surprisingly fundamental impact on the way that data is stored and the tools used to get data into and out of the database.
Another way to think about the difference is to think of transactional use to be operational where the speed of recording a transaction is important. Data warehouses on the other hand are used to support decisions where speed of recording is less important than our ability to analyse. This is useful but just to confuse you, data warehouses can also be used for operational reporting and analysis.
Most data warehouses store data in rows within tables but a growing number use columns (see Columnar Data Warehouses below).
Data Mart
Data marts are data stores whose data covers specific business functions for a specific community within an organisation. For example a finance or sales data mart. Data marts are often subsets of data in a data warehouse. Data marts simplify analytics as they:
- Remove data that is not relevant to the intended type of analytic insight required by the business.
- Pre-calculate measures that are frequently needed. This means that the measures can be standardised, reports are simpler to design, and run more quickly as they don't have to do as much calculation.
Data Warehouse Appliances
A DWH Appliance is an integrated set of servers, storage, operating system, database and software that is-optimised for data warehousing. Most appliances today use massively parallel processing (MPP) architectures to provide high query performance and platform scalability. In this respect appliances use the same architecture of the traditional very large scale data warehouses from Teradata. Appliance vendors include Netezza and Greenplum.
Normalised versus dimensional Storage Models
A storage model is the way the data is stored in a database.
A normalised model is one where the tables are grouped together by subject area. Each area containing a data category. Examples are data on customers, products, and the like. The advantages of normalisation is that it is easy to load information into the database. The disadvantages are that the database can end up with a large and confusing number of tables. This often makes it difficult for users understand the data and it also slows down the retrieval of data.
A dimensional model is one where transaction data are separated into 'facts' and 'dimensions'. Facts are typically numeric and transaction data. Dimensions are the reference information that gives context to the facts.
An example (from Wikipedia) helps understand the difference:
"A sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order."
The advantages of the dimensional approach are very big advantages. The first is that the data in the warehouse is easier to understand and use. Secondly, the retrieval of data from the warehouse is fast.
The main disadvantages of the dimensional approach are that making the data easy to understand comes at the cost of a more complex ETL load into the warehouse. This is because maintaining the integrity of the facts and dimensions is complex. The second disadvantage it that it can be difficult to change the data warehouse structure if the organisation changes the way it does business.
Another tricky aspect of the dimensional approach are Slowly Changing Dimensions (SCD). These are dimensions that have data that slowly changes (in case you didn't guess). An example best illustrates the problem:
A common dimension in a data warehouse is one that tracks the sales records of your company's salesforce. Reporting and analysing sales performance should be straightforward but what do you do when a sales person is transferred from one sales team to another. How do you record such a change in your sales dimension so that you remember the past and the present composition of your salesforce? This is important when you do all sorts of performance analytics.
Solving these (and a range of similar) issues involves methodologies referred to as Type 0, 1, 2, 3, 4, and 6 SCDs. The solutions to handling these types are outside of the scope of this post. Stay tuned for a fuller discussion in another post.
Alternative and Emerging Ideas
Columnar Data Warehouses
Wikipedia says it best:
"A column-oriented DBMS is a database management system (DBMS) which stores its content by column rather than by row. This has advantages for databases such as data warehouses and library catalogues, where aggregates are computed over large numbers of similar data items."
The Bus Architecture
A bus in computing is a subsystem that transfers data between computer components or between computers. It has a more specific meaning in data warehousing. Ralph Kimball, one of the founders of data warehousing explains it best. Here is my mashup of his ideas:
You can think of the conformed dimensions and facts of an enterprise as a standard set of connection points for applications — in other words, as a data warehouse bus architecture.
The corresponding bus matrix identifies the key business processes of an organisation, along with their associated dimensions. Business processes (typically corresponding to major source systems) are listed as matrix rows, while dimensions appear as matrix columns. The cells of the matrix are then marked to indicate which dimensions apply to which processes.
In a single document, the data warehouse team has a tool for planning the overall data warehouse, identifying the shared dimensions across the enterprise, coordinating the efforts of separate implementation teams, and communicating the importance of shared dimensions throughout the organisation.
Data warehouse bus architecture. The bus architecture is based on standardised dimensions and facts that let separate data marts, fact tables or OLAP cubes coexist and integrate. In this context, the term bus is not a large motor vehicle. Instead, it refers to the term's early meaning in the electrical power industry (a conductor for collecting electrical currents and distributing them) and commonly used in the computer industry to describe the standard interface specification that lets peripheral devices usefully coexist.
Ralph Kimball is credited with associating these fundamental bus concepts to the data warehouse delivery and presentation environment.
Related Terms and Concepts
Refer also to ODS, Operational Data Store
Refer also to Metadata
Comments? Via form below or send feedback to
[email protected] version 0.1 201002