A Step-by-Step Approach to Building an Enterprise Data Warehouse
Many enterprises have started benefitting from collecting their transactional data and leveraging the insights derived from it for better decision-making. It isn't easy to find many companies that do not have a database system in place thesedays.
It's important to make sure that the data that resides in your system is in a usable form, for which you need to have a data warehouse. A data warehouse is essentially a database, or collection of databases, that centralizes a business's information from multiple sources and applications, and makes it available for analytics and use across the organization.
For the IT managers in an enterprise scenario, the dilemma is how to use the historical data collected over many years. The answer is simple: store everything simultaneously, manipulate it, and run reports against the database (DB).
A data warehouse's goal is to offer your company a quick and easy way to look at your historical data. The most advanced online analytical processing (OLAP) tools will let the data warehouse users generate easily interpretable reports at once click and keep track of the company performance from different angles.
Suppose you run a manufacturing plant that makes thousands of units of products every day. Here, the information you may be interested in will be things like the number of defective productive produced per hour. Even though you may want to examine the number of defective parts produce over a period against the same rate last year or 2 years ago, such info may not provide the best picture of your performance.
However, if you run a car rental, the information about the number of customers who paid for your service this month against the same last month may be of great value. That’s why if you want to draw insights from data residing in your systems, you have to build a data warehouse.
Steps to Build a Data Warehouse
Building a data warehouse basically includes the following steps:
- Extract the transactional data from various sources into a staging area.
- Transform transactional data.
- Loading data on to the dimensional database.
- Make the summary values to expedite report generation.
- Get a front-end tool for reporting.
Let us explore the steps in more detail…
#1. Extraction of transactional data
A major part of a data warehouse's construction is pulling the data from various sources and putting them all into a centralized storage location. This can be the most complicated step to accomplish, rightly as most people who may have worked on building the system may have moved away from the organization.
Identify which database systems you have to use for the staging area and the ways to pull data from various sources into it. One excellent tool for this is the Data Transformation Services (DTS) from Microsoft, which allows you to import and export data and comes packed with MS SQL Server.
#2. Transforming the transactional data
The next important step is transforming the data extracted from various sources. What makes this complicated is that many companies may have the data spreading across different decision guidance management systems (DGMS) like MS SQL Server, MS Access, Sybase, and Oracle. Other companies may have their data in files, spreadsheets, and even on their mail systems.
While constructing a data warehouse, you have to transform data from all these sources by bringing them on to the starting area. Before transforming this data, you have to figure a foolproof way to relate the tables and columns of one system to the same from other systems.
#3. Creation of a dimensional database
The next step is to create a dimensional model, which is a database modeling system optimized or better suited to read, summarize, analyze numeric information like values, counts, balances, and weights.
Most of the advanced transactional systems are built on the conventional relational model that is a good option for capturing data. However, relational databases are highly normalized to to minimize redundancy and duplicate data.
While designing a database system, you may try to get rid of the repeating data columns and make all the available columns dependent on each data table's primary keys.
Relational DB systems can perform well in the OLTP (On-Line Transaction Processing) environment, but they may show poor performance in reporting and data warehousing. In these cases, joining many huge tables may not be an ideal approach.
So, the relational format is not that efficient while building reports and aggregating values. It is the dimensional approach or model that can provide a better way to improve the query performance without hampering the data integrity.
#4. Loading data
After building the dimensional model, next you’ll need to populate actual data into the staging DB. This step might involve combining various columns and splitting a field into different columns. You may also need to perform various lookups before calculating various values for a dimensional model.
Such data transformation for loading can be performed at two stages while extracting data from its origin or loading data to the dimensional model. At which stage you have to do it needs to be decided based on your project.
#5. Generation of pre-calculated summary values
Once loading data is complete, the next process in the sequence is the generation of precalculated summary values, known as aggregations.
After populating the dimensional database, database tools like SQL Server Analysis Services can do the aggregate generation. The more dimensions you have, the more time it may take to generate aggregations.
Whichever dimensional model you choose, though, make sure that the SQL Server has the maximum possible memory. Building aggregations can be a very memory-intensive process, and the more memory you availed, the lesser time it will take for generating aggregate values.
#6. Getting a front-end reporting tool
Once you have the dimensional DB and aggregations in place, you can further build or purchase a reporting tool. Based on your requirements, you may consider a data drill-down tool like the Pivot Table Service of Microsoft Excel.
However, if the reporting needs are more than what Excel can contain, you may have to spend more resources building or buying a custom reporting tool. Luckily, there are many vendors today offering such analytical tools at reasonable prices.
Microsoft, for example, recently released a Data Analyzer tool, which can be a very cost-effective option. Consider buying such premium tools before developing your own internal software. Reinventing the wheel may not always be cheap or worth it in the end.