Queries from Hell: Data warehouse size issues

Does your organization have a multi-terabyte data warehouse? If you said yes, it is a very good bet that it contains massive amounts of redundant data. Only a few very large organizations can generate terabytes of unique structured data. Certainly, you could grow into hundreds of terabytes if you store full length movies in the data warehouse but with the technology that is available today there is no value in storing massive blobs of unstructured material.

What goes into the database?

Many companies talk about having data warehouses with five, ten, one hundred or even more terabytes of data in them. Frequently, this is the measurement of the incoming data, not the final data structure with all indexing, star schemas, OLAP cubes and other performance optimization features. Where does this incredible amount of data come from?
If we look at typical counts of business related objects that are in a data warehouse like customers, orders, products, etc. we can see that it is very difficult to reach these massive sizes if information is stored only one time. For example, assume we are looking at a retail company that has one million customers with an average of one hundred orders with ten detail lines each from each customer each year. The company has three years of history in the data warehouse. Converting these business object counts to database objects and space requirements, we get the table below.

-------------------------------------------------------------------------------------------
Object                Count                      Size (bytes)         Total size (Gbytes)
-------------------------------------------------------------------------------------------
Customer            1,000,000                 2,000                 2.0
Order                 300,000,000              300                    90.0
Order detail    3,000,000,000           100                    300.0
Miscellaneous     20,000,000                500                   10.0
-------------------------------------------------------------------------------------------
   Total 400.0

If this is an example of a retail business, the customer, order and order detail tables will consume most of the required space. The miscellaneous group would cover all other objects. They might stock twenty or thirty thousand items so the product table size is trivial. Inventory records for that number of items would be small even with a number of locations for every item. If they produce an average of ten purchase orders per year per item, the purchase order file is also small. Promotions generally are one or two per week but even with many more than that and a large amount of data for each promotion, it is a relatively small amount of data. All of the other files, personnel, GL, and others are of insignificant size.
This table shows a total of about 400 Gigabytes of raw data – less than half a terabyte. So how does an organization get fifty or one hundred terabytes of real data? Perhaps Wal-Mart or the government of China might really have that much non-duplicated data. In most other cases it comes from redundancy.
The typical data warehouse contains tables that have many versions and subsets of the same data structured for different purposes. This data is thought of by the business as different data from different sources and used for different purposes. In reality, it is essentially the same data, structured differently for different purposes and it must be retained in these diverse structures because RDBMS technology is not capable of using data in a single physical structure for a multitude of different purposes.
Even if the counts in the table above were understated by a factor of ten, there is less than five terabytes of really unique data. Even if the business is a telco with one hundred billion call detail records it is hard to come up with more than ten terabytes of unique data. The conclusion then is obvious; most data warehouses receive and store massive amounts of redundant data.

What happens after it is loaded?

If a data warehouse is expected to function as anything more than a big bucket of raw data, it will need indexing, summary tables, OLAP cubes and other additions to optimize the value and accessibility of the data it holds. As a result, the database size can easily be five to ten times the size of the raw data that was loaded. If the warehouse will be expected to support analytics or information discovery functions, it may grow to twenty times the size of the raw data or even more. It is easy to see how less than one terabyte of unique raw data can be the basis of many terabytes of data warehouse space. Even a moderate sized enterprise will need a very substantial computer with many terabytes of disk to manage a routine data warehouse.
All of this massive duplication and space consumption is required to make the common record structure in a relational format efficient enough for use in an organization. But if the database is capable of using a single physical structure to support any required type of access and it does not require star schemas, snowflake schemas and massive indexing, then the organization will not need massively parallel servers with rooms full of disks to support an enterprise warehouse.
So what? All databases have the same limitations and require essentially the same tuning, right? Wrong!
The value based database technology eliminates the need for storing data in many slightly different forms to support many different uses. A single physical structure supports all forms of applications. In addition, it stores each unique value only once, further reducing redundancy and database size.
Starting with five hundred gigabytes of raw data, the Illuminate database will be less than one terabyte total including all internal structures. This means the data warehouse for a substantial enterprise can be effectively stored and accessed on a low-cost commodity computer.
It also means that the response times will be substantially faster. Often, large disk requirements are considered unimportant since disk space is so cheap. However, the real cost of massive data sets is that the I/O systems, the slowest feature of the computer architecture, are choked by massive data transfers. The fastest computers in the world can not provide information any faster than they can find and move the required data to and from a disk.

Queries from Hell

Tuesday, July 13, 2010

Data warehouse size issues

No comments:

Post a Comment

My Mission

Blog Archive