Queries from Hell: July 2010

Tuesday, July 20, 2010

Can you speak SQL?

Imagine trying to carry on a conversation with another person where you can only speak in complete and correct directives. Try to order a turkey sandwich with no mayo at a restaurant -

Customer: Select menu from Restaurant
Server: Select choice from customer
Customer: 0 records found
Customer: Select contents of turkey sandwich from menu.
Server: 2 slices of bread, 2 ounces of sliced turkey, lettuce, tomato, mayonnaise, salt and pepper.
Customer: Select turkey sandwich from menu where ingredient not equal to mayonnaise
Server: 0 records found
Customer: Select “2 slices of bread”, “2 ounces of sliced turkey”, “lettuce”, “tomato”, ”salt”, “pepper” from menu where choice = “turkey sandwich”.
Server: 1 record found. Select choice from menu
Customer: Select “Coffee with cream” from menu
Server: 0 records found.
Customer: Select “Coffee” from menu where condiments are cream
Etc., etc, etc.
Now try to imagine how you ask the server to bring a clean spoon.

SQL is a good structured language for applying set based logic to data stores. It is categorically bad for communicating. The worst problem with SQL is that there is no way to ask a question. You can write a SQL “query” which is not a query at all. It is a directive to the database system on how to process data sets and select records to be returned to you.
When a person asks a question it is usually because they don’t know the answer. With SQL based queries you must know exactly the answer you need before you can properly prepare the query.
If you ran out of stock for an item yesterday, a natural human question is “What happened yesterday?” Sorry, that can’t be written in SQL. You can select all orders from yesterday. You can sum them by product and compare the sales to last year. You can review all sorts of facts when you know which facts you need but you can’t ask what happened yesterday. But if the answer to your real question is not in the orders, none of the SQL queries will provide the needed answer. The important information could be in orders but it could be in inventory, receipts or some other table.
With Illuminate you can have partially complete questions where some or even all of the qualifications are not yet known. For example “find 7/24/2010”. You can see the result of the partial question and formulate the next step in the investigation. This is the train of though problem solving process that is completely natural to people but simply not possible with SQL.

Thursday, July 15, 2010

Queries from hell returns

Hello all! After technical problems, vacations, busy schedules and every other possible interruption, I am back to complaining about bad data base designs. I copied some of my favorite posts from the old blog. New ones will be following.

Chaos

If you have never read the original Chaos report from the Standish Group you should stop looking at this screen and immediately read it. When you have finished, come back here and read on.

For those of you who read the report sixteen years ago and have forgotten it, this report, among other things, identifies reasons for failure of software projects. The original report in 1994 showed 16% of software projects succeeded, 53% were completed but with cost and time overruns and reduced deliverables and 31% were complete failures. In 2004 the Standish Group reported that this had improved to 29% succeeded and only 18% failed.

Yikes!!! In 10 years, as an industry, we have gone from 31% total failure to 18% total failure. I am really happy that we are not building airplanes!

One of the primary reasons for the limited improvement of the failure rate that we have been able to achieve is the continued long project timelines. The Chaos report states, “Research at The Standish Group also indicates that smaller time frames… will increase the success rate.” On my scale, a project with a timeline of more than a year has almost no chance of success, while a project with a timeline of a few weeks has a very good chance of success.

What about a data warehouse project? Usually these are measured in months, often exceeding a year. How is the success rate? According to Bill Inmon’s estimate there is a failure rate of 70% – 80%. Other studies have reported failure rates as high as 90%. Given the overall failure rate of software projects from the Standish Group, it seems like data warehouse projects are the primary source of project failures in the IT industry.

Of course, one can respond that building a data warehouse is a big job and requires a lot of time. Where is most of the time spent? Requirements definition, schema design, data normalization, optimizing indexing strategy and designing all of the secondary structure needed for access to information.

Now, I would like to challenge the other database vendors to participate in benchmarks where these factors are evaluated rather than the essentially meaningless TPC-H tests. I think most people would agree that any success is better than a large failure and these are the factors that will determine whether your data warehouse project will need sixty days to succeed or sixty weeks to fail.

The Information Asset

“Information is a corporate asset” is a mantra that many IT people have been chanting for years. Recognition of information as a real corporate asset would give the IT industry a new status they have longingly sought for decades. However, what kind of asset should it be?

A drill press is an asset and so is a pile of cash. A drill press can be used to efficiently perform one specific type of a manufacturing process and a pile of cash can be used to accomplish almost any business objective in any line of business. If information is rigidly structured, it looks a lot like the drill press. If it can be accessed in any way, at any time for any purpose, then it is more like cash.

Obviously, information should be an asset like cash but in most organizations it is much more like the drill press. An EDW that has been built on standard RDBMS structures is optimized for one set of functions and generally performs them well. However, these structures handle different information functions about as badly as a drill press would perform injection molding.

The value of the information asset is directly related to the flexibility of access the business has to that asset. The more flexibility the organization has to using the information, the higher its value will be. If a business person can see a fixed report that is produced each day and all other information comes from phone calls, the information asset has very limited value. If that user can see a dashboard that indicates a problem or opportunity and then immediately begin analyzing the detailed information behind the dashboard, the information has much more value.

However, if a new business opportunity comes up and the decision makers can find answers to all of the new questions associated with that opportunity in time to make an informed decision, then the information has almost unlimited value. This means providing answers to questions that no one thought of just one day earlier much less at a requirements definition meeting eighteen months ago. It means accessing data through paths that no one thought of in any design session. And it means getting answers to these unplanned questions at the speed of thought.

When is data mining not data mining

Product affinity is one of those processes that used to be relegated to the statisticians in the back room. The process was so time consuming and expensive that it wasn’t done except for very high profile situations. Also, the results frequently proved to be of little value. On one discovery project I found an amazing affinity between bananas and dog food. When I told the client, he laughed and said, “Check for bananas and toilet paper.” Sure enough there was a strong affinity there as well. It seems that bananas have an affinity with almost everything in the store.

With high costs and results like bananas and everything, affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.

Using the incremental query feature of the iCorrelate query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.

We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.

The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.

The Airplane and the Data Warehouse

I have been flying more than a pilot lately and while idly passing time waiting until I could get my computer back out I thought about how similar the progression of flying is to the progression of data warehouses. A hundred years ago a few skilled pilots could use a plane to go places where there were no direct roads. He could fly faster than a car could drive or a horse could run. Fifty years ago a technician could run a program in a computer that would store and retrieve information faster and more accurately than any person.

Airplanes got faster, better and cheaper. Soon ordinary people like me routinely flew around the world. Computers got faster, better and cheaper. Soon every person in their business and personal life was using them every day.

However, I had to travel from Barcelona Spain to San Diego California. To accomplish this simple task, I flew from Barcelona to Amsterdam. There I waited in an airport for a couple of hours before flying to Minneapolis. After a few more hours there, I flew to San Diego. The return to Barcelona was a similar trip.

If I want to see information about sales for a day, I have to start by accumulating all of the daily transactions. Then I group them and sum the statistics like volume and net prices. Then I need to build my star schema. Now I can use my BI tool to finally see the sales. If I want to see all of the sales of non food products to female customers that were made on weekday mornings it would be more difficult than a trip from an Inuit village in Alaska to a jungle village deep in the Congo basin.

Airplanes meet 80% of the travel needs very well and the computers meet 80% of the information needs well. And better, faster, cheaper technology will not improve that ratio much for either technology. In fact, as people become more used to the technologies, the 80% portion is taken for granted and there is increased demand for the 20% part. Soon the ratio changes to 70/30 or 60/40.

A new technology like a personal transporter or a correlation database is needed to change the old cast-in-concrete methods. I don’t know where to find the personal transporter, but with the Illuminate database you can ask any question directly from the raw data as soon as it is loaded. You can change the direction of the questioning at any time. You can move from one question to the next without starting over and tracing out a new path. It gives you the ability to navigate through your information as you need to rather than through the “airports” that were built to designate paths for you.

Wednesday, July 14, 2010

So, how fast can you load data

When I talk to people about the Illuminate CDBMS, the first reaction I usually get from them is, “That sounds great, but how fast can you load data?”

My usual answer is “How fast is fast enough?”

This is not a flippant answer but an honest question that is intended to start a discussion on the meaning of load speed. The real measurement of load speed should be the time needed from starting the transfer of source data from the operational data stores, to the availability of the business information for the end user. In a multi-layer information infrastructure, that will include time for:

1. Importing raw data
2. Indexing the imported data
3. Running all aggregation processes and building intermediate layer access structures as needed
4. Further aggregations, rebuilding and reindexing star schemas, OLAP cubes and any other customized, end use information delivery structures.

In an RDBMS environment, steps 1 and 2 are very fast since there is very little indexing on the foundation data layer. Depending on the hardware and communications methods, this could be at a rate of perhaps 20, 30 or even more gigabytes per hour in extreme cases with extreme hardware.

Steps 3 and 4 are the most unpredictable since every organization will have differing amounts of work in each layer. However, it is safe to say that this will be the longest part of the load process. A factor of ten times the duration of steps 1 and 2 would not be unusual. An overnight load process that loads for thirty minutes would then need five more hours to complete all of the preparation steps.

In the Illuminate environment, the load is a single step with some aggregations needed in limited cases. When data is loaded into iIuminate, the load process indexes everything. This means that step 2 in this case is zero.

Because everything is indexed, there is no star schema or OLAP cube building required. The foundation layer with 100% indexing already provides the full functionality of those structures. Also, due to the indexing, aggregations will not be needed as soon as they are with an RDBMS. The iLuminate database will minimize the I/O needed for the aggregation so the computational speed of the computer is the determining factor for aggregations. Small aggregations can be done as a part of the query process without impacting the user response time.

While an exact comparison can not be made because of variability of hardware, database design and the number of information layers, an example comparison can demonstrate the conceptual difference between loading with a conventional system and loading with iLuminate.

                                                                    RDBMS                        iLuminate

Step 1 – Import 2 gigabytes of raw data –     15 minutes                       30 minutes

Step 2 – index foundation layer –                 15 minutes                             0

Step 3 – build intermediate layer –               60 minutes                       30 minutes
              and aggregations
Step 4 – build access layers –                     60 minutes                             0

Total load time –                                           2.5 hours                          1 hour

Value Based Storage

Just what is value based storage? This is a common question I often need to answer. It sounds like some kind of marketing jingle but, in fact, it is the storage structure technology that we are talking about.

The Illuminate internal storage looks dramatically different from either row based storage (Oracle, SQL Server, etc) or column based storage (Sybase IQ, Vertica, etc). Value Based Storage (VBS) stores the data values in sets (value pool) based on data type and each distinct value is stored only once. The original record definition is maintained by an indexing system that links to values the value pool. These two objects are enhanced by the extended metadata giving the iLuminate database its complete functionality.

This storage structure provides several benefits. The database is very small, in some cases even smaller than the raw data that went into it. In addition to simply being small, it concentrates the most active elements of the data into very compact sets. The value pool is often small enough to fit in memory. As a result, the slowest part of database processing, transfer of data from disk to memory, is dramatically reduced.

The VBS also provides some data access features that can not be duplicated in any other structure. A search can be executed starting with a data value in the pool. By going from the value pool back to the index, it is possible to quickly locate every use of the value wherever is may be used in the logical record structures.

This structure also enables our incremental query capability. As the result of a query, the database returns a set of instance identifiers rather than a set of records. This is because there are no records, only pointers and values. With the response being a set of pointers, it is a simple matter to perform the next query step and then get the union or difference between the two sets of pointers for the result of the second query step. This process can be continued indefinitely with the result set shrinking or growing as the new results are merged with the old.

The extended metadata and the indexing structure and logic provide more features but that can be the subject of another discussion.

Incremental Analytical Queries

Analysis: "A breaking up of a whole into its parts to find out their nature" (Websters New World Dictionary)

It seems strange to talk about analytical queries when the entire analysis must first be defined in order to create the SQL query. The analysis is really in the preparation and the actual query becomes just another report generation process. If the analysis is not exactly what is needed, there is no way to know that until the report is done. Then the analysis/report process starts again or, all too frequently, the process is abandoned without a result.

The incremental query moves the analysis into the query process where it really belongs. The person who needs the result of the analysis can ask questions until the analysis is complete and the answer has been found.

For example, a retailer wants to prepare a list of customers for a promotion. They want the usual criteria like geographic and demographic selections and they want at least 5,000 good names and not more than 6,000.

The initial geographic and demographic questions are asked and there are 25,000 customers qualified. Now, the retailer can just take the first 5000 names or the quality of the list can be improved through some analysis, narrowing the list to the best 5000 not simply the first 5000.

With the incremental query process provided by Illuminate, they continue by asking for only customers who had previously purchased a specific type of product. Now the 25,000 names have been reduced to 10,000 better qualified customers. Add another query step to select only those who shop on weekends. Now they have the 5000 customers who precisely fit the profile that they want.

With the incremental approach, you ask for one qualification at a time and you can review the results between each step. The process just described would take a minute or two. In an RDBMS environment, reaching the same result would require either an extensive analytical process to define the final query or multiple sessions of: query design, examination of results and query redesign until the final result is reached.

Due to the cost of the old approach, the typical process will be to stop with the first query step and just take 5000 off the list. The quality of the promotion will be compromised but if the promotion planning is done on a weekly basis, there is no time for the alternative.

The business suffers due to the limits of technology.

Which tool do I use?

I monitor a few BI web sites and I talk with many IT folks working in DW and BI. In spite of how many times I see and hear it, I am still astonished when people discuss which is the best BI tool.

Many years ago I heard a smart person say the whole world looks like a nail when the only tool you have is a hammer. It seems that rule is as constant as the sunrise.

Comparing SQL Server, Cognos, SAS and Qliktech is like comparing hammers, saws, drills and wrenches.   These are all very different products with different strengths and weaknesses. Each of them may be the best tool for something but none is best for everything in BI any more than one of the builders’ tools is the best for building a house.

My interest is in databases and I am frequently told something like “I can do that with (SAS, Business Objects, Pentaho,…) when I describe a feature of our database.

Yes, perhaps, but I guess you could use a hammer to smash the end off a board if you didn’t have a saw. Does that make sense? Of course not, but many people do not see the obvious parallel.

For example, the Illuminate database indexes everything, always. This makes it an excellent foundation for analytics and information discovery.    Because of the indexing, any column of any table can be either a dimension or a fact depending on the nature of the business question. Any column or columns of any table or tables can be used as selection criteria and there will never be a full table scan.

Can you duplicate these features with SQL Server? Theoretically yes, but in the real world, it can never be done. This would require building as many star schemas and indices as there are columns in the database.   You would need more disk than can be attached to a computer and most of the stars would never be used. Unfortunately, you can’t predict which of them would never be used. With SQL Server you can resolve some of the business issues but never all of them.
Can you duplicate these features with an OLAP product? Even theoretically this is not possible. How about with SAS or Business Objects or XYZ? No. But can it be done with the right suite of products?

Again, no. The only product in a complete suite that could attempt to provide the indexing power is the database and the relational databases can not do it. Front end systems and mathematical algorithms can not even address the issue. There is no product or combination of products that can duplicate the benefits of the indexing function.

The real statement that people are making is that they know how to work well with a given tool and they would rather focus on the things they can do with that tool than solve business problems. As a result, the query from hell seems to have a long and healthy future.

How many DBA's does it take to change a light bulb?

Come on! They have more important things to do.

How many DBA’s does it take to optimize performance of a data warehouse? The answer should be the same.

A DBA should be focused on making the information owned by the enterprise as accessible and useful to the enterprise as possible. Unfortunately, when the data warehouse has been built on a relational database, performance tuning issues will consume most of their time. Precious little time is left for tasks like identifying user needs, coaching designers and developers, developing and managing database access policy and the myriad of other tasks that directly benefit the business.

Imagine a database that never needs reorganization, never needs a new index and can take in new data sources as if they had been planned for years. Imagine a database that looks like a star schema designed for every possible query; every column of every table can be either a fact or dimension based on the query.

The value based structure provides all of these benefits and more. In this system, each value is stored once regardless of where or how frequently it may be used. The core of the database is an index and linking system that links all of those values into their proper place in records. Because of this structure, every column of every table is always indexed. There is no way to access anything in this database except through the index. Inserts, updates and deletes never result in overflow areas or empty spaces in the database – no reorganizing ever. The size of the database is dramatically less than it would be with the same data in an RDBMS based data warehouse, nearly eliminating physical design issues.

When the database system does all of that for the DBAs, they can spend a lot more time with users and developers to maximize the benefit that the organization gets from its information. The organizations information is an asset and it is the job of the DBA to ensure that the asset is used as effectively as possible.

Spending time on ensuring that the database is designed to take advantage of every possible processor cycle does very little to help the business.

Flexibility is Cash

How can the information in an RDBMS be converted from a drill press into a pile of cash?

Add flexibility!

A drill press does one thing very well, while cash can be used to accomplish many objectives. The structure of data in an RDBMS defines the boundaries of its use just as the motors, gears and levers of a drill press define its use.

A data structure that allows complete flexibility of use frees the corporate information asset to be used more like cash. As long as data is stored in records and indexed for access, it will do one thing very well and it will require armies of smart people to make it deliver minimal results for all other information functions.

There are some relatively new options for supporting information discovery and analytics in the market today that were not available when most data warehouses were being planned and implemented.

Column structure was introduced by Sybase and other companies have joined in with their new products. Value based structure is available only from Illuminate today.

Column structured database systems store data in columns rather than records. Adding a new record in this structure requires splitting the new record into individual values and then updating the column sets individually. Retrieving information involves finding the desired value in a column and the retrieving all of the values from the same relative position in all other columns. While this may sound complicated and difficult, it actually takes advantage of the fact that processor cycles are cheap and plentiful while disk bandwidth is limited and expensive. This structure reduces the I/O load and disk usage when compared to record structures.

The value based structure stores unique data values once in a data pool and maintains an indexing system that reconstructs the records as needed. Like the column structure, this structure uses processor cycles to reduce I/O. However, the value based structure reduces I/O even more than the column structure and makes possible new analytical and discovery functions that are unique to this structure. This is as close as information gets to cash today.

Tuesday, July 13, 2010

Data warehouse size issues

Does your organization have a multi-terabyte data warehouse? If you said yes, it is a very good bet that it contains massive amounts of redundant data. Only a few very large organizations can generate terabytes of unique structured data. Certainly, you could grow into hundreds of terabytes if you store full length movies in the data warehouse but with the technology that is available today there is no value in storing massive blobs of unstructured material.

What goes into the database?

Many companies talk about having data warehouses with five, ten, one hundred or even more terabytes of data in them. Frequently, this is the measurement of the incoming data, not the final data structure with all indexing, star schemas, OLAP cubes and other performance optimization features. Where does this incredible amount of data come from?
If we look at typical counts of business related objects that are in a data warehouse like customers, orders, products, etc. we can see that it is very difficult to reach these massive sizes if information is stored only one time. For example, assume we are looking at a retail company that has one million customers with an average of one hundred orders with ten detail lines each from each customer each year. The company has three years of history in the data warehouse. Converting these business object counts to database objects and space requirements, we get the table below.

-------------------------------------------------------------------------------------------
Object                Count                      Size (bytes)         Total size (Gbytes)
-------------------------------------------------------------------------------------------
Customer            1,000,000                 2,000                 2.0
Order                 300,000,000              300                    90.0
Order detail    3,000,000,000           100                    300.0
Miscellaneous     20,000,000                500                   10.0
-------------------------------------------------------------------------------------------
   Total 400.0

If this is an example of a retail business, the customer, order and order detail tables will consume most of the required space. The miscellaneous group would cover all other objects. They might stock twenty or thirty thousand items so the product table size is trivial. Inventory records for that number of items would be small even with a number of locations for every item. If they produce an average of ten purchase orders per year per item, the purchase order file is also small. Promotions generally are one or two per week but even with many more than that and a large amount of data for each promotion, it is a relatively small amount of data. All of the other files, personnel, GL, and others are of insignificant size.
This table shows a total of about 400 Gigabytes of raw data – less than half a terabyte. So how does an organization get fifty or one hundred terabytes of real data? Perhaps Wal-Mart or the government of China might really have that much non-duplicated data. In most other cases it comes from redundancy.
The typical data warehouse contains tables that have many versions and subsets of the same data structured for different purposes. This data is thought of by the business as different data from different sources and used for different purposes. In reality, it is essentially the same data, structured differently for different purposes and it must be retained in these diverse structures because RDBMS technology is not capable of using data in a single physical structure for a multitude of different purposes.
Even if the counts in the table above were understated by a factor of ten, there is less than five terabytes of really unique data. Even if the business is a telco with one hundred billion call detail records it is hard to come up with more than ten terabytes of unique data. The conclusion then is obvious; most data warehouses receive and store massive amounts of redundant data.

What happens after it is loaded?

If a data warehouse is expected to function as anything more than a big bucket of raw data, it will need indexing, summary tables, OLAP cubes and other additions to optimize the value and accessibility of the data it holds. As a result, the database size can easily be five to ten times the size of the raw data that was loaded. If the warehouse will be expected to support analytics or information discovery functions, it may grow to twenty times the size of the raw data or even more. It is easy to see how less than one terabyte of unique raw data can be the basis of many terabytes of data warehouse space. Even a moderate sized enterprise will need a very substantial computer with many terabytes of disk to manage a routine data warehouse.
All of this massive duplication and space consumption is required to make the common record structure in a relational format efficient enough for use in an organization. But if the database is capable of using a single physical structure to support any required type of access and it does not require star schemas, snowflake schemas and massive indexing, then the organization will not need massively parallel servers with rooms full of disks to support an enterprise warehouse.
So what? All databases have the same limitations and require essentially the same tuning, right? Wrong!
The value based database technology eliminates the need for storing data in many slightly different forms to support many different uses. A single physical structure supports all forms of applications. In addition, it stores each unique value only once, further reducing redundancy and database size.
Starting with five hundred gigabytes of raw data, the Illuminate database will be less than one terabyte total including all internal structures. This means the data warehouse for a substantial enterprise can be effectively stored and accessed on a low-cost commodity computer.
It also means that the response times will be substantially faster. Often, large disk requirements are considered unimportant since disk space is so cheap. However, the real cost of massive data sets is that the I/O systems, the slowest feature of the computer architecture, are choked by massive data transfers. The fastest computers in the world can not provide information any faster than they can find and move the required data to and from a disk.

Queries from Hell