Imagine trying to carry on a conversation with another person where you can only speak in complete and correct directives. Try to order a turkey sandwich with no mayo at a restaurant -
Customer: Select menu from Restaurant
Server: Select choice from customer
Customer: 0 records found
Customer: Select contents of turkey sandwich from menu.
Server: 2 slices of bread, 2 ounces of sliced turkey, lettuce, tomato, mayonnaise, salt and pepper.
Customer: Select turkey sandwich from menu where ingredient not equal to mayonnaise
Server: 0 records found
Customer: Select “2 slices of bread”, “2 ounces of sliced turkey”, “lettuce”, “tomato”, ”salt”, “pepper” from menu where choice = “turkey sandwich”.
Server: 1 record found. Select choice from menu
Customer: Select “Coffee with cream” from menu
Server: 0 records found.
Customer: Select “Coffee” from menu where condiments are cream
Etc., etc, etc.
Now try to imagine how you ask the server to bring a clean spoon.
SQL is a good structured language for applying set based logic to data stores. It is categorically bad for communicating. The worst problem with SQL is that there is no way to ask a question. You can write a SQL “query” which is not a query at all. It is a directive to the database system on how to process data sets and select records to be returned to you.
When a person asks a question it is usually because they don’t know the answer. With SQL based queries you must know exactly the answer you need before you can properly prepare the query.
If you ran out of stock for an item yesterday, a natural human question is “What happened yesterday?” Sorry, that can’t be written in SQL. You can select all orders from yesterday. You can sum them by product and compare the sales to last year. You can review all sorts of facts when you know which facts you need but you can’t ask what happened yesterday. But if the answer to your real question is not in the orders, none of the SQL queries will provide the needed answer. The important information could be in orders but it could be in inventory, receipts or some other table.
With Illuminate you can have partially complete questions where some or even all of the qualifications are not yet known. For example “find 7/24/2010”. You can see the result of the partial question and formulate the next step in the investigation. This is the train of though problem solving process that is completely natural to people but simply not possible with SQL.
Queries from Hell
Thoughts about data warehousing, correlation databases, incremental and associative queries, value based storage (VBS), on the fly indexing, BI, data mining, ETL, SQL limitations, pattern recognition and more about what you don't know that you don't know.
Tuesday, July 20, 2010
Thursday, July 15, 2010
Queries from hell returns
Hello all! After technical problems, vacations, busy schedules and every other possible interruption, I am back to complaining about bad data base designs. I copied some of my favorite posts from the old blog. New ones will be following.
Chaos
If you have never read the original Chaos report from the Standish Group you should stop looking at this screen and immediately read it. When you have finished, come back here and read on.
For those of you who read the report sixteen years ago and have forgotten it, this report, among other things, identifies reasons for failure of software projects. The original report in 1994 showed 16% of software projects succeeded, 53% were completed but with cost and time overruns and reduced deliverables and 31% were complete failures. In 2004 the Standish Group reported that this had improved to 29% succeeded and only 18% failed.
Yikes!!! In 10 years, as an industry, we have gone from 31% total failure to 18% total failure. I am really happy that we are not building airplanes!
One of the primary reasons for the limited improvement of the failure rate that we have been able to achieve is the continued long project timelines. The Chaos report states, “Research at The Standish Group also indicates that smaller time frames… will increase the success rate.” On my scale, a project with a timeline of more than a year has almost no chance of success, while a project with a timeline of a few weeks has a very good chance of success.
What about a data warehouse project? Usually these are measured in months, often exceeding a year. How is the success rate? According to Bill Inmon’s estimate there is a failure rate of 70% – 80%. Other studies have reported failure rates as high as 90%. Given the overall failure rate of software projects from the Standish Group, it seems like data warehouse projects are the primary source of project failures in the IT industry.
Of course, one can respond that building a data warehouse is a big job and requires a lot of time. Where is most of the time spent? Requirements definition, schema design, data normalization, optimizing indexing strategy and designing all of the secondary structure needed for access to information.
Now, I would like to challenge the other database vendors to participate in benchmarks where these factors are evaluated rather than the essentially meaningless TPC-H tests. I think most people would agree that any success is better than a large failure and these are the factors that will determine whether your data warehouse project will need sixty days to succeed or sixty weeks to fail.
For those of you who read the report sixteen years ago and have forgotten it, this report, among other things, identifies reasons for failure of software projects. The original report in 1994 showed 16% of software projects succeeded, 53% were completed but with cost and time overruns and reduced deliverables and 31% were complete failures. In 2004 the Standish Group reported that this had improved to 29% succeeded and only 18% failed.
Yikes!!! In 10 years, as an industry, we have gone from 31% total failure to 18% total failure. I am really happy that we are not building airplanes!
One of the primary reasons for the limited improvement of the failure rate that we have been able to achieve is the continued long project timelines. The Chaos report states, “Research at The Standish Group also indicates that smaller time frames… will increase the success rate.” On my scale, a project with a timeline of more than a year has almost no chance of success, while a project with a timeline of a few weeks has a very good chance of success.
What about a data warehouse project? Usually these are measured in months, often exceeding a year. How is the success rate? According to Bill Inmon’s estimate there is a failure rate of 70% – 80%. Other studies have reported failure rates as high as 90%. Given the overall failure rate of software projects from the Standish Group, it seems like data warehouse projects are the primary source of project failures in the IT industry.
Of course, one can respond that building a data warehouse is a big job and requires a lot of time. Where is most of the time spent? Requirements definition, schema design, data normalization, optimizing indexing strategy and designing all of the secondary structure needed for access to information.
Now, I would like to challenge the other database vendors to participate in benchmarks where these factors are evaluated rather than the essentially meaningless TPC-H tests. I think most people would agree that any success is better than a large failure and these are the factors that will determine whether your data warehouse project will need sixty days to succeed or sixty weeks to fail.
The Information Asset
“Information is a corporate asset” is a mantra that many IT people have been chanting for years. Recognition of information as a real corporate asset would give the IT industry a new status they have longingly sought for decades. However, what kind of asset should it be?
A drill press is an asset and so is a pile of cash. A drill press can be used to efficiently perform one specific type of a manufacturing process and a pile of cash can be used to accomplish almost any business objective in any line of business. If information is rigidly structured, it looks a lot like the drill press. If it can be accessed in any way, at any time for any purpose, then it is more like cash.
Obviously, information should be an asset like cash but in most organizations it is much more like the drill press. An EDW that has been built on standard RDBMS structures is optimized for one set of functions and generally performs them well. However, these structures handle different information functions about as badly as a drill press would perform injection molding.
The value of the information asset is directly related to the flexibility of access the business has to that asset. The more flexibility the organization has to using the information, the higher its value will be. If a business person can see a fixed report that is produced each day and all other information comes from phone calls, the information asset has very limited value. If that user can see a dashboard that indicates a problem or opportunity and then immediately begin analyzing the detailed information behind the dashboard, the information has much more value.
However, if a new business opportunity comes up and the decision makers can find answers to all of the new questions associated with that opportunity in time to make an informed decision, then the information has almost unlimited value. This means providing answers to questions that no one thought of just one day earlier much less at a requirements definition meeting eighteen months ago. It means accessing data through paths that no one thought of in any design session. And it means getting answers to these unplanned questions at the speed of thought.
A drill press is an asset and so is a pile of cash. A drill press can be used to efficiently perform one specific type of a manufacturing process and a pile of cash can be used to accomplish almost any business objective in any line of business. If information is rigidly structured, it looks a lot like the drill press. If it can be accessed in any way, at any time for any purpose, then it is more like cash.
Obviously, information should be an asset like cash but in most organizations it is much more like the drill press. An EDW that has been built on standard RDBMS structures is optimized for one set of functions and generally performs them well. However, these structures handle different information functions about as badly as a drill press would perform injection molding.
The value of the information asset is directly related to the flexibility of access the business has to that asset. The more flexibility the organization has to using the information, the higher its value will be. If a business person can see a fixed report that is produced each day and all other information comes from phone calls, the information asset has very limited value. If that user can see a dashboard that indicates a problem or opportunity and then immediately begin analyzing the detailed information behind the dashboard, the information has much more value.
However, if a new business opportunity comes up and the decision makers can find answers to all of the new questions associated with that opportunity in time to make an informed decision, then the information has almost unlimited value. This means providing answers to questions that no one thought of just one day earlier much less at a requirements definition meeting eighteen months ago. It means accessing data through paths that no one thought of in any design session. And it means getting answers to these unplanned questions at the speed of thought.
When is data mining not data mining
Product affinity is one of those processes that used to be relegated to the statisticians in the back room. The process was so time consuming and expensive that it wasn’t done except for very high profile situations. Also, the results frequently proved to be of little value. On one discovery project I found an amazing affinity between bananas and dog food. When I told the client, he laughed and said, “Check for bananas and toilet paper.” Sure enough there was a strong affinity there as well. It seems that bananas have an affinity with almost everything in the store.
With high costs and results like bananas and everything, affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.
Using the incremental query feature of the iCorrelate query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.
We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.
The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.
With high costs and results like bananas and everything, affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.
Using the incremental query feature of the iCorrelate query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.
We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.
The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.
The Airplane and the Data Warehouse
I have been flying more than a pilot lately and while idly passing time waiting until I could get my computer back out I thought about how similar the progression of flying is to the progression of data warehouses. A hundred years ago a few skilled pilots could use a plane to go places where there were no direct roads. He could fly faster than a car could drive or a horse could run. Fifty years ago a technician could run a program in a computer that would store and retrieve information faster and more accurately than any person.
Airplanes got faster, better and cheaper. Soon ordinary people like me routinely flew around the world. Computers got faster, better and cheaper. Soon every person in their business and personal life was using them every day.
However, I had to travel from Barcelona Spain to San Diego California. To accomplish this simple task, I flew from Barcelona to Amsterdam. There I waited in an airport for a couple of hours before flying to Minneapolis. After a few more hours there, I flew to San Diego. The return to Barcelona was a similar trip.
If I want to see information about sales for a day, I have to start by accumulating all of the daily transactions. Then I group them and sum the statistics like volume and net prices. Then I need to build my star schema. Now I can use my BI tool to finally see the sales. If I want to see all of the sales of non food products to female customers that were made on weekday mornings it would be more difficult than a trip from an Inuit village in Alaska to a jungle village deep in the Congo basin.
Airplanes meet 80% of the travel needs very well and the computers meet 80% of the information needs well. And better, faster, cheaper technology will not improve that ratio much for either technology. In fact, as people become more used to the technologies, the 80% portion is taken for granted and there is increased demand for the 20% part. Soon the ratio changes to 70/30 or 60/40.
A new technology like a personal transporter or a correlation database is needed to change the old cast-in-concrete methods. I don’t know where to find the personal transporter, but with the Illuminate database you can ask any question directly from the raw data as soon as it is loaded. You can change the direction of the questioning at any time. You can move from one question to the next without starting over and tracing out a new path. It gives you the ability to navigate through your information as you need to rather than through the “airports” that were built to designate paths for you.
Airplanes got faster, better and cheaper. Soon ordinary people like me routinely flew around the world. Computers got faster, better and cheaper. Soon every person in their business and personal life was using them every day.
However, I had to travel from Barcelona Spain to San Diego California. To accomplish this simple task, I flew from Barcelona to Amsterdam. There I waited in an airport for a couple of hours before flying to Minneapolis. After a few more hours there, I flew to San Diego. The return to Barcelona was a similar trip.
If I want to see information about sales for a day, I have to start by accumulating all of the daily transactions. Then I group them and sum the statistics like volume and net prices. Then I need to build my star schema. Now I can use my BI tool to finally see the sales. If I want to see all of the sales of non food products to female customers that were made on weekday mornings it would be more difficult than a trip from an Inuit village in Alaska to a jungle village deep in the Congo basin.
Airplanes meet 80% of the travel needs very well and the computers meet 80% of the information needs well. And better, faster, cheaper technology will not improve that ratio much for either technology. In fact, as people become more used to the technologies, the 80% portion is taken for granted and there is increased demand for the 20% part. Soon the ratio changes to 70/30 or 60/40.
A new technology like a personal transporter or a correlation database is needed to change the old cast-in-concrete methods. I don’t know where to find the personal transporter, but with the Illuminate database you can ask any question directly from the raw data as soon as it is loaded. You can change the direction of the questioning at any time. You can move from one question to the next without starting over and tracing out a new path. It gives you the ability to navigate through your information as you need to rather than through the “airports” that were built to designate paths for you.
Wednesday, July 14, 2010
So, how fast can you load data
When I talk to people about the Illuminate CDBMS, the first reaction I usually get from them is, “That sounds great, but how fast can you load data?”
My usual answer is “How fast is fast enough?”
This is not a flippant answer but an honest question that is intended to start a discussion on the meaning of load speed. The real measurement of load speed should be the time needed from starting the transfer of source data from the operational data stores, to the availability of the business information for the end user. In a multi-layer information infrastructure, that will include time for:
1. Importing raw data
2. Indexing the imported data
3. Running all aggregation processes and building intermediate layer access structures as needed
4. Further aggregations, rebuilding and reindexing star schemas, OLAP cubes and any other customized, end use information delivery structures.
In an RDBMS environment, steps 1 and 2 are very fast since there is very little indexing on the foundation data layer. Depending on the hardware and communications methods, this could be at a rate of perhaps 20, 30 or even more gigabytes per hour in extreme cases with extreme hardware.
Steps 3 and 4 are the most unpredictable since every organization will have differing amounts of work in each layer. However, it is safe to say that this will be the longest part of the load process. A factor of ten times the duration of steps 1 and 2 would not be unusual. An overnight load process that loads for thirty minutes would then need five more hours to complete all of the preparation steps.
In the Illuminate environment, the load is a single step with some aggregations needed in limited cases. When data is loaded into iIuminate, the load process indexes everything. This means that step 2 in this case is zero.
Because everything is indexed, there is no star schema or OLAP cube building required. The foundation layer with 100% indexing already provides the full functionality of those structures. Also, due to the indexing, aggregations will not be needed as soon as they are with an RDBMS. The iLuminate database will minimize the I/O needed for the aggregation so the computational speed of the computer is the determining factor for aggregations. Small aggregations can be done as a part of the query process without impacting the user response time.
While an exact comparison can not be made because of variability of hardware, database design and the number of information layers, an example comparison can demonstrate the conceptual difference between loading with a conventional system and loading with iLuminate.
RDBMS iLuminate
Step 1 – Import 2 gigabytes of raw data – 15 minutes 30 minutes
Step 2 – index foundation layer – 15 minutes 0
Step 3 – build intermediate layer – 60 minutes 30 minutes
and aggregations
Step 4 – build access layers – 60 minutes 0
Total load time – 2.5 hours 1 hour
My usual answer is “How fast is fast enough?”
This is not a flippant answer but an honest question that is intended to start a discussion on the meaning of load speed. The real measurement of load speed should be the time needed from starting the transfer of source data from the operational data stores, to the availability of the business information for the end user. In a multi-layer information infrastructure, that will include time for:
1. Importing raw data
2. Indexing the imported data
3. Running all aggregation processes and building intermediate layer access structures as needed
4. Further aggregations, rebuilding and reindexing star schemas, OLAP cubes and any other customized, end use information delivery structures.
In an RDBMS environment, steps 1 and 2 are very fast since there is very little indexing on the foundation data layer. Depending on the hardware and communications methods, this could be at a rate of perhaps 20, 30 or even more gigabytes per hour in extreme cases with extreme hardware.
Steps 3 and 4 are the most unpredictable since every organization will have differing amounts of work in each layer. However, it is safe to say that this will be the longest part of the load process. A factor of ten times the duration of steps 1 and 2 would not be unusual. An overnight load process that loads for thirty minutes would then need five more hours to complete all of the preparation steps.
In the Illuminate environment, the load is a single step with some aggregations needed in limited cases. When data is loaded into iIuminate, the load process indexes everything. This means that step 2 in this case is zero.
Because everything is indexed, there is no star schema or OLAP cube building required. The foundation layer with 100% indexing already provides the full functionality of those structures. Also, due to the indexing, aggregations will not be needed as soon as they are with an RDBMS. The iLuminate database will minimize the I/O needed for the aggregation so the computational speed of the computer is the determining factor for aggregations. Small aggregations can be done as a part of the query process without impacting the user response time.
While an exact comparison can not be made because of variability of hardware, database design and the number of information layers, an example comparison can demonstrate the conceptual difference between loading with a conventional system and loading with iLuminate.
RDBMS iLuminate
Step 1 – Import 2 gigabytes of raw data – 15 minutes 30 minutes
Step 2 – index foundation layer – 15 minutes 0
Step 3 – build intermediate layer – 60 minutes 30 minutes
and aggregations
Step 4 – build access layers – 60 minutes 0
Total load time – 2.5 hours 1 hour
Subscribe to:
Posts (Atom)