Wednesday, July 14, 2010

So, how fast can you load data

When I talk to people about the Illuminate CDBMS, the first reaction I usually get from them is, “That sounds great, but how fast can you load data?”


My usual answer is “How fast is fast enough?”

This is not a flippant answer but an honest question that is intended to start a discussion on the meaning of load speed. The real measurement of load speed should be the time needed from starting the transfer of source data from the operational data stores, to the availability of the business information for the end user. In a multi-layer information infrastructure, that will include time for:

1. Importing raw data
2. Indexing the imported data
3. Running all aggregation processes and building intermediate layer access structures as needed
4. Further aggregations, rebuilding and reindexing star schemas, OLAP cubes and any other customized, end use information delivery structures.

In an RDBMS environment, steps 1 and 2 are very fast since there is very little indexing on the foundation data layer. Depending on the hardware and communications methods, this could be at a rate of perhaps 20, 30 or even more gigabytes per hour in extreme cases with extreme hardware.

Steps 3 and 4 are the most unpredictable since every organization will have differing amounts of work in each layer. However, it is safe to say that this will be the longest part of the load process. A factor of ten times the duration of steps 1 and 2 would not be unusual. An overnight load process that loads for thirty minutes would then need five more hours to complete all of the preparation steps.

In the Illuminate environment, the load is a single step with some aggregations needed in limited cases. When data is loaded into iIuminate, the load process indexes everything. This means that step 2 in this case is zero.

Because everything is indexed, there is no star schema or OLAP cube building required. The foundation layer with 100% indexing already provides the full functionality of those structures. Also, due to the indexing, aggregations will not be needed as soon as they are with an RDBMS. The iLuminate database will minimize the I/O needed for the aggregation so the computational speed of the computer is the determining factor for aggregations. Small aggregations can be done as a part of the query process without impacting the user response time.

While an exact comparison can not be made because of variability of hardware, database design and the number of information layers, an example comparison can demonstrate the conceptual difference between loading with a conventional system and loading with iLuminate.

                                                                    RDBMS                        iLuminate

Step 1 – Import 2 gigabytes of raw data –     15 minutes                       30 minutes

Step 2 – index foundation layer –                 15 minutes                             0

Step 3 – build intermediate layer –               60 minutes                       30 minutes
              and aggregations
Step 4 – build access layers –                     60 minutes                             0

Total load time –                                           2.5 hours                          1 hour

No comments:

Post a Comment