Clustered Columnstore Indexes – part 25 (“Faster Smaller Better Stronger”)

Continuation from the previous 24 parts, starting from http://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/

I decided to take a simple OLTP table, make it bigger and to tear it apart with Clustered Columnstore & so this blog post is about this experiment.
I picked up the always lovely AdventureWorks in its 2012 version and to decided to apply the awesome MakeBigAdventure script, created by Adam Machanic.

I kicked off by executing the following script which will create 31 Million, 62 Million & 125 million rows respectively. Why? Because I have seen enough cases when OLTP tables would grow beyond anything reasonable, even while using Enterprise Edition of SQL Server and not knowing what Partitioning or Compression is.
Notice that varying the part “AND a.number BETWEEN 1 AND 50” by changing the last number to 100 and to 200 I have achieved the respectful growth for the 2nd & 3rd test cases.

Now create a Clustered Columnstore Index on this bigProduct table, it will help you inserting the bigTransaction table which will generate the 30+ Million Rows:

Warning: the next script will run for some significant amount of time, so make sure you have enough space (I gave it 7GB), and make sure that your Transaction log is big enough to support some serious storage. Even though my test VM has just 2 cores, I have a reasonable SSD which is a kind of important when working with those numbers of rows.

And now let’s create a Clustered Columnstore Index on our table:

What I love about this table is that it does have a structure and the type of data that is typical to found in a lot of applications.
For better understanding of the type of table we are dealing with, execute the following query which will bring the information about counts of distinct values stored in our OLTP-style table:

Screen Shot 2014-01-30 at 22.19.24
You can clearly see that our [Quantity] and [TransactionDate] are quite sparse, which is perfectly normal, while ActualCost is extremely diverse which is probably means almost no compression can be applied on the top of it.

So Let’s see how good our compression actually was:

Screen Shot 2014-01-30 at 22.22.57
We have 312 MB spent on 32 Million Rows. Interesting number. :) Without looking at the data I would expect a little bit more, but let us carry on.
The number of Row Groups is perfectly fine, as far as I am concerned.

I decided to make a reality check and created a page-compressed copy of the table. I copied all the data into it and here is the result of the space occupied by it:
Screen Shot 2014-01-30 at 22.57.56
Yes indeed – 462 MB (the value on the screenshot divided by 1024), it is a good 1/3 bigger than the Columnstore compressed table.
This means that from the storage point of view even if you compare by the occupied size, Columnstore compression wins easily, but if you start thinking about Columnstore Locking & Blocking – I guess that you will not want to use it on a regular transaction modifying system. :)

Next step would be to check the situation with dictionaries, by using the following query:

We have got here just 3 dictionaries which are extremely small – 0.111 MB:
Screen Shot 2014-01-30 at 23.33.42

So lets take a more detailed look at those dictionaries:

Screen Shot 2014-01-30 at 23.45.16
Very clear picture here: the columns 2 to 4 are turned into dictionaries, while 1 is identity and is not compatible with dictionaries at the current version, and the last 5th column [ActualCost] is having way too many distinct values.

Results

Now let us compare this all data with even bigger tables – I executed the same queries for the 65 Million & 121 Million Rows respectively as I described the beginning of the post, and so here are my results:

Time

Screen Shot 2014-01-30 at 23.58.28
I have measured the time to load the data into a HEAP plus creating a Clustered Columnstore Index on it, and for me it looked like if using this tactic, one should be aware that the Columnstore Indexes creation seems to be less of a problem than the data load itself. You can clearly see from the trend line that the times spent on creation are growing quite proportionally, while loading more data directly into HEAP only my test VM means putting more stress on the system.

Row Groups & Sizes

Screen Shot 2014-01-31 at 00.16.45
Very balanced picture here, with the only exception of 65 Million rows case, where I was lucky to get just 61 Row Groups and the coefficient of an average size per row group was 10.07, 9.67, 10.47 for 32/65/121 Million rows respectively.

Dictionaries

Screen Shot 2014-01-31 at 00.28.02
The only exciting thing about the dictionaries is that for the bigger table, there are more dictionaries (local), but the total size seems to be unaffected by the number of rows, which makes sense -> the content of the tables is not that distinct.

Heap & Clustered Columnstore vs Clustered Columnstore Load

At this point I decided to compare those times with loading data into a table with a Clustered Columnstore Index, but I will spare you all the t-sql statements and so here are the results:

Screen Shot 2014-01-31 at 01.04.49

Screen Shot 2014-01-31 at 01.04.35

Loading into Clustered Columnstore tables directly is definitely Faster, and it makes the tables look Smaller (also 1 Row Group less compared to the HEAP & CCI loading method), so the whole situation is much Better and the database performance is much Stronger. :)

to be continued with Clustered Columnstore Indexes – Part 26: Backup & Restore

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Current ye@r *