Clustered Columnstore Indexes – part 13 (“Dictionaries Analyzed”)

Continuation from the previous 12 parts, starting from

This time I decided to play a little bit more with Clustered Columnstore dictionaries in order to understand them a little bit better.

Continuing working with Contoso BI Database but this time I am kicking off with in a different way – by executing the following script which will help me to find the tables which might be suitable for the Clustered Columnstore Indexes (with over 1 Million rows) and it provides with some useful information about those parts where Clustered Columnstore are imposing some limits:
Warning: this is a code in development and it is based on the current CTP1 version of SQL Server 2014.
Warning 2: indeed you can convert any table with less than 1 Million rows, and some tables might greatly benefit from it, but this very script and this blog post is focusing just on the big tables.

These are the results that I have received when executing this script on my Contoso BI Database:
Screen Shot 2013-07-30 at 14.29.20
Basically, the results are telling me that I have 5 tables which are candidates to be converted to use the Clustered Columnstore Index: FactOnlineSales, FactInventory, FactSalesQuota, FactSales, FactStrategyPlan. Besides that we can see the number of columns that every table has as well as the number of the respective Primary and Foreign Keys, plus some other columns which are all fine for our exercise.

Lets go forward and drop all the respective Primary as well as the Foreign Keys:

Now lets create the respective Clustered Colunstore Indexes for each of those tables:

To analyze the situation with the dictionaries I have decided to run the following query, which would give me back the information about dictionary sizes for each of the tables:

The results are here (already inserted into an Excel file):
Screen Shot 2013-07-30 at 18.06.50

According to the diciontary_id, here is an image of the development of the number of dictionaries available:
Screen Shot 2013-07-30 at 17.42.31

Maybe it is just me, but I see some very clear tendency of the number of dictionaries decreasing according to the dictionary_id progression. Since there is no information at all what should be considered a Primary (Global) dictionary and what is a Secondary one, I would guess that the first available dictionary_id is the Primary dictionary, based on the number of the dictionaries available.
Note: of course this is a speculation and just a one-man in the middle-of-other-things-research 🙂

Warning: FactInventory & FactSales are represented on the right scale.
Screen Shot 2013-07-30 at 22.53.04

If we look at the Number of Entries inside of a dictionary, aka Entry Count results, we might notice the following tendency – the very first dictionary looks like it is a kind of displaced, but starting with a dictionary_id = 1, the number of entries definitely goes up to dive down, which also points out at the location of the Primary dictionary – the first one which contains the smallest amount of entries -> dictionary_id = 0.

Other thoughts:
From the tables I have analyzed, there were different amount of space that different type of dictionaries have occupied, and from which I believe that the absolute minimum for the very first dictionary_id is 60 bytes, but what makes things even more interesting is that for the second and the third levels of dictionaries_id (which are =1 and =2 respectfully), the minimum number of bytes goes down to 56 bytes, which means that there is a 4 bytes difference between the first level of dictionaries_id and the consequent ones. I wish I had any explanation for all this stuff, but lets hope that someone will discover it quite soon, or that Microsoft shall publish more internal information about the Clustered Columnstore Indexes.

to be continued with Clustered Columnstore Indexes – part 14 (“Partitioning”)

3 thoughts on “Clustered Columnstore Indexes – part 13 (“Dictionaries Analyzed”)

  1. Pingback: Clustered Columnstore Indexes – part 1 (“Intro”) | Nikoport

  2. Pittsburgh DBA

    I believe you should filter sys.partitions like so:
    and p.index_id IN (0,1) –Heap, Clustered

    Otherwise, the query returns partitions for all indexes, and this inflates the row count (sum(p.rows))

Leave a Reply

Your email address will not be published. Required fields are marked *