Clustered Columnstore Indexes – part 45 (“Multi-Dimensional Clustering”)

Continuation from the previous 44 parts, starting from http://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/

I have blogged a couple times (part 29 – “Data Loading for Better Segment Elimination”, part 34 – “Deleted Segments Elimination” ) about the importance of the Segment Elimination, I have decided to show you something that I personally consider to be beyond awesome feature in Columnstore Indexes – I call this thing, MultiDimensional Clustering(Ordering). I have spent some time delaying the post until the PASSCamp 2014, but now it has officially started I am dedicating it to this amazing event.

Notice, that this technic is valid for Nonclustered Columnstore Indexes as well as the Clustered Columnstore Indexes:
Let us consider the following scenario: we have an instance with SQL Server 2012+ where we have a partitioned table that is being hammered by a number of different queries.
To improve the performance we have identified the most common predicate in our SELECT queries and so decided to implement Segment Clustering to improve our performance.

This has helped us a lot but we need a further improvement, because to be honest, there are almost no such situations when we have just 1 query that needs to be tuned. 🙂

In this example we have found 2 queries that need to be optimised and surprise – even though they have 1 common predicate (DateKey), they include a second predicate that is different for each of the queries:

But let’s not hurry to much and start with our Segment Clustering implementation:
As so often, I will play with a Contoso Retail DW, my favourite free database.
Right after restoring the database, I will drop the foreign & primary keys on the FactOnlineSales table, on which we shall run our experiments:

I have already studied the data inside our table and decided to create the following partitions & filegroups: 1 for Archived Old Data, and 1 partition per each of the Years with data – 2007, 2008, 2009 & 2010:

The following script shall add 1 physical file to each of the above defined file groups:

Now that we have our file structure defined, let us create partition function and partition scheme:

Now it is the right time to create our Columnstore Index, which will need first a presence of a traditional rowstore Clustered Index, which will physically move data into the right partitions:

At this point we should check the situation with the Segment Clustering on the DateKey column (id = 2) that we have partitioned our table on:

Basic Segment Clustering

It looks quite fine and we have done an optimisation to our system at this point.

It is perfectly aligned for the Column Number 2 (DateKey) on which we have partitioned our Clustered Columnstore Index, but what about the 2 other columns (StoreKey (Column 3) & SalesAmount(Column 11)):

StoreKey Basic Clustering

SalesAmount Basic Clustering
It is very clear that there is no alignment whatsoever involved, because we have partitioned our table on the different column (DateKey).

Let us establish a baseline for the IO performance of our queries:

This means we have 1209 logical reads & 1325 logical reads respectively. Is this the best that we can do?

Partitioning information:

Let’s see how the data is spread around the different partitions:

Partitioning InformationIt’s clear now that we actually will be concentrating our job on the partitions 3 & 4 respectively, since they include the data that interest us for those 2 queries optimization.

Query 1 (Partition 3):

We should concentrate on the first query, looking at the data between in the year 2008 and thus let’s try to optimize it first (in practice the order of optimization does not matter as long as different partitions are being involved):
Actually, what happens if at this point we reload our partition?
Clustered Columnstore Indexes should allow us to do that without any problem:

And what if we build a traditional row store clustered index on that table before building a Clustered Columnstore Index and switching it in:

Notice that we are building Clustered Columnstore right after the rowstore clustered index that have sorted that data on the disk, and that MAXDOP = 1 is extremely important as long Microsoft has not fixed bug with the Segment Clustering.

Now we are ready to switch the data in, but before that we shall remove all the copied data from the year 2008:

Before switching the data from the table dbo.FactOnlineSales2008 into the main one, we need to add a constraint that will guarantee the boundaries of the data contained inside the table:

Time to switch in the data:

Surprise, it worked! 🙂

Let’s check our structure, which should show us Segment Clustering, based on the 3rd partition (we have data only in the partitions 2,3,4):

StoreKey 3rd Partition ClusteringYeap, without any denial we have our Segment Clustering working for the column StoreKey which just have 3 distinct values for the specified partition, but still works perfectly!

Let’s run our first query again:

We are down from 1209 logical reads to 696 logical reads.
This means that without changing a bit of the data on the partition level we have managed to improve the IO in almost 2 times!

Do not forget that IO in Columnstore Indexes has a strong relationship to the Memory Bandwidth as well as the number of rows processed in a Batch at once, concepts explained in Clustered Columnstore Indexes – part 42 (“Materialisation”).

Query 2 (Partition 4):

Our first query is optimised, and so we can turn our attention to our to the tuning of the second query, which has a predicate on SalesAmount column:

What about the internal structure – let’s check on it:

SalesAmount - 4th Partition ClusteringYou can clearly see that starting from the 9th Segment the alignment is perfect – we have got a different Segment Clustering for the 4th partition here. Notice that since we did not touched the data at the other partitions, that data is still perfectly aligned on the StoreKey column for the 1st query.

That’s enough of speaking, let’s run our first query again:

Our results here are :

We are down from 1325 logical reads to 344 logical reads.
In the case of the 2nd query we have received a 4 times improvement!!!

Final Thoughts

I just have shown you how to do multi-dimensional clustering (ordering) of a table with Columnstore Index, in which you can practically control the degree of the first clustering dimension, in the sense of controlling the size of the partition – not always and of course until the maximum supported 15.000 partitions.

The technic implies a need to switch out a partition, order it by applying Row store compression with an index and then creation of a Columnstore Index.

Your results will naturally vary, depending on the data distribution and you might not get any improvements if your predicate inside the partition has no variation whatsoever.

to be continued with Clustered Columnstore Indexes – part 46 (“DateTime compression and performance”)

2 thoughts on “Clustered Columnstore Indexes – part 45 (“Multi-Dimensional Clustering”)

  1. tobi

    Multidimensional clustering is all nice – but it is so hard to control. For example rebuilding the table will likely “smear” adjacent segments into each other.

    What do you think about the following? Create a composite partitioning key (e.g. “DateKey=2014-12-XX;StoreKey=8XX”, or make this an integer). That way we can reliably shove data into partitions, each of which will have uniform segments. This works for arbitrary dimensionality.

    1. Niko Neugebauer Post author

      Hi Tobi,

      thank you for sharing your thoughts 🙂
      I agree that it is hard to control, but there will be enough environments where composite partitioning key will not be an allowed option.

      Also, notice that I have sorted different partitions based on the different columns – which allows me to adapt a partition to the most common predicates existing on them, and this can’t be done alone by a composite partitioning key.

      As the time goes by so can the queries extracting the data – and in order to get the best performance you do not have to change your data – just adapt maintenance processes.

      Best regards,
      Niko

Leave a Reply

Your email address will not be published. Required fields are marked *