Columnstore Indexes – part 62 (“Row Groups Trimming”)

Continuation from the previous 61 parts, the whole series can be found at http://www.nikoport.com/columnstore/.

I have spoken a lot of times on the aspect of the Columnstore Indexes maintenance during my presentations, and I have even written 2 blogposts
Columnstore Indexes – part 36 (“Maintenance Solutions for Columnstore”) & Columnstore Indexes – part 57 (“Segment Alignment Maintenance”) plus I have created a Connect Item on another important aspect of it – Columnstore Segments Maintenance – Remove & Merge, but one thing I kept talking and even writing about it’s analysis improvement in SQL Server 2016 (Columnstore Indexes – part 56 (“New DMV’s in SQL Server 2016″)) I have never dedicated a whole post.
It’s name is Row Groups Trimming and this is the event that happens when your Row Groups do not reach the maximum allowed number of rows, but getting closed & compressed before.
It’s time to write about it now. 🙂

For setup, I will run the following script on the free ContosoRetailDW database, it will restore the database, drop the foreign key constraints, primary key and create a clustered columnstore index on my test table FactOnlineSales.

I will run the following exercise on the SQL Server 2014 instance, and later will switch to SQL Server 2016 to show what was improved in the upcoming version.

We expect everything to be perfect or as close to perfect as possible, hoping to have 1048576 rows in each and every Row Group. To have Row Groups close to the maximum allowed value (1048576 rows) will allow us to have the fastest read-ahead operations possible (should those Segments MB-sized) and Segment Elimination (we can eliminate vast amounts of data a number of times).
Let’s check the results with the query against sys.column_store_row_groups DMV:

Row Groups ListYou can see that from 13 Row Group that I have, only the last 2 did not reached the maximum allowed number of rows – they have 435710 & 657562 rows respectively. We can make an educated guess, that because the sum of those rows (1093272) is superior to the maximum allowed number of rows in a Row Group (1048576), the engine decided to split them between 2 Row Groups, while achieving a good compression trying to balance them right.
In SQL Server 2014 we have no possibility to do a correct post-mortem analysis, we can run the respective Extended Events while rebuilding the index, which would help us to determine the cause of the Row Groups trimming, but should we arrive on the scene after the rebuild we can only determine the cause in SQL Server 2016 with the introduction of the sys.dm_db_column_store_row_group_physical_stats DMV.

We have a number of ways to get smaller then the maximum number of rows inside a Row Group:
– Upgrade from a previous version
– Bulk Load (loading more then 102400 Rows and not dividable amount on 1045678 Rows will produce smaller Row Groups)
– Alter Index Reorganize operation invocation. Especially the one with the COMPRESS_ALL_ROW_GROUPS = ON hint 🙂
– Memory Pressure.
– Dictionary Pressure.
– Not having exact number of rows that is dividable through 1045678.

Note that also in practice running Rebuilds with DOP > 1 will produce unevenly distributed Row Groups.

To exemplify what I am writing here about, imagine that we are loading data from 2 files, each with 110.000 Rows. For that, I will export this amount of data from the FactOnlineSales into a text file on the disk, enabling xp_cmdshell before:

Now I will import that amount of information into FactOnlineSales twice:

Now, if we run the same analysis query agains sys.column_store_row_groups DMV, we shall see 2 extra Row Groups, each one with 110K Rows:

Row Groups List with 2 Bulk LoadsNow with 2 more not full Row Groups the situation has got a little bit more different – now we might have some percentage of the not full Row Groups that actually might require a rebuild of the table or respective partition!

Let’s run a simply analytical query, which will count the number of non-full (aka Trimmed) Row Groups:

Trimmed Row Groups AnalysisThis result shows us an important piece of analysis for our table on partition basis: the number of Trimmed Row Groups, the total Percentage of the Trimmed Row Groups, Average Number of Rows in a Row Group and the total number of Row Groups.

We have 4 Trimmed Row Groups – This is nice, but what does it exactly mean ?
Let us abstract from the number of trimmed Row Groups and their total percentage, and simply take the average number of rows and by calculating the total size of the table, to see if we can potentially optimize the number of Row Groups by making it smaller:

This means that if we Rebuild this table, in the best case we should be able to lower the total number of Row Groups by 2 (from 15), which represents a very good percentage ~13%!

Applying this logic, here is an updatable version of the very same query, which will provide me with the optimised numbers that I do not have to calculate manually any more:

Trimmed Row Groups Analysis OptimizedNow this is much better & interesting view.
But wait a second, is this really a 3rd perspective ? Yes, because we have no deleted rows at the moment, and the Segment Alignment can be tuned externally (by creating a )

Trimmed Row Groups Analysis Optimized Bigger PictureFrom the picture above, one can see that there is no deleted fragmentation, but there is a great opportunity to improve our table Trimmed Groups fragmentation, and by doing it to lower the overall number of Row Groups by ~13.33%.
I will invoke the rebuild process for the Clustered Columnstore Index, by issuing ALTER INDEX REBUILD command:

Consulting the last analytical command shows the following result:
Trimmed Row Groups Analysis Optimized FinalNow I can see that our total number of Row Groups has lowered to 13 as expected, and even though we have 2 Trimmed Row Groups, the average number of rows per Row Group is so high, that we can theoretically further improve only 1 Row Group by moving more data into it, though it might provoke a certain misbalance in overall performance (Imagine synchronising Row Groups with 10 Rows & 1 Million Rows).

You will need to very careful when using this view results, because the number of trimmed Row Groups might actually say nothing – If we have 10.000 Row Groups with 1045677 rows (1 row less than maximum allowed value), then there is no chance we can optimize anything. From a different point of view, if we have just incomplete 1 Row Group (once again imagine it has 1045677 rows), then we have 100% of trimmed Row Groups, but we can’t optimize it any further.
When considering Row Groups Trimming (3rd perspective besides Deleted Rows and Segment Alignment), focus on the number of optimisable Row Groups, it is the real number that truly matters.

Note that an enhanced version of this script is a part of my upcoming free Columnstore Indexes Scripts Library.

to be continued with Columnstore Indexes – part 63 (“Parallel Data Insertion”)

Leave a Reply

Your email address will not be published. Required fields are marked *