Columnstore Indexes – part 124 (“Estimate Columnstore Compression”)

Continuation from the previous 123 parts, the whole series can be found at https://www.nikoport.com/columnstore/.

For anyone working with SQL Server since version 2005 (with Service Pack 2 to be precise) there is a very common task when thinking or research the system optimisation – the usage of the compression, and before advancing with this step the question that one usually receives is – “can you estimate how much improvement we shall get?”
For this purpose since SQL Server 2008, we have a very useful stored procedure that is called sp_estimate_data_compression_savings, that is capable of providing us with the estimation of how much storage we can save by enabling or moving to a more effective compression method (as in NONE | ROW | PAGE types).
Columnstore Indexes have appeared in SQL Server 2012 (that is well over 6 years ago) and even though from time to time I would ask and suggest to enable the stored procedure sp_estimate_data_compression_savings to start supporting Columnstore Indexes, until now there is no such support.

Until now – I am introducing a conjunction of my scripts in a comprehensive and reasonably capable stored procedure that is called “cstore_sp_estimate_columnstore_compression_savings” and that is a part of my free & open-sourced Columnstore Indexes Script Library, freely available on GitHub.
Supporting all publicly available versions of SQL Server ( 2012, 2014, 2016, 2017 & Azure SQL Database), this stored procedure (and of course there is a stand-alone script as always) will provide you with good number of options, besides trivial copying data into a temp table and estimating the overall result, but while supporting all respective Columnstore Compressions it will give you a chance to keep the data in the destination table by making it persistent, cancel the operation if the table would exist, specify the MAXDOP to be used for the operations and provide the basic insights over the trimming reasons on the sampled destination table, between others.
I am not pretending on creating something perfect and capable as the original source code for the sp_estimate_data_compression_savings, but I simply wanted to share something that I am using and that I believe can be useful for others.

cstore_sp_estimate_columnstore_compression_savings

While the variety of the parameters for the stored procedure cstore_sp_estimate_columnstore_compression_savings is big, the bare minimum of the parameters that are required for the usage currently are just 1: @tableName – this is the name of the source

The following minimal invocation of the compression estimation stored procedure within the ContosoRetailDW sample database for the table FactSalesQuota:

execute dbo.cstore_sp_estimate_columnstore_compression_savings 
	@tableName = 'FactSalesQuota';

will take the dbo.FactSales table, create a temporary table #SampledTable in the tempdb, by copying the data relative the default 4 Row Groups (4 * 1048576 rows to be precise and if there are less rows, than it will copy as much as it can), create a default Clustered Columnstore Index on the destination table with the default Columnstore Compression, calculate the estimated total size of the table and will show the reasons for the Row Group trimming:

In this particular case the estimation using the sample of 56.18% would be an expected 74.88% improvement over the currently used page compression. You can see all the details about the source table, the overall number of rows (approximate, because it is taken from sys.partitions), the location of the sampled table (which is temporary by default and it is located in TempDB).
This stored procedure has a good number of options and you can try, for example a different compression method – COLUMNSTORE_ARCHIVE:

execute master.dbo.cstore_sp_estimate_columnstore_compression_savings 
	@tableName = 'FactSalesQuota',
	@data_compression = 'COLUMNSTORE_ARCHIVE';

And here you go right away, you will notice that compression improvement estimation has risen to incredible 158%, we have moved from estimated 110 MB to just 75 MB for our table if we start applying columnstore indexes.

Besides obvious (leaving sample for later usage)

This function will allow you to leave the existing sampled table for later usage (as for querying for example) – for that purpose you should not use the default temp table but specify the database, the schema and the name of the destination table, besides specifying that parameter @deleteAfterSampling should be disabled:

execute master.dbo.cstore_sp_estimate_columnstore_compression_savings 
	@dbName = 'ContosoRetailDW',
	@schemaName = 'dbo', 
	@tableName = 'FactSalesQuota',
	@data_compression = 'COLUMNSTORE_ARCHIVE',
	@destinationDbName = 'ContosoRetailDW',
	@destinationSchema = 'dbo',
	@destinationTable = 'FactSalesQuota_Test',
	@deleteAfterSampling = 0;

will deliver you all the same information about the sample percentages, but most importantly will deliver you a final table [ContosoRetailDW].[dbo].[FactSalesQuota_Test] that will be usable for whatever you need it to.

This will be a real persisted table with a Clustered Columnstore Index on it, as specified in the results.

Important to notice that running the very same query for the second time will produce an error, because the script will protect any existing table by default:

Msg 55001, Level 16, State 1, Line 3
The Destination Table already exists!

and so for such situations there is another parameter that will drop any existing destination table (YOU HAVE GOT TO BE CAREFUL ABOUT IT):

execute master.dbo.cstore_sp_estimate_columnstore_compression_savings 
	@dbName = 'ContosoRetailDW',
	@schemaName = 'dbo', 
	@tableName = 'FactSalesQuota',
	@data_compression = 'COLUMNSTORE_ARCHIVE',
	@destinationDbName = 'ContosoRetailDW',
	@destinationSchema = 'dbo',
	@destinationTable = 'FactSalesQuota_Test',
	@deleteAfterSampling = 0,
	@cancelIfDestTableExists = 0;

will execute the process successfully by removing the destination table if it exists.

Making sure that it works with a Clustered Columnstore Index was easy, but what about the Nonclustered Indexes when we want to compare their space impact with the Nonclustered Columnstore Indexes ? Let’s create a simple Nonclustered Rowstore Index on our test table with all the columns that participate in the Primary & Foreign Key relationships:

CREATE NONCLUSTERED INDEX NC_FactSalesQuota
	ON dbo.FactSalesQuota 
		(SalesQuotaKey, ChannelKey, StoreKey, ProductKey, DateKey, CurrencyKey, ScenarioKey ) 
			WITH (DATA_COMPRESSION = PAGE);

And to advance any further we shall need to determine the IndexId of this index to invoke the stored procedure master.dbo.cstore_sp_estimate_columnstore_compression_savings, just like the good old sp_estimate_data_compression_savings requires,
and for that purpose let’s query up the sys.indexes DMV:

SELECT index_id, *
	FROM [ContosoRetailDW].sys.indexes o
	WHERE o.object_id = OBJECT_ID('[ContosoRetailDW].[dbo].[FactSalesQuota]');

You should be able to see from the screenshot above that our newest index NC_FactSalesQuota is nonclustered rowstore index with an id = 8, which we shall use as the parameter while invoking cstore_sp_estimate_columnstore_compression_savings while storing result in the tempdb and then consequently removing it:

execute master.dbo.cstore_sp_estimate_columnstore_compression_savings 
	@dbName = 'ContosoRetailDW',
	@schemaName = 'dbo', 
	@tableName = 'FactSalesQuota',
	@data_compression = 'COLUMNSTORE_ARCHIVE',
	@indexId = 8;

You can see that we have got a Nonclustered Columnstore Index on our temp table and its size (55MB) with the Columnstore Archival compression is clearly below the the 75 MB that we have achieved with the same compression on the table with Clustered Columnstore Index. Here the estimated improvement is almost 100%, which is still incredible number.
Leaving IndexId on NULL will result in creation of the Clustered Columnstore Index and that is the default behaviour.

Sampling Amount

After some considerations at the moment, I decided to base the sampling on the number of the Row Groups with 4 being the default number, but feel free to change it according to the real size of your table, and on the script below you will see :

execute master.dbo.cstore_sp_estimate_columnstore_compression_savings 
	@dbName = 'ContosoRetailDW',
	@schemaName = 'dbo', 
	@tableName = 'FactSalesQuota',
	@data_compression = 'COLUMNSTORE',
	@indexId = 8,
	@rowGroupsNr = 2;

You can clearly notice that now we are sampling just 28.09% of the total size of the table (which is pretty incredible in the real world for a fast test) and we are estimating the improvement to be 91.36% for when using the standard COLUMNSTORE compression.

Degree of parallelism

If we would need to execute a better control over the degree of parallelism, then a MAXDOP (Maximum Degree Of Parallelism) would be a great parameter to do it, so as in the example below:

execute master.dbo.cstore_sp_estimate_columnstore_compression_savings 
	@dbName = 'ContosoRetailDW',
	@schemaName = 'dbo', 
	@tableName = 'FactSalesQuota',
	@data_compression = 'COLUMNSTORE_ARCHIVE',
	@indexId = 8,
	@maxDOP = 1;

Trimming Information

While I personally consider trimming to be a very important pointer of the Columnstore Indexes performance, not everyone will agree with me and so the relevance of those trimmed Row Groups can be controlled with the help from the @showTrimmingInfo parameter. To remove the result sets with the Row Groups trim reasons, simply enable it like on the example below, which will deliver just 1 result set:

execute master.dbo.cstore_sp_estimate_columnstore_compression_savings 
	@dbName = 'ContosoRetailDW',
	@schemaName = 'dbo', 
	@tableName = 'FactSalesQuota',
	@data_compression = 'COLUMNSTORE_ARCHIVE',
	@indexId = 8,
	@showTrimmingInfo = 0;

This is not the end, but there is more functionality that will be coming into this function in the weeks/months to come – support for the filtered Noncustered Columnstore, InMemory Columnstore, Partitioning, and of course I expect to test and improve this stored procedure for the Azure SQL Database Managed Instance – but for now the release is public and I am looking forward for the feedback and improvement suggestions.

Go on, try to download and experiment with cstore_sp_estimate_columnstore_compression_savings!

to be continued with Columnstore Indexes – part 125 (“Estimate Columnstore Compression as a System Stored Proc”)

Niko Neugebauer

SQL Server, Columnstore, Data Platform & Community