Columnstore Indexes – part 84 (“Practical Dictionary Cases”)

Continuation from the previous 83 parts, the whole series can be found at https://www.nikoport.com/columnstore/

In this blog post I wanted to focus on 2 practical aspects of Columnstore Dictionaries compression – the ones I have seen in the past months and years of Columnstore Indexes usage.
The information found in this blog about dictionaries behaviour applies to SQL Server 2012-2016, where applicable to both clustered and nonclustered columnstore indexes. Naturally, before SQL Server 2016 you can’t update nonclustered columnstore index and for getting the results in SQL Server 2012 you would need to copy & reinsert the data in the same order as before.

The Sequences

In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed. Some of the examples of the sequences are: Prime numbers, Catalan numbers of Fibonacci numbers.
The sequence does not have to have a constant value for delta – a good example of a sequence is a Fibonacci sequence, where previous 2 numbers sum creates the next number in the sequence.

In a the relational databases the sequences are represented in a number of different ways – they can be generated manually or with some internal math functions. In SQL Server we can generate simple sequences with the help of 2 functionalities – IDENTITY and SEQUENCE.

Consider the following script below, where a table with a single integer column is created. That single column is an integer field with an ever-increasing sequence of values, thanks to the identity functionality specified in the definition.
After filling out 2 full Row Groups, the script finishes off with a creation of a Clustered Columnstore Index (you would need to create a Nonclustered Columnstore on SQL Server 2012):

drop table if exists dbo.BatchTest;

create table dbo.BatchTest(
	c1 int identity(1,1) not null
	);

set nocount on
declare @i as int;
declare @max as int;
select @max = isnull(max(C1),0) from dbo.BatchTest;
set @i = 1;

begin tran
while @i <= 1048576 * 2
begin
	insert into dbo.BatchTest
		default values;
		 
	set @i = @i + 1;
end;
commit;

create clustered columnstore index ncci_BatchTest
	on dbo.BatchTest with (maxdop = 1);

For seeing the results, produced by the script, I will use the upcoming CISL (Columnstore Indexes Scripts Library) version, which you can download for free.

exec cstore_GetDictionaries @tableName = 'BatchTest';
exec dbo.cstore_GetRowGroupsDetails @tableName = 'BatchTest';

As you can see in the results, the engine does not recognise nor apply any special compression for the sequences. Unlike some other engines, including Hive ORC there is nothing that allows to compress the simple integer sequences, which is a pity. Even more, the RLE compression that is applied by the engine is not very effective as the script below will show.

Let's create the same sequence of values for 2 Row Groups, this time using manual process and setting the date type to beignet to see if it affects the compression in any way.

dropt table if exists dbo.BatchTest2;

create table dbo.BatchTest2(
	c1 bigint not null
	);

set nocount on
declare @i as int;
declare @max as int;
select @max = isnull(max(C1),0) from dbo.BatchTest2;
set @i = 1;

begin tran
while @i <= 1048576 * 2 
begin
	insert into dbo.BatchTest2
		(c1) values (@i);
		 
	set @i = @i + 1;
end;
commit;

create clustered columnstore index cci_BatchTest2
	on dbo.BatchTest2 with (maxdop = 1);

Let's run the test queries against this table:

exec cstore_GetDictionaries @tableName = 'BatchTest2';
exec dbo.cstore_GetRowGroupsDetails @tableName = 'BatchTest2';

As you can see below, the results are absolutely equal to the automated identity:

No matter if we are using integer or bigint the engine is doing a smart base compression (subtracting the base value from the beginning of the sequence and storing & compression the difference from it, you can find the base value exposed in the sys.column_store_segments DMV)

Even though we are talking about just 2.6 MB of occupied space in this case, I think that the opportunity to compress this values should not be lost easily.
Let's see how the engine behaves if we convert those values to a variable character field directly (10 chars will cover the maximum value for the integer)

alter table dbo.BatchTest2
	alter column c1 varchar(10) not null;

It takes a couple of seconds to run the above query on my VM, because the engine will remove all the Segments from all the Row Groups and create new compressed Segments for all of them. This means that any operation on a big Columnstore Table, affecting data type of a number of columns will take a significant time to execute.

Let's see if there are any effects of this change from bigint to varchar

exec cstore_GetDictionaries @tableName = 'BatchTest2';
exec dbo.cstore_GetRowGroupsDetails @tableName = 'BatchTest2';

There are still no dictionaries - and trying to rebuild this table will not bring any effect at all, but take a look at the size of the segments - their size was lowered for almost 40% to ~1.6 MB!

This technic is very effective if you are compressing the columns that you do access rarely - it should be considered for the log tables for example.
Also notice that Columnstore Archival compression will not bring any significant changes - the original 2.6 MB will lower to 2.42 while the variable char column will not get any further improvements, making the improvement difference around 32%.

Warning: Do NOT use this technic without understanding the consequences - the processing of such columns will lower their effectiveness, since Predicate Pushdown will work in a very limited way, plus the Segment Elimination will not work at all.

I have had previously expressed my opinion on Twitter about sequences and their compression and I do still believe it is an interesting functionality to include into the engine.

Default Values

For a long time, SQL Server allows us to add new NULL and NOT NULL values to existing tables, and I thought that it was important to share an example of what I have observed.

We can add a new column with a default value with a following statement:

alter table dbo.BatchTest2
	add c2 varchar(100) NOT NULL default ('Once Upon a Time')

Let's see how the adding of new column with a constraint functions:

exec cstore_GetDictionaries @tableName = 'BatchTest2';
exec dbo.cstore_GetRowGroupsDetails @tableName = 'BatchTest2';

What happens under the hood is that the SQL Server engine adds a new global dictionary for the constraint that we have specified for the new column and depending if the column is NULL or is NOT NULL, it will initialise the compressed segments with the dictionary values or not.

To prove that point here are the details of the segments for our table:

select object_name(part.object_id), seg.on_disk_size, seg.*
	from sys.column_store_segments seg
	inner join sys.partitions part
		on seg.hobt_id = part.hobt_id and seg.partition_id = part.partition_id
	inner join sys.columns icol
		on icol.object_id = part.object_id and icol.column_id = seg.column_id
	where part.object_id = object_id('batchtest2');

You can see that the new addition of the new column has created new segments for each of the existing row groups, and so if you are adding new columns to a huge columnstore tables you will need to plan accordingly.

Another interesting aspect or technic is to consider initialising the column with the vast majority of the same value at the end of the ETL process, ensuring then that the restart values are correctly set after. This way the data loading process goes significantly faster without going into transaction log. Of course you will pay with moving all of your data into the Delta-Store, so you need to be extremely careful when doing the calculations.

to be continued with Columnstore Indexes – part 85 ("Important Batch Mode Changes in SQL Server 2016")

Niko Neugebauer

SQL Server, Columnstore, Data Platform & Community

Columnstore Indexes – part 84 (“Practical Dictionary Cases”)

The Sequences

Default Values

Leave a Reply Cancel reply