Columnstore Indexes – part 92 (“Lobs”)

Continuation from the previous 91 parts, the whole series can be found at https://www.nikoport.com/columnstore/.

Since the very first version of the Columnstore Indexes in 2012, there has been no support for LOB (large objects binaries). The reason behind not focusing on implementing their support was that a good Data Warehouse avoids storing such data within database, and as you should know – the first 2 implementations (Non-Updatable Nonclustered Columnstore Indexes in SQL Server 2012 & updatable Clustered Columnstore Indexes in SQL Server 2014) were focused on BI/DWH/Reporting scenarios.
The world has changed quite a bit since that time and Nonclustered Columnstore Indexes are well updatable since SQL Server 2016 and supporting Hybrid OLTP Analytical Systems (aka HTAP / Operational Analytics).
The truth is that
I remember a couple of years ago insisting with a couple of the Columnstore developers that support for the LOBs for Columnstore Indexes is badly needed, and that in the real world developers unfortunately do not follow the best practices and everyone working in the field have seen a number of very creative and scary usages of the SQL Server capabilities.

In the upcoming version of SQL Server (for the moment known as SQL Server vNext), Microsoft has finally announced the upcoming support for the LOBs within Columnstore Indexes – thus enabling the usage of the NVARCHAR(MAX), VARCHAR(MAX), and VARBINARY(MAX) data types on the tables with Columnstore Indexes that include those columns.

For the tests, I have decided to spin a Virtual Machine in Azure with an installation of the currently available CTP1 of the SQL Server vNext, which has a version 14.0.1.126.

UPDATED on 28th of December 2016:
After exchanging some emails with the Program managers from Microsoft responsible for this feature, I can share the following information:

For Clustered Columnstore Indexes there are 3 categories of compression:
1. Inline, within the Row Group – compressing all strings with the total length below 8000 bytes. All values in this case are stored within the dictionaries.
2. Compressed Off-Row – compressing values between 8000 bytes and 16 Megabytes. The strings are individually block-compressed. Only the headers/pointers to these off-row LOBs are stored in dictionaries.
3. Pass-through – containing values above 16 MB, these are simply uncompressed LOBs with headers/pointers stored in the dictionaries.

Currently the complete information about off-row storage of the lobs is not available through the DMVs. (Please insert 10.000 :( here – I hope that this will be fixed before RTM).

I decided to do multiple tests with different columnstore indexes (disk-based Clustered, Nonclustered & Memory-Optimised Clustered Columnstore) to cover different scenarios.
First of all, my favourite ones – the Clustered Columnstore Indexes were taken to the test, where I have taken a download from the Bureau of Labor Statistics free data source and using the data on the urban consumers (which is occupying around 33MB) inserted into my test table with a Clustered Columnstore Index:
urban_consumers-lob

EXEC master.dbo.sp_configure 'show advanced options', 1
RECONFIGURE
EXEC master.dbo.sp_configure 'xp_cmdshell', 1
RECONFIGURE 

DROP TABLE IF EXISTS dbo.LobTest;

CREATE TABLE dbo.LobTest(
	c1 int identity(1,1),
	c2 varchar(max),
	index CCI_LobTest clustered columnstore );

After issuing the CREATE TABLE statement, the following warning was shown to me, notifying that the capacity is still in the preview and it should not be used on the production yet (SQL Server vNext CTP1 stand)
Warning: Using Lob types (NVARCHAR(MAX), VARCHAR(MAX), and VARBINARY(MAX)) with Clustered Columnstore Index is in public preview. Do not use it on production data without backup during public preview period.
It is still breath-taking to see that the LOBs are finally finding their way into the Columnstore Engine, since their support will allow a lot of applications with a lot less hustle in the first place.

For loading the I have used OPENROWSET command, specifying that underlying data should be used a single BLOB

-- Bulk Load the data
INSERT INTO LobTest(c2)  
SELECT * FROM OPENROWSET(  
   BULK 'C:\Install\urban_consumers.txt',  
   SINGLE_BLOB) AS x;   

After a couple of seconds – voilá, we have inserted 1 row that contains around 33 MB.
Taking a more detailed look at it regarding to the occupied space:

exec sp_spaceused 'dbo.LobTest'

lobtest-occupied-space
We can see that our 1 row was not compressed at all – it occupies the same space it was occupying on the disk.
With the help of CISL, we can take an easy look at the internals of the Row Groups, seeing that even though we are dealing with the compressed Row Group (no Delta-Stores here) and even though the DMVs are reporting that Vertipaq optimisations are in – the engine is not showing any occupied space.

exec dbo.cstore_GetRowGroupsDetails;

cisl-get-row-groups-details
Going one step deeper, we can uncover the current information on the compressed Row Group is showing that only 172 bytes were occupied:

select object_name(object_id), *
	from sys.column_store_row_groups;

column_store_row_groups-for-lobs
For sure, the real correct information can be obtained through other ways and means :):

select object_name(rg.object_id), 
	   rg.size_in_bytes as RowGroupSpaceInBytes,
	   cast(1.0*pstat.reserved_page_count*8/1024 as Decimal(9,2)) as ReservedMB
	from sys.column_store_row_groups rg
		inner join sys.partitions part
			on rg.object_id = part.object_id
		inner join sys.dm_db_partition_stats AS pstat
			on pstat.partition_id = part.partition_id;

table-space-through-partitionstats
You can clearly see that through the good old friend sys.dm_db_partition_stats we can have concrete information about the size of the Row Group, even though the sys.column_store_row_groups contain a small bug right now.

Maybe the data is stored directly in the dictionary ?
Let’s try to uncover it, though I seriously doubt that the maximum limit of 16MB was lifted so easily:

exec dbo.cstore_GetDictionaries;

lob-dictionaries
There is nothing significant to see, meaning its the rough size of the Row Group that is occupying the space.
To confirm that, let’s query direct the DMV sys.column_store_dictionaries:

select object_name(part.object_id) as TableName, on_disk_size as SizeInBytes
	from sys.column_store_dictionaries dict
	inner join sys.partitions part
			on dict.partition_id = part.partition_id and dict.hobt_id = part.hobt_id;

column_store_dictionaries-for-lobs
There is truly nothing to be found, so let us move on.

Let’s do a different test and loading a couple (1048576) of rows that are 9000 bytes wide:

drop table if exists dbo.ManyLobs;

create table dbo.ManyLobs
(
    id int not null,
    LobData varbinary(max) null,
	Index CCI_ManyLobs CLUSTERED COLUMNSTORE
);

;WITH N1(C) AS (SELECT 0 UNION ALL SELECT 0)
,N2(C) AS (SELECT 0 FROM N1 AS T1 CROSS JOIN N1 AS T2)  
,N3(C) AS (SELECT 0 FROM N2 AS T1 CROSS JOIN N2 AS T2)  
,N4(C) AS (SELECT 0 FROM N3 AS T1 CROSS JOIN N3 AS T2)  
,N5(C) AS (SELECT 0 FROM N4 AS T1 CROSS JOIN N4 AS T2) 
,N6(C) AS (SELECT 0 FROM N5 AS T1 CROSS JOIN N3 AS T2) 
,IDs(ID) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM N6)
insert into dbo.ManyLobs(id, LobData)
    select   
        ID
        ,convert( varbinary(max), replicate(convert(varchar(max),'a'),9000) )
    from Ids;

Analysing it straightly with the space usage and the CISL function for the Row Group Details:

exec sp_spaceused 'dbo.ManyLobs';
exec dbo.cstore_GetRowGroupsDetails @tableName = 'ManyLobs';

space-used-for-the-1048576-lobs
row-groups-for-the-1048576-lobs
First of all we are occupying over 400 MB for this table, this is visible from the sp_spaceused function, but at the Row Group level we see just a couple of the MB occupied (around 5,1 MB to be more precise) – and the essential LOB information is still missing. I am starting to suspect that it is being hidden in some kind of a new element or a new compression. Looking at the Row Groups information shows that they were clearly cut of because of the dictionary pressure – maybe the dictionaries are simply disproportionally big ?
Making sure that the size is not hidden in the dictionaries, let’s run another CISL function:

exec dbo.cstore_GetDictionaries @tableName = 'ManyLobs';

dictionaries-for-the-1048576-lobs
There is nothing special to be found here – no global dictionaries, just 3 local ones with each size being well under 16MB limit and well under 7.5 MB where the dictionary pressure typically start to appear.
I am waiting for the explanations from someone from Microsoft if this is a small bug (work in progress) or there are some new elements for the OFF-ROW storage that are not exposed.

There is one more thing that I have found and I wanted to share it before we advance to the Nonclustered Columsntore Indexes – the limitation on the secondary rowstore indexes built on the Clustered Columnstore.
Let’s recreate our table with a clustered columnstore index:

drop table if exists dbo.ManyLobs;

create table dbo.ManyLobs
(
    id int not null PRIMARY KEY,
    LobData varchar(max) null,
	Index CCI_ManyLobs CLUSTERED COLUMNSTORE
);

Now we can try to add a nonclustered rowstore index that includes our LOB column:

create nonclustered index IX_ManyLobs_Lob on dbo.ManyLobs (id) include(lobdata);

Msg 35384, Level 16, State 1, Line 10
The statement failed because column ‘LobData’ on table ‘dbo.ManyLobs’ is a lob column.
Non clustered index with lob column as included column cannot co-exist with clustered columnstore index.

Well, that’s a bummer. This points to some kind of the off-row storage that is being used for the clustered columnstore index specifically. One needs to be aware of this limitation, and I am anxious to find more about this feature and limitation in the weeks & months to come.

Nonclustered Columnstore

Let’s re-run the creation of the ManyLobs table with the Nonclustered Columnstore Index this time:

drop table if exists dbo.ManyLobs;

create table dbo.ManyLobs
(
    id int not null PRIMARY KEY,
    LobData varchar(max) null,
    Index CCI_ManyLobs NONCLUSTERED COLUMNSTORE (id,LobData)
);

:(
Msg 1919, Level 16, State 1, Line 3
Column ‘LobData’ in table ‘ManyLobs’ is of a type that is invalid for use as a key column in an index.
Msg 1750, Level 16, State 0, Line 3
Could not create constraint or index. See previous errors.

Well, it seems like this functionality is not implemented yet. :( In the case of the Nonclustered Columnstore Index – there are not many arguments of supporting the LOBs, when compared to the Clustered Columnstore Index, where table contains all columns without exception. Should this support not be implemented at all, the error message should be clearly fixed, since creating a nonclustered Columnstore Index outside of the table definition gives a more correct information about lack of the support of the LOBs:

create nonclustered columnstore index NCCI_ManyLobs on dbo.ManyLobs (LobData);

Msg 35343, Level 16, State 1, Line 10
The statement failed. Column ‘LobData’ has a data type that cannot participate in a columnstore index. Omit column ‘LobData’.

In-Memory Columnstore

We have one more member of the Columnstore family to the test – the In-Memory Columnstore. Let’s add the memory-optimised File Group to our database:

ALTER DATABASE Lobs 
	ADD FILEGROUP [Lobs_Hekaton] 
	CONTAINS MEMORY_OPTIMIZED_DATA;

ALTER DATABASE Lobs
	ADD FILE(NAME = Lobs_HekatonDir, 
	FILENAME = 'C:\Data\LobsXtp') 
	TO FILEGROUP [Lobs_Hekaton];

Now, let’s add to our table with a LOB in the definition with a creation of a clustered columnstore index:

drop table if exists dbo.ManyLobs;

create table dbo.ManyLobs
(
    id int not null PRIMARY KEY NONCLUSTERED,
    LobData varchar(max) null,
	Index CCI_ManyLobs CLUSTERED COLUMNSTORE --(id,LobData)
)WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);

The error message is very clear – not supported so far, and this is a huge miss as long as we can’t have a Nonclustered Columnstore Index on the In-Memory tables:
Msg 35343, Level 16, State 1, Line 3
The statement failed. Column ‘LobData’ has a data type that cannot participate in a columnstore index. Omit column ‘LobData’.
Msg 1750, Level 16, State 1, Line 3
Could not create constraint or index. See previous errors.

You see, for the disk-based HTAP scenarios we can simply avoid the column by not putting it into the Nonclustered Columnstore Index, but for the In-Memory Tables – because we can have only the Clustered Columnstore: there is no such options as to avoid the LOB. :(
Taking a different approach for getting a different message here does not work – if adding a clustered columnstore index to already existing table, the same error message persists:

drop table if exists dbo.ManyLobs;

create table dbo.ManyLobs
(
    id int not null PRIMARY KEY NONCLUSTERED,
    LobData varchar(max) null,
)WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
 
alter table dbo.ManyLobs
	add INDEX NCCI_ManyLobs
		CLUSTERED COLUMNSTORE;

Msg 35343, Level 16, State 1, Line 14
The statement failed. Column ‘LobData’ has a data type that cannot participate in a columnstore index. Omit column ‘LobData’.
Msg 1750, Level 16, State 1, Line 14
Could not create constraint or index. See previous errors.

Final Thoughts

I am very happy with the addition of the LOBs to the Clustered Columnstore Indexes – I see it solving a couple of real scenarios in the world. I am a little bit surprised by the fact that the secondary indexes are not allowed to include the column – this might have to do with the kind of storage used for the solution. This is definitely nothing unsolvable, but still one needs to be aware of this.

The lack of the support of the Nonclustered Columnstore Indexs for the LOBs is more than ok with me, the only thing to get fixed there are the error messages on the creation.

In-Memory support for LOBs on the other side is quite critical – since we do not have Nonclustered Columnstore Index yet, and because SQL Server 2016 already brought us the support for the LOBs on the In-Memory tables, I am expecting this item to be improved for the SQL Server vNext.

Let’s see how this feature will progress for the final release of the SQL Server vNext – I will be updating this article as the feature evolves.

to be continued with Columnstore Indexes – part 93 (“Batch Mode Adaptive Memory Grant Feedback”)

4 thoughts on “Columnstore Indexes – part 92 (“Lobs”)

  1. Michael Smith

    Hi Niko-

    Is there any additional information about the space used accounting available? While sp_spaceused and your scripts show I’m getting amazing compression on LOB data in CCI, I’m still showing the uncompressed levels of file space used/allocated ( FILEPROPERTY(‘COLSTR_01’, ‘SpaceUsed’) ). Might there be a bug that is not releasing original LOB pages after they’ve been compressed? My LOB data range from 33 bytes to approx 34kb.

    1. Niko Neugebauer Post author

      Hi Michael,

      Did you check the Row Group sizes? … For your LBO sizes everything should be in-row and the sys.column_store_row_groups should be giving you the correct informations.

      Best regards,
      Niko Neugebauer

Leave a Reply to Rob Nicholson Cancel reply

Your email address will not be published. Required fields are marked *