This is the third in the series dedicated to the Clustered Columnstore Indexes.
The first part can be found here – https://www.nikoport.com/2013/07/05/clustered-columnstore-indexes-part-1-intro/, while the whole series can be found at https://www.nikoport.com/columnstore/
This blogpost is continuing focusing on the part of the internals of the Clustered Columnstore Indexes.
If you have ever read about NonClustered Columnstore Indexes for SQL Server 2012, then you surely know that the process of the creation and rebuild is extremely resource intensive.
According to Microsoft, they have implemented an improvement over memory consumption for SQL Server 2014, and so besides traditional control wit the use of DOP (Degree of Parallelism), we can expect that internally the actual DOP will be varying as the SQL Server might be changing the memory consumption based on the currently available resources. This means that some of the threads might even be put on hold, in order to keep the system stable.
A lot of times in the DatawareHousing environments there are subsets of data which are rarely accessed (historic data), but which keeps occupying rather significant amount of space.
SQL Server 2014 will introduce a new compression mode for the Columnstore Indexes to address this situation, which is called Archival Compression. This mode introduces an additional layer compression layer, on the top of the compression already made. It works transparently by compressing the data blobs before storing on the disk.
The algorithm used for compression is a XPress8 (which is a Microsoft internal variant of LZ77 compression created in the 1970s). Designed to work with multiple threads in the same time, it uses data streams up to 64KB in size.
This should add around 30% improvement over the traditional Columnstore Compression.
Note: Compression can be applied on a partition level and thus can guarantee that we apply this method only to the most-rarely accessed data. The compression type can be easily changed at the Rebuild time.
Some of the columns inside of the Columnstore Index might be using Dictionaries encoding in order to improve compression. There are 2 types of the dictionaries: Local & Global
– Local dictionary is used inside of the same Segment.
– Global which is applied across all Segments. The more relevant data is stored in Global dictionary, the more compression effect can be achieved across the Columnstore Index.
Note: If the Global Dictionary is 100% relevant it might eliminate the need for any potential Local dictionary.
Note: Data is never sorted inside of a Segment.
After encoding and compression phases are finished, dictionaries are stored as Blobs on the disk by using good old 8k pages mechanism. The ids which are being used for the mapping of the dictionary values are 32 bits.
In SQL Server 2014 the whole process of building Global Dictionary is revamped. The data is first sample across all Segments for all the columns that requires a dictionary and based on that analysis that the Global Dictionary is created.
SQL Server Query Optimizer uses statistics in order to create a query execution plan and the basics about it are to be found everywhere in the internet. For NonClustered Columnstore Indexes there were no implementation for the Sampling of the statistics – it makes a kind of sense, since there were no updates allowed.
Clustered Columnstore Indexes are different, very different – they have 2 different sampling algorithms:
– the first one is focused on the accuracy, while spending more IO & CPU resources. This algorithm will scan each and every Segment as well as Delta Store and then randomly selecting some subset of the rows. This algorithm is supposed to produce a more accurate histogram when comparing to Page sampling.
– the second one is targeting performance while relaxing some of the accuracy. This one is using cluster sampling, by selecting some random rows within each of the cluster group (the number is controlled by the sampling percentage). This algorithm might scan just some of the Segments and Delta Stores, while completely ignoring the others. This algorithm is used to help with dictionary creation and never for the build of the histogram.
Clustered Columnstore Indexes will be introducing support for the Bookmarks inside of the Columnstore Index. It is important to remember that there is no unique key inside of a Columnstore Index, that identifies a row, the new Bookmarks is actually composed out of following combination:
So basically the Segment_Id is not stored, since it is implicit when accessing the data inside of the Segment, and then the Tuple_Id inside of the Segment is simply is the locator for finding the tuple in question.
For the cases when we are talking about the b-tree storage (such as for Delta Stores and Deleted Bitmaps), the heap bookmark is represented by the following combination: file_id:page_id:slot_id.
I see a lot of important uses for a Bookmark – Lookups (aka Seeks), deletes (using Deleted Bitmap), updates, etc, and so this is one of the functionalities that could not be missed.
to be continued with Clustered Columnstore Indexes – part 4 (“Basic T-SQL”)