This is a first blog post in a very large series of posts about the CLustered Columnstore Indexes:
Clustered Columnstore Indexes – part 2 (“Internals”)
Clustered Columnstore Indexes – part 3 (“More Internals”)
Clustered Columnstore Indexes – part 4 (“Basic T-SQL”)
Clustered Columnstore Indexes – part 5 (“New Meta-Information and System Stored Procedure”)
Clustered Columnstore Indexes – part 6 (“Observing the behavior”)
Clustered Columnstore Indexes – part 7 (“Transaction Isolation”)
Clustered Columnstore Indexes – part 8 (“Locking”)
Clustered Columnstore Indexes – part 9 (“CTP1 Observations”)
Clustered Columnstore Indexes – part 10 (“Compression basics”)
Clustered Columnstore Indexes – part 11 (“Clustered vs Nonclustered compression basics”)
Clustered Columnstore Indexes – part 12 (“Compression Dive”)
Clustered Columnstore Indexes – part 13 (“Dictionaries Analyzed”)
Clustered Columnstore Indexes – part 14 (“Partitioning”)
Clustered Columnstore Indexes – part 15 (“Partitioning Advanced”)
Clustered Columnstore Indexes – part 16 (“Index Builds”)
Clustered Columnstore Indexes – part 17 (“Resources 2012 vs 2014”)
Clustered Columnstore Indexes – part 18 (“Basic Batch Mode Improvements”)
Clustered Columnstore Indexes – part 19 (“Batch Mode 2012 Limitations … Updated!”)
Clustered Columnstore Indexes – part 20 (“TempDB Spills – when memory is not enough”)
Clustered Columnstore Indexes – part 21 (“DBCC CSIndex”)
Clustered Columnstore Indexes – part 22 (“Invisible Row Groups”)
Clustered Columnstore Indexes – part 23 (“Data Loading”)
Clustered Columnstore Indexes – part 24 (“Data Loading continued”)
Clustered Columnstore Indexes – part 25 (“Faster Smaller Better Stronger”)
Clustered Columnstore Indexes – part 26 (“Backup & Restore”)
Clustered Columnstore Indexes – part 27 (“Load with Delta-Stores”)
Clustered Columnstore Indexes – part 28 (“Update vs Delete + Insert”)
Clustered Columnstore Indexes – part 29 (“Data Loading for Better Segment Elimination”)
After the Microsoft’s first attempt to improve DataWarehouse performance drastically with introduction of the Nonclustered Columnstore Indexes, which are extremely limited (non-updatable, schema modification prohibited, limited number of data types supported, cannot be unique, batch execution mode is also very limited in SQL Server 2012, etc),
this time for SQL Server 2014 we shall have a feature that really promises to change this game around – CLUSTERED columnstore indexes.
As far as I know, not many organizations around the world are using Nonclusted NonUpdateable Columnstore Indexes mainly because of it is being a very new feature and the other reasons is its limitations.
Clustered Columnstore Indexes in SQL Server 2014 are promising to solve a lot of the problems, such as:
- making tables updatable
- schema modification is available
- more datatypes included
- mixed execution modes support (batch & row)
- more operations support for the batch mode ( outer join, union all, etc )
- improved global dictionaries for segments compression
- support for even better data compression (aka Archival)
- seek operation support
- spill - bulk insert operation support and associated improvements
Since CTP1 is already available I decide to give it a try and to play with Clustered Columnstore Indexes a bit
I will start with introduction of some of the architectural concepts and then in the following posts I will try to explore the actual improvements:
Columnstore Indexes are using the X-Velocity In-Memory Compression Engine which is already being used by a number of Microsoft Technologies, such as PowerPivot, Tabular Mode for Analysis Services and
PowerView (Thanks to Jamie Thompson and to Riccardo Mutti for clarifications.) for example.
The basic technic of this engine consist in dividing available data into different row groups, also known as segments. Each of those segments should consist of around 1.000.000 rows (this number is not controlled by a parameter in the current versions) and the rows are encoded & compressed by using some internal algorithm. For some of the columns which are requiring dictionaries (character types for example) an additional dictionary encoding conversion is being used.
Note: each column is separated into its own groups of the Segments – and so this where the ColumnStore idea comes from.
After the process of encoding and compression is done, then the segments and the dictionaries are converted to Blobs and stored inside of the SQL Server. The underlying storage mechanism is still the good old 8K pages. Should one of the Blobs span more then 1 page, then the usual blob storage mechanisms are invoked.
Note: a very important part of understanding and optimization of Columnstore Indexes is that data is not sorted inside of the Segment, which will lead us to a number of technics to improve their compression and performance in some of the next posts.
There is a principal meta-information storage, which is called a Directory. Inside of this Directory, the information about the segment & dictionaries allocation status is stored. It is also contains additional metadata about number of rows, size, min and max values inside of each of the segments. This information is available inside of the sys.column_store_segments DMV.
There is one more very important element for the meta-data storage inside of the Clustered Columnstore Indexes – Delete Bitmap. Delete Bitmap is basically a storage which contains information about the deleted rows inside of the Segments. Its representation in-memory is a bitmap, while on the disk it is stored as a b-tree, where ids of the deleted rows are stored. Delete Bitmap is consulted on a regular basis in order to avoid returning the rows which were already deleted.
The Clustered Columnstore Index is the primary source for the data storage, as the name Clustered typically referrers to the original source of the table data.
Note 1: Upcoming SQL Server 2014 has the limitation that no other indexes are allowed to be created on the table with a Clustered Columnstore Index.
Note 2: The data is not really clustered inside of the CCI, since it is not really being ordered in the upcoming SQL Server version.
The whole architecture of the updateable Clustered Columnstore Indexes is basically divided into 2 parts: Row Groups (aka Segments) and Delta Stores. While Segments are encoded & compressed, as well as unupdateable directly, the Delta Stores are a “normal” good old b-tree row stores, which have all the characteristics of an habitual Sql Server table.
Same as with Segments, every Clustered Columnstore Index can have multiple Delta Stores.
Microsoft has introduced a brand new DMV sys.column_store_row_groups for the SQL Server 2014 CTP1. This view contains the information about all segments and delta stores for the Columnstore Indexes (Clustered & NonClustered).
It seems like internally they are represented equally, so you will see them listed all together. The difference for distinction is that Segments are actually having their status “state” column set on 3, with the description of “Compressed” while the Delta Stores can have status of 1 & 2 as a state with corresponding description of “Open” / “Close”. This status for the Delta Store is delivering the information if the Delta Store still accepting new information or it is just waiting to become a Segment (encoded & compressed).
Some words about some of the basic operations:
Inserts are simply added to one of the currently open Delta Stores.
If the deleted row is found inside of a Segment, then the Deleted Bitmap information is updated with the row id of the respective row.
If the deleted row is actually inside of a Delta Store, then the direct process of removal is executed on the b-tree.
They are basically represented as deletes and inserts.
To be continued with Clustered Columnstore Indexes – part 2 (“Internals”)