Columnstore

I am deeply fascinated by the Columnstore Indexes, and I have open some Connect Items to suggest their important improvements:
– Implement Batch Mode Support for Row Store
– Multi-threaded rebuilds of Clustered Columnstore Indexes break the sequence of pre-sorted segment ordering (Order Clustering)
– Columnstore Segments Maintenance – Remove & Merge
Implement Computed Columns for Clustered Columnstore Indexes

Scripts Library:
I am publishing CSIL – Columnstore Indexes Script Library, with the first release targeting the 1st of September 2015.
Sign up for notifications, if you are interested!

Here is the series of blog posts that I have written about them:

Azure:
Azure Columnstore, part 1 – The initial Preview offering
Azure Columnstore, part 2 – Snapshot Isolation & Batch Mode DOP
Azure Columnstore, part 3 – Modern Segment Elimination and Set Statistics IO

28 thoughts on “Columnstore”

Christopher Grace March 13, 2017 at 7:36 am

Wow, thanks for all your work on this. It is very informative.

Reply ↓
1. Niko Neugebauer Post authorMarch 15, 2017 at 11:21 pm
  
  You are very welcome, Christopher!
  
  Best regards,
  Niko
  
  Reply ↓
Konstantin April 3, 2017 at 10:42 am

Awesome work!

Reply ↓
1. Niko Neugebauer Post authorApril 4, 2017 at 4:36 pm
  
  Thank you for the kind words, Konstantin !
  
  Best regards,
  Niko
  
  Reply ↓
Roman April 6, 2017 at 8:28 am

So impressed by your investigation.
Thanks a lot, Niko.

Reply ↓
1. Niko Neugebauer Post authorApril 20, 2017 at 12:38 am
  
  Thank you for the kind words, Roman !
  
  Best regards,
  Niko
  
  Reply ↓
Anuj Saboo May 24, 2017 at 8:38 am

Hello,

I have heard praise about your blog from Brent Ozar podcasts and I would want to ask you a question about ColumnStore Indexes as a DBA. I use SQL 2014 and using the traditional DMV – sys.dm_db_index_physical_stats, I am not able to find fragmentation on Clustered Columnstore Index. When I manually try to find the fragmentation by going into Index Properties, the fragmentation shows at 0% which is quite surprising seeing that I do a lot of data inserts/deletes in my Data Warehouse.

Does the fragmentation work in some other way, is there any other method to see fragmentation on ColumnStore Indexes?

Reply ↓
1. Niko Neugebauer Post authorMay 25, 2017 at 9:24 am
  
  Hi Anuj,
  
  Columnstore Indexes do not have physical fragmentation in the same sense as the traditional Rowstore indexes. The columnstore segments are stored as LOBs continuously.
  You have the logical fragmentation, because of the deleted rows. For more information check out these posts:
  http://www.nikoport.com/2014/07/29/clustered-columnstore-indexes-part-36-maintenance-solutions-for-columnstore/
  http://www.nikoport.com/2015/06/28/columnstore-indexes-part-57-segment-alignment-maintenance/
  http://www.nikoport.com/2014/07/20/clustered-columnstore-indexes-part-34-deleted-segments-elimination/
  
  Additionally check out the following script at the CISL library (SQL Server 2016 version):
  https://github.com/NikoNeugebauer/CISL/blob/master/SQL-2016/fragmentation.sql
  
  Best regards,
  Niko
  
  Reply ↓
Ovidiu Sorin Berca September 10, 2017 at 12:18 am

Hi Niko !
I am a columnstore index beginner, live and work in Unites states I am working at a presentation for my Company and a demo and I have a related question for you:
I have to admit I am very confused. The Microsoft page tells us that related to column store bulk insert mode the optimal number of rows is 102400 in order to be compressed but when I load that using an insert –select in SQL2016 I still get delta-stores I do not see any compressed data not until I hit the other number 1048576 (2^20).
The Microsoft article is at
https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview

I will try to do bcp so other methods of bulk insert , but what I am doing wrong here?
If I create the columnstore index from the data heap yes I get the row-groups compressed directly, but not with the insert-select.
If you answered in your blog somewhere, just point me there please..

Thank you !
Sorin

Reply ↓
1. Niko Neugebauer Post authorSeptember 14, 2017 at 2:16 pm
  
  Hi Sorin,
  
  The number 102.400 rows is correct, it activates the switch to load into compressed Row Group without touching Delta-Stores.
  Are you using TABLOCK hint ?
  Are you using SSIS ? Can you share an example of the statement you are invoking ?
  Did you take a look at these articles:
  http://www.nikoport.com/2014/06/20/clustered-columnstore-indexes-part-30-bulk-load-api-magic-number/
  http://www.nikoport.com/2015/08/19/columnstore-indexes-part-62-parallel-data-insertion/
  
  Best regards,
  Niko
  
  Reply ↓
Thorsten October 17, 2017 at 4:26 pm

Hallo Niko,
great job!

One remark for VLDBs:
In Suggested Tables.sql [Min RowGroups] should be int not smallint.

Best regards,
Thorsten

Reply ↓
1. Niko Neugebauer Post authorNovember 14, 2017 at 12:15 am
  
  Hi Thorsten,
  
  huge thanks for the feedback – getting this one corrected in the next release!
  
  Best regards,
  Niko
  
  Reply ↓
Sagar Bathe February 6, 2019 at 7:28 pm

Hi Niko – First of all, let me say this is a remarkable collection of information on CCIs. Awesome!!
I do an issue on CCIs which I am hoping you may be able to assist. I am creating CCI with partitions on a Fact Table (1.5B records). But when I look at the plan I see a big difference between the estimated and actual row count. I think this is leading to tempdb spillover which is slowing down our reporting queries (Most of our queries have group/order by). I ran DBCC Stats on the CCI and saw that it did not return any records (which I believe is an expected behavior)
My question is the CCI built properly? Is there a way to build Stats on CCIs which I am missing which is causing the actual vs estimated mismatch.

This is how I am building the CCI (based on suggestions from Microsoft). Auditwebsite is our partition column
CREATE CLUSTERED INDEX TableName_cci ON TableName (AuditWebsite)
WITH (MAXDOP = 0, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON PS_FactEligibility (auditwebsite);

CREATE CLUSTERED COLUMNSTORE INDEX TableName_NonQuoted_cci ON TableName
WITH (MAXDOP = 0, DROP_EXISTING = ON)

Any pointers are appreciated

Reply ↓
1. Niko Neugebauer Post authorFebruary 10, 2019 at 7:03 pm
  
  Hi Sagar,
  
  I suggest you update the statistics on your CCI table manually before building the index.
  Otherwise notice that the statistics object is populated on the fly when a query is executed against the columnstore index or when executing DBCC SHOW_STATISTICS against the columnstore index, but the columnstore index statistics aren’t persisted in the storage.
  
  Best regards,
  Niko
  
  Reply ↓
Fred March 2, 2019 at 10:57 am

Hi Niko,
Vers gréât job on CCI & co
Little question:
Do you have a pdf document compiling all posts on CCI/CI ?
Thks
Fred

Reply ↓
1. Niko Neugebauer Post authorMarch 5, 2019 at 11:04 pm
  
  Hi Fred,
  
  thank you very much. There is no PDF, but I know that some people simply convert web pages into PDFs for reading.
  Later this year, there will be a PDF in a form of a book.
  
  Best regards,
  Niko Neugebauer
  
  Reply ↓
Andrey October 16, 2019 at 11:23 am

Niko hi!

Do you have any reasons not make all tables cluster columnstore even small ones (<100 recs)?

Our developers prefer to have all tables unified (all ccs) despite their sizes.
I have a feeling that it's not a good approach, but have no valid reasons yet except the case with a query which fails in case of small table being ccs and runs fine when the same table is a classic table with clustered index.

Thanks in advance,
Andrey.

Reply ↓
1. Niko Neugebauer Post authorOctober 16, 2019 at 1:57 pm
  
  Hi Andrey,
  
  the unnecessary level of Hash Joins might punish your applications and the forced preference for the Hash Joins instead of the Inner Loop Joins will definitely have effects, even thought they might be small.
  One day the situation might change and the penalty will be too big, because a different kind of testing and different kind of artefacts will appear.
  I suggest to be EXTREMELY careful when building CCI on such small tables.
  
  Best regards,
  Niko Neugebauer
  
  Reply ↓
  1. Andrey October 16, 2019 at 4:41 pm
    
    Niko, thanks for reply!
    
    I didn’t mention that the db is DWH and analytical queries are the most often ones.
    In this case Hash Joins are more typical than Nested Loops, if I’m not mistaken. What do you think?
    
    Anyway, I share your opinion with our developers, thanks for that again.
    
    Regards,
    Andrey.
    
    Reply ↓
    1. Niko Neugebauer Post authorOctober 16, 2019 at 10:49 pm
      
      Hi Andrey,
      
      Regarding the Joins – you write that they are more typical but not exclusive. :)
      I would give an opportunity to Query Optimiser to do the hard choice of choosing, and unless it is badly wrong – I love being able to get better plans according to the current scenario.
      Sounds like you developers are looking for a hammer … As long as they just have the nails to hit – all is fine. ;)
      
      Best regards,
      Niko Neugebauer
      
      Reply ↓
      1. Andrey October 17, 2019 at 11:21 am
        
        Thanks again, Niko!
        
        All the best to you, I appreciate your support of SQL community :)
        
        Regards,
        Andrey.
Gw van Olderen October 21, 2019 at 12:12 pm

Hello,

Very informative blogseries on the columnstore indexes.
Do you have any tips or insights on how to use Visual Studio to automate the deoloyment of these indexes?
If a i add a column to a table with a columnstore index and i deploy that to a production environment de publish script first drops the indes, adds the column, adds a normal clustered index and then recreates the columnstore index.

Regards,
Gerwin

Reply ↓
1. Niko Neugebauer Post authorOctober 21, 2019 at 6:44 pm
  
  Hi Gerwin,
  
  Yeap, a known behaviour from the beloved VS …
  Manual script is the solution as far as I know …
  
  Best regards,
  Niko Neugebauer
  
  Reply ↓
  1. Gerwin October 22, 2019 at 9:39 am
    
    Hi Niko,
    
    Thanks for your quick reply. For now its usually possible to add the column manually but with continuous integration and automated deployment it would be nice not to manually intervene,
    
    I reported the problem through visual studio (2019) hoping that it will be picked up and fixed.
    If people reading this will add comments it will hopefully be picked up and fixed.
    
    https://developercommunity.visualstudio.com/content/problem/787825/when-publishing-a-datbase-with-a-new-column-on-a-t.html
    
    Reply ↓
ManishA January 29, 2020 at 8:14 pm

We are thankful to you Niko for this precious series. :)

Reply ↓
1. Niko Neugebauer Post authorFebruary 8, 2020 at 1:54 am
  
  Hi Manish,
  
  thank you for the kind words.
  I am grateful to you for them.
  
  Best regards,
  Niko Neugebauer
  
  Reply ↓
Sumrin May 10, 2020 at 2:09 pm

Hi Niko,

I am inserting 1M rows into a table I have Columnstore index on it with MaxDOP =0 , but I see the insertation of records its taking more than 2hours. Any tips that you would like to provide or how can I over come this.

Reply ↓
Rif February 23, 2025 at 1:20 pm

Hello Niko,

Firstly, I want to express my gratitude for all your insightful blog posts and your contributions to the community. They have been incredibly helpful.

I am referencing a specific Microsoft document that discusses memory optimizations for columnstore compression in Azure Synapse, which can be found here: Memory Optimizations for Columnstore Compression. https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-memory-optimizations-for-columnstore-compression#how-to-estimate-memory-requirements

I am working on optimizing rowgroup quality for columnstore indexes in an Azure Synapse dedicated SQL pool and comparing the behaviors with other platforms like SQL Server, SSAS & PowerBI. My focus is primarily on understanding the differences in dictionary usage and its impact on compression efficiency across different systems. Based on the documentation, it appears that dictionaries are only utilized for string columns where the string data type exceeds 32 bytes. I have a few questions regarding this:

1) Could you confirm my understanding that strings of less than 32 bytes do not use dictionary compression and are instead compressed using Run-Length Encoding (RLE)? Is this interpretation accurate?

2) How does the “32 bytes” behaviour of dictionary compression in Azure Synapse’s dedicated SQL pool compare to SQL Server�s column store indexes? I can’t find this rule documented else where.

3) Does my objective to optimise string columns that are 32 bytes or less to be of a data type that uses 8 bytes or less, primarily to benefit from aggregated pushdown and batch mode? Is this still the guiding principle?

Thank you for considering my question,

Rif

Reply ↓

Niko Neugebauer

SQL Server, Columnstore, Data Platform & Community

Columnstore

28 thoughts on “Columnstore”

Leave a Reply Cancel reply