Columnstore Indexes – part 126 (“Extracting Columnstore Statistics to Cloned Database”)

Continuation from the previous 125 parts, the whole series can be found at http://www.nikoport.com/columnstore/.

I have a huge love for the DBCC CLONEDATABASE command – it has been made available (backported) to every SQL Server version starting with SQL Server 2012, since the original release in SQL Server 2014, while being constantly improved in the Service Packs and Cumulative Updates.

This blog post is focusing on the Database Cloning improvement in the SQL Server 2019 that is already available in the public CTP 2.0 – the possibility of the automated statistics extraction for the Columnstore Indexes.
WHY ?
Well, there was quite a significant problem with the Columnstore Indexes previously – the statistics for them were not extracted into the cloned database, unless you did created the statistics in the most recent step before Database cloning.

Let’s view a simple example of the situation on the SQL Server 2019 and for the purpose of testing I will be using a 10GB copy of the TPCH sample database, that can be easily generated with the help of the free software, such as the HammerDB.

Let us create the lineitem_cc_without_stats table, create a Clustered Columnstore Index on it and finally load 1 Million Rows into it:

DROP TABLE IF EXISTS dbo.lineitem_cc_without_stats;

SELECT TOP 1000000 *
	INTO dbo.lineitem_cc_without_stats
	FROM dbo.lineitem
	WHERE 1 = 0;

CREATE CLUSTERED COLUMNSTORE INDEX CCI_lineitem_cc_without_stats
	ON dbo.lineitem_cc_without_stats;

INSERT INTO dbo.lineitem_cc_without_stats
SELECT TOP 1000000 *	
	FROM dbo.lineitem;

Before advancing any step further and cloning the source database, we need to ensure that a new statistics object has been created for the column l_shipdate and for that purpose I shall ran a rather trivial query against the created table dbo.lineitem_cc_without_stats where I shall just pinpoint the data from the 1st of January of 2008:

DECLARE @i DATE;

SELECT @i = l_shipdate
	FROM tpch.dbo.lineitem_cc_without_stats
	where l_shipdate = '2008-01-06'

There is a small detail, a significant difference with the Rowstore Indexes – while for the Columnstore Indexes the statistics objects are being created, and while they are populated with data – they are not persisted. This is not a big thing for your live production system, unless you are trying to extract/stream out the statistics object.
Here is the query that will list me both of my statistics object – the Columnstore Index & the Automatically Created Statistics on my test table lineitem_cc_without_stats:

select * from tpch.sys.stats where object_id=OBJECT_ID('dbo.lineitem_cc_without_stats');


Both objects are available and none of them are user created or temporary.

Lets see the execution plan of our test query and the cost of it:

DECLARE @i DATE;

SELECT @i = l_shipdate
	FROM tpch.dbo.lineitem_cc_without_stats
	where l_shipdate = '2008-01-06'


The overall estimated cost on the image is 0.164993 with estimated 1002.25 rows estimated – and there are no quirks, as we are working directly with a real production database that has a concrete data.

Now, let us create a new cloned database for the TPCH, which we shall call TPCH_CLONE:

DROP DATABASE IF EXISTS TPCH_CLONE;

DBCC CLONEDATABASE (TPCH, TPCH_CLONE);

After a couple of seconds the clone is ready and let’s try the very same operation against the TPCH_CLONE database:

DECLARE @i DATE;

SELECT @i = l_shipdate
	FROM tpch_clone.dbo.lineitem_cc_without_stats
	where l_shipdate = '2008-01-06';


The actual execution plan looks pretty much the same with one important detail – there is a warning sign on the Columnstore Index Scan iterator!
Investigating it a little bit – by hovering the mouse cursor over the iterator with the warning sign we can see that the Query Optimiser is advising us that we have columns with no statistics! It is also to notice that the expected number of rows to be read from our Columnstore Index is different – instead of the 1.000.000 rows as in the original execution plan, we have 929059 rows ! This is not just around 8% of the difference, this is huge when you are trying to emulate and generate plans in the clone database that are expected to be the same as in the production database – in the complex execution plan the skewing will raise into 100s & 1000s of % within just a couple of joins, while we would expect to see the very same plans!
We need to see the overall expected execution plan cost – and here it on the right side, showing us that we just have 0.105337 instead of the original 0.164993 representing an unhealthy difference of the whooping 63%!
This was not what one would expect, right!?

But how could it be? We have had the statistics in the original database, let us verify it:

select * from tpch_clone.sys.stats where object_id=OBJECT_ID('lineitem_cc_without_stats');


As you can see on the picture above, we still have the very same 2 statistics objects – they were copied from the original, but unfortunately they contain no data and our Auto-Created statistics is now user-created and temporary, while containing an empty filter definition! Indeed, the Columnstore Index statistics are not persisted…

SQL Server 2019

Testing a freshly installed SQL Server 2019 CTP2.0 I replayed the original setup script against the TPCH database with 10GB:

DROP TABLE IF EXISTS dbo.lineitem_cc_with_stats;

SELECT TOP 1000000 *
	INTO dbo.lineitem_cc_with_stats
	FROM dbo.lineitem

CREATE CLUSTERED COLUMNSTORE INDEX CCI_lineitem_cc_with_stats
	ON dbo.lineitem_cc_with_stats;
GO

DECLARE @i DATE;

SELECT @i = l_shipdate
	FROM dbo.lineitem_cc_without_stats
	where l_shipdate = '2008-01-06';
GO

DROP DATABASE IF EXISTS TPCH_CLONE;
DBCC CLONEDATABASE (TPCH, TPCH_CLONE);
GO

thus advancing to the test queries agains the both databases:

DECLARE @i DATE;

SELECT @i = l_shipdate
	FROM tpch_10.dbo.lineitem_cc_without_stats
	where l_shipdate = '2008-01-06'

SELECT @i = l_shipdate
	FROM TPCH_10_Clone.dbo.lineitem_cc_without_stats
	where l_shipdate = '2008-01-06';


On the pictures above you can see both execution plans being similar in the form and shape and without any warnings whatsoever!
Their cost is equal (and it is 0.0031408 being quite different from the original situation in the SQL Server 2016/2017, but more on this will be available hopefully in the nearest future)


Fair Warning – on my test instance I have called the database TPCH_10 & TPCH_10_Clone, since I have a good variation of the different sizes there and the default names area already taken :)

I won’t be able to show you the stats on the SQL Server 2019, because they will provoke more thoughts and questions then I am ready to invest in this current blog post, and I guess that CTP 2.0 still far from being ready and some of the features are still buggy – but we can see that the results and the queries do not suffer from the differences on SQL Server 2019, while they do on SQL Servers 2017, 2016, 2014.

Fixing the statistics on the previous SQL Server versions

If you are unsure of how to fix the Columnstore Indexes in the Cloned Database problem on the previous editions of SQL Server, then please consider reading the following article fromParikshit Savjani (SQL Tiger Team) and use the update script from the Tiger Team to solve it.

Final Thoughts

I love the fact this rather simple but important improvement finally made into the engine. No more streaming of the Columnstore Tables tricks will be needed, but I truly hope that this improvement will be backported into the SQL Server 2017 & SQL Server 2016 … Should not be that much of a game changer that someone will decide not to upgrade to SQL Server 2019, but it is a huge pleaser and promoter of the DBCC CLONEDATABASE command, which in my eyes is one of the coolest and underrated features of the late SQL Server releases.

to be continued …

Leave a Reply

Your email address will not be published. Required fields are marked *