Columnstore Indexes – part 61 (“Window aggregate functions”)

Continuation from the previous 60 parts, the whole series can be found at

A lot of financial calculations in the SQL world involves a number of analytical & statistical functions, such as CUME_DIST, LEAD, FIRST_VALUE, PERCENTILE_CONT, LAG, PERCENTILE_DISC, LAST_VALUE, PERCENT_RANK (Their description and functionality can be found at the MSDN page for Analytic Functions) & VAR, VARP, STDEV, STDEVP and others (check out the Aggregate Functions page at the MSDN) plus window aggregate functions: COUNT, COUNT_BIG, SUM, AVG, MIN, MAX & CLR implementations.
Their support and implementation in SQL Server has been a part of the mission that some well-known specialists, such as Itzik Ben-Gan, have been fighting for years.
Microsoft has greatly improved support for window functions in SQL Server 2012 by implementing a number of features, and now, the support for the Batch Execution Mode for Window functions has been introduced in SQL Server 2016.

To all my surprise officially the most commonly used ranking functions (RANK,NTILE,DENSE_RANK,ROW_NUMBER) are not listed at the moment between those functions that support Batch Execution Mode, though my tests clearly have shown that it already exists.

Windows AggregateLadies and gentlemen, let me introduce you to the newest member of the execution plan iterators – Window Aggregate. In all my tests during last weeks, I have only managed to see it performing in Batch Execution Mode, but I do not exclude the possibility that it might have an implementation in old traditional Row Execution Mode as a backup plan. It would make few sense to me now that there is a Batch Execution Mode support for a single core execution plan, but you know, just in case 😉

Important Note:
My tests have been executed on SQL Server 2016 CTP 2.1, where the first bits of the support have been implemented. The commitment to support much bigger range of different functions is made by Microsoft for the final release of SQL Server 2016, and so I am definitely expecting the supported operation list to grow over the next month before the final release.

For testing this new iterator I decide to use the very same VM with SQL Serve 2016 CTP 2.1 simply changing the compatibility level of my tested database to 120 whenever emulating results of SQL Server 2014. The importance of the compatibility level in SQL Server 2016 is described in a lot of precious details on the MSDN documentation page, and notice that the new features related to the batch mode are only available for the 130 compatibility level.

Once again, the usual suspect – the free database ContosoRetailDW have been restored and a Clustered Columnstore Index was created on the FactOnlineSales table:

Let’s kick off tests with a simple window aggregate, based on the MAX function:

For testing in SQL Server 2014 compatibility level, you will need to add the following statement, but do not forget to reset it back to 130 once you want to get back to the SQL Server 2016 CTP 2.1 compatibility level.

Basic execution plan for Windows AggregateOn the top of this text, you will see a very simple and clear execution plan which runs in the Batch Execution Mode for SQL Server 2016 CTP 2.1.
Now compare it to the execution plan below, which runs in SQL Server 2014 compatibility level.
SQL Server 2014 execution plan for Windows AggregateThere is one important detail over the execution plan differences – the SORT iterator runs in the Batch Execution Mode for SQL Server 2016 while for SQL Server 2014 it is only capable of running in a Row Execution Mode, and hence we can only estimate the true difference between having Window Aggregate iterator and not having it in our execution plan.
Anyway, comparing 2 plans I can see that Window Aggregate iterator is serving as a substitution for the following combination: Segment + Segment + Windows Spool + Aggregate.
I really hope that someone way smarter than I am, such as Paul White will dive deeply into this iterator in the following months, explaining its functionalities and limitaitons.

Execution times for MAX window functionRegarding the execution time, there is not much to say except that SQL Server 2016 Window Aggregate function runs around 13 times faster than the SQL Server 2014 version, which is incredibly impressive for a functionality that demands a whole lot of CPU.

From another side, the memory grant for the SQL Server 2016 execution plan is 722 MB vs 974MB of the SQL Server 2014 execution plan, which again gives a significant (25%) improvement for the resource usage and with Columnstore Indexes we all know how hungry they are for all available resources.

Let’s advance and run a query, doing 3 different calculations inside the same query, where all of the used window functions support the Batch Execution Mode within Window Aggregate iterator:

This time it took a significant amount of resources to process the requested information – 1 Minute and 28 seconds for the SQL Server 2016 CTP 2.1. For SQL Server 2014 compatibility level, I have managed to receive 116 seconds, which corresponds to almost 2 Minutes, a good difference to the original but somehow it does not get to the same scale of improvement as the single window function… Where is the problem ?
Multiple Window Functions subpart of the execution planThe thing is that the first part of the query (the first window aggregate) is the same as for the single window function, but then for both of the queries we are falling back to the original SQL Server 2014 execution strategy: Segment + Segment + Window Spool + Stream Aggregate + Compute Scalar, which is exactly explains that the difference between the execution plans is based on the difference between the first window functions – around 30 seconds, since the 2nd and the 3rd window functions are running in exactly the same way.

I could add here that running 2 window aggregate functions works just fine and that in the execution plan I see 2 Window Aggregate operators and in the terms of the performance the execution time increases quite linearly, but I do not consider the limitation of using 2 window functions at maximum as a reasonable limitation and so this is definitely an area where I expect Microsoft to improve before the final release.

As I have mentioned in the begin of this article, there is whole category of functions – ranking functions (RANK,NTILE,DENSE_RANK,ROW_NUMBER) that are not listed as supporting Batch Execution Mode, but I decided to take them for a couple of tests if there are any improvements.
Between those ranking functions, ROW_NUMBER is by far the most popular one, being used for many different goals and can be found very frequently in OLTP systems as well.
Let’s run a simple test query to find it’s performance and compare with SQL Server 2014:

ROW_NUMBER in SQL Server 2016 Execution Plan
This execution plan is very similar to the one, that can be found for the first query in this article, with the difference of Compute Scalar iterator being absent. The Window Aggregate iterator is well present and running in the Batch Execution mode, delivering great performance, but overall I suspect that having SORT iterator running in the Batch Mode is what truly delivers great performance for this execution plan.
ROW_NUMBER in SQL Server 2014 Execution PlanComparing with SQL Server 2014 execution plan shows us the difference that lies in the presence of the Segment + Sequence Project (Compute Scalar) iterators vs Window Aggregate iterator in SQL Server 2016 execution plan. Once again do not forget about the Batch Mode for the Sort iterator that should make the huge part of the performance difference.

Execution times for ROW_NUMBERComparison of the execution times gives a good overview that the in SQL Server 2016 for this function (ROW_NUMBER), we receive 2.5 times improvement on rather small dataset (7.6 Million Rows).

Notice that for different mathematical functions the different might be more or less significant, it will all depend on the executed operations and on the used datatypes. In what I have seen until now – going well over 8 bytes will always bring a performance penalty and sometimes it might be very significant.

Now it’s time to take a look at another analytical window function – LEAD, to check how it’s performance is getting updated with SQL Server 2016 CTP 2.1:

LEAD in SQL Server 2016 CTP 2.1 Execution PlanLooking at the execution plan from SQL Server 2016 CTP 2.1, you will find no Window Aggregate iterator, apparently because it is not implemented yet.
The difference in performance between SQL Server 2014 and SQL Server 2016 for the LEAD function is still noticeable, we are spending much more CPU (~17 seconds vs 11 seconds for SQL Server 2016 CTP 2.1), while overall time performance lies in the domain of SQL Server 2014 : 10 Seconds vs 11.5 Seconds for SQL Server 2016 CTP 2.1. I do not even try to make any judgement at this point, since we are dealing with a very early preview version.

I have done number of tests on the improvements in SQL Server 2016 CTP 2.1 and here is the table that I have compiled after my tests of the window functions:

Function Support in 2016 CTP 2.4
LAG yes
LEAD yes
CUME_DIST yes (partial)
PERCENT_RANK yes (partial)
VAR yes
VARP yes
MIN yes
MAX yes
AVG yes
SUM yes
RANK yes
NTILE yes (partial)

I will update this table in the near future, whenever Microsoft shall release updates for SQL Server 2016 CTPs and definitely there will be a detailed information on RTM.

to be continued with Columnstore Indexes – part 62 (“Row Groups Trimming”)

Leave a Reply

Your email address will not be published. Required fields are marked *