Azure VM Availability Groups Shakes

I have been writing quite a lot about IO and Backup in Azure VMs, but there is one topic I wanted to share for quite some time (around a year), but never gotten enough time to do so – until now.

I am quite tired of the comparisons of the Cloud offerings to the Lego blocks, which is incredibly false and shortsighted. Well, unless you are as rich as Uncle Scrooge and ready to pay the expenses for every single desire that passes through your head … until you finally hit the wall with your head.
Unless you are considering reasonable a request for building something with Lego and buying a double/triple sized boxes when you need an extra block – yeah, that’s how reasonable Cloud (ANY CLOUD, I am not focusing my writing on other clouds yet, but hey the feature is the same, so do not get overly excited by any promises) offerings are looking right now. They are indeed balanced for an average consumer (pro tip: the average number of legs for a human being is 1.92 and good luck finding that human being with those 1.92 legs, by the way) and that means that it is most probably not perfectly suited and balanced for you, dear reader.

If you have just configured an Availability Group in the Azure VMs (IaaS) for the first time, or if you are running a hugely oversized VMs, you might not even have a clue about the adventures that some people are facing, but let me share with you a couple of them, to make you feel the urge of happiness and I might spare you the necessary calls with the support, which sometimes will point blank suggest you to upgrade your VMs to the next available number of cores.

a Shake

Disclaimer: I am using the word shake by my own initiative and no Microsoft Documentation ever to my knowledge ever mentioned that situation. Those shakes are represented most of the time as health events to the cluster, such as the Lease Timeout resulting in a sudden attempt of Failover.
Why did I choose that word ? I don’t know. Honestly. :)

A Host Shake

This is contention that is taking place on the host server in a virtualised environment.
On Azure this is unheard off and I have only seen this in real life in an over-virtualised environments.

A CPU Shake

This one a lof of the people running mixed workloads (such as the ones that makes sharing the VM between the relational Database and SQL Server Analysis Services, for example) would instantly recognize.
Because your CPU was at 100% of occupation and without digging into the internals of Windows & SQL Server you have understood that your Virtual Machine became unresponsive, and hence the failover took place.

An IO Shake

Now … we have finally came to something that I am indeed eager to share.
Imagine that you are putting your wonderful Azure VM with SQL Server in IaaS with a couple of terabytes of space for your data …
and after setting up a Cluster with an Availability Group and running test workload for some time, you are ready to go into production.
A couple of weeks later you suddenly notice that some random and inexplicable failovers, 5 miles away from the middle of nowhere …

Boom!

Booya!

It turns out that OVER-PROVISIONING IO system (such as in adding IO capacity that is superior to the limits of the VM) may result in this situation.
Imagine that we have a wonderful E16s_v3 Azure VM (avoid using this type of VMs in Availability Groups at all costs) on each of our 2 nodes. This VM type will provide you with the maximum of 256 MB/s cached throughput (take a good read here on the cached and uncached throughput).
For this VM we shall pick up 2 P40 (250 MB/s) managed drives for the data, 1 P40 (250 MB/s) disk for the Log and just 1 last P40 (250 MB/s) drive for the TempDB.
Note that that I totally understand that we could choose 1 P50 (same 250 MB/s) drive for the data keeping access to the same total 4 TB, but that would not change a bit the complete situation.
I am consciously ignoring the additional impact of the IO operations on system drive (Disk C) and all the caching weirdness on the D drive that will put our Availability Group on its knees.

In order to provoke this situation we just need to have a couple of TB (3, for example) of data on the data drive, 512 GB of Transaction Log.
Let’s start 3 operations:
1. Full Backup of the Database
2. A big ETL transaction that will write heavily into the transaction log
3. Some query that will ~~sort~~spill data heavily in TempDB

If this does not work out, consider adding a consistency check with the help of DBCC CHECKDB.

If you think that having those operations in parallel in production in real life, I guess we have different experiences of the production environments. :)

How will the throttling will look like ?

First of all you will notice the Long checkpoints in the error log

2021-01-21 16:27:31.38 spid115s    FlushCache: cleaned up 1571 bufs with 1290 writes in 91394 ms (avoided 71 new dirty bufs) for db X:0
2021-01-21 16:27:37.57 spid29s     FlushCache: cleaned up 2502 bufs with 2073 writes in 138518 ms (avoided 140 new dirty bufs) for db X:0
2021-01-21 16:31:47.39 spid117s    FlushCache: cleaned up 7683 bufs with 5657 writes in 330464 ms (avoided 331 new dirty bufs) for db X:0
2021-01-21 16:33:02.13 spid29s     FlushCache: cleaned up 2562 bufs with 1886 writes in 103041 ms (avoided 169 new dirty bufs) for db X:0

In the error log you will find error messages such as:

2021-01-21 16:33:01.47 Server      SQL Server hosting availability group 'TestAG' did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.
2021-01-21 16:33:05.47 Server      Error: 19407, Severity: 16, State: 1.
2021-01-21 16:33:05.47 Server      The lease between availability group 'TestAG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

In the cluster log (use Powershell command Get-ClusterLog to extract this information into somewhere for easier analysis), you will find things like

00001c6c.0001eccc::2021/01/21-16:33:05.470 ERR   [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
00001c6c.0001eccc::2021/01/21-16:33:05.470 ERR   [RES] SQL Server Availability Group : [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
00001c6c.0001eccc::2021/01/21-16:33:05.470 ERR   [RES] SQL Server Availability Group : [hadrag] Resource Alive result 0.
00001c6c.0001eccc::2021/01/21-16:33:05.470 ERR   [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost

This will also result in the messages like this for the Extended Events of the AG, pointing out that the connection between the replica has failed:

01/21/2021 16:33:05 PM   Information   ServerName.vo 35206   MSSQLSERVER                         Server             N/A                               A connection timeout has occurred on a previously established connection to availability replica 'ServerName' with id [GuidId].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

If you wish to analyse it from a different angle, you can start by diving into diagnostic data, provided by the sp_server_diagnostics procedure and stored in the files with a format of Extended Events.
In order to have a kind of snapshot and since this information is volatile and eventually get’s replaced, I like store it in a table with a query like this (note that this one specifically is targeting on SQL Server 2016):

SELECT 
    xml_data.value('(/event/@name)[1]','varchar(max)') AS Name  
  , xml_data.value('(/event/@package)[1]', 'varchar(max)') AS Package  
  , xml_data.value('(/event/@timestamp)[1]', 'datetime') AS 'Time'  
  , xml_data.value('(/event/data[@name=''component_type'']/value)[1]','sysname') AS Sysname  
  , xml_data.value('(/event/data[@name=''component_name'']/value)[1]','sysname') AS Component  
  , xml_data.value('(/event/data[@name=''state'']/value)[1]','int') AS State  
  , xml_data.value('(/event/data[@name=''state_desc'']/value)[1]','sysname') AS State_desc  
  , xml_data.query('(/event/data[@name="data"]/value/*)') AS Data  
  INTO dbo.Diagnostics
FROM   
(  
      SELECT 
                        object_name as event  
                        ,CONVERT(xml, event_data) as xml_data  
       FROM    
      sys.fn_xe_file_target_read_file('C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Log\*.xel', NULL, NULL, NULL)  
)   
AS XEventData

After that I can calmly query this information with all the ease and time in the world focusing just on the warning states of the instance (state >= 2) and focusing on the IO Subsystem Latch Timeouts:

SELECT TOP 100 *
	FROM dbo.Diagnostics
	WHERE Time between '2021-01-21 16:00:00.000' and '2021-01-21 16:40:00.000'
		AND state >= 2
                AND cast(Data as nvarchar(MAX)) like N'<ioSubsystem ioLatchTimeouts=%'
	ORDER BY time asc

The inner side of that XML looks like this:

<ioSubsystem ioLatchTimeouts="11" intervalLongIos="0" totalLongIos="1">
  <longestPendingRequests>
    <pendingRequest duration="30354" filePath="\\?\Z:\Data\MySlowDB.mdf" offset="720896" handle="0x11f8" />
    <pendingRequest duration="341" filePath="\\?\Z:\Data\AgFailureDB_13.ndf" offset="3980984320" handle="0x18b8" />
    <pendingRequest duration="341" filePath="\\?\Z:\Data\AgFailureDB_13.ndf" offset="3981246464" handle="0x18b8" />
    <pendingRequest duration="341" filePath="\\?\Z:\Data\AgFailureDB_13.ndf" offset="3981377536" handle="0x18b8" />
    <pendingRequest duration="313" filePath="\\?\Z:\Data\AgFailureDB_14.ndf" offset="95477760" handle="0x18c0" />
  </longestPendingRequests>
</ioSubsystem>

Pointing to the fact that as I hit hard on the IO level my databases (SlowDB & AGFailureDB), even though AgFailureDB is suffering delays on the IO level in the range of 340 milliseconds, its my SlowDB database which is getting seriously hammered, eventually provoking the “choke” of the Virtual Machine and resulting in the forced failover (the situation I call “shake”).

Face the Shake

You can (and a VERY BIG QUESTION IS IF YOU SHOULD!) change settings of the cluster Health Check & Lease Timeouts in order to … (what would be the correct word for that ?) skip/ignore/closeYourEyesAndCry the unavailability of your replica, and I am laughing my bottom off, imagining off someone making their Primary Replica unavailable and saying that the client SHOULD not notice it.
You can (this one is the same as in the song “Living On Prayer” by Bon Jovi) change the failure condition level, but you definitely should talk to Microsoft Support or a very good specialist before doing it.
You can (note: no, no, no!) (and unless you REALLY understand what you are affecting) change the SubnetThreshold for the cluster and subnet, but I am not going to lead you into THAT kind of trouble.
Here is the script for ya, setting all the most dangerous and irresponsible values that you can think of:

$AGNAME = "TestAG";
$YourCluster = get-cluster;
Get-ClusterResource -Name $AGNAME | Set-ClusterParameter -name FailureConditionLevel -value "1";
Get-ClusterResource -Name $AGNAME | Set-ClusterParameter -name HealthcheckTimeout -value "60000";
Get-ClusterResource -Name $AGNAME | set-ClusterParameter -name  LeaseTimeout -value "40000";
$YourCluster.SameSubnetThreshold = 30
$YourCluster.CrossSubnetThreshold = 30

Also I would like to point to very useful article regarding Timeouts, Delays and Networks for the Availability Groups at Microsoft Documentation.

You can scale up the virtual machine to include more IO, but … oh well … There is a tight connection between the number of CPU cores (think SQL Server Enterprise Edition licensing cost) and the IO, meaning that sometimes the only way will be to go for more CPU cores – like from 32 to 64 in 2 or 3 Availability Group Replicas configuration, effectively splashing for 96 CPU cores licensing, in order to solve those unwanted problems.

Final Thoughts

Maybe Resource Governor would be able to help, but from the other view as blogged in Resouce Governor IOPS_PER_VOLUME is not enough and
SQL Server Backup Throttling in Azure VMs (and in other virtualised environments), but as you can imagine – there are so many cases, such as high-performing OLTP environments where you would rather avoid messing with the Resource Governor.
If you would be able to isolate the operation that causes the overload of the IO, such as Backup operation, you would be able to configure Resource Governor, praying that nobody will start heavy OLTP or any other operations with different block size.

One interesting solution to this problem, as was suggested by Microsoft Support, would be in the case of Backup and CheckDB operations, if there is a still possibility – to add a new replica and do the operations there, but I have unfortunately seen this taking place on the Availability Groups with JUST complex and strong IO workload, where it would not help.

About a 1.5 years ago, when I started facing such situations I could not understand how can this can be true that a good solution with well-calculated IO Subsystem would face extinction after some time, just because of the data growth a couple of new Disks have put the whole System under sudden failovers and unexpected risks.

You would definitely love to know more about the limits of the VMs and how many TB per current generation VM’s you can add, right ?
A blog post on that matter will follow soon. :)

Niko Neugebauer

SQL Server, Columnstore, Data Platform & Community