Unknown Database Performance Challenge

Client Objectives:
Client was suffering unknown database-level performance drains, dramatically slowing down critical database performance, as well as random SQL Server Availability Group failovers and outright failures.

Solution:
Heraflux reviewed their entire IT architecture, including flash-hybrid storage, virtualization, and SQL Server databases, and determined both the SAN controller network interconnects and controller cache were being overwhelmed periodically, causing downstream issues with the SQL Server performance and availability solution.

Results:
Client upgraded the storage controllers to 10GbE with increased cache amounts achieved improved database availability and performance because the storage interconnects were no longer bursting to the maximum possible throughput.

Thank you to you and all your guys. This health check really helped us get to the next level and over some big bumps in our organization.

 

Aaron U.

Situation

Client is a member of a non-insurance health care group that allows its members to share the burden of the group’s health care needs as part of a larger customer-backed pool.

A number of the business-critical databases that empower daily business operations were experiencing significant performance challenges randomly throughout each day. Additionally, SQL Server Availability Groups in use for improved high availability were failing over or outright stopping randomly throughout each day, causing major disruptions to the business each time it failed.

The database administration team spent many hours attempting to track down the root cause of the performance and availability issues, including measure and adjusting the SQL Server and Windows-layers to better distribute the database workloads across the presented storage. However, the improvements were ineffective, and the overall trend was that the volume of failovers and failures were growing.

The Client needed an outside specialist to review the architecture and inner workings of the datacenter to determine where in the architecture the source of the database outages was originating. The entire datacenter architecture underneath the database server needed to be reviewed to determine where the performance bottlenecks were at, and remediate the problem.

Solution

Heraflux learned quickly that the datacenter infrastructure and applications had some additional layers in the architecture that were previously unreported.

Heraflux performed a deep-dive systems review of the VMware virtualization systems platform powering the database and application servers, as well as the flash-hybrid storage array. First, Heraflux performed a performance analysis of the current platform to establish a performance baseline for the SQL Servers. Windows Perfmon and SQL Server dynamic management views (DMVs), both included with the respective layers, were used to establish this baseline, so no additional costs were incurred to establish this baseline.

Heraflux subsequently found that the virtual disks were exhibiting extremely high bursts of storage latency at the approximate times of the Availability Group failovers. The errors reported by both the SQL Server and Windows layers showed symptoms of the underlying storage going unresponsive for longer periods of time than the architecture could withstand, and the SQL Server and WSFC components were failing as a result.

Heraflux then reviewed the storage architecture, and found that the storage architecture was being overwhelmed by the sheer volume of database requests. The storage controllers had four 1GbE network adapters per controller, but even with the active multipathing configuration at the switch and VMware layers configured properly according to the storage manufacturer, only two 1GbE network adapters were being used for a given workload stream. When various SQL Server operations were taking place, such as index and statistics maintenance, database backups, or even as quick as long running reports triggered by end users, the network adapters on the SAN controllers were hitting maximum throughput capacity, and the storage latency would spike at the VMware and in-guest Windows layers. The storage controller cache was also insufficient to accommodate the workload demands as well, with the controller cache hit ratios falling dramatically during these operations.

To remediate this, the vendor was contacted to determine why only two of the four storage controller NICs were in use for active multipathing, but the vendor claimed that this was an underlying design principal and could not be chanced. The Client then replaced these controllers with new controllers containing 2x10GbE network adapters and greater amounts of cache memory.

The end result was that the SQL Server databases were now able to access the storage subsystem faster than ever with no bottlenecking at the network interconnect layer, and every single database transaction’s performance increased substantially. The databases were performing better than ever, and the SQL Server Availability Group stability returned to normal.

Results

The SQL Server Availability Group layer maintained normal availability functions without exhibiting the random failovers.

The WSFC-layer in Windows reported no errors after the controllers were replaced, and resumed normal availability operations afterwards.

The SQL Server databases now performed better than before, thanks to the improved network throughput performance and increased controller cache amounts.