Highly Available & Geo-Redundant Databases

Client Objectives:
Client’s mission-critical SQL Server databases underneath a global perimeter security platform required an upgraded architecture to improve the availability and patching strategy for these systems, and also wanted geographically distributed replication to improve the resiliency of the platform to disasters.

Solution:
Heraflux reviewed their availability requirements, disaster recovery goals, and assessed the constraints placed around the SQL Server connectivity from the third-party application vendor. A highly available SQL Server 2016 Availability Group architecture would fulfill the availability requirements at the primary datacenters, and distribueted Availability Groups would extend the architecture to remote datacenters.

Results:
Client achieved much improved database availability and flexibility from the revised SQL Server architecture, and the platform is now redundant among regional datacenters.

Heraflux designed a highly available and disaster recovery strategy for this security platform’s SQL Servers so robust that the application consultants for this platform are preparing to implement this architecture at other organizations. This platform is so much more flexible than before, and we can perform our maintenance during the day!

 

Situation

Orion Advisor Services, Inc. is a leading provider of financial advisor and wealth management software that helps advisors manage their clients’ extensive financial portfolios. Orion provides a fully-integrated software platform to complement their RIA services, including reporting, trade order management, cost basis and tax reporting, and compliance.

A number of the business-critical databases that empower daily business operations were recording messages in the error logs about significantly higher than expected disk latencies. SQL Server “long I/O alerts,” which are indications that a single I/O operation took longer than 15 seconds, were growing in frequency during routine database operations, and during these periods of time, the databases were performing well below expectations.

The database administration team spent many hours attempting to track down the root cause of the storage latency issues, including measure and adjusting the SQL Server and Windows-layers to better distribute the database workloads across the presented storage. However, the improvements were incremental, but the overall trend was that the volume of long I/O alerts were growing. The storage team, on the other hand, reviewed the storage performance as reported by the all-flash SAN, and determined that the SAN was performing quite well, with latencies never exceeding one millisecond.

The Client needed an outside specialist to review the architecture and inner workings of the datacenter to determine where in the architecture the source of the database latency was originating. The entire datacenter architecture for the manner that storage was presented to the database server VMware-based virtual machines needed to be reviewed to determine where the storage performance bottleneck was at, and remediate the problem.

Solution

Heraflux learned quickly that the datacenter infrastructure and applications had some additional layers in the architecture that were previously unreported.

Heraflux performed a deep-dive systems review of the VMware virtualization systems platform powering the database and application servers, as well as the all-flash storage array. First, Heraflux performed a performance analysis of the current platform to establish a performance baseline for the SQL Servers. Windows Perfmon and SQL Server dynamic management views (DMVs), both included with the respective layers, were used to establish this baseline, so no additional costs were incurred to establish this baseline.

Heraflux subsequently found that the virtual disk configuration on the database servers were suboptimal, with I/O bottlenecks originating in the Windows layer because of how Windows performs I/O coalescing by virtual disk controller and not by virtual disk. The storage requests were stacking up because Windows only had one available route to get storage requests in and out of the operating system. Three additional controllers were added to the virtual machine, and the workload balanced among all available controllers, and the storage performance significantly improved. However, it was nowhere near the low latencies that the SAN was reporting.

Heraflux then reviewed the storage architecture, and found that a storage abstraction device, called a SAN volume controller, was in-line with the storage device. These devices are designed to make SAN administration less time consuming. All storage requests were funneled through this device. An extensive analysis of this device found that the controllers and cache memory inside these devices were being overwhelmed during times of elevated I/O requests, and as a result, the device was slowing down inbound I/O while it processed the requests. Additional storage abstraction devices were added to this architecture to better scale out the workload, and the storage performance then improved across the entire datacenter to match the very low storage performance latencies that the SAN was reporting.

The end result was that the SQL Server databases were now able to access the storage subsystem faster than ever, and every single database transaction’s performance increased substantially. The databases were performing better than ever, the storage abstraction layer performing as designed, and the long I/O alerts disappeared.

Results

The Availability Group architecture now allows the databases, instances, and operating system to be patched on a routine basis with no application interruption.

The upgraded SQL Server platform to version 2016 improved performance of the security platform.

The Distributed SQL Server Availability Group architecture allowed the SQL Server environment to be geographically separated but exceed expected SLAs for the DR requirements.