Demanding 24x7 Performance & Scalability

Client Objectives:
Client needed to ensure that their IT infrastructure is providing the maximum possible performance for their continually available 24×7 operations while ensuring the systems are capable of empowering the organization’s explosive growth.

Solution:
Heraflux reviewed their entire IT architecture, including storage, virtualization, databases, and business continuity strategy, and implemented a revised architecture to ensure that the IT systems met their availability and performance requirements.

Results:
Client achieved uptimes and performance well exceeding the requirements, and the enhanced platform is now facilitating the companies growth.

Heraflux helped to provide peace of mind by reviewing our availability and scalability of our IT infrastructure, and uncovered significant gaps in our design. A few maintenance windows later, I can now sleep soundly at night knowing our environment is built to handle even the most unusual crisis scenarios.

 

Situation

Client is a leading provider of energy metering and telemetry services for numerous power sources. Client provides equipment and services for energy delivery management, real-time operational control, and historical analytics, all while complying with the various regulatory requirements from the various power governance bodies.

A number of the business-critical applications that empower daily business operations were running well below expected performance, and a flurry of unexpected system outages were causing the business concern. The datacenter infrastructure was starting to approach its useful end of life, and the IT management were not sure if the root cause of the outages was simply unreliable equipment due to age, or some other challenge lurking in the environment. The business was also expanding, the additional load in the environment was a concern of upper management.

The IT team had spent months carefully reviewing the datacenter systems but were unable to determine any significant faults in the design, but they were still seeing random system outages. Application servers would just freeze. Database servers would refuse connections. Occasionally a database would go corrupt.

The Client needed an outside specialist to review the architecture and inner workings of the datacenter to determine where in the architecture the instability was coming from. The performance of the system should be reviewed, and the overall capacity and utilization of the environment was to be reviewed to make sure they had enough headroom in the system for the immediate future, as well as determine the consumption trends so they knew what they needed to purchase for the next major infrastructure purchase.

Solution

Heraflux learned quickly that the datacenter infrastructure and applications had some unique challenges.

Heraflux performed a deep-dive systems review of the Hyper-V virtualization systems platform powering the database and application servers, and found that any of the systems contains tiny misconfigurations that were causing them to fail to notify the IT staff about routine warnings and faults in the system. The storage array had a failed controller RAID cache battery, a silent fault in this particular array, which caused the array to perform about five times lower than normal. The SQL Server database was configured for a Failover Cluster Instance (FCI), but the hypervisor layer was sharing the in-guest storage connection paths with the Hyper-V live-migration and backup network, causing the storage connectivity to time out randomly as it was squeezed during Live Migrations or VM-level backups. Routine system maintenance on the SQL Server instances were also not being performed up to Heraflux standards, thanks to a bad recommendation from third-party application vendors.

Heraflux reconfigured the alerting and configured a dedicated set of paths for both storage presentation to the Hyper-V virtualization environment and to the SQL Server FCI. The SQL Server database maintenance was reconfigured, and the database integrity checks caught minor database corruption from the storage layer dropping off so many times. The corruption was repaired without data loss. The improved database maintenance also helped the application servers to run faster than ever, and the end users noticed the speed improvement.

The end result was that the IT infrastructure underneath a 24×7 business operation could now fully drive business around the clock, all while helping the IT staff rest assured that the datacenter platforms were stable and highly redundant. Systems were performing better than ever, and the capacity analysis showed that the platform could handle at least another year of sustainable business growth.

Results

The revised datacenter architecture has proven to exceed the availability requirements from the business.

The systems review helped to improve performance while reducing the workload on the IT infrastructure, providing greater long-term capacity.

Each component in the IT infrastructure is now improved to accommodate multiple types of failures with no disruption to business.