SQL Server File-Level Latency Collector

sql-serverSQL Server records ongoing disk latency metrics in the form of ‘disk stall’ values, as collected by SQL Server on an ongoing basis. The challenge is that these stall metrics are cumulative from the last time that the SQL Server service was started, instead of an average value on a sample window. Therefore, an ongoing collector can be set up to store these cumulative values on a fixed time interval, and then calculate the latency averages on a small granular window of time. These metrics provide a deeper look into the latency values, and can be used to analyze the ongoing latency state of the server’s disks. Overlay this data with the drive-level data collected through Windows Perfmon counters to see if you have any hot-spots on one or more disks.

First, either create a new database or select an existing DBA utility database. This example uses a new database called CollectorTemp.

USE [CollectorTemp]
GO

CREATE TABLE [dbo].[DiskStall] (
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[ServerName] [nvarchar](128) NULL,
[InstanceName] [nvarchar](128) NULL,
[DatabaseID] [smallint] NOT NULL,
[DatabaseName] [sysname] NOT NULL,
[PhysicalFileName] [nvarchar](260) NOT NULL,
[DatabaseFileID] [int] NOT NULL,
[AvgReadStallms] [numeric](10, 1) NULL,
[AvgWriteStallms] [numeric](10, 1) NULL,
[IOStallReadms] [bigint] NULL,
[NumOfReads] [bigint] NULL,
[IOStallWritems] [bigint] NULL,
[NumOfWrites] [bigint] NULL,
[TotalNumOfBytesRead] [bigint] NULL,
[TotalNumOfBytesWritten] [bigint] NULL,
[CollectionDateTime] [datetime] NOT NULL,
CONSTRAINT [PK_DiskStall] PRIMARY KEY CLUSTERED
( [ID] ASC )
) ON [PRIMARY]

 Next, create a stored procedure in this database that performs our metric collection and previous sample differential calculation.

 

CREATE PROCEDURE dbo.usp_DiskStallCollection AS

SET NOCOUNT ON;

BEGIN TRY

— Calculates average stalls per read, per write, and
— per total input/output for each database file.
;WITH PreviousSample_CTE( ServerName, InstanceName, DatabaseID, DatabaseFileID,
IOStallReadms, NumOfReads, IOStallWritems, NumOfWrites) AS (
SELECT
ServerName, InstanceName, DatabaseID, DatabaseFileID,
IOStallReadms, NumOfReads, IOStallWritems, NumOfWrites
FROM
dbo.DiskStall
WHERE
CollectionDateTime =
(SELECT MAX(CollectionDateTime) FROM dbo.DiskStall))
INSERT INTO dbo.DiskStall
([ServerName]
,[InstanceName]
,[DatabaseID]
,[DatabaseName]
,[PhysicalFileName]
,[DatabaseFileID]
,[AvgReadStallms]
,[AvgWriteStallms]
,[IOStallReadms]
,[NumOfReads]
,[IOStallWritems]
,[NumOfWrites]
,[TotalNumOfBytesRead]
,[TotalNumOfBytesWritten]
,[CollectionDateTime])
SELECT
@@SERVERNAME AS ServerName,
@@SERVICENAME AS InstanceName,
fs.database_id AS DatabaseID,
d.name AS DatabaseName,
mf.physical_name AS PhysicalFileName,
mf.file_id AS DatabaseFileID,
CASE WHEN 1.0 +(num_of_reads – ISNULL(cte.NumOfReads,0.0)) = 0.0 THEN 0.0 ELSE
CAST((io_stall_read_ms -(ISNULL(cte.IOStallReadms,0.0)))/
(1.0 +(num_of_reads – ISNULL(cte.NumOfReads,0.0))) AS NUMERIC(10,1)) END AS [AvgReadStallms],
CASE WHEN 1.0 +(num_of_writes-ISNULL(cte.NumOfWrites,0.0)) = 0.0 THEN 0.0 ELSE
CAST((io_stall_write_ms -(ISNULL(cte.IOStallWritems,0.0)))/
(1.0 +(num_of_writes-ISNULL(cte.NumOfWrites,0.0))) AS NUMERIC(10,1)) END AS [AvgWriteStallms],
io_stall_read_ms AS IOStallReadms,
num_of_reads AS NumOfReads,
io_stall_write_ms AS IOStallWritems,
num_of_writes AS NumOfWrites,
num_of_bytes_read AS TotalNumOfBytesRead,
num_of_bytes_written AS TotalNumOfBytesWritten,
GETDATE() AS CollectionDateTime
FROM
sys.dm_io_virtual_file_stats(null,null) AS fs
INNER JOIN sys.master_files AS mf ON fs.database_id = mf.database_id
AND fs.[file_id] = mf.[file_id]
INNER JOIN sys.databases as d ON d.database_id = fs.database_id
LEFT JOIN PreviousSample_CTE cte ON cte.ServerName = ServerName
AND cte.InstanceName = InstanceName
AND cte.DatabaseID = fs.database_id
AND cte.DatabaseFileID = mf.file_id
ORDER BY
fs.database_id ASC OPTION (RECOMPILE);
END TRY

BEGIN CATCH
THROW
END CATCH;

Create a SQL Server Agent scheduled job that executes this stored procedure on an ongoing basis. A recommended value is every 30 seconds if a latency problem is suspected, or one minute if no problem is currently under review.

To view the raw values per database data and log file, execute the following query.

 

SELECT * FROM dbo.DiskStall ORDER BY CollectionDateTime, ServerName, DatabaseName

This query extracts the raw data, along with the calculated values for disk read and write latency over that sample period. However, analyzing these values over a longer window of time are necessary for determining the running state health of the environment. Performing a percentile analysis provides us a much more granular and significant view of the performance of the environment. To perform a percentile analysis, this SQL Server 2012 and above query will determine the running percentile metrics for the sample window. Adjust the time window being queried to match your specific requirements.

 

SELECT DISTINCT databaseid,

servername,

databasename,

physicalfilename,

databasefileid,

Round(Percentile_cont(0)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms00,

Round(Percentile_cont(0.25)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms25,

Round(Percentile_cont(0.5)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms50,

Round(Percentile_cont(0.75)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms75,

Round(Percentile_cont(0.9)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms90,

Round(Percentile_cont(0.95)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms95,

Round(Percentile_cont(0.96)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms96,

Round(Percentile_cont(0.97)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms97,

Round(Percentile_cont(0.98)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms98,

Round(Percentile_cont(0.99)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms99,

Round(Percentile_cont(0.999)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms99d9,

Round(Percentile_cont(1)

within GROUP (ORDER BY avgreadstallms) over (

PARTITION BY physicalfilename), 2) AS ReadStallms100,

Round(Percentile_cont(0)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms00,

Round(Percentile_cont(0.25)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms25,

Round(Percentile_cont(0.5)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms50,

Round(Percentile_cont(0.75)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms75,

Round(Percentile_cont(0.9)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms90,

Round(Percentile_cont(0.95)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms95,

Round(Percentile_cont(0.96)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms96,

Round(Percentile_cont(0.97)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms97,

Round(Percentile_cont(0.98)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms98,

Round(Percentile_cont(0.99)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms99,

Round(Percentile_cont(0.999)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms99d9,

Round(Percentile_cont(1)

within GROUP (ORDER BY avgwritestallms) over (

PARTITION BY physicalfilename), 2) AS WriteStallms100

FROM dbo.diskstall

ORDER BY

databasename,

databasefileid

 You can also create a SQL Agent job that can purge the stats over some window of time so that the database does not grow out of control. The code to perform the cleanup can look like the following block for a 60-day retention window.

–clean up after 60 days

CREATE PROCEDURE dbo.usp_DiskStall_Cleanup AS

SET NOCOUNT ON;

DELETE FROM dbo.diskstall WHERE CollectionDateTime < DATEADD(day, -60, GETDATE())