Monitoring Hard Drive Failure Predictors

Most server storage will be attached via a RAID controller (and typically need to be monitored using the vendor's own software [1] [2]) but there may be occasions where it's desirable to check the health of an IDE or USB-connected hard drive running under the generic storage drivers.

This guide describes testing for issues reported by an ATA drive's S.M.A.R.T. implementation using the Check a Console Command task.

DISCLAIMER: The information provided herein is intended as an example only. No representation is made as to its accuracy, completeness or suitability for a particular purpose or platform (see also the Winserver Wingman EULA). Please test your own implementation thoroughly in the specific environment within which it need function and ensure that it operates as required.

Using Windows Management Instrumentation (WMI)

WMI is built-in to every supported version of Windows and facilitates reporting on a range of hardware and system data including the Win32_DiskDrive class. A command-line query utility (wmic.exe) prints tabular results to STDOUT. To see all available data for all supported drives open a DOS prompt 'as Administrator' and pipe output from the following command to a file;

wmic DiskDrive > C:\WMIDisks.txt

You'll see from the file that's created that there's more information available than we need for this purpose so we'll further refine the command line to return only a useful subset;

wmic DiskDrive GET Caption, CreationClassName, Status

Which might output a list something like the following;

Caption                      CreationClassName  Status  
WD Elements 10B8 USB Device  Win32_DiskDrive    OK
SAMSUNG HD103SJ              Win32_DiskDrive    OK
SanDisk Ultra II 240GB       Win32_DiskDrive    OK

Given just one drive it would be sufficient to configure a task that simply checks for the presence of the fragment OK. Alternatively we might refine the query to return data for one drive at a time and run a task to test for this fragment for each disk separately. More flexible though to use the power of regular expressions to check all connected drives at once and that;

a) At least one row contains the the text Win32_DiskDrive OK and;

b) Nowhere is Win32_DiskDrive followed by anything other than OK;

Check ATA drive health via WMI each hour.

Task Parameters

Task Type: Check a Console Command

Frequency: 1 Hour

Execute: wmic.exe

Arguments: DiskDrive GET Caption, CreationClassName, Status

CAUTION On: (command) ...does not run or its output; does not match the expression; Win32_DiskDrive\s{4}OK

FAILURE On: (command) ...runs and its output; matches the expression; Win32_DiskDrive\s{4}(?!OK)

Using S.M.A.R.T. Monitoring Tools (Smartmontools)

While convenient by virtue of always being available WMI is less than thorough relying predominantly on manufacturer-defined thresholds for acceptable error rates. To check these predictors more rigorously we need a tool that exposes the underlying values in a readily interpreted form. For this we would recommend; Smartmontools

Once installed, open a DOS prompt 'as Administrator', change to the installation folder's \bin directory and enter the following command to scan for supported drives;

smartctl --scan

The output should include something like the following;

/dev/sda -d ata # /dev/sda, ATA device
/dev/sdb -d ata # /dev/sdb, ATA device
/dev/sdc -d sat # /dev/sdc [SAT], ATA device

The path shown in the first column is the label by which each drive must be addressed (a separate task will be required for each drive monitored). If it's not immediately apparent which drive is which you can request more detail for each with;

smartctl --info /dev/sda

To check the condition of a drive we can request a health assessment and enumeration of available S.M.A.R.T. values by combining the --health (-H) and --attributes (-A) parameters;

smartctl -H -A /dev/sda

Note that one column has been omitted from the following table for display purposes;

SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always  0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always  0
  3 Spin_Up_Time            0x0023   069   067   025    Pre-fail  Always  9618
  4 Start_Stop_Count        0x0032   092   092   000    Old_age   Always  9009
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always  0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always  0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline 0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always  25460
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always  0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always  0
 12 Power_Cycle_Count       0x0032   095   095   000    Old_age   Always  5257
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always  1072
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always  0
194 Temperature_Celsius     0x0002   064   046   000    Old_age   Always  31
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always  0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always  0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always  0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline 0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always  0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always  74
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always  0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always  9032

This drive is in apparently good health with zero reported as the raw value for the most significant indicators of imminent failure (see ATA S.M.A.R.T. Attributes).

Obviously our evaluation needs to test for the fragment PASSED, but we may also craft a Numeric Comparison that will alert if at any stage the raw value for an important predictor exceeds a specified threshold;

Check IDE disk 1 for ATA drive failure predictors each hour.

Task Parameters

Task Type: Check a Console Command

Frequency: 1 Hour

Execute: C:\Program Files\Smartmontools\bin\smartctl.exe

Arguments: -H -A /dev/sda

CAUTION On: (command) ...runs and its output; is numerically more than; ^(\s\s1|\s\s5|\s\s7|\s10|196|197|198).*?(\d+)$/[Max:1st]10

FAILURE On: (command) ...does not run or its output; does not contain the fragment; PASSED

To better understand the numeric comparison used here you might trial it in the Evaluate Text utility;

First with the original output, and then again;

With any matching (raw) value set higher than 10.

Finally note that in order for the depicted Offline_Uncorrectable attribute to return up-to-date information an offline test should be scheduled on the drive at regular intervals using the command;

smartctl --test=offline /dev/sda

2 October 2016