Monday, October 5, 2009

zpool monitoring

The second script checks the current state of the zpools, looking for degraded arrays (caused by failed drives), unavailable spares and unrecovered errors. Because it keeps a state file in /etc/zfs, it would need to be run as root. I run this hourly. It should be possible to update this script to also check for ZFS checksum errors, but I haven't taken the time to do it. The reminder code hasn't been tested, as I haven't had a failure since the code was put in place.

#! /bin/sh

STATEFILE="/etc/zfs/chk.state"
ALARMUSER="root@localhost"

zpool status 2>&1 | \
egrep -i '(degraded|unavail|unrecover)' > /dev/null

STATE=$?

if [ -f $STATEFILE ]
then
LASTSTATE=`cat $STATEFILE`
else
LASTSTATE=1
echo $STATE > $STATEFILE
fi

#
# Error is currently set.
#
if [ $STATE = 0 ]
then

#
# Error wasn't set previously. Send out the error message.
#
if [ $LASTSTATE = 1 ]
then
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi

#
# Send out a reminder every other day.
#
FOUND=`find $STATEFILE -mtime -2`
if [ -z $FOUND ]
then
exit
fi
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.reminder.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi

#
# Error was set, but is no longer. Send out the fixed message.
#
if [ $STATE = 1 -a $LASTSTATE = 0 ]
then
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.fixed.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi


EDIT: Updated above script to look for unrecovered errors, thanks to information in this post by nhamilto40. To reset the error counts, the "zpool clear pool" command can be used.

I scanned this thread, and see no scripts. Perhaps this will be more useful than I thought.