Raymond's blog: zpool monitoring

The second script checks the current state of the zpools, looking for degraded arrays (caused by failed drives), unavailable spares and unrecovered errors. Because it keeps a state file in /etc/zfs, it would need to be run as root. I run this hourly. It should be possible to update this script to also check for ZFS checksum errors, but I haven't taken the time to do it. The reminder code hasn't been tested, as I haven't had a failure since the code was put in place.

#! /bin/sh

STATEFILE="/etc/zfs/chk.state"
ALARMUSER="root@localhost"

zpool status 2>&1 | \
egrep -i '(degraded|unavail|unrecover)' > /dev/null

STATE=$?

if [ -f $STATEFILE ]
then
LASTSTATE=`cat $STATEFILE`
else
LASTSTATE=1
echo $STATE > $STATEFILE
fi

#
# Error is currently set.
#
if [ $STATE = 0 ]
then

#
# Error wasn't set previously. Send out the error message.
#
if [ $LASTSTATE = 1 ]
then
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi

#
# Send out a reminder every other day.
#
FOUND=`find $STATEFILE -mtime -2`
if [ -z $FOUND ]
then
exit
fi
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.reminder.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi

#
# Error was set, but is no longer. Send out the fixed message.
#
if [ $STATE = 1 -a $LASTSTATE = 0 ]
then
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.fixed.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi

EDIT: Updated above script to look for unrecovered errors, thanks to information in this post by nhamilto40. To reset the error counts, the "zpool clear pool" command can be used.

I scanned this thread, and see no scripts. Perhaps this will be more useful than I thought.

Raymond's blog

Monday, October 5, 2009

zpool monitoring

Search This Blog

Labels

Blog Archive

About Me