Saturday, 20 September 2014

Graphing HDD health with smartctl

I proudly built myself a front door for my TV cabinet recently - the very same TV cabinet that houses my NAS. Two weeks later and two crashes of my NAS box (that coincidentally uses my drives for swap), I discover 2 of my 4 HDD's had started giving errors, one had completely died. Turns out this thing called "ventilation" is important after all! *shrugs*

Being a reasonably diligent fellow, I had backups. I didn't have to use them though. ZFS to the rescue and I replaced both drives, one at a time. In honour of this auspicious restoration of my data redundancy, I hereby present my latest in ghetto monitoring scripts:

smartctl_log.sh
#!/bin/bash
# Run from a cronjob
TIMESTAMP=$(date +"%s")
for i in `seq 0 3`; do
smartctl --attributes /dev/${DEV} | grep "^[ 0-9]" | awk '{ print "'${TIMESTAMP},${DEV}',"$2","$4 }' >> /mnt/tank/logs/smartd.log
done
DEV=ada${i}
gen_index.sh
#!/bin/bash
#
# Quick and dirty HTML generator for displaying smartctl stats.
#
FILENAME=$1
cat > index.html << EOF
<html><head><title>Smartctl Graphs</title><head><body><h1>Smartctl Graphs</h1>
EOF
DRIVES=`cat ${FILENAME} | awk -F, '{print $2}' | sort | uniq`
TYPES=`cat ${FILENAME} | awk -F, '{print $3}' | sort | uniq`
MIN_TIMESTAMP=`date --date="last year" +"%s"`
for t in ${TYPES}; do
  rm -f /tmp/gnuplot.data.*
  for d in ${DRIVES}; do
    cat ${FILENAME} | grep "$d,$t" | awk -F, '{ if(int($1) > int('${MIN_TIMESTAMP}')) print $1" "$2" "$4 }' | sort -n > /tmp/gnuplot.data.${d}
  done
  cat > /tmp/gnuplot.cmd << EOF
set term png
set output "gen_${t}.png"
#set size 17,17
set title "${t}"
set style data fsteps
set timefmt "%s"
set format x "%Y/%m/%d %H:%M"
set yrange [0:]
set xdata time
set xtics rotate
set grid
set key bottom left
EOF
  echo -ne "plot " >> /tmp/gnuplot.cmd
  for d in ${DRIVES}; do
    echo -ne "'/tmp/gnuplot.data.${d}' using 1:3 title columnheader(2) with lines," >> /tmp/gnuplot.cmd
  done
  cat /tmp/gnuplot.cmd | gnuplot
  echo "<div style=\"float:left;width:340px;\"><img width=\"320\" src=\"gen_${t}.png\"></div>" >> index.html
done
echo "<div style=\"clear:both\"><center>Generated at `date` on `hostname`</center></div>" >> index.html
echo "</body></html>" >> index.html
This monstrosity produces glorious graphs like these:

Now these scripts are not exactly shining examples of what you should do. They're probably more like counter-examples. For starters, this is going to slow down linearly the longer you run it.

In any case, for fellow lazy fellows, this may be enough for you as it was for me. I am mainly interested in running this after-the-fact when I notice issues so I can look for a correlated downward trend in a graph and more confidently predict my drives pending demise. YMMV.