From time to time there is simply no plugin that you need or you can’t find it. In such case you should consider writing one is an easy task, really. I will show how to write a simple plugin in bash. We will report age of oldest file in directory. Some time ago I had a need to check if data in temporary directory is processed or not.

Introduction

Of course before you start writing your own plugin, spend 5 minutes with google – there is a huge possibility that someone already wrote something you need. Maybe not exactly a thing you need but close enough.

When it comes to a language you will write a plugin in – you can use almost anything. Only requirement is that dependencies for your script/program are met at host that runs it. For example, bash is present on almost every *nix machine out there, you can expect python or perl, but seeing Cobol would be kind of surprise these days. It is worth to mention, that plugin is meant to be run frequently and for as short time as possible, so writing it in Java might have a little overhead for starting and stopping JVM. But in the end you choose the language.

Code

I’ve written a very simple code. We will talk about it.

#!/bin/bash

OK=0
WARNING=1
CRITICAL=2
UNKNOWN=3

MONITORED_DIRECTORY="none"
WARNING_LEVEL=900
CRITICAL_LEVEL=3600
LEVEL="none"

output() {
    echo -e "$1"
}

usage() {
    output "
Check age of the oldest file in given directory (in seconds).

    Options:
      -d <directory>    Path to directory that should be scanned for files.
      -w <number>       Warning level.
      -c <number>       Critical level.

Usage: $0 -d /tmp/workdir -w 500 -c 1000
"
}

check_required_parameters() {
    # $* - list of parameters
    for param in $* ; do
        if [ "${!param}" == "none" ]; then
            output "Parameter $param is not set. Aborting!"
            usage
            exit $UNKNOWN
        fi
    done
}

check_treshhold_levels() {
    if [ $WARNING_LEVEL -gt $CRITICAL_LEVEL ]; then
        output "Warning level (${WARNING_LEVEL}) is set above critical level (${CRITICAL_LEVEL}). Aborting."
        exit $UNKNOWN
    fi
}

gather_data() {
    timestamp=$(find ${MONITORED_DIRECTORY} -maxdepth 1 -type f -printf "%T@   %h/%f   %s\n" 2>&1 | sort -n 2>&1 | head -n 1 | cut -f1 -d".")
    if [ -z $timestamp ]; then
        LEVEL=0
        return
    fi
    now=$(date "+%s")
    LEVEL=$((${now} - ${timestamp}))
}

compare_tresholds() {
    if [ "$LEVEL" == "none" ]; then
        output "Level is not set. Something went wrong. Aborting."
        exit $UNKNOWN
    fi

    message="oldest file is ${LEVEL} seconds old"
    perfdata="AGE=${LEVEL}s;$WARNING_LEVEL;$CRITICAL_LEVEL"

    if [ $LEVEL -gt $CRITICAL_LEVEL ]; then
        output "CRITICAL ${message} |${perfdata}"
        exit $CRITICAL
    fi

    if [ $LEVEL -gt $WARNING_LEVEL ]; then
        output "WARNING - ${message} |${perfdata}"
        exit $WARNING
    fi

    output "OK - ${message} |${perfdata}"
    exit $OK

}

while getopts "d:w:c:h" OPTION
do
    case $OPTION in
        h)
            usage
            exit $UNKNOWN
            ;;
        d)
            MONITORED_DIRECTORY="$OPTARG"
            ;;
        w)
            WARNING_LEVEL="$OPTARG"
            ;;
        c)
            CRITICAL_LEVEL="$OPTARG"
            ;;
        *)
            usage
            exit $UNKNOWN
            ;;
    esac
done

check_required_parameters MONITORED_DIRECTORY WARNING_LEVEL CRITICAL_LEVEL
check_treshhold_levels
gather_data
compare_tresholds

# vim: expandtab:ts=4:sw=4:sts=4

Plugin parts

In general each plugin consists of few parts. Let’s discuss them here.

Initializing

Every script needs to deal with some variables or arguments. I parsed arguments, I checked if warning and critical levels are set correctly. You don’t want to have a critical level set below warning level – it simply doesn’t make sense. I don’t think there is a need to get into details here.

Gathering and evaluating data

At some point you will need to gather some data. You will use it to determine health of service. It can be output of df command, some page fetched from your application health endpoint, you name it. In my case I wanted to know how old is oldest file that reside in temporary directory. My application is consuming those files, so knowing how old is the oldest one gives me an idea on delay in processing of data.

gather_data() {
    timestamp=$(find ${MONITORED_DIRECTORY} -maxdepth 1 -type f -printf "%T@   %h/%f   %s\n" 2>&1 | sort -n 2>&1 | head -n 1 | cut -f1 -d".")
    if [ -z $timestamp ]; then
        LEVEL=0
        return
    fi
    now=$(date "+%s")
    LEVEL=$((${now} - ${timestamp}))
}

As you can see, I used find command to find files in directory, I sorted them out, took the first one (oldest one) and picked a timestamp.
At lines 49-50 I count how old is file in seconds. I consider empty directory as best situation, so I set delay equal to 0 if no file was found.
That’s it! With those simple steps I have a value (in LEVEL variable) which I can compare to level thresholds and determine service health.

I’d like to make a distinction between gathering and evaluating data. In this example, and probably in most other cases, you will gather data and then process it in one step. But it is not always a case. I’ve been working with monitoring system based on Nagios that didn’t gather any data. Data was fetched by bunch of bash scripts and put in files. Nagios plugins were able to find an interesting file and process data stored there. Consider checking of disk space on several file systems, you could gather df output once and process it several times for different data.

Deciding service health and communicating it

Nagios determines service status by plugin’s exit code. That’s why I created variables with them at the top of script – that is more readable and will protect you from stupid mistakes. Exit codes are as follows:

OK=0
WARNING=1
CRITICAL=2
UNKNOWN=3

Beside service status Nagios will also write a status information. It’s meant for humans. It should fit in one line. Nagios should be able to pick multi line description but not all CGIs will display it. You simply print a line to a standard output, see line 60 below.

There is one more thing to cover – graphs. There are plenty of applications that can draw a neat graph for you. But you will have to tell it what to draw. You should use a performance data for this. Everything that goes after a pipe character (|) in the status information is considered by Nagios to be a performance data (lines 66 and 69 will show you how to compose one). Performance data has following structure:

‘label’=value[UOM];[warn];[crit];[min];[max]
UOM: measure unit (octets, bits/s, volts, …)
warn: WARNING threshold
crit: CRITICAL threshold
min: minimal value of control
max: maximal value of control

Notice, that not all data is required, you will be fine with only label and value but other data might make your graph more informative. Don’t underestimate min and max value – they will help scale graphs, so if you know them, for example if value is given as percentages, don’t hesitate to use them.

compare_tresholds() {
    if [ "$LEVEL" == "none" ]; then
        output "Level is not set. Something went wrong. Aborting."
        exit $UNKNOWN
    fi

    message="oldest file is ${LEVEL} seconds old"
    perfdata="AGE=${LEVEL}s;$WARNING_LEVEL;$CRITICAL_LEVEL"

    if [ $LEVEL -gt $CRITICAL_LEVEL ]; then
        output "CRITICAL ${message} |${perfdata}"
        exit $CRITICAL
    fi

    if [ $LEVEL -gt $WARNING_LEVEL ]; then
        output "WARNING - ${message} |${perfdata}"
        exit $WARNING
    fi

    output "OK - ${message} |${perfdata}"
    exit $OK

}

That’s it. This is a plugin that could be successfully used. Have fun!

Leave a Reply