Nagios, "Nagios Ain't Gonna Insist On Sainthood"....What an awesome tool for systems administration. I have been using Nagios since the days of netsaint back when it was first released in 1999. In 10 years time, the program has matured and become a de-facto standard in Network and Systems Monitoring. In recent years there have been many spin-off's of nagios, but I have chosen to stick with the core package.
One of the most over-looked, but very powerful, features of Nagios is it's ability to use event handlers (Nagios Docs). An Event Handler allows for a script to be executed based on changes to the service/host system state. There are endless possibilities to using this feature. Usually when Nagios alerts to a changed system state (warning/critical), an administrator is emailed/paged/tweeted in response to the system state change. The said administrator then logs in, restarts services, checks network connection, finds out why disk space went critical, so on and so forth. Event-Handler's can do this and so much more!
I have written numerous event-handler scripts that creates a 'self-healing' environment for different servers utilizing event-handlers and NRPE. Some of the Event handlers that I utilize in Nagios are the following:
*Disk Space Critical - When Nagios alerts to a warning/critical disk space, NRPE executes a custom du-scan.sh script that sorts all of the data on the mount point by highest ammt used, puts it into a log file on the /tmp directory and emails the location of the log to administrators
*CPU Load Critical - When Nagios alerts to a warning/critical CPU Load, whether it's in linux or windows, a script is executed (Bash in linux, VB in windows) that emails administration the top 5 running CPU process's on the server
*NTP (Network Time Protocol) time sync Critical - Sometimes when CPU load goes critical, the NTP service running on a Linux machine goes WAY out of sync (over 1000seconds) causing the NTP Daemon to crash. When the time is off on the server, the various services we use, report different times, causing even more issues. The fix to this, is to have Nagios restart the ntp service on the remote server via NRPE.
This little blog entry will detail how to setup an NTP event-handler for Nagios. You can use this as a base for just about anything else event-handler related.
This is written with the assumption that the person using this is already familiar with, and has a Nagios Server utilizing NRPE for linux clients, along with having nagios-plugins installed on the client. This isn't a basic 'nagios howto' at all.
First things first, setting up and configuring the remote client for allowing command arguments.
1. Navigate to your nrpe source directory (in this case: /root/Download/nrpe2-12/
2. Reconfigure nrpe for command arguments
./configure --enable-command-args && make && make install
3. Modify the nrpe.cfg file
a) change: dont_blame_nrpe=0 to: dont_blame_nrpe=1
b) add the following to your command arguments:
## /usr/local/.... is the path to the check_ntp plugin that comes with nagios plugins, change 0.pool.ntp.org to the ntp server that your organization uses to get ntp data. Warning at 10 seconds, critical at 20 seconds.
command[check_ntp]=/usr/local/nagios/libexec/check_ntp -H 0.pool.ntp.org -w 10 -c 20
## /usr/local/.... is the path to the event-ntp handler, as seen below.
command[event-ntp]=/usr/local/nagios/libexec/event-ntp $ARG1$ $ARG2$ $ARG3$
4. create a file called event-ntp, u:g of nagios:nagios, set executable.
5. Drop this code into the event-ntp file:
#!/bin/bash
## This is an event handler that will be executed on Warnings and Critical alerts.
## On Warnings, an ntp query will be issued, and the email will be sent to the specified admin
## On Critical, an ntp query will be issued, and the ntpd service will be restarted to re-sync the clocks
case "$1" in
OK)
;;
WARNING)
echo -e "Running NTP Query" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query" adminacct@example.com
;;
UNKNOWN)
;;
CRITICAL)
case "$2" in
SOFT)
case "$3" in
3)
echo -e "Running NTP Query & Restarting NTP Service" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query - Restarted NTPD" adminacct@example.com && /usr/bin/sudo /sbin/service ntpd restart
;;
esac
;;
HARD)
echo -e "Running NTP Query & Restarting NTP Service" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query - Restarted NTPD" adminacct@example.com && /usr/bin/sudo /sbin/service ntpd restart
;;
esac
;;
esac
exit 0
Be sure to change HOSTNAME and adminacct@example.com to the client's hostname, and the admin account that you want to email to go to.
Now, the tricky, and probably not the most secure way to do this, is to modify the sudoer's file to allow the nagios user to execute system commands. I'm sure there is a more 'secure' way of doing this, but this works for me.
1. visudo
2. add the following:
User_Alias NAGIOS = nagios,nagcmd
Cmnd_Alias NAGIOSCOMMANDS = /sbin/service
Defaults:NAGIOS !requiretty
NAGIOS ALL=(ALL) NOPASSWD: NAGIOSCOMMANDS
Be sure to restart the nrpe client after all this has been accomplished. Now to move onto the server end of things.
First thing you need to do is create an event-ntp command in the commands.cfg file:
define command{
command_name event-ntp
command_line /usr/local/nagios/libexec/check_nrpe -H $HOSTNAME$ -c ev
ent-ntp -a $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
This will be called by the event-handler object in your configuration file.
now, modify your service description in wherever you configure your service/host definitions. In my case I have a separate configuration file called linux.cfg.
define service{
use YOUR-service-TEMPLATE
host_name HOSTNAME-HERE
service_description Time Sync Check
event_handler event-ntp
check_command check_nrpe!check_ntp
}
Now restart nagios (service nagios restart), and test the configuration from the server end:
/usr/local/nagios/libexec/check_nrpe -H REMOTEHOSTNAME -c event-ntp -a CRITICAL HARD
If all goes well, you should receive an email from your client with an output from ntpq -p, and an ntpd service restart.
If you have any problems, not receiving email, or not executing the said script, set the debug level=1 on nrpe.cfg, restart nrpe, execute the above event-ntp test, and check your logs.
As you can see, it's not too difficult to execute event-handler scripts, and saves Administrator's time when nagios can do the leg-work on system/host critical alerts. This example 'self-heals' the NTPd service, but can be used/modified to just report data when there is a problem. Any time that nagios can do self-automation/testing, before administrators get to the machine, shaves time off of troubleshooting a problem.
Special Thanks to keith4 on freenode.net's #nagios channel for catching a syntax error for my $HOSTNAME$ argument in my commands.cfg file. Thus saving me many hours of hair pulling and name calling.
Awesome Blog
ReplyDelete