Friday, September 25, 2009
NTP event-handler in Nagios
One of the most over-looked, but very powerful, features of Nagios is it's ability to use event handlers (Nagios Docs). An Event Handler allows for a script to be executed based on changes to the service/host system state. There are endless possibilities to using this feature. Usually when Nagios alerts to a changed system state (warning/critical), an administrator is emailed/paged/tweeted in response to the system state change. The said administrator then logs in, restarts services, checks network connection, finds out why disk space went critical, so on and so forth. Event-Handler's can do this and so much more!
I have written numerous event-handler scripts that creates a 'self-healing' environment for different servers utilizing event-handlers and NRPE. Some of the Event handlers that I utilize in Nagios are the following:
*Disk Space Critical - When Nagios alerts to a warning/critical disk space, NRPE executes a custom du-scan.sh script that sorts all of the data on the mount point by highest ammt used, puts it into a log file on the /tmp directory and emails the location of the log to administrators
*CPU Load Critical - When Nagios alerts to a warning/critical CPU Load, whether it's in linux or windows, a script is executed (Bash in linux, VB in windows) that emails administration the top 5 running CPU process's on the server
*NTP (Network Time Protocol) time sync Critical - Sometimes when CPU load goes critical, the NTP service running on a Linux machine goes WAY out of sync (over 1000seconds) causing the NTP Daemon to crash. When the time is off on the server, the various services we use, report different times, causing even more issues. The fix to this, is to have Nagios restart the ntp service on the remote server via NRPE.
This little blog entry will detail how to setup an NTP event-handler for Nagios. You can use this as a base for just about anything else event-handler related.
This is written with the assumption that the person using this is already familiar with, and has a Nagios Server utilizing NRPE for linux clients, along with having nagios-plugins installed on the client. This isn't a basic 'nagios howto' at all.
First things first, setting up and configuring the remote client for allowing command arguments.
1. Navigate to your nrpe source directory (in this case: /root/Download/nrpe2-12/
2. Reconfigure nrpe for command arguments
./configure --enable-command-args && make && make install
3. Modify the nrpe.cfg file
a) change: dont_blame_nrpe=0 to: dont_blame_nrpe=1
b) add the following to your command arguments:
## /usr/local/.... is the path to the check_ntp plugin that comes with nagios plugins, change 0.pool.ntp.org to the ntp server that your organization uses to get ntp data. Warning at 10 seconds, critical at 20 seconds.
command[check_ntp]=/usr/local/nagios/libexec/check_ntp -H 0.pool.ntp.org -w 10 -c 20
## /usr/local/.... is the path to the event-ntp handler, as seen below.
command[event-ntp]=/usr/local/nagios/libexec/event-ntp $ARG1$ $ARG2$ $ARG3$
4. create a file called event-ntp, u:g of nagios:nagios, set executable.
5. Drop this code into the event-ntp file:
#!/bin/bash
## This is an event handler that will be executed on Warnings and Critical alerts.
## On Warnings, an ntp query will be issued, and the email will be sent to the specified admin
## On Critical, an ntp query will be issued, and the ntpd service will be restarted to re-sync the clocks
case "$1" in
OK)
;;
WARNING)
echo -e "Running NTP Query" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query" adminacct@example.com
;;
UNKNOWN)
;;
CRITICAL)
case "$2" in
SOFT)
case "$3" in
3)
echo -e "Running NTP Query & Restarting NTP Service" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query - Restarted NTPD" adminacct@example.com && /usr/bin/sudo /sbin/service ntpd restart
;;
esac
;;
HARD)
echo -e "Running NTP Query & Restarting NTP Service" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query - Restarted NTPD" adminacct@example.com && /usr/bin/sudo /sbin/service ntpd restart
;;
esac
;;
esac
exit 0
Be sure to change HOSTNAME and adminacct@example.com to the client's hostname, and the admin account that you want to email to go to.
Now, the tricky, and probably not the most secure way to do this, is to modify the sudoer's file to allow the nagios user to execute system commands. I'm sure there is a more 'secure' way of doing this, but this works for me.
1. visudo
2. add the following:
User_Alias NAGIOS = nagios,nagcmd
Cmnd_Alias NAGIOSCOMMANDS = /sbin/service
Defaults:NAGIOS !requiretty
NAGIOS ALL=(ALL) NOPASSWD: NAGIOSCOMMANDS
Be sure to restart the nrpe client after all this has been accomplished. Now to move onto the server end of things.
First thing you need to do is create an event-ntp command in the commands.cfg file:
define command{
command_name event-ntp
command_line /usr/local/nagios/libexec/check_nrpe -H $HOSTNAME$ -c ev
ent-ntp -a $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}
This will be called by the event-handler object in your configuration file.
now, modify your service description in wherever you configure your service/host definitions. In my case I have a separate configuration file called linux.cfg.
define service{
use YOUR-service-TEMPLATE
host_name HOSTNAME-HERE
service_description Time Sync Check
event_handler event-ntp
check_command check_nrpe!check_ntp
}
Now restart nagios (service nagios restart), and test the configuration from the server end:
/usr/local/nagios/libexec/check_nrpe -H REMOTEHOSTNAME -c event-ntp -a CRITICAL HARD
If all goes well, you should receive an email from your client with an output from ntpq -p, and an ntpd service restart.
If you have any problems, not receiving email, or not executing the said script, set the debug level=1 on nrpe.cfg, restart nrpe, execute the above event-ntp test, and check your logs.
As you can see, it's not too difficult to execute event-handler scripts, and saves Administrator's time when nagios can do the leg-work on system/host critical alerts. This example 'self-heals' the NTPd service, but can be used/modified to just report data when there is a problem. Any time that nagios can do self-automation/testing, before administrators get to the machine, shaves time off of troubleshooting a problem.
Special Thanks to keith4 on freenode.net's #nagios channel for catching a syntax error for my $HOSTNAME$ argument in my commands.cfg file. Thus saving me many hours of hair pulling and name calling.
Monday, September 21, 2009
OpenSimulator
It looks like VirtualWorld environment's are the next big thing (and they still are 8+ years later since SecondLife came out). More and more I read articles online how big companies such as IBM, Sun and HP are using SecondLife (and their own internal Virtual World Servers) to host conferences, webcasts, and training sessions for their staff and the public.
As a matter of fact, not only are companies using Virtual Worlds to host various public functions; but also creating 'Virtual Data Centers'. Case in point, IBM's Virtual Data Center. IBM has special software that allows System's Administrators to monitor and administer their Data Centers, in a virtual environment.
So, while surfing around the internet for VirtualWorld servers, I came across OpenSimulator. You can create a stand-alone server, or attach it to other virtualworlds through osgrid.org. As the developers do their testing in Ubuntu, I downloaded and installed the latest Ubuntu Server and started up OpenSimulator. The wiki page located on their website is pretty self-explanatory, walking you through step-by-step on setting up your own virtual world.
The one problem that I did come across, was using either the Hippo Viewer or SecondLife Client to connect to the Instance. Both Hippo Viewer and SecondLife Client had my avatar as a 'ghost'. I found that other people had these problems, and the solution was to create hair for your avatar, and the ghost would go away. Strange Quirk, but it worked!
I wouldn't have been able to get this setup and troubleshooted without the excellent help from freenode.net's #opensim channel. They were a big help, even at 2200hrs MST.
Anyway, now that I have my opensim server running in Ubuntu, on a virtualbox; it's time to create some land!
Until next time.....
Tuesday, September 15, 2009
Linux Screen template
This time, the post is about Linux Screen templates. With more and more linux Distro's going GUI, and trying to get away from the CLI; the screen program has been mostly overlooked. As I have been working in and around linux for the last
12 years or so, I have become quite fond of the screen application.
First things first: man screen
Everything you need to know about custom .screenrc 's is located in the man page, lo and behold, many people don't realize this.
To create a .screenrc that I use every day:
touch /home/username/.screenrc (or /root/.screenrc for root's screen)
vi .screenrc
hardstatus alwayslastline "%{=b}%{G} Screen(s): %{b}%w %=%{kG}%C%A %D, %M/%d/%Y"
startup_message off
msgwait 1
Save and quit the editor, and fire up screen. This will show all of the screens that you have open (Name each one you create with: Ctrl-a A), and puts in a date/time stamp in the lower right hand corner. I put this in because I get so busy that I hardly look up at the clock and end up missing lunch. This way I can easily look over to the right to check my time :-)
As I am using a GIANT monitor, my screen capture of what my screen session looks like here in Blogger, seems to be all wacky. You should be able to click on the image to view it.

As I said, I live by screen for server work. I can create multiple screens, split them Horizontally or Vertically into 1 main screen, detach a screen when I go home, VPN in and re-attach the said screen from another location. And because I spend most of my time in a console, having a custom screenrc file, just make sense.
Tuesday, February 10, 2009
a free alternative to wireshark's pilot
I am using the tcptrace program to read the log files from wireshark, and Ploticus to pipe the data to a graph. tcptrace can create graphs, but not of tcp ports/percentages. So Im in the process of whipping up a bash/awk script that takes the output from tcptrace's port information dump, cleans up and drops it into a file that ploticus can read, in megabytes.
The really quick and dirty way to get the top 10 TCP port usage in bytes is as follows:
tcptrace -xtraffic <.cap file>
#this outputs a file called traffic_byport.dat)
sort -nr -k 4 traffic_byport.dat | awk 'NR==2,NR==11' > TCPtop10
#this numerically & reverse sorts column 4 of traffic_byport.dat (the bytes data),
# then it prints out lines 2-11 (line 1 has title data, don't need it for the script)
# after awking, it prints out the top 10 TCP port usage, in bytes.
I have also spit out a very rough and dirty way to transform the bytes to megabytes with a .00 decimal place in order to graph the data properly. But Im going to look into a better way to merge columns in awk before I post anything.
Debugging a shell script
sh -x script.sh
the -x switch echo's everything in the script line by line until there is an error in the script. the errors are printed out next to the line that errored out. VERY VERY useful! Instead of writing 50+ lines of scripts and spending all day debugging it, the -x does all my work for me. i have used strace
Monday, February 9, 2009
Print queue check script
Needless to say, I wrote the script anyway. Im sure I can pilfer bits and pieces of it for another quick script later.
#!/bin/bash
### Removes Print jobs from the print queue
### Command:
#Query printer for print queue, then drop data to jobid file
lpq -P $1 awk '{print $3}' > jobid
# remove all print jobs from specified printer
cat jobid xargs lprm
# Query printer queue again and let admin know that the jobs have been removed
lpq -P $1
echo 'Print jobs removed'
# remove temp file
rm -rf jobid
Thursday, February 5, 2009
nagiosgraph & windows clients
So, 3 days into searching just about everything on the web for nagiosgrapher and windows server map files, I finally found a website that guided me in the right direction.
http://nerhood.wordpress.com/2004/09/22/nagiosgraph-with-windows-support/
As you can see, the article is over 4 years old, but yet I couldn't find anything else on the web with nagiosgrapher and nsclient++. So, just in case I will post parts of my nagiosgraph/maps file in case someone else comes across this blog looking for nagiosgraphing and nsclient++ integration.
By the way, it's AWESOME! Nagiosgrapher has already caught a few problems that we had suspected, and provides a visual tool for sys admins looking back at historical data.
/nagiosgraph/map
# Service type: memory
# check command: check_nt -H Address -v MEMUSE -w 50 -c 90
#output: Memory usage: tootal:2467.75 Mb - used: 510.38 Mb (21%) - free: 1957.37 Mb (79%)
/perfdata:Memory usage=([.0-9])+Mb;([.0-9+);([.0-9+);([.0-9+);([.0-9]+)/
and push @s, [ntmem,
[memused, GAUGE, $1*1024**2 ]
];
# Service type: ntload
# Check command: check_nt -H Address -v CPULOAD -l1,70,90,5,70,90,30,70,90
# output: CPU Load 9% (5 min average) 11% (30 min average)
#perfdata: '5 min avg Load'=9%;70;80;0;100 '30 min avg Load'=11%;70;90;0;100
/output:.*?(\d+)% .*?(\d+)% /
and push @s, [ ntload,
[ avg05min, GAUGE, $1 ],
[avg30min, GAUGE, $2 ] ];
# Service type: ntdisk
# check command: check_nt -H Address -v USEDDISKSPACE -lc -w 75 -c 90
# output: c:\ - total: 25.87 Gb - used: 4.10 Gb (16%) - free 21.77 Gb (84%)
# perfdata: c:\ Used Space=4.10Gb;19.40;23.28;0.00;25.87
/perfdata:.*Space=([.0-9]+)Gb;([.0-9]+);([.0-9]+);([.0-9]+);([.0-9]+)/
and push @s, [ ntdisk,
[ diskused, GAUGE, $1*1024**3 ],
[ diskwarn, GAUGE, $2*1024**3 ],
[ diskcrit, GAUGE, $3*1024**3 ],
[ diskmaxi, GAUGE, $5*1024**3 ] ];
Alas! Blogger seems to put a .5 space between the code, o'well, at least one can tell where the code begins and ends. So once the map file has been populated, you can check your syntax with:
perl -c map
output should be: map syntax OK. From there, .rrd files should start generating in the hosts file under /rrd (or wherever one has setup their /rrd directory).