Ramblings of a Techie

Friday, September 25, 2009

NTP event-handler in Nagios

Nagios, "Nagios Ain't Gonna Insist On Sainthood"....What an awesome tool for systems administration. I have been using Nagios since the days of netsaint back when it was first released in 1999. In 10 years time, the program has matured and become a de-facto standard in Network and Systems Monitoring. In recent years there have been many spin-off's of nagios, but I have chosen to stick with the core package.

One of the most over-looked, but very powerful, features of Nagios is it's ability to use event handlers (Nagios Docs). An Event Handler allows for a script to be executed based on changes to the service/host system state. There are endless possibilities to using this feature. Usually when Nagios alerts to a changed system state (warning/critical), an administrator is emailed/paged/tweeted in response to the system state change. The said administrator then logs in, restarts services, checks network connection, finds out why disk space went critical, so on and so forth. Event-Handler's can do this and so much more!

I have written numerous event-handler scripts that creates a 'self-healing' environment for different servers utilizing event-handlers and NRPE. Some of the Event handlers that I utilize in Nagios are the following:

*Disk Space Critical - When Nagios alerts to a warning/critical disk space, NRPE executes a custom du-scan.sh script that sorts all of the data on the mount point by highest ammt used, puts it into a log file on the /tmp directory and emails the location of the log to administrators

*CPU Load Critical - When Nagios alerts to a warning/critical CPU Load, whether it's in linux or windows, a script is executed (Bash in linux, VB in windows) that emails administration the top 5 running CPU process's on the server

*NTP (Network Time Protocol) time sync Critical - Sometimes when CPU load goes critical, the NTP service running on a Linux machine goes WAY out of sync (over 1000seconds) causing the NTP Daemon to crash. When the time is off on the server, the various services we use, report different times, causing even more issues. The fix to this, is to have Nagios restart the ntp service on the remote server via NRPE.

This little blog entry will detail how to setup an NTP event-handler for Nagios. You can use this as a base for just about anything else event-handler related.

This is written with the assumption that the person using this is already familiar with, and has a Nagios Server utilizing NRPE for linux clients, along with having nagios-plugins installed on the client. This isn't a basic 'nagios howto' at all.

First things first, setting up and configuring the remote client for allowing command arguments.

1. Navigate to your nrpe source directory (in this case: /root/Download/nrpe2-12/
2. Reconfigure nrpe for command arguments
./configure --enable-command-args && make && make install
3. Modify the nrpe.cfg file
a) change: dont_blame_nrpe=0 to: dont_blame_nrpe=1
b) add the following to your command arguments:
## /usr/local/.... is the path to the check_ntp plugin that comes with nagios plugins, change 0.pool.ntp.org to the ntp server that your organization uses to get ntp data. Warning at 10 seconds, critical at 20 seconds.

command[check_ntp]=/usr/local/nagios/libexec/check_ntp -H 0.pool.ntp.org -w 10 -c 20

## /usr/local/.... is the path to the event-ntp handler, as seen below.

command[event-ntp]=/usr/local/nagios/libexec/event-ntp $ARG1$ $ARG2$ $ARG3$

4. create a file called event-ntp, u:g of nagios:nagios, set executable.
5. Drop this code into the event-ntp file:

#!/bin/bash

## This is an event handler that will be executed on Warnings and Critical alerts.
## On Warnings, an ntp query will be issued, and the email will be sent to the specified admin
## On Critical, an ntp query will be issued, and the ntpd service will be restarted to re-sync the clocks

case "$1" in
OK)
;;
WARNING)
echo -e "Running NTP Query" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query" adminacct@example.com
;;
UNKNOWN)
;;
CRITICAL)

case "$2" in
SOFT)
case "$3" in
3)
echo -e "Running NTP Query & Restarting NTP Service" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query - Restarted NTPD" adminacct@example.com && /usr/bin/sudo /sbin/service ntpd restart
;;
esac
;;
HARD)
echo -e "Running NTP Query & Restarting NTP Service" "\n"
ntpq -p | mailx -s "HOSTNAME - NTP Query - Restarted NTPD" adminacct@example.com && /usr/bin/sudo /sbin/service ntpd restart
;;
esac
;;

esac
exit 0

Be sure to change HOSTNAME and adminacct@example.com to the client's hostname, and the admin account that you want to email to go to.

Now, the tricky, and probably not the most secure way to do this, is to modify the sudoer's file to allow the nagios user to execute system commands. I'm sure there is a more 'secure' way of doing this, but this works for me.

1. visudo
2. add the following:
User_Alias NAGIOS = nagios,nagcmd
Cmnd_Alias NAGIOSCOMMANDS = /sbin/service
Defaults:NAGIOS !requiretty
NAGIOS ALL=(ALL) NOPASSWD: NAGIOSCOMMANDS

Be sure to restart the nrpe client after all this has been accomplished. Now to move onto the server end of things.

First thing you need to do is create an event-ntp command in the commands.cfg file:

define command{
command_name event-ntp
command_line /usr/local/nagios/libexec/check_nrpe -H $HOSTNAME$ -c ev
ent-ntp -a $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}

This will be called by the event-handler object in your configuration file.

now, modify your service description in wherever you configure your service/host definitions. In my case I have a separate configuration file called linux.cfg.

define service{
use YOUR-service-TEMPLATE
host_name HOSTNAME-HERE
service_description Time Sync Check
event_handler event-ntp
check_command check_nrpe!check_ntp
}

Now restart nagios (service nagios restart), and test the configuration from the server end:
/usr/local/nagios/libexec/check_nrpe -H REMOTEHOSTNAME -c event-ntp -a CRITICAL HARD

If all goes well, you should receive an email from your client with an output from ntpq -p, and an ntpd service restart.

If you have any problems, not receiving email, or not executing the said script, set the debug level=1 on nrpe.cfg, restart nrpe, execute the above event-ntp test, and check your logs.

As you can see, it's not too difficult to execute event-handler scripts, and saves Administrator's time when nagios can do the leg-work on system/host critical alerts. This example 'self-heals' the NTPd service, but can be used/modified to just report data when there is a problem. Any time that nagios can do self-automation/testing, before administrators get to the machine, shaves time off of troubleshooting a problem.

Special Thanks to keith4 on freenode.net's #nagios channel for catching a syntax error for my $HOSTNAME$ argument in my commands.cfg file. Thus saving me many hours of hair pulling and name calling.

Monday, September 21, 2009

OpenSimulator

Last week I was tasked with attempting to create a 'virtual world' for a training environment. Full on PowerPoint Slides, VOIP, etc in a training room type environment. Well, as I have never really played with SecondLife, or any other type of Virtual World, I did some digging.

It looks like VirtualWorld environment's are the next big thing (and they still are 8+ years later since SecondLife came out). More and more I read articles online how big companies such as IBM, Sun and HP are using SecondLife (and their own internal Virtual World Servers) to host conferences, webcasts, and training sessions for their staff and the public.

As a matter of fact, not only are companies using Virtual Worlds to host various public functions; but also creating 'Virtual Data Centers'. Case in point, IBM's Virtual Data Center. IBM has special software that allows System's Administrators to monitor and administer their Data Centers, in a virtual environment.

So, while surfing around the internet for VirtualWorld servers, I came across OpenSimulator. You can create a stand-alone server, or attach it to other virtualworlds through osgrid.org. As the developers do their testing in Ubuntu, I downloaded and installed the latest Ubuntu Server and started up OpenSimulator. The wiki page located on their website is pretty self-explanatory, walking you through step-by-step on setting up your own virtual world.

The one problem that I did come across, was using either the Hippo Viewer or SecondLife Client to connect to the Instance. Both Hippo Viewer and SecondLife Client had my avatar as a 'ghost'. I found that other people had these problems, and the solution was to create hair for your avatar, and the ghost would go away. Strange Quirk, but it worked!

I wouldn't have been able to get this setup and troubleshooted without the excellent help from freenode.net's #opensim channel. They were a big help, even at 2200hrs MST.

Anyway, now that I have my opensim server running in Ubuntu, on a virtualbox; it's time to create some land!

Until next time.....

Tuesday, September 15, 2009

Linux Screen template

It's been awhile since I have posted anything, but hopefully I will start posting little tidbits of helpful *nix stuff again.

This time, the post is about Linux Screen templates. With more and more linux Distro's going GUI, and trying to get away from the CLI; the screen program has been mostly overlooked. As I have been working in and around linux for the last
12 years or so, I have become quite fond of the screen application.

First things first: man screen
Everything you need to know about custom .screenrc 's is located in the man page, lo and behold, many people don't realize this.

To create a .screenrc that I use every day:

touch /home/username/.screenrc (or /root/.screenrc for root's screen)
vi .screenrc
hardstatus alwayslastline "%{=b}%{G} Screen(s): %{b}%w %=%{kG}%C%A %D, %M/%d/%Y"
startup_message off
msgwait 1

Save and quit the editor, and fire up screen. This will show all of the screens that you have open (Name each one you create with: Ctrl-a A), and puts in a date/time stamp in the lower right hand corner. I put this in because I get so busy that I hardly look up at the clock and end up missing lunch. This way I can easily look over to the right to check my time :-)

As I am using a GIANT monitor, my screen capture of what my screen session looks like here in Blogger, seems to be all wacky. You should be able to click on the image to view it.

As I said, I live by screen for server work. I can create multiple screens, split them Horizontally or Vertically into 1 main screen, detach a screen when I go home, VPN in and re-attach the said screen from another location. And because I spend most of my time in a console, having a custom screenrc file, just make sense.

Tuesday, February 10, 2009

a free alternative to wireshark's pilot

So I have been put in charge of using the wireshark program called pilot in order to mimic results of a network test that we did. But alas! pilot wasn't working on me, and while waiting for tech support to get back to me; I figured I would take matters into my own hands and come up with a free-ware alternative.

I am using the tcptrace program to read the log files from wireshark, and Ploticus to pipe the data to a graph. tcptrace can create graphs, but not of tcp ports/percentages. So Im in the process of whipping up a bash/awk script that takes the output from tcptrace's port information dump, cleans up and drops it into a file that ploticus can read, in megabytes.

The really quick and dirty way to get the top 10 TCP port usage in bytes is as follows:

tcptrace -xtraffic <.cap file>
#this outputs a file called traffic_byport.dat)

sort -nr -k 4 traffic_byport.dat | awk 'NR==2,NR==11' > TCPtop10
#this numerically & reverse sorts column 4 of traffic_byport.dat (the bytes data),
# then it prints out lines 2-11 (line 1 has title data, don't need it for the script)
# after awking, it prints out the top 10 TCP port usage, in bytes.

I have also spit out a very rough and dirty way to transform the bytes to megabytes with a .00 decimal place in order to graph the data properly. But Im going to look into a better way to merge columns in awk before I post anything.

Debugging a shell script

I have been scripting various 'programs' in bash for nearly 10 years now, so there is not much that I don't know when it comes to bash shell programming. But alas! while looking for some code snippits on how to use calc to add a decimal point to a number, I came across a very nifty tidbit on shell debugging.

sh -x script.sh

the -x switch echo's everything in the script line by line until there is an error in the script. the errors are printed out next to the line that errored out. VERY VERY useful! Instead of writing 50+ lines of scripts and spending all day debugging it, the -x does all my work for me. i have used strace to debug various applications, but I couldnt' find anything for actual shell scripts. Freaking AWESOME!

Monday, February 9, 2009

Print queue check script

So on Thursday I was manually cleaning out the print queue on a CUPS print server, 40+ jobs one at a time, and it came to me! Just whip up a quick sys-admin script that polls, the data from a column and just deletes it from there. What I didn't think to do, was READ the manpage on lprm. Had I read the man page instead of re-creating the wheel, I would have realized that in order to remove all print jobs from a CUPS queue for a specific printer, just add a - to the commandline for the printer.

Needless to say, I wrote the script anyway. Im sure I can pilfer bits and pieces of it for another quick script later.

#!/bin/bash
### Removes Print jobs from the print queue
### Command:

#Query printer for print queue, then drop data to jobid file
lpq -P $1 awk '{print $3}' > jobid

# remove all print jobs from specified printer
cat jobid xargs lprm

# Query printer queue again and let admin know that the jobs have been removed
lpq -P $1
echo 'Print jobs removed'

# remove temp file
rm -rf jobid

Thursday, February 5, 2009

nagiosgraph & windows clients

About 6 months ago I started using Nagios to monitor 26 servers (mixed OS) with 144 Services. I must say, nagios has saved my butt many times over. Not only do I have it setup for email, but it will also SMS staff if the central network goes down. Well, the other day I came across nagiosgraph (http://sourceforge.net/projects/nagiosgraph/). Nagiosgraph will take the perf-data from Nagios and put it into a graph with rrdtool. Setting up graphs for pings, and linux-unix servers were pretty straightfoward, and already added to the map file on nagiosgraph. The problem that I had was that I use nsclient++ to monitor the windows servers, and even though I could get perf-data from the windows servers, there was no way to get graph data.

So, 3 days into searching just about everything on the web for nagiosgrapher and windows server map files, I finally found a website that guided me in the right direction.

http://nerhood.wordpress.com/2004/09/22/nagiosgraph-with-windows-support/

As you can see, the article is over 4 years old, but yet I couldn't find anything else on the web with nagiosgrapher and nsclient++. So, just in case I will post parts of my nagiosgraph/maps file in case someone else comes across this blog looking for nagiosgraphing and nsclient++ integration.

By the way, it's AWESOME! Nagiosgrapher has already caught a few problems that we had suspected, and provides a visual tool for sys admins looking back at historical data.

/nagiosgraph/map

# Service type: memory

# check command: check_nt -H Address -v MEMUSE -w 50 -c 90

#output: Memory usage: tootal:2467.75 Mb - used: 510.38 Mb (21%) - free: 1957.37 Mb (79%)

/perfdata:Memory usage=([.0-9])+Mb;([.0-9+);([.0-9+);([.0-9+);([.0-9]+)/

and push @s, [ntmem,

[memused, GAUGE, $1*1024**2 ]

];

# Service type: ntload

# Check command: check_nt -H Address -v CPULOAD -l1,70,90,5,70,90,30,70,90

# output: CPU Load 9% (5 min average) 11% (30 min average)

#perfdata: '5 min avg Load'=9%;70;80;0;100 '30 min avg Load'=11%;70;90;0;100

/output:.*?(\d+)% .*?(\d+)% /

and push @s, [ ntload,

[ avg05min, GAUGE, $1 ],

[avg30min, GAUGE, $2 ] ];

# Service type: ntdisk

# check command: check_nt -H Address -v USEDDISKSPACE -lc -w 75 -c 90

# output: c:\ - total: 25.87 Gb - used: 4.10 Gb (16%) - free 21.77 Gb (84%)

# perfdata: c:\ Used Space=4.10Gb;19.40;23.28;0.00;25.87

/perfdata:.*Space=([.0-9]+)Gb;([.0-9]+);([.0-9]+);([.0-9]+);([.0-9]+)/

and push @s, [ ntdisk,

[ diskused, GAUGE, $1*1024**3 ],

[ diskwarn, GAUGE, $2*1024**3 ],

[ diskcrit, GAUGE, $3*1024**3 ],

[ diskmaxi, GAUGE, $5*1024**3 ] ];

Alas! Blogger seems to put a .5 space between the code, o'well, at least one can tell where the code begins and ends. So once the map file has been populated, you can check your syntax with:

perl -c map

output should be: map syntax OK. From there, .rrd files should start generating in the hosts file under /rrd (or wherever one has setup their /rrd directory).