SNMP Traphandling für Nagios mit snmptt und Shellscripts Martin Fürstenau mf@maerber.de 1
Methode Nagios Voraussetzung: snmptrapd muss installiert sein Für jeden Trap ist in /etc/snmptrapd.conf ein handle zu definieren mit folgender Syntax: traphandle OID Eventhandler/Programm Die Handler übergeben ihr Erbebnis an NSCA Für jeden Handle ist ein Service Check zu übergeben Fazit Aufwendig Schwer zu warten Unflexibel 2
Besser mit snmptt (http://www.snmptt.org) Voraussetzung: snmptrapd muss installiert sein snmptt muss installiert sein Konfigurationsdateien f. snmptt werden per snmpttconvertmib aus MIBs generiert. Der Eventhandler submit_check_result sollte installiert sein Nachrichtenfluss: snmptrapd übergibt an snmptt snmptt bereitet den Trap auf und ruft einen Eventhandler (submit_check_result) auf oder ein Script, das seinerseits einen Eventhandler (submit_check_result) aufruft der Eventhandler übermittelt das als passiven Check an Nagios 3
submit_check_result?? (1) #!/bin/sh # SUBMIT_CHECK_RESULT # Written by Ethan Galstad (nagios@nagios.org) # Last Modified: 02-18-2002 # # This script will write a command to the Nagios command # file to cause Nagios to process a passive service check # result. Note: This script is intended to be run on the # same host that is running Nagios. If you want to # submit passive check results from a remote machine, look # at using the nsca addon. # # Arguments: # $1 = host_name (Short name of host that the service is # associated with) # $2 = svc_description (Description of the service) # $3 = return_code (An integer that determines the state # of the service check, 0=OK, 1=WARNING, 2=CRITICAL, # 3=UNKNOWN). # $4 = plugin_output (A text string that should be used # as the plugin output for the service check) 4
submit_check_result?? (2) echocmd="/bin/echo" CommandFile="/var/spool/nagios/nagios.cmd" # get the current date/time in seconds since UNIX epoch datetime=`date +%s` # create the command line to add to the command file cmdline="[$datetime] PROCESS_SERVICE_CHECK_RESULT;$1;$2;$3;$4" # append the command to the end of the command file `$echocmd $cmdline >> $CommandFile` 5
Konfiguration snmptrapd.conf: traphandle default /usr/sbin/snmptthandler snmptt.ini nach Handbuch Wesentlich d. letzte Abschnitt: [TrapFiles] # A list of snmptt.conf files (this is NOT the snmptrapd.conf file). The COMPLETE path # and filename. Ex: '/etc/snmp/snmptt.conf' snmptt_conf_files = <<END /etc/snmptt/snmptt.conf.apx END 6
Beispiel Szenario APX 1 W2K-Server (1 x produktiv, 1 x icecold standby) Poing 1 Testserver Venlo NL 4 Traps Start Stop Heartbeat Alle abgebrochenen Jobs und ihre Probleme Nur Traps vom produktiven Server sollen akzeptiert werden Inhalte wie z.b. Abbruchcode, Severity etc. im Messagestring des Traps 7
Konvertieren d. benötigten MIBS mit snmpttconvertmib Beispielergebnis /etc/snmptt/snmptt.conf.apx (1): # MIB: APX-MIB (file:./apmapx.mib) converted on Fri Nov 4 16:00:13 2005 using snmpttconvertmib v1.0 # EVENT apxcontrolstart.1.3.6.1.4.1.1024.2.24.11.10.0.1081 "Status Events" Normal FORMAT Control started $* EXEC /usr/lib/nagios/ops_plugins/apx_start_trap_wrapper $r SDESC Control started Variables: 1: apxcontrolstate 2: apxtrapid EDESC # 8
Konvertieren d. benötigten MIBS mit snmpttconvertmib Beispielergebnis /etc/snmptt/snmptt.conf.apx (2): # EVENT apxcontrolshutdown.1.3.6.1.4.1.1024.2.24.11.10.0.1082 "Status Events" Normal FORMAT Control shutdown $* EXEC /usr/lib/nagios/ops_plugins/apx_stop_trap_wrapper $r SDESC Control shutdown Variables: 1: apxcontrolstate 2: apxtrapid EDESC # 9
Konvertieren d. benötigten MIBS mit snmpttconvertmib Beispielergebnis /etc/snmptt/snmptt.conf.apx (3): # EVENT apxheartbeat.1.3.6.1.4.1.1024.2.24.11.10.0.1088 "Status Events" Normal FORMAT Control heartbeat $* EXEC /usr/lib/nagios/ops_plugins/apx_heartbeat_trap_wrapper $r SDESC Control heartbeat Variables: 1: apxcontrolstate 2: apxlastaction 3: apxtrapid EDESC # 10
Konvertieren d. benötigten MIBS mit snmpttconvertmib Beispielergebnis /etc/snmptt/snmptt.conf.apx (4): # EVENT apxjobcancelled.1.3.6.1.4.1.1024.2.24.11.11.0.2004 "Status Events" Normal FORMAT Job cancelled $* EXEC /usr/lib/nagios/ops_plugins/apx_service_trap_wrapper $r $3 $5 $2 $7 $1 SDESC Job cancelled Variables: 1: apxtrapid 2: apxjobkey 3: apxjobname 4: apxjoboid 5: apxcondcode 6: apxjobstate 7: apxgroup EDESC 11
Die Logfiles (1) snmptrapd.log 2006-09-03 08:32:50 ops-apx01.ops.oce.net [10.53.11.121] (via UDP: [10.53.11.121]:1048) TRAP, SNMP v1, community OCE_PO. 1.3.6.1.4.1.1024.2.24.11.10 Enterprise Specific Trap (1081) Uptime: 450 days, 4:48:25.32.1.3.6.1.4.1.1024.2.24.11.1.1 = STRING: "1081". 1.3.6.1.4.1.1024.2.24.11.1.201 = STRING: "1,5,0,141". 1.3.6.1.4.1.1024.2.24.11.1.202 = STRING: "RUN". 1.3.6.1.4.1.1024.2.24.11.1.203 = STRING: "03.09.2006 08:32:52" snmptt.log Sun Sep 3 08:32:50 2006.1.3.6.1.4.1.1024.2.24.11.10.0.1081 Normal "Status Events" ops-apx01 - Control started 1081 1,5,0,141 RUN 03.09.2006 08:32:52 12
Die Logfiles (2) snmptrapd.log 2006-09-03 08:32:57 ops-apx01.ops.oce.net [10.53.11.121] (via UDP: [10.53.11.121]:1067) TRAP, SNMP v1, community OCE_PO.1.3.6.1.4.1.1024.2.24.11.10 Enterprise Specific Trap (1088) Uptime: 450 days, 4:48:32.93.1.3.6.1.4.1.1024.2.24.11.1.1 = STRING: "1088". 1.3.6.1.4.1.1024.2.24.11.1.201 = STRING: "1,5,0,141". 1.3.6.1.4.1.1024.2.24.11.1.202 = STRING: "RUN". 1.3.6.1.4.1.1024.2.24.11.1.203 = STRING: "03.09.2006 08:32:59" snmptt.log Sun Sep 3 08:32:58 2006.1.3.6.1.4.1.1024.2.24.11.10.0.1088 Normal "Status Events" ops-apx01 - Control heartbeat 1088 1,5,0,141 RUN 03.09.2006 08:32:59 13
Die Logfiles (3) snmptrapd.log 2006-09-06 12:15:33 ops-apx01.ops.oce.net [10.53.11.121] (via UDP: [10.53.11.121]:3344) TRAP, SNMP v1, community OCE_PO.1.3.6.1.4.1.1024.2.24.11.11 Enterprise Specific Trap (2004) Uptime: 453 days, 8:31:06.25.1.3.6.1.4.1.1024.2.24.11.1.1 = STRING: "2004". 1.3.6.1.4.1.1024.2.24.11.1.101 = STRING: "1858194". 1.3.6.1.4.1.1024.2.24.11.1.102 = STRING: "RP651700_1". 1.3.6.1.4.1.1024.2.24.11.1.103 = STRING: "0x081190973FFC4642B68B026E2402423D".1.3.6.1.4.1.1024.2.24.11.1.104 = STRING: "ABAP".1.3.6.1.4.1.1024.2.24.11.1.106 = STRING: "?". 1.3.6.1.4.1.1024.2.24.11.1.107 = STRING: "MNLsev1" snmptt.log Wed Sep 6 12:15:33 2006.1.3.6.1.4.1.1024.2.24.11.11.0.2004 Normal "Status Events" ops-apx01 - Job cancelled 2004 1858194 RP651700_1 0x081190973FFC4642B68B026E2402423D ABAP? MNLsev1 14
Define service (siehe snmptt Doku) define service{ Description host_name server01 Name of host service_description TRAP Name of service. What you use here must match the same value for the <submit_check_result> script is_volatile 1 Enables volatile services check_command check-host-alive Used to reset the status to OK when schedule an immediate check of this service is selected max_check_attempts 1 Leave as 1. normal_check_interval 1 Leave as 1 retry_check_interval 1 Leave as 1 passive_checks_enabled 1 Enables passive checks check_period none When this servcie can be checked. Because it is a passive service, it never needs to be automatically checked notification_interval 31536000 Notification interval. Set to a very high number to prevent you from getting pages of previously received traps (1 year - restart Nagios at least once a year! - do not set to 0!). notification_period 24x7 When you can be notified. Can be changed notification_options w,u,c,r notifications_enabled 1 contact_groups cg_core Notify on warning, unknown, critical and recovery Enable notifications Name of contact group to notify 15
Definition Service Template (im Beispiel f.apx) define service{ name APX-SERVER register 0 host_name ops-apx01 passive_checks_enabled 1 notifications_enabled 1 is_volatile 1 check_period none check_freshness 1 max_check_attempts 1 normal_check_interval 1 flap_detection_enabled 0 retry_check_interval 1 contact_groups apx-admins,apx-admins_sms notification_interval 31536000 notification_period 24x7 } define service{ name APX-TRAP use APX-SERVER register 0 # Nach 5 Stunden werden Traps v. Service Checks zureuckgesetzt freshness_threshold 18000 check_command apx_service_trap_reset notification_options c,w } 16
Definition Service (im Beispiel f.apx) define service{ use service_description } APX-TRAP RP818100 17
Beispiel Wrapper: apx_heartbeat_trap_wrapperefinition (1) #!/bin/bash # Author: M.Fuerstenau, OPS, Poing # Purpose: # Will ge startet from snmptt (Config-file /etc/snmptt/snmptt.conf.apx) # Instead of being started directly from the mentioned config file # the use of this wrapper was necessary due to the fact that the # service name for the Nagios service has to be generated and # that strings like "APXsev1" must be mapped to the appropriate # Nagios return code HOST=$1 NAGIOS_DIR=/etc/nagios # Pruefen, ob es den sendenden Host ueberhaupt gibt. Sonst raus. grep ^$HOST$ $NAGIOS_DIR/hosts/* 1>&2 > /dev/null RETCO=$? if [ $RETCO -ne 0 ] then exit 1 fi 18
Beispiel Wrapper: apx_heartbeat_trap_wrapperefinition (2) # IP des produktiven Host ermitteln PROD_HOST=$(grep host_name $NAGIOS_DIR/services/APX-TRAP-template.cfg awk '{print $2}') HOST_PROD_CFG=$(grep ^$PROD_HOST$ $NAGIOS_DIR/hosts/* grep host_name awk '{print $1}' sed 's/:$//') PROD_IP=$(grep address $HOST_PROD_CFG grep -v '#' awk '{print $2}') # IP des anfragenden Hosts ermitteln (da der mal mit Hostname (Poing) oder IP (Venlo) daherkummt) STR_2_GREP=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* grep -v alias awk '{print $2}') HOST_ASK_CFG=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* grep $STR_2_GREP awk '{print $1}' sed 's/:$//') ASK_IP=$(grep address $HOST_ASK_CFG grep -v '#' awk '{print $2}') if [ "$PROD_IP" = "$ASK_IP" ] fi then /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Heartbeat 0 "APX is running" exit 0 else exit 1 19
Beispiel (Re)set: apx_heartbeat_alarm_set #!/bin/sh HOST=$1 /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Heartbeat 2 "Heartbeat signal from APX is missing. Maybe system is not running!" exit 2 20
Beispiel Wrapper: apx_service_trap_wrapper (1) #!/bin/bash # Author: M.Fuerstenau, OPS, Poing # 18.11.2005 # # Purpose: # Will ge startet from snmptt (Config-file /etc/snmptt/snmptt.conf.apx) # Instead of being started directly from the mentioned config file # the use of this wrapper was necessary due to the fact that the # service name for the Nagios service has to be generated and # that strings like "APXsev1" must be mapped to the appropriate # Nagios return code # Anzahl uebergebener Argumente pruefen if [ $# -ne 6 ] then echo $* echo "Wrong number of arguments - exiting" exit 1 fi 21
Beispiel Wrapper: apx_service_trap_wrapper (2) # Uebergebene Argumente Variablen zuordnen HOST=$1 APX_JOB_NAME=$2 APX_COND_CODE=$3 APX_JOB_KEY=$4 APX_GROUP=$5 APX_TRAP=$6 # Globale Variablen setzen NAGIOS_DIR=/etc/nagios HAMLET=0 # Pruefen, ob es den sendenden Host ueberhaupt gibt. Sonst raus. grep ^$HOST$ $NAGIOS_DIR/hosts/* 1>&2 > /dev/null RETCO=$? 22
Beispiel Wrapper: apx_service_trap_wrapper (3) if [ $RETCO -ne 0 ] then exit 1 fi # IP des produktiven Host ermitteln PROD_HOST=$(grep host_name $NAGIOS_DIR/services/APX-TRAP-template.cfg awk '{print $2}') HOST_PROD_CFG=$(grep ^$PROD_HOST$ $NAGIOS_DIR/hosts/* grep host_name awk '{print $1}' sed 's/:$//') PROD_IP=$(grep address $HOST_PROD_CFG grep -v '#' awk '{print $2}') # IP des anfragenden Hosts ermitteln (da der mal mit Hostname (Poing) oder IP (Venlo) daherkummt) STR_2_GREP=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* grep -v alias awk '{print $2}') HOST_ASK_CFG=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* grep $STR_2_GREP awk '{print $1}' sed 's/:$//') ASK_IP=$(grep address $HOST_ASK_CFG grep -v '#' awk '{print $2}') 23
Beispiel Wrapper: apx_service_trap_wrapper (4) if [ $RETCO -ne 0 ] then exit 1 fi # IP des produktiven Host ermitteln PROD_HOST=$(grep host_name $NAGIOS_DIR/services/APX-TRAP-template.cfg awk '{print $2}') HOST_PROD_CFG=$(grep ^$PROD_HOST$ $NAGIOS_DIR/hosts/* grep host_name awk '{print $1}' sed 's/:$//') PROD_IP=$(grep address $HOST_PROD_CFG grep -v '#' awk '{print $2}') # IP des anfragenden Hosts ermitteln (da der mal mit Hostname (Poing) oder IP (Venlo) daherkummt) STR_2_GREP=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* grep -v alias awk '{print $2}') HOST_ASK_CFG=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* grep $STR_2_GREP awk '{print $1}' sed 's/:$//') ASK_IP=$(grep address $HOST_ASK_CFG grep -v '#' awk '{print $2}') 24
Beispiel Wrapper: apx_service_trap_wrapper (5) # Katastophe oder Warnung - dass ist hier die Frage case "$APX_GROUP" in APXsev1) HAMLET=2 ;; APXsev2) HAMLET=1 ;; MNLsev1) HAMLET=2 ;; MNLsev2) HAMLET=1 ;; EUROCEsev1) HAMLET=2 ;; EUROCEsev2) HAMLET=1 ;; SMSTESTsev1) HAMLET=2 ;; SMSTESTsev2) HAMLET=1 ;; *) echo "Unknown groupcode - babe, it is impossible" esac 25
Beispiel Wrapper: apx_service_trap_wrapper (6) # # Pruefen of der den Trap sendende Host gleich dem ueberwachten ist. Wenn ja, weiter, sonst raus if [ "$PROD_IP" = "$ASK_IP" ] then /usr/lib/nagios/eventhandlers/submit_check_result $HOST $APX_JOB_NAME $HAMLET "Job $APX_JOB_NAME cancelled - Condition Code: $APX_COND_CODE - APX JobName $APX_JOB_KEY - APX Group $APX_GROUP - APX Trap ID $APX_TRAP" exit 0 else exit 1 fi 26
Beispiel Reset: apx_service_trap_reset #!/bin/sh echo "Ok. No error. No warning." exit 0 27
Beispiel: Start und Stop Auszug aus apx_start_trap_wrapper: if [ "$PROD_IP" = "$ASK_IP" ] then /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Start_Stop 0 "APX started" exit 0 else exit 1 fi Auszug aus apx_stop_trap_wrapper: if [ "$PROD_IP" = "$ASK_IP" ] then /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Start_Stop 2 "APX shutdown" exit 0 else exit 1 fi 28