Nagios Monitoring Tools Setup Explained

May 16, 2017

3827

Nagios (a recursive acronym for “Nagios Ain’t Gonna Insist On Sainthood”) has been one of the most favoured “prefects” of the data centre, monitoring parameters such as systems status (whether a system is up and running; CPU/memory/disk usage, etc.), service status (whether a service is up and running — e.g., DNS, Web server, mail server, etc.), and many other factors including room temperature and even humidity! It can generate alerts (through email/SMS) when the monitored parameters exceed preset thresholds.

How does Nagios work?

Nagios is installed on a central monitoring server
We tell Nagios what hosts, and services on those hosts, that we want to monitor.
Nagios polls the hosts and services periodically, checking for “alive-ness”. For a server, this means it must respond to ICMP ping requests. For a service, such as an HTTP server, Nagios checks that it can make a successful connection. The frequency of these checks is configurable.
If a host or service fails to respond, or returns a not-good reply, Nagios will alert the configured contact by email or SMS. Once this happens, the service or host is considered to be in a Critical state.
Whoever responds to the alert Acknowledges it via a web interface. This lets other system administrators know that someone is working on it.
Once the problem is fixed, Nagios will detect this and return it to an Okay state.

Installation

On Red Hat Enterprise Linux, Nagios can be easily installed using the EPEL Repository. To the uninitiated, EPEL is: “Extra Packages for Enterprise Linux (or EPEL), a Fedora Special Interest Group that creates, maintains, and manages a high-quality set of additional packages for Enterprise Linux, including, but not limited to, Red Hat Enterprise Linux (RHEL), CentOS and Scientific Linux (SL).”

To ensure that Nagios is available in the EPEL repository, let’s browse the relevant repository (since ours is a 64-bit host, we’re looking at the x86_64 EPEL repository. On jumping to packages whose names begin with “N”, we can see that (as of this writing), there are 65 Nagios packages (RPMs) available for 64-bit RHEL 5 & 6. We can check this using the following command (on the URL for the group of packages we just mentioned):

To install Nagios from EPEL, add the EPEL repository to yum, and then install the RPMs. The instructions to add the EPEL repository (clearly mentioned on the EPEL site) are as follows:

Download the relevant RPM to set up the repo:
RHEL 5:
# wget -c http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
RHEL 6:
# wget –c https://epel.mirror.angkasa.id/pub/epel/6/i386/epel-release-6-8.noarch.rpm

Install it:

# rpm -Uvh /home/vbg/downloads/epel-release-5-4.noarch.rpm
# yum install nagios nagios-common nagios-plugins nagios-plugins-http \
nagios-plugins-disk nagios-plugins-ping -y

This command helps to install Nagios without any hiccups,

Let’s have a look at the various configuration files (each with a specific purpose), and understand how Nagios uses them. They are:

The main configuration file — /etc/nagios/nagios.cfg
Object definition files — /etc/nagios/commands.cfg and /etc/nagios/localhost.cfg
Resource configuration file — /etc/nagios/private/resource.cfg
CGI configuration file — /etc/nagios/cgi.cfg

The Apache configuration file (/etc/httpd/conf.d/nagios.conf) contains the directive for the URLs http://<nagios-host>/nagios/, and http://<nagios-host>/nagios/cgi-bin/, whereas the /etc/logrotate.d/nagios file is a log rotation configuration file.

The main configuration file

The /etc/nagios/nagios.cfg file controls the behavior of the Nagios process and also the CGIs. There are many configuration directives in this file, and all of them are well documented. Let us look at some of the more important ones to get our basic configuration going:

Log file: This should be the first directive — the log file where host and service events are logged. Be careful that the file is accessible and writeable by the nagios user:

log_file=<path-of-log-file>

Nagios user and group: These are the user and group names under which the nagios process runs. The yum installation, as above, creates both a user and a group named nagios, which we will use:

nagios_user=nagios
nagios_group=nagios

Object definition file(s): This parameter can be specified multiple times. These files contain definitions for each host and service, as well as groups of hosts and services. As an example, the yum installation creates two object configuration files: commands.cfg and localhost.cfg. We will look at these a little later. The parameter syntax is as follows:

cfg_file=<path-of-object-definition-file_1>
cfg_file=<path-of-object-definition-file_2>

Object cache file (object_cache_file): To speed up operations, the nagios service caches the read object definitions and configurations them in a cache file, which is then read by the CGI. This also prevents inconsistencies, such as when an object file is being modified, and is saved before all changes are completed.

Status file (status_file): This file is where the status of all monitored hosts and services is stored by Nagios, to be processed by the CGI scripts.

Resource file (resource_file): The CGIs do not read these files, and they can contain sensitive information such as user names and passwords. Therefore, restrictive permissions such as 600 (only the owner can read/write) should be placed on these files. This parameter too can be specified multiple times. Resource files contain macros that are expanded by Nagios when executing a command found in the commands file. As you can see, the Nagios RPMs install these files in a separate directory, /etc/nagios/private, which is owned by the root user and readable by the nagios group:

# ls -ld /etc/nagios/
drwxrwxr-x 5 root root 4096 May 15 13:50 /etc/nagios/
# ls -ld /etc/nagios/private/
drwxr-x--- 2 root nagios 4096 May 15 13:36 /etc/nagios/private/

Object and Resource definition files

Objects are entities that need to be monitored, or are used for monitoring. Some examples are commands, hosts, groups, services and contacts. Let us explore a host object and a command object in this article.

Host object definitions are used to define a particular host that is being monitored; the mandatory directives are:

host_name: a short name for the host. Multiple services can be monitored on a single host. Normally, the FQDN is used.
alias: a longer description.
address: the IP address of the host being monitored.
max_check_attempts: the number of attempts to check the host, if a non-OK state is returned.
check_period: the period name (which is also defined), during which checks should be made.
contact_groups: the contact groups (people to be contacted) in case of problems (or recoveries) with this host.
notification_interval: the time interval (by default, in minutes) after which notifications will be sent, in case the host is still down.
notification_period: the time period in which notifications should be sent. In case the host is down in a time period that is not in this period, no notifications will be sent.

notification_options: This directive can have the following values:
d: send notifications when the host is down
u: send notifications if the host is unreachable
r: send notifications on recoveries
f: when the host starts and stops flapping (flapping is usually used to determine whether a service/host is stable. Flapping occurs when a service/host changes states too frequently.)
n: no notifications will be sent

A more efficient way to use host definitions is to define templates and use them. A snippet from the file /etc/nagios/localhost.cfg that defines a template, and then uses it for a host object definition, is shown below:

define host{
use linux-server            ; Name of host template to use
; This host definition will inherit all variables that are defined
; in (or inherited by) the linux-server host template definition.
host_name localhost
alias           localhost
address         127.0.0.1
}

The use statement above specifies that this host definition uses a template called linux-server. It is defined in the same file, as follows:

define host{
name My-linux-box     ; The name of this host template
use  generic-host     ; This template inherits other values from the generic-host template
check_period 24x7     ; By default, Linux hosts are checked round the clock
max_check_attempts 5 ; Check each Linux host 10 times (max)
check_command check-host-alive   ; Default command to check Linux hosts
notification_period workhours    ; Linux admins hate to be woken; only notify in the day
; Note that notification_period overrides the value
; inherited from the generic-host template!
notification_interval  120           ; Resend notification every 2 hours
notification_options   d,u,r         ; Only send notifications for specific host states
contact_groups         admins        ; Notifications get sent to the admins by default
register               0             ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

This template further uses a template called generic-host, which is also defined in the same file, as:

define host{
name generic-host            ; The name of this host template
notifications_enabled 1      ; Host notifications are enabled
event_handler_enabled 1      ; Host event handler is enabled
flap_detection_enabled 1     ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1          ; Process performance data
retain_status_information 1  ; Retain status information across program restarts
retain_nonstatus_information  1 ; Retain non-status information across program restarts
notification_period 24x7     ; Send host notifications at any time
register 0                   ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

Other objects referenced in the above snippets are:

contact_groups called “ops”
notification_period called “workhours” and “24×7”
check_command called “check-host-alive”

The contact_group object called “ops” is also defined in the same file:

define contactgroup{
contactgroup_name       ops
alias                   Nagios Administrators
members                 nagiosadmin
}

The member nagios-admin of the above contact-groups is defined as:

define contact{
contact_name                    nagiosadmin
alias                           Nagios Admin
service_notification_period     24x7
host_notification_period        24x7
service_notification_options    w,u,c,r
host_notification_options       d,r
service_notification_commands   notify-by-email
host_notification_commands      host-notify-by-email
email                           nagiosadmin@localhost
}

The time period “workhours” is defined as:

define timeperiod{
timeperiod_name workhours
alias           "Normal" Working Hours
monday          09:00-17:00
tuesday         09:00-17:00
wednesday       09:00-17:00
thursday        09:00-17:00
friday          09:00-17:00
}

The time period “24×7” is defined as:

define timeperiod{
timeperiod_name 24x7
alias           24 Hours A Day, 7 Days A Week
sunday          00:00-24:00
monday          00:00-24:00
tuesday         00:00-24:00
wednesday       00:00-24:00
thursday        00:00-24:00
friday          00:00-24:00
saturday        00:00-24:00
}

Command definitions are used to define commands that Nagios will use. They can include macros from resource definition files. The command used in the localhost.cfg file for localhost is defined in /etc/nagios/commands.cfg as:

define command {
command_name    check-host-alive
command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
}
$USER1$ is a macro defined in /etc/nagios/private/resource.cfg as a file system path:
$USER1$=/usr/lib64/nagios/plugins

Once the host, host-groups, commands and time periods have been defined, it is time to define services. For the purpose of this introductory article, we will use only the ping service. Again, the service definition sections in the sample configuration file listed below are self-explanatory.

define service {
use local-service         ; Name of service template to use
host_name localhost
service_description PING
check_command check_ping!100.0,20%!500.0,60%
}
This definition uses a template, local-service, defined as:
define service {
name local-service   ; The name of this service template
use generic-service  ; Inherit default values from the generic-service definition
check_period 24x7    ; The service can be checked at any time of the day
max_check_attempts 4 ; Re-check up to 4 times to determine final (hard) state
normal_check_interval 5 ; Check service every 5 minutes normally
retry_check_interval 1  ; Re-check every minute until a hard state can be determined
contact_groups ops   ; Send notifications to all in the 'ops' group
notification_options w,u,c,r ; Send warning, unknown, critical, and recovery notifications
notification_interval 60     ; Re-notify about problems every hour
notification_period 24x7     ; Notifications can be sent out at any time
register 0                   ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

The local-service template further uses a template, generic-service. For our use-case scenario, please ensure that you comment out all other service definitions in this configuration file.

Therefore, to sum up the various files used, based on the default configuration, our Nagios instance is set up thus:

Will monitor localhost (IP address 127.0.0.1)
It will be monitored 24×7.
This host is checked using the command /usr/lib64/nagios/plugins/check_ping.
Notifications are sent if the host is down, is unreachable or has recovered.
Notifications go to nagiosadmin@localhost, but are sent only during workhours, and will be resent every two hours if the host is still down or unreachable.

The CGI configuration file

The CGI configuration file (/etc/nagios/cgi.cfg) configures the CGI scripts and the Web GUI of Nagios. The significant parameters are:

main_config_file: The path of the main Nagios configuration file, and where the CGI scripts should find it.

physical_html_path: The filesystem path for Nagios HTML files.

url_html_path: The URL portion appended to the base URL, that will access the Nagios HTML files.

refresh_rate: Specifies the refresh rate for various CGIs such as status.

use_authentication: Specifies that the CGI scripts should use authentication.

Once Nagios has been configured, you will need to add an authentication file to be able to access Nagios pages. By default, the Apache configuration directives (specified in /etc/httpd/conf.d/nagios.conf) rely on basic authentication, and allow access only from localhost. The user authentication file /etc/nagios/passwd needs to be created. You can do this using the htpasswd command:

# htpasswd -bc /etc/nagios/passwd nagiosadmin xxxxxxxxxx

This creates the nagiosadmin user, with the password xxxxxx and stores the details in the file /etc/nagios/passwd.

Now we are ready with the setup and once we restart the apache service, you can start/restart the Nagios service,

# /etc/init.d/httpd restart
# /etc/init.d/nagios start
# tail –f /var/log/nagios/nagios.log

If the Nagios logs are fine, you should now open your browser and connect to http://localhost/nagios/, authenticate as nagiosadmin and check the Host summary. The configured host, localhost, should be up. And also you can see the service status and extra features.

Nagios Monitoring Tools Setup Explained

How does Nagios work?

Like this:

Related

5xx Server Errors – Why it occurs, What is Causes and How to Resolve?

How to Troubleshoot SELinux issue

Apt Command – Complete guide

Most Popular

Simplifying Azure Application Gateway Failed Request Monitoring

Kubernetes pod command and args – compared with Dockerfile

How to Troubleshoot Kubernetes Insufficient Node Resources

Deep Dive into Kubernetes Design Patterns: Building Resilient and Scalable Applications

Recent Comments

EDITOR PICKS

Simplifying Azure Application Gateway Failed Request Monitoring

Kubernetes pod command and args – compared with Dockerfile

How to Troubleshoot Kubernetes Insufficient Node Resources

POPULAR POSTS

Simplifying Azure Application Gateway Failed Request Monitoring

Kubernetes pod command and args – compared with Dockerfile

How to Troubleshoot Kubernetes Insufficient Node Resources

POPULAR CATEGORY

ABOUT US

FOLLOW US

Nagios Monitoring Tools Setup Explained

How does Nagios work?

Share this:

Like this:

Related

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US