NetWatcher Architecture

Data Model

The universe as seen by the NetWatcher consists of "objects". Object is a thing that operate (or fail to operate) within several of a set of "categories". Examples of "objects" are, e.g. physical computers, mail systems serving particular domains, web servers.

"Categories" are functions or characteristics that can be checked/measured by monitors. Typically, one monitor checks one category on a set of objects. Some examples of "categories" are: responsiveness to ICMP ping, load average, functioning of mail delivery system. The field "sleeptime" in the categories table, if filled, tells the monitor to sleep that may seconds between checks. Note that this is not period of checking because the check cycle itelf may take considerable amount of time. Also, sleetime is randomized to avoid possbile bust patterns.

Objects and categories are listed in two NetWatcher tables.

"Checks" table carries records telling that this particular category needs to be checked on this particular object, and optionally data specific for such check (e.g. SNMP credentials needed to establish session). Typical monitor probes all objects that are listed in the checks table in conjunction with the category served by this monitor. If "stepping" field is filled with N, that particular check will be performed every Nth cycle. This allows to have same category checked with different frequency on different objects.

"Subjects" are recipients of notifications. They may be human or otherwise. "subjects" table carries name and method of delivery, e.g. the name of sms sender program plus telephone number. The program specified in this table recieves message from standard input and tries to deliver to the destination. The most common example of such program is "mailx". For example, you may have an entry in this table named "operator" with the data field filled with

/bin/mailx -s "NetWatcher alert" noc@acme.com
Note that these programs are executed synchronously by the monitors so they should complete quickly, possibly just queueing the message for later delivery in the background.

"responsibilities" table tells which checks, if status change noticed, need to notify which subjects. A field in this table, "elevation", is used to facilitate simple escalation procedure. If it is equal to N, the subject is notified about a "good to bad" transition only after "bad" status repeats N+1 times in succession. Note that escalation delay is measured not in minutes but in successive check attempts, which may be performed more or less often depending on configuration and way of operation of a particular monitor.

And finally, "status" table holds current status of all previously run checks. Once result of new check differes from the result stored in this table, responsible subjects are notified about the change. (modulo escalation delay). Then the new status is stored in the table replacing the previous one, and in addition is logged in the "statuslog" table.

The following figure depicts relationship between main NetWatcher tables:

Program Design

Netwatcher programs are written in Perl5 using objects.

Base Class - NetWatcher

This is a parent class for everybody else; it defines methods for debug output and error logging. Inside constructor, it initialises the database handle.

NetWatcher::Config

The source code of this module is created by Makefile.PL; it contains definitions of the DBI connection string and location of the log file.

NetWatcher::Report

This one is the heart of the system. Monitoring processes use `report' method provided by this module to tell everybody about the status of a monitored object. This method compares the reported status with old status from the database, updates the old status, and if necessary sends notifications to resposible subjects. Theoretically monitors may not use any other modules of the package and still do their job within NetWatcher.

NetWatcher::Monitor

Although it may be hardcoded in the monitor which objects to check usually it wants to get the list from the configuration table in the database. To facilitate this, and to provide other convenience tools, there exists Monitor module. It is designed to be used as parent class for actual monitors. It provides methods `run_once' and `run' that would run the monitor code one time or periodically respectively. Child class must provide `category' method that says which category is served by this monitor, and one of `check' and `check_all' methods. If `check' method is provided, `run' will invoke this method for every object that needs to be checked against this category. If `check_all' method is provided, it will be invoked once in a while and a list of objects that must be checked will be passed to it. If checking of every object takes significant time, it could be advisable to run checks in parallel, using `check_all' model. On the other hand, checking one object at a time may be easier to implement.

Another convenience thing provided by the Monitor module is `supervise' method. If all your monitors are implemented as subclasses of the Monitor class, you can invoke `supervise' class passing to it the list of subclass names, and it will start one process per subsclass, restart it if it aborts, and also run a special `Housekeeper' pseudo-monitor that takes care of periodical renaming/reopening of the log file and cleaning up old entries in the "statuslog" table in the database.

NetWatcher::Console

This one facilitates HTTP/CGI interface to the system. Subclasses of this class display status of objects and edit configuration tables as HTML documents with forms.

home