General requirements for improved intelligence in Status polling via ITO/NNM

The Low Down on a Node Down:

Improving on NNM’s Default Status Event Configuration Settings

White paper by Mike Peckar

Seasoned users of HP OpenView Network Node Manager (NNM) know interpreting NNM’s status events can be problematic. Default event logging behaviors don’t provide complete information. The node down and node up events in particular are often subject to misinterpretation, and to the uninitiated can seem to act in unpredictable and inconsistent ways. This paper will focus on understanding these events and remedying a couple of the most common pitfalls encountered with them. Specific suggestions for event configuration customizations along with a useful PERL script will be provided. These will reduce confusion for NNM users, improve intelligence about the status of the network, and demonstrate the power and features of basic event configuration in NNM. This paper is written for NNM users new to the product or uncomfortable with customizing NNM.

NNM’s default event settings are to blame for some of the problems encountered. NNM ships with configuration settings designed to work best for the widest base of HP’s customers: in essence, the settings favor LAN management, that is, networks comprised primarily of computer systems. Management of network connector devices is de-emphasized in the event settings. Also, the text displayed by the events themselves, for example: “node down” can be very poor choices of words to describe the true meaning of these events. A closer and more detailed look at these events and where they come from will reveal these problems more clearly.

In essence, NNM discovers interfaces, and then polls them regularly for status via ICMP. Note that because devices, whether they have SNMP agents or not, typically don’t send events when someone pulls the plug, it becomes NNM’s job to determine device’s status and report it. When a device becomes unreachable, NNM internal events are generated that update icon status colors on the NNM maps. The node down status event is just one of these internally generated events. It is displayed in the event browser along with SNMP traps from all sources. The node down and node up events are thus generated by NNM.

Confusion over the difference between internally generated events and SNMP traps often arises because the source of the NNM status event is set to the device to which it refers, not NNM itself, making it appear the node was the source of the event. Also, NNM events and SNMP traps are displayed together in the same browser and are indistinguishable from each other in appearance. What happens is that SNMP traps are actually reformatted by NNM into NNM events with attributes not defined under the SNMP protocol such as severity, automatic action and logging behavior.

Further confusion comes from the LAN management bias mentioned above. The internal node down and node up events are not triggered directly by the results of status polls. They are, in fact internal events that are triggered by other NNM internal events: the interface down and interface up events. This is because NNM’s ICMP status polls are issued to individual interfaces, not whole nodes. Unfortunately, though, the events displayed in the event browser relate to whole nodes. This is because the nature of SNMP is such that there is a one-to-one correspondence between an SNMP agent and an individual node.

NNM’s internal interface down and interface up events are by default configured as “Log-Only” events. This means that they are generated and placed in the event database, but never displayed in the event browser so that NNM users can see them. Events that trigger the severity color changes on the map are similarly configured as “Log-Only” events by default. The net effect of this behavior is that the colors on the map change to reflect different states as they relate to interfaces, devices, connectors, network symbols, etc, but only those events that relate to whole nodes are passed to the event browser.

For those managing mostly single nodes, this default status event logging behavior a good thing: one node, one event. But for those who are more interested in the status of their network’s connector devices, this is not a good thing at all: individual interfaces may go down and come up and an associated event may or may not be sent to the NNM event browser – it depends. A node down event is only generated when all the interfaces as known in the NNM topology database are of critical status. A node up event is similarly only generated when all the interfaces in the NNM topology database are of normal status. The node up and node down event are the only status events that are displayed in the event browser by default.

Clearly, “node down” is a poor choice of words given the actual morphology as described above. A more accurate text for node down would be: “all discovered, pollable interfaces for this node do not respond to status polls”. More concise perhaps, is “All node’s interfaces unreachable”. Note how the change in wording for this event reveals the bias towards LAN management mentioned above. Where the events are coming predominantly from nodes that are singly-homed, the original wording makes more sense. The suggested working above shifts the bias 180 degrees from single nodes to multi-homed nodes. This may seem a simple matter of semantics, but this one event’s text is very often a big source of confusion and misinterpretation.

The closer one looks at NNM status events as they relate to nodes with multiple interfaces, the more the importance of the node down/up events becomes diminished. This is good, since NNM users typically over-inflate the importance of these two events. As seen above, the node up and node down events are derived empirically, whereas is the interface down/up events come directly from the results of status polls. The section above dealt with the internal behavior of the status events. In this section, external factors, which also add further confusion and opportunities for misinterpretation, will be discussed.

If a multi-homed node has an interface go down, a node down event may be generated, even if other interfaces are operational. This will be the case if the interface that went down is the connector to the NNM server and represents the route between the NNM server and the other interfaces. Here, the topology of the device and its placement in the network affect the behavior of NNM status events.

Consider this example: a remote ISDN router has a backup interface. The backup interface is always down unless the primary interfaces go down. NNM knows about the backup interface because the SNMP MIB II interface table showed itn was there upon discovery, and thus place it on the NNM map with at least one interface at a critical status. Suppose the primary interface goes down and the backup interface comes up. The resulting NNM events can be unpredictable, and depend on a combination of internal and external factors.

In this example, the node down event will be generated only if NNM thinks both interfaces are down. Timing comes into play here: the timing (and order) of the status polls to each interface combined with the way the router responds could result in NNM detecting either both interfaces down, or one interface down and one interface up. In the former case, a node down event will be generated, but in the later case it will not. The same situation applies to the node up event because, once again, a node up is only generated when all interfaces are flagged as reachable.

The most commonly seen scenario in the real world is that the node down is reported, but a node up event is never generated when the router recovers. Backup interfaces, as in this example, throw a monkey wrench into the normal operational behavior of NNM’s status polling engine. There are several other such monkey wrenches, for example DHCP. NNM’s workaround for these is to place the IP Addresses of interfaces to exclude from discovery in the $OV_CONF/netmon.nodiscover file. Other problems, such as those caused by HSRP, are handled in different ways, and newer versions of NNM are becoming more flexible in the handling of anomalous behaviors.

In the example above, and indeed in all cases for multi-homed nodes, logging of the hidden interface down and interface up events would clear up much of the confusion that can is created by the nature of the node up and node down event with respect to multi-homed nodes. As was the case with changing the text of the node status events above, changing the logging behavior for events is also very easy. Unfortunately, however, an immediate problem arises if interface events are logged: NNM users are now faced with a shadow interface down or up status event for every node down or up status event relating to one-interface nodes. One could stop logging the node down and node up events, but then users would not be able to tell whether all a node’s interfaces were reachable or not.

In general, NNM’s default behavior of not reporting interface down events for important multi-homed devices comes as a surprise and a disappointment to most network managers. Solving the problem of duplicate events after making interface status events visible in the event browser can be done with NNM’s embedded event correlation engine’s pairwise circuit, but there are problems with this as the last interface down event and interface up event will always be suppressed and this is less desirable.

As an aside, it is important to note that interface status can also be derived by NNM from SNMP MIB-II interface tables. Non-IP interface status such as administratively down or operationally up affect NNM status propagation and the generation of the node down and node up events per settings defined in the $OV_CONF/netmon.statusMapping configuration file.

A simpler and more manageable customization to the NNM event configuration is to prevent logging of both the node event and the interface event only for single-interface nodes. This is achieved by first setting the default intetrface down and interface up event to log to the event browser, then making a copy of those events, and setting them to not log the events if the source is a singly-homed node. This may sound complicated and difficult to maintain, but it is not and, as explained below, can be completely automated.

The manual method for this event customization takes advantage of the node sources field in the modify event window under event configuration. The node names can be entered in three ways. They can be typed in by hand with spaces separating each entry. NNM’s find option can be used to search the NNM topology by node capability for nodes with single interfaces. The option within the find dialog to “select highlighted nodes” will then allow nodes to be placed in the node sources list in the modify event window with the “add from map” button.

The third and most powerful method to specify sources is to use an external file containing a list of node sources. This requires only the files path and name be entered in the node source field, for example on NT:

c:\openview\conf\OneIfHosts

This method is more flexible as it is easier to maintain, dynamically updateable, and more scalable. As above, the file can be populated from the results of a capabilities-based find from the menu (version 6+ allows saving of search results to a file). Command line tools like ovobjprint can also be used to populate this file. An example perl script to do this is listed at the end of this paper

To populate this file automatically, use the perl script supplied as an automated action to the node added event. The script calls the oject database perusal tool ovobjprint to check if the node is singly-homed, then writes the node name to an external file that can then be used as the set of sources for the log-only copies of the interface doen and interface up events. Note the script has two “modes” of operation: one designed to build the external node source files from scratch and one for updating the external file automatically as the result of the node added event.

For the script to be added as an action to the node added event, the host name is passed into the script as an argument, which just happens to be the second variable binding of the node added event. Use one of the examples below for the action for node added event, depending on the platform of the NNM installation:

OVHIDESHELL C:\\OpenView\\bin\\OneIfHosts.pl $2

/opt/OV/bin/OneIfHosts.pl $2

With the perl script in place, interface down and interface up events will now be logged, but only for multi-homed nodes. Combine this with a few simple changes to the text of the node down and node up events, and the understanding and interpretation of NNM’s status events will be greatly improved. Below is a summary of all the customizations suggested above. Note that the “$7” referred to in the interface text below is the name of the interface, for example hme0. Please remember to properly back up your NNM installation, and particularly, the $OV_CONF/C/trapd.conf file, before and after making customizations.

Default Status Events

Event Name Event text Default Logging behavior Sources

--------------------------------------------------------------------------------------------------------

OV_Node_Down Node down Status Alarm All

OV_Node_Up Node up Status Alarm All

OV_IF_Down If $7 down Log-Only All

OV_IF_Up If $7 up Log-Only All

OV_Node_Added Node added Configuration Alarm All

Suggested Changes to Default Status Events

Event Name Event text Logging behavior Sources Action

---------------------------------------------------------------------------------------------------------------------------------

OV_Node_Down Node Unreachable Status Alarm All

OV_Node_Up Node Reachable Status Alarm All

OV_IF_Down If $7 Unreachable Status Event All

OV_IF_Down_OneIf If $7 Unreachable Log-Only OneIfHosts

OV_IF_Up If $7 Reachable Status Event All

OV_IF_Up_OneIf If $7 Reachable Log-Only OneIfHosts

OV_Node_Added Node added Configuration Alarm All OneIfHosts.ovpl

OneIfHosts.ovpl

Cut here

#!/opt/OV/bin/Perl/bin/perl

# Written by Mike Peckar, Fognet Consulting ov@fognet.com, unsupported

# 1. Create or recreates an external list of all NNM nodes that have

# only one interface. Use this list as the set of node sources for a

# copy of the OpenView interface down/up events. Invoke this script

# with the single argument "ALL" for this behavior.

# 2. Append a single node to the above mentioned list. Intended

# to be used as an automatic action with the node added event

# to keep the list updated when new nodes are discovered.

# For example, on NT/2000, syntax for the automatic action:

# OVHIDESHELL C:\\OpenView\\bin\\OneIfHosts.ovpl $2

# Note this requires proper ovactiond trust. Create trusted command

# under $OV_CONF/TrustedCmds.conf directory or touch ALLOW_ALL

# Set the name of the file holding names of singly-homed hosts here:

$hfile="$OV_CONF/OneIfHosts";

# works in NNM 6.2 and after:

use OVvars;

# For NNM versions prior to 6.2 modify to define OV environment vars

#usage:

$NAME=$0; $NAME =~ s/.*\///; $NAME =~ s/.*\\//;

if ( scalar(@ARGV) != 1 ) {

printf ("\nUsage 1: $NAME ALL\n");

printf (" Dump all singly-homed node names to $hfile\n");

printf ("\nUsage 2: $NAME <Node Name>\n");

printf (" If singly-homed, append node name to $hfile\n");

exit 10

}

#list all nodes with just one interface:

if ( $ARGV[0] =~ /^[aA][lL][lL]$/ ) {

$cmd="$OV_BIN/ovobjprint -a \"Selection Name\" \"TopM Interface Count\"=1";

open( IN, "$cmd |");

while (<IN>) {

$matches .= "$1\n" if m/^\s*\d*\s+["](.*)["]$/;

}

$matches =~ s/^(.*).$/$1/s;

open ( OUT, ">$hfile");

print OUT $matches;

} else {

#determine the number of interfaces for $ARGV[0]:

$node = $ARGV[0];

$cmd="$OV_BIN/ovobjprint -a \"TopM Interface Count\" \"Selection Name\"=$node";

open( IN, "$cmd |");

while (<IN>) {

($f1,$f2) = split(' ', $_);

if ( $f1 =~ /^\d*$/ && $f2 =~ "1" ) {

open ( OUT, ">>$hfile");

print OUT "\n$node";

}

#end of file