High Availability Choices for OpenView

High Availability Choices for OpenView
Mike Peckar, Fognet Consulting

Deploying OpenView to manage highly available services is one thing. Deploying OpenView as a highly available service is an entirely different matter.

Most of today’s high-availability solutions are, in reality, a disappointing set of hacks and workarounds. Under their covers lurks legacy hardware and operating systems whose core architecture was never intended to be deployed in mission critical environments. UNIX, NT and most other computing protocols were all designed only to please some of the people, some of the time.

OpenView and other enterprise management software are often purchased to manage highly available enterprise management applications. For better or for worse, enterprise management tools themselves are also expected to be highly available in support of increasingly mission critical IT functions. Sometimes there are good justifications for this. Sometimes there are not (see end of article). This article will focus on deploying OpenView Operations (OVO) and Network Node Manager (NNM) as highly available applications.

NNM and OVO provide different capabilities for managing highly available applications and services. OVO ships with templates to help manage many high-availability software and hardware elements, and meshes very well into highly available enterprises. NNM by itself can be problematic in its support for managing highly available networks.

To give you a brief example, NNM has difficulty handling routers that are configured for Hot-Swappable Routing Protocol (HSRP). NNM by default will continually discover and then remove the backup HSRP interface from the NNM topology. To resolve this issue, NNM provides the -D switch for netmon to fix this, but setting this switch completely blocks discovery and mapping of the offending backup interfaces.

There is no holy grail when it comes to managing highly available enterprises. Still, OpenView can help.

Basic NNM provides a very solid and reliable internal SNMP trap transport, and good support for more reliable network management with SNMP V3 and SNMP v2c. But, NNM was not designed to manage non-SNMP mission-critical networks.

Fortunately, the hundreds of available HP and third-party add- ons to NNM can fill in these gaps. Network Node Manager Extended Topology (NNM ET), for example, provides excellent handling and mapping of HSRP routers.

Deploying OpenView as a highly available service requires understanding OpenView’s capabilities and mapping them to carefully established organizational needs. When implementing an OpenView system to increase availability, three technologies should be considered: clustering, OpenView’s built-in distributed features and highly available hardware.

It is important to note that high-availability implementations can and do fail. Sometimes this is because the technology is immature or too complex. Sometimes it is because the administration and deployment resources required are grossly underestimated. In most cases, though, high-availability deployments of OpenView fail because the specific goals are defined too narrowly and without consideration of broader technology, integration and organizational issues.

These days, the road to making any application a highly available service is wrought with potholes, unmarked forks and soft shoulders.

Clustering

Clustering, be it Windows NT or UNIX-based, is the most popular choice for assuring high-availability for server-based applications running under those operating systems. However, clustering’s core technology is an awkward approach to attaining high-availability in Operating Systemss that were not originally intended to be mission-critical.

Both OVO and NNM are fully supported in HP-UX clusters based on MC ServiceGuard, HP’s Clustering product. OVO Version 7 is also supported on Sun’s Suncluster clustering solution and with Veritas’ VCS for Sun-based systems. NNM is not explicitly supported in Sun-based clusters, but will work. All of these clustering products are difficult to set up, easy to break and require lots of extra ongoing administrative overhead.

It is very important to confirm support for the desired flavor of OS, platform and OpenView software. One must also very carefully consider third party add-on products. Many are not supported in clustered environments and some more tightly integrated add-ons may actually cause unrecoverable cluster fail-over conditions when installed. (For example NNM can be made to work under Windows 2000 DataCenter-based clusters, but is not supported.)

In a clustered environment, OpenView administrative services such as ovstart must be managed in different ways. Issuing normal control commands to a clustered installation of OpenView could also inadvertently force a cluster fail-over, often into an inoperable state. Very knowledgeable and highly competent system administration is critical for any cluster-based highly available service, and OpenView is no exception.

The most compelling argument for deploying OpenView in a clustered environment is the potential increase in server availability. If server-based features such as operator access, data collection and server-based polling for status are important, this solution makes sense. Another very compelling argument for clustering is that certain software element failures can be trapped and recovered from. A database service failure, for example, can be detected and potentially recovered within a clustered environment.

A plane with two engines can still crash. Clustering adds layers of complexity that could actually be a source of service level degradation if not very tightly administered.

This added benefit of software element failure-detection is limited to server-based application elements, however. The integrity of data derived off-server, such as NNM status polling data, is not reliably assured with clustering. If the goal is timely and accurate status of managed objects, placing NNM into a cluster will only assure the availability of the status-polling daemon, not the integrity of the data gathered by the status- polling daemon.

Distributed Computing (DIDM)

Both NNM and OVO provide separate distributed manager-to- manager architectures that are quite flexible and robust. Certain DIDM configuration can take the place of some, but not all of the benefits of clustering, and can help assure server, agent and network-based application availability.

In addition, with DIDM the systems can be geographically separated. The systems need not run the same versions of operating system or the same versions of OpenView (check support matrices). While NNM distribution features are a bit more limited, distributed OVO servers can be separated across NAT and firewall boundaries.

There are important differences between DIDM and clustering in how the OpenView servers are addressed. With clustering, floating IP addresses are aliased to the same "package" name. This allows operators to connect to the same management server with the same name regardless of what hardware the server software runs. This is not so under DIDM. Operators must have processes or special scripts in place to determine what server to attach to and when. With NNM, SNMP agents may need to be configured to report to separate management servers. With OVO, agents need to be configured to switch management servers.

Spread the wealth – distributing OpenView provides a unique set of options and high-availability scenarios that can serve as a viable alternative to clustering

DIDM provides additional scalability and the potential for some very creative applications of highly available OpenView. One such scenario could be a dual-role production/primary system and development/backup system. Using DIDM to establish a "hot standby" is a fairly common practice. In this specific case it may be possible to negotiate the additional OpenView license required under the same highly discounted terms as those for obtaining additional OpenView licenses to support clustering.

As mentioned above, DIDM provides more opportunity to assure non-server-based application data reliability. SNMP events, for example, can be collected by multiple managers and cross-correlated to decrease the incidence of dropped events. Enablement of such high-availability features, however, typically requires significant implementation effort.

Thus the biggest drawback with DIDM is that a higher level of OpenView competence is required to design, implement, and administrate it. In general, DIDM often facilitates better organizational integration, since its flexibility allows real IT needs to dictate a particular architecture that can maximize specific high-availability needs.

Highly Available Hardware

The biggest advances in highly available computing of late have come in hardware options. Vendor offerings are getting sufficiently sophisticated (and reasonably priced) so as to nearly eliminate the need to consider clustering at all. The catch, of course, is that both the clustering and the DIDM solution can provide recovery paths for software element, off-server and transport-related failures. Pure hardware high-availability solutions cannot.

Still, hardware-based high-availability should be carefully considered. A highly available server may cost many times more than standard server offerings, but the ROI is excellent, particularly when administrative overhead is considered in comparison to clustering.

Several vendors offer fully fault-tolerant hardware that run native Operating Systems. Some offer combined hardware-software solutions that are not based on clustering, such as Marathon or SUN’s Netra High-availability Suite. OpenView may be fully supported under such configurations, but it is very important to confirm that the desired combinations of software, OS and hardware are all supported in concert.

A compelling offering is Stratus’ Continuum line of servers, which run HP-UX natively. HP OEM’s the PA-RISC chip to Stratus, and Stratus sells this hardware with support for its proprietary VOS operating system as well as for HP-UX. OpenView installs and runs on the Continuum server the same way it installs on any HP-9000 box. This is an excellent alternative to achieving server-based availability without the headaches that come with the administrative overhead and required application hooks of clustering.

Base the high-availability technologies you choose on the specific core requirements you have. Don’t let technology drive your high-availability solution choices.

What to do?

There are compelling reasons to deploy a combination of DIDM with some form of clustering or hardware high-availability. There also compelling reasons to hold off for some more integration maturity. Indeed the HP-Compaq merger will play an important role in HP’s future direction with highly available solutions, particularly when OpenView is added to the mix.

Much technology lies in the wings. (For example, Compaq owned Tandem, a full-service enterprise high-availability provider. It is unclear if HP will incorporate hardware, OS and software design elements from Tandem into HP-UX, its hardware offerings, or OpenView.) HP has announced it will be integrating high-availability features from Compaq’s acquired technology from Digital Equipment’s Tru-64 and Alpha Servers into HP-UX and HP’s Hardware lines. These should provide additional high- availability choices in 2003 and beyond.

Meanwhile, HP’s competition is also integrating high-availability features into their products at every level. Ultimately, it will become easier for applications like OpenView to meet the high-availability needs of today’s increasingly mission-critical enterprise environments.

For the time being, the choices remain complex and the returns on investment questionable. Deploying OpenView as a highly available service can increase the reliability of management data, but only when the goals are carefully defined and mapped to the appropriate technologies, and all pieces of the enterprise management puzzle are properly pieced together.

The Availability Management Paradox

Often, IT managers insist their enterprise management software must be made highly available when its task is to manage highly available services. They ask, "How can I accurately measure service levels in the 99.999% range with IT management tools that themselves may only be available 99.0% of the time?" Fair question, if not a little naive about how enterprise management really works.

If server-based polls were the only source of management data, then yes, the availability of the enterprise manager would have to be greater than the availability of the managed objects to accurately measure them. The reality, of course, is that the best management data comes from polled data in combination with data stored in remote agents or remote managers.

The management server’s proper role should be to gather and correlate management data from multiple sources. Too heavy a reliance on a monolithic management server’s status polls is a very dangerous practice. No level of management server availability can make a tool that relies on the network for data more accurate.

Indeed, status polling via ICMP, for example, may or may not return accurate results since many routers and switches are configured by default to drop ICMP polls if they get busy. Firewalls and other policy management devices can also block polls. Desktop users may download and install port-blocking tools. ICMP is a connectionless protocol – packets may simply drop under normal loads. The list goes on.

There is a difference between the status of the need to know device and the availability of the need to know device. The former is exception management and may require management server high-availability; the latter is service-level monitoring, and does not typically require management server high-availability.

To illustrate, imagine a tool that gathers remote agent data, then correlates this with availability metrics gathered from multiple distributed pollers, and then builds its reports. Hypothetically, this tool could accurately report the availability of a device that is down less than five minutes per year and itself only be up five minutes a year!

High-availability improves the ability to perform exception management. Accurate availability and service level management, however, requires correlation of data. Correlation need not be fancy. An action callback script on NNM’s "node down" event that performs a re-poll, for example, is event correlation.

No matter what the tool being used to manage availability or status, the investment in event correlation is directly proportional to the reliability of the management data. As this investment increases, the need for the management tools to be highly available should decrease. Considering the expense of placing an application like OpenView NNM into a high-availability cluster, an investment in event correlation can save a lot of time, effort, and money.