High
Availability Choices for
OpenView Mike Peckar, Fognet
Consulting
Deploying OpenView to manage highly available
services is one thing. Deploying OpenView as a highly
available service is an entirely different
matter.
Most of today’s high-availability solutions are, in
reality, a disappointing set of hacks and workarounds. Under
their covers lurks legacy hardware and operating systems whose
core architecture was never intended to be deployed in mission
critical environments. UNIX, NT and most other computing
protocols were all designed only to please some of the people,
some of the time.
OpenView and other enterprise management software are often
purchased to manage highly available enterprise management
applications. For better or for worse, enterprise management
tools themselves are also expected to be highly available in
support of increasingly mission critical IT functions.
Sometimes there are good justifications for this. Sometimes
there are not (see end of article). This article will focus on
deploying OpenView Operations (OVO) and Network Node Manager
(NNM) as highly available applications.
NNM and OVO provide different capabilities for managing
highly available applications and services. OVO ships with
templates to help manage many high-availability software and
hardware elements, and meshes very well into highly available
enterprises. NNM by itself can be problematic in its support
for managing highly available networks.
To give you a brief example, NNM has difficulty handling
routers that are configured for Hot-Swappable Routing Protocol
(HSRP). NNM by default will continually discover and then
remove the backup HSRP interface from the NNM topology. To
resolve this issue, NNM provides the -D switch for netmon to
fix this, but setting this switch completely blocks discovery
and mapping of the offending backup interfaces.
There
is no holy grail when it comes to managing highly available
enterprises. Still, OpenView can help.
Basic NNM provides a
very solid and reliable internal SNMP trap transport, and good
support for more reliable network management with SNMP V3 and
SNMP v2c. But, NNM was not designed to manage non-SNMP
mission-critical networks.
Fortunately, the hundreds of available HP and third-party
add- ons to NNM can fill in these gaps. Network Node Manager
Extended Topology (NNM ET), for example, provides excellent
handling and mapping of HSRP routers.
Deploying OpenView as a highly available service requires
understanding OpenView’s capabilities and mapping them to
carefully established organizational needs. When implementing
an OpenView system to increase availability, three
technologies should be considered: clustering, OpenView’s
built-in distributed features and highly available hardware.
It is important to note that high-availability
implementations can and do fail. Sometimes this is because the
technology is immature or too complex. Sometimes it is because
the administration and deployment resources required are
grossly underestimated. In most cases, though,
high-availability deployments of OpenView fail because the
specific goals are defined too narrowly and without
consideration of broader technology, integration and
organizational issues.
These
days, the road to making any application a highly available
service is wrought with potholes, unmarked forks and soft
shoulders.
Clustering
Clustering, be it
Windows NT or UNIX-based, is the most popular choice for
assuring high-availability for server-based applications
running under those operating systems. However, clustering’s
core technology is an awkward approach to attaining
high-availability in Operating Systemss that were not
originally intended to be mission-critical.
Both OVO and NNM are fully supported in HP-UX clusters
based on MC ServiceGuard, HP’s Clustering product. OVO Version
7 is also supported on Sun’s Suncluster clustering solution
and with Veritas’ VCS for Sun-based systems. NNM is not
explicitly supported in Sun-based clusters, but will work. All
of these clustering products are difficult to set up, easy to
break and require lots of extra ongoing administrative
overhead.
It is very important to confirm support for the desired
flavor of OS, platform and OpenView software. One must also
very carefully consider third party add-on products. Many are
not supported in clustered
environments and some more tightly integrated add-ons may
actually cause unrecoverable cluster fail-over conditions when
installed. (For example NNM can be made to work under Windows
2000 DataCenter-based clusters, but is not supported.)
In a clustered environment, OpenView administrative
services such as ovstart must be managed in different ways.
Issuing normal control commands to a clustered installation of
OpenView could also inadvertently force a cluster fail-over,
often into an inoperable state. Very knowledgeable and highly
competent system administration is critical for any
cluster-based highly available service, and OpenView is no
exception.
The most compelling argument for deploying OpenView in a
clustered environment is the potential increase in server
availability. If server-based features such as operator
access, data collection and server-based polling for status
are important, this solution makes sense. Another very
compelling argument for clustering is that certain software
element failures can be trapped and recovered from. A database
service failure, for example, can be detected and potentially
recovered within a clustered environment.
A plane with two engines can still crash. Clustering adds
layers of complexity that could actually be a source of
service level degradation if not very tightly administered.
This added benefit of software element failure-detection is
limited to server-based application elements, however. The
integrity of data derived off-server, such as NNM status
polling data, is not reliably assured with clustering. If the
goal is timely and accurate status of managed objects, placing
NNM into a cluster will only assure the availability of the
status-polling daemon, not the integrity of the data gathered
by the status- polling daemon.
Distributed Computing (DIDM)
Both NNM and OVO provide separate distributed manager-to-
manager architectures that are quite flexible and robust.
Certain DIDM configuration can take the place of some, but not
all of the benefits of clustering, and can help assure server,
agent and network-based application availability.
In addition, with DIDM the systems can be geographically
separated. The systems need not run the same versions of
operating system or the same versions of OpenView (check
support matrices). While NNM distribution features are a bit
more limited, distributed OVO servers can be separated across
NAT and firewall boundaries.
There are important differences between DIDM and clustering
in how the OpenView servers are addressed. With clustering,
floating IP addresses are aliased to the same "package" name.
This allows operators to connect to the same management server
with the same name regardless of what hardware the server
software runs. This is not so under DIDM. Operators must have
processes or special scripts in place to determine what server
to attach to and when. With NNM, SNMP agents may need to be
configured to report to separate management servers. With OVO,
agents need to be configured to switch management servers.
Spread the wealth – distributing OpenView provides a unique
set of options and high-availability scenarios that can serve
as a viable alternative to clustering
DIDM provides additional scalability and the potential for
some very creative applications of highly available OpenView.
One such scenario could be a dual-role production/primary
system and development/backup system. Using DIDM to establish
a "hot standby" is a fairly common practice. In this specific
case it may be possible to negotiate the additional OpenView
license required under the same highly discounted terms as
those for obtaining additional OpenView licenses to support
clustering.
As mentioned above, DIDM provides more opportunity to
assure non-server-based application data reliability. SNMP
events, for example, can be collected by multiple managers and
cross-correlated to decrease the incidence of dropped events.
Enablement of such high-availability features, however,
typically requires significant implementation effort.
Thus the biggest drawback with DIDM is that a higher level
of OpenView competence is required to design, implement, and
administrate it. In general, DIDM often facilitates better
organizational integration, since its flexibility allows real
IT needs to dictate a particular architecture that can
maximize specific high-availability needs.
Highly Available Hardware
The biggest advances in highly available computing of late
have come in hardware options. Vendor offerings are getting
sufficiently sophisticated (and reasonably priced) so as to
nearly eliminate the need to consider clustering at all. The
catch, of course, is that both the clustering and the DIDM
solution can provide recovery paths for software element,
off-server and transport-related failures. Pure hardware
high-availability solutions cannot.
Still, hardware-based high-availability should be carefully
considered. A highly available server may cost many times more
than standard server offerings, but the ROI is excellent,
particularly when administrative overhead is considered in
comparison to clustering.
Several vendors offer fully fault-tolerant hardware that
run native Operating Systems. Some offer combined
hardware-software solutions that are not based on clustering,
such as Marathon or SUN’s Netra High-availability Suite.
OpenView may be fully supported under such configurations, but
it is very important to confirm that the desired combinations
of software, OS and hardware are all supported in concert.
A compelling offering is Stratus’ Continuum line of
servers, which run HP-UX natively. HP OEM’s the PA-RISC chip
to Stratus, and Stratus sells this hardware with support for
its proprietary VOS operating system as well as for HP-UX.
OpenView installs and runs on the Continuum server the same
way it installs on any HP-9000 box. This is an excellent
alternative to achieving server-based availability without the
headaches that come with the administrative overhead and
required application hooks of clustering.
Base
the high-availability technologies you choose on the specific
core requirements you have. Don’t let technology drive your
high-availability solution choices.
What to do?
There are compelling
reasons to deploy a combination of DIDM with some form of
clustering or hardware high-availability. There also
compelling reasons to hold off for some more integration
maturity. Indeed the HP-Compaq merger will play an important
role in HP’s future direction with highly available solutions,
particularly when OpenView is added to the mix.
Much technology lies in the wings. (For example, Compaq
owned Tandem, a full-service enterprise high-availability
provider. It is unclear if HP will incorporate hardware, OS
and software design elements from Tandem into HP-UX, its
hardware offerings, or OpenView.) HP has announced it will be
integrating high-availability features from Compaq’s acquired
technology from Digital Equipment’s Tru-64 and Alpha Servers
into HP-UX and HP’s Hardware lines. These should provide
additional high- availability choices in 2003 and beyond.
Meanwhile, HP’s competition is also integrating
high-availability features into their products at every level.
Ultimately, it will become easier for applications like
OpenView to meet the high-availability needs of today’s
increasingly mission-critical enterprise environments.
For the time being, the choices remain complex and the
returns on investment questionable. Deploying OpenView as a
highly available service can increase the reliability of
management data, but only when the goals are carefully defined
and mapped to the appropriate technologies, and all pieces of
the enterprise management puzzle are properly pieced together.
The Availability Management Paradox
Often, IT managers
insist their enterprise management software must be made
highly available when its task is to manage highly available
services. They ask, "How can I accurately measure service
levels in the 99.999% range with IT management tools that
themselves may only be available 99.0% of the time?" Fair
question, if not a little naive about how enterprise
management really works.
If server-based polls were the only source of management
data, then yes, the availability of the enterprise manager
would have to be greater than the availability of the managed
objects to accurately measure them. The reality, of course, is
that the best management data comes from polled data in
combination with data stored in remote agents or remote
managers.
The management server’s proper role should be to gather and
correlate management data from multiple sources. Too heavy a
reliance on a monolithic management server’s status polls is a
very dangerous practice. No level of management server
availability can make a tool that relies on the network for
data more accurate.
Indeed, status polling via ICMP, for example, may or may
not return accurate results since many routers and switches
are configured by default to drop ICMP polls if they get busy.
Firewalls and other policy management devices can also block
polls. Desktop users may download and install port-blocking
tools. ICMP is a connectionless protocol – packets may simply
drop under normal loads. The list goes on.
There is a difference between the status of the need to
know device and the availability of the need to know device.
The former is exception management and may require management
server high-availability; the latter is service-level
monitoring, and does not typically require management server
high-availability.
To illustrate, imagine a tool that gathers remote agent
data, then correlates this with availability metrics gathered
from multiple distributed pollers, and then builds its
reports. Hypothetically, this tool could accurately report the
availability of a device that is down less than five minutes
per year and itself only be up five minutes a year!
High-availability improves the ability to perform exception
management. Accurate availability and service level
management, however, requires correlation of data. Correlation
need not be fancy. An action callback script on NNM’s "node
down" event that performs a re-poll, for example, is event
correlation.
No matter what the tool being used to manage availability
or status, the investment in event correlation is directly
proportional to the reliability of the management data. As
this investment increases, the need for the management tools
to be highly available should decrease. Considering the
expense of placing an application like OpenView NNM into a
high-availability cluster, an investment in event correlation
can save a lot of time, effort, and money.
|