XenServer – Octoblu – SNMP Integration: The Sum of the Parts

Indeed, one of the key themes of Citrix Synergy 2017 was integration. Citrix products were designed to work better together and Octoblu is no exception. The Internet of Things (IoT) offers way more than just that, for it can be seen as the glue not only among Citrix products, but the stepping stone to practically anything that might be desired to work together with these products.

The mechanism I put together goes beyond just XenServer, which serves as one component to demonstrate the capabilities of incorporating a Citrix product into automation. It is extensible to many other Citrix products, as well as being able to pull in external metrics, such as SNMP, but also environmental data and about anything else one might imagine. The goal here was to:

Integration of Citrix XenServer with Octoblu
Provide the means to interactively control XenServer
Allow for batch jobs to be defined and executed
Enable server HW monitoring for proactive actions/DR
Integrate control over XenServer functions with external input
Define smart actions and informative reporting/logging

The implementation was done by integrating XenServer “xe” commands into the Octoblu flow, which in turn makes mostly use of the Connector Shell, f(x), and the Compose Tool. Others, such as Eric Haavarstein and Dave Brett (using PowerShell connectivity), and James Bulpin (using APIs), have tied XenServer to Octoblu. I took a different approach, relying on standard and long-supported “xe” commands, which many system administrators already use as a basis for scripts and are already familiar with. This way, I only needed a connector, which could be created by leveraging the already existing Shell Tool and making use of standard open utilities provided by the suite of putty, plink and pageant. This, in turn, also provided a secure, encrypted, ssh-based connection using ssh keys.

The result is that anything that can be done via the CLI using “xe” or, in fact, any bash batch job can be run as a script. The area where such scripts can be run is severely restricted, of course, and secure connections are put in place only on those servers that need them. Control and access are via one server with access restricted to specific NFS shares.

For the SNMP integration to Dell servers, after some experimentation, I settled on open source snmpget and snmpwalk instead of Dell’s dracadm and SNMPv3 internal options. Note that this works only in iDRAC 7 or newer controllers. Using dracadm, it was possible to identify the tokens used to extract various metrics.

A key suggestion came from Citrix’ Director of IoT, Chris Matthieu, who gave the sound advice to do as much processing as possible external to Octoblu and only pass it data when needed. An external Linux box, therefore, was set up to do the SNMP monitoring of multiple XenServer hosts and also to do the analysis and perform the decision-making processes as to what events require actions and hence need to be relayed to Octoblu. This was done with bash and perl scripts.

Finally, when certain action items are identified, they are paired with specific commands that execute pre-determined actions, such as evacuating a host and shutting it down, creating logs, and informing the necessary recipients as to what actions were taken.

Communication

The communication among the various components is accomplished using remote shell operations using ssh and the open-source putty utilities plink, puttygen and pageant. Slack is also integrated. Octoblu is the central agent responsible for data exchange and the launching of action items. The diagram below shows the general interoperations among the various components:

The script syntax is composed of a batch token “XESCRIPT” followed by a XenServer host name or IP address. The name of the script and one or more optional arguments, all separated up to the script name by “#” signs:

#XESCRIPT #XS-hostname|IP-address #script-name <ARG1> <ARG2> …

Below is an example:

XESCRIPT# xstest1.myorg.com #vm-list.sh name-label=TST-ubuntu12-2 params=memory-actual,start-time

Scripts can only be run under a specific subdirectory, hence no need for a path. These scripts can be launched from within an Octoblu flow using a Connector Shell or via a secured Slack channel thread.

SNMP Monitoring – Why?

There are a number of reasons to be able to gather and process SNMP information, primarily because all such information can be utilized in the process of deciding if a system is undergoing issues. If a condition is in the process of getting worse, there is a chance of catching it before a failure occurs.

Things just fail, and HA/DR isn’t always being proactive
Better to detect issues before they become big problems
Many servers already have built-in SNMP capabilities
SNMP can collect hundreds of useful metrics – you can choose which
Possible to monitor numerous severs centrally, easily and fast
Octoblu integrates with many devices (temp, water, fire, break-in, etc.)
Many cases can be handled automatically – no sys admin needed
Contact only those who need to know with the right level of information!
Groups of contacts can include sys admins, directors, security officers, building managers, etc.
Means of contact can include Email, Slack, SMS, alarms, carrier pigeon (*), etc.

Keeping such options and possibilities in mind, the goals then become:

Catch issues before they become major problems
Let people know ahead of time what’s going on
When possible, create pre-designed action items that can take care of things on their own, such as evacuating virtual machines (VMs) from a server and shutting it down, custom load balancing, bringing up services at an alternate location, etc. If unclear what to do, leave it up to people to decide!

You can then customize scripts to perform specific actions:

For emergency issues, take action and notify at a priority level
When less urgent events take place, notify a subgroup with less urgency (e.g., Email instead of a text)
Avoid the “3 AM wake-up call” by deciding what can be done and how urgent informing people is
If the data center can take care of things all by itself, why wake anyone up?
Logging of events will allow better identification of problem areas and better ways of dealing with them
With data synchronization and multiple servers, easy to create duplicate services

Here is an example of an snmpget command to extract the temperature of CPU1 (x10) from which the actual temperature retrieved is 73.0 C as seen in the output:

It is possible to use basic “snmpget” commands to gather metrics from one or more hosts and even use the internal warning and limit thresholds to compare against the actual measured values.

These in turn can be used to evaluate the severity and what should be undertaken (informational as well as action items).

Actions are determined by a separate routine that parses issue files for content.

Here is a perl code snippet, showing how SNMP codes are mapped to human-readable variables made use of in the perl script:

This list can be extended as desired to contain any of the metrics one wishes to monitor and check.

Processing the Data

As these values are collected from multiple hosts, another script monitors for discrepancies, i.e. user-defined instances where metrics either have shifted from a normal to an abnormal state or a data point lies outside of a pre-determined acceptable value or range of values. One can then assign a specific “action item” that corresponds to one or more such conditions. For example a failure of a fan or one of two redundant power supplies may not be catastrophic if there are enough remaining ones to do the job, but it might be good to notify via email the sys admins. On the other hand, an overhead CPU or inlet temperature may be cause for major concern and be considered sufficiently serious to trigger VM Xenmotion (actively relocating running VMs) to a different host and shutting down the host. In turn, this might mandate a text alert to one more system administrators in addition to email notifications to a subgroup of those concerned with operations. All that can be defined and customized.

Here is what the Octoblu flow currently looks like:

The initial Connector Shell parses the server that collects and evaluates the SNMP information for an issue. If one is found, the relevant parameters are pre-processed and sent to another Connector Shell to determine the XenServer pool master associated with the host on which the issues was identified and if on the pool master itself and it involves incapacitating that host, it will switch the pool master to another host if need be. The in-process flag will then be cleared and the timer will then be able to trigger the action on the host that is no longer the pool master when the next check takes place. Because the first issue encountered will be the only one handled during each cycle, this prevents conflicts that might affect being able to deal with multiple XenServer hosts concurrently. Augmented with the pool master information, the third Connector Shell launches the appropriate batch script on the relevant server, passes the resulting information to be parsed, and sends notifications according to the tagged user-defined severity level.

An example of an issue file created when an error condition is identified is shown here:

:ISSUE:inlet_overheat_critical

:TIMESTAMP:2017-07-18 20:45:11

:MEASURED:520

:LIMIT:470

:HOSTNAM:xstest2

:HOSTIP:10.15.9.45

:LEVEL:100

These are parsed and processed by the Connector Shell that launches the batch script.

Before that takes place, this is the appearance of the pool in XenCenter. Note that the host xstest2 has two VMs running on it while xstest1 has one:

After the process completes, here’s what XenCenter shows:

The Slack bot shows this report:

And the following email was also generated, containing the duplicate information:

The ability to select the recipients and means of notification depending on the nature and level of the issue is a big plus.

This video shows this process in action, where a simulated overheating of a XenServer triggers the evacuation of the VMs running on that server and their live migration to another XenServer within the pool, followed by the server experiencing the issue being put into maintenance mode.

Future Work

Plenty can and should be done to build upon this model:

Add more monitoring functions (focus on the most critical first)
Gather logs on metrics and analyze for what is “normal”
Leverage Splunk and Splunk ML (Machine Learning) to look for outliers and anomalies
Integration of other machine room metrics (temperature, water, humidity, fire, etc.)
Include internal metrics such as load and memory: potential for Super WLB
Incorporate other devices: NetScaler, XenDesktop, etc.

This could be tied into future features offered by Octoblu, backup/restore solutions, etc. as well as into other Citrix products, such as NetScaler, XenApp, etc.

Again, there are great benefits to concentrating on doing as much processing outside of Octoblu as possible for the sake of efficiency and minimizing the flow of data that need not transcend any networks any more than necessary.

Conclusions

The possibilities here are extensive, since once the basic mechanisms have been worked out and the modules created to support them, there is plenty of room for expansion and diversification of what is integrated and how. The means to determine “normal” values via the collection and analysis of data could lead to even greater efficiencies as well as better means to identify and react to error conditions. Exciting prospects definitely lie ahead.

Follow @tkreidl

Footnotes:

*)Cf. RFC 1149 (https://tools.ietf.org/html/rfc1149), RFC 2549 (https://tools.ietf.org/html/rfc2549)

Citrix TechBytes – Created by Citrix Experts, made for Citrix Technologists! Learn from passionate Citrix Experts and gain technical insights into the latest Citrix Technologies.

Click here for more TechBytes and subscribe.

Want specific TechBytes? Let us know! tech-content-feedback@citrix.com

Topics

Products