zmiku: An automation daemon (by alaric)
A few years ago, I wrote my own service monitoring system for my servers and networks; I did this because Nagios, the common choice, was just too complicated for my tastes and didn't cleanly fit my needs. And so, The Eye of Horus was born, and has been monitoring my servers ever since. I open-sourced it, but I've not migrated it to the new Kitten Technologies infrastructure yet, so I don't have a link.
A design goal for Horus was to limit what needed to be installed on the monitored servers; it's a Python script that you run from cron which sshes into the servers and runs shell commands directly; the results are sucked back from standard output. The configuration file format is easy to work with, and it's modular - the python script spits out a new status file listing the status of all the services that a set of CGIs uses to produce HTML reports on demand and to update rrdtool logs of measurables, and produces a list of changes to be fed to a notification system.
However, it has some rough edges - I decided to make the shell
commands run on the remote servers all output a standard format of
report, which means mangling the output of commands such as pidof
with sed and awk in order to produce them, which is a pain to do
portably. In general, support for generating different commands to get
the same report on different platforms is poor, too. I never got
around to implementing hysterisis in the change detectors to put a
service that's rapidly failing and recovering into an "unstable"
state. And it's written in Python, when I've migrated all of my new
development into Scheme.
I was tinkering with the idea of a straight rewrite in Scheme, with the rough edges fixed up, when I noticed a convergence with some of my other projects beginning to form.
I've long wanted to have a system where some small lightweight computer (perhaps a Raspberry Pi), attached to the home LAN, drives speakers as a music player, streaming music from the home file server. There's off the shelf stuff to do that, but I wanted to go a little further and also provide a text-to-speech notification system; the box would also have a queue of messages. If the queue was not empty, it would pause the music (perhaps with a nice fade), emit an announcement ding sound, then play the messages in turn via a text-to-speech engine. I had previously had some success in helping my wife manage her adult ADHD by putting a cronjob on her Mac that used the "say" command to remind her when it was time to have lunch and the like, as she easily gets too absorbed in something on her laptop and forgets everything else; I thought it would be good to extend that so it worked if she wasn't near her laptop, by making it part of a house-wide music system composed of music streamers in many rooms. And it would be a good place to route notifications from systems like Horus, too. And as the house we lived in then had a very long driveway, we could have a sensor at the end of the drive speak a notification if a car entered the driveway (in the new house, we have a similar requirement for a doorbell that can be heard in distant rooms...). And so on.
But that started to lead to similar design issues as the notification system in Horus; sometimes a single event causes a lot of notifications to be generated, which spam the user when you really just want a single notification that tells them all they need to know. Horus has some domain-specific knowledge about what services depend on what listed in the configuration file, and it automatically suppresses failures that "are to be expected" given root failures, but it could be smarter (for instance, if the failure occurs after the root service has been checked and is fine but before the child services have been checked, then it will notify of the failure of all the child services, rather than noticing the suspicious trend).
And when multiple events occur in the same time span, yet are unrelated so such tricks can't be applied, some notion of priority and rate limiting need to be applied. If ten thousand notifications suddenly appear in the queue in a single second, what's the system to do? Clearly, it will start fading the music down the very instant a notificatoin arrives, but by the time it then gets to start talking a second later, it may have recevied a lot of messages; now it needs to decide what to do. Repeated messages of the same "type" should be summarised somehow. A single high-priority message should be able to cut through a slew of boring ones. And so on.
At the same time, I was looking into home automation and security systems. There you have a bunch of sensors, and actions you want to trigger (often involving yet more notifications...) in response to events. And similarly I wanted to try and automate failover actions; host failure notifications in Horus should trigger certain recovery activities - but only if the failure state lasts for more than a threshold period, to make sure expensive operations are not triggered based on transient failures.
Programming these complex "rules", be they for automation, analysing the root cause of failures from a wide range of inter-dependent service statuses, or deciding how best to summarise a slew of messages, is often complex as they deal with asynchronous inputs and the timing relationships between them, too; specialist programming models, generally based around state machines, help a great deal.
Also, having a common infrastructure for hosting such "reactive behaviour" would make it possible to build a distributed fault-tolerant implementation, which would be very useful for many of the above problems...
So, I have decided, it would be a good idea to design and build an automation daemon. It'll be a bit of software that is started (with reference to a configuration file specifying a bunch of state machines), and then sits there waiting for events. Events can be timers expiring, or external events that come from sensors; and the actions of the state machines might be to trigger events themselves, or to activate external actuators (such as the text-to-speech engine or a server reboot). And a bunch of daemons configured to cooperate would all synchronise to the same state in lock-step; if daemons drop out of the cluster, then all that will happen is that sensors and external actions attached to that daemon will become unavailable, and state machines which depend on them will be notified. In the event of a network partition separating groups of daemons, the states can diverge; a resolution mechanism will need to be specified for when they re-merge.
Having that in place would mean that building a service monitoring system would merely involve writing a sensor that ran check commands, and a standard state machine for monitoring a service (with reference to the state machines of services it depends on), generating suitable events for other consumers of service state machine to use - and human-level notification events in a standard format recognised by a general human notification handler running in the same automation daemon cluster.
The shared infrastructure would make it easy to integrated automation systems.
Now, this is a medium-term project as what I have is working OK for now and I'm focussing on Ugarit development at the moment, but I am now researching complex event processing systems to start designing a reliable distributed processing model for it. And I've chosen a name: "zmiku", the Lojban word for "automatic" or "automaton"; its goal is, in general, to automate complex systems. As I apply it to more problems, I'd like to bring in tools from the artificial intelligence toolbox to make it able to automate things in "smarter" ways; I feel that many of these techniques are currently difficult to employ in many cases where automation is required, so it would be good to make them more available.