A few years ago, I wrote my own service monitoring system for my
servers and networks; I did this because Nagios, the common choice,
was just too complicated for my tastes and didn't cleanly fit my
needs. And so, The Eye of Horus was born, and has been monitoring my
servers ever since. I open-sourced it, but I've not migrated it to the
new Kitten Technologies
infrastructure yet, so I don't have a link.
A design goal for Horus was to limit what needed to be installed on
the monitored servers; it's a Python script that you run from cron
which sshes into the servers and runs shell commands directly; the
results are sucked back from standard output. The configuration file
format is easy to work with, and it's modular - the python script
spits out a new status file listing the status of all the services
that a set of CGIs uses to produce HTML reports on demand and to
update rrdtool logs of measurables, and produces a list of changes to
be fed to a notification system.
However, it has some rough edges - I decided to make the shell
commands run on the remote servers all output a standard format of
report, which means mangling the output of commands such as pidof
with sed and awk in order to produce them, which is a pain to do
portably. In general, support for generating different commands to get
the same report on different platforms is poor, too. I never got
around to implementing hysterisis in the change detectors to put a
service that's rapidly failing and recovering into an "unstable"
state. And it's written in Python, when I've migrated all of my new
development into Scheme.
I was tinkering with the idea of a straight rewrite in Scheme, with
the rough edges fixed up, when I noticed a convergence with some of my
other projects beginning to form.
I've long wanted to have a system where some small lightweight
computer (perhaps a Raspberry Pi), attached to the home LAN, drives
speakers as a music player, streaming music from the home file
server. There's off the shelf stuff to do that, but I wanted to go a
little further and also provide a text-to-speech notification system;
the box would also have a queue of messages. If the queue was not
empty, it would pause the music (perhaps with a nice fade), emit an
announcement ding sound, then play the messages in turn via a
text-to-speech engine. I had previously had some success in helping my
wife manage her adult ADHD by putting a cronjob on her Mac that used the
"say" command to remind her when it was time to have lunch and the
like, as she easily gets too absorbed in something on her laptop and
forgets everything else; I thought it would be good to extend that so
it worked if she wasn't near her laptop, by making it part of a
house-wide music system composed of music streamers in many rooms. And
it would be a good place to route notifications from systems like
Horus, too. And as the house we lived in then had a very long
driveway, we could have a sensor at the end of the drive speak a
notification if a car entered the driveway (in the new house, we have
a similar requirement for a doorbell that can be heard in distant
rooms...). And so on.
But that started to lead to similar design issues as the notification
system in Horus; sometimes a single event causes a lot of
notifications to be generated, which spam the user when you really
just want a single notification that tells them all they need to
know. Horus has some domain-specific knowledge about what services
depend on what listed in the configuration file, and it automatically
suppresses failures that "are to be expected" given root failures, but
it could be smarter (for instance, if the failure occurs after the
root service has been checked and is fine but before the child
services have been checked, then it will notify of the failure of
all the child services, rather than noticing the suspicious trend).
And when multiple events occur in the same time span, yet are
unrelated so such tricks can't be applied, some notion of priority
and rate limiting need to be applied. If ten thousand notifications
suddenly appear in the queue in a single second, what's the system to
do? Clearly, it will start fading the music down the very instant a
notificatoin arrives, but by the time it then gets to start talking a
second later, it may have recevied a lot of messages; now it needs
to decide what to do. Repeated messages of the same "type" should be
summarised somehow. A single high-priority message should be able to
cut through a slew of boring ones. And so on.
At the same time, I was looking into home automation and security
systems. There you have a bunch of sensors, and actions you want to
trigger (often involving yet more notifications...) in response to
events. And similarly I wanted to try and automate failover actions;
host failure notifications in Horus should trigger certain recovery
activities - but only if the failure state lasts for more than a
threshold period, to make sure expensive operations are not triggered
based on transient failures.
Programming these complex "rules", be they for automation, analysing
the root cause of failures from a wide range of inter-dependent
service statuses, or deciding how best to summarise a slew of
messages, is often complex as they deal with asynchronous inputs and
the timing relationships between them, too; specialist programming
models, generally based around state machines, help a great deal.
Also, having a common infrastructure for hosting such "reactive
behaviour" would make it possible to build a distributed
fault-tolerant implementation, which would be very useful for many of
the above problems...
So, I have decided, it would be a good idea to design and build an
automation daemon. It'll be a bit of software that is started (with
reference to a configuration file specifying a bunch of state
machines), and then sits there waiting for events. Events can be
timers expiring, or external events that come from sensors; and the
actions of the state machines might be to trigger events themselves,
or to activate external actuators (such as the text-to-speech engine
or a server reboot). And a bunch of daemons configured to cooperate
would all synchronise to the same state in lock-step; if daemons drop
out of the cluster, then all that will happen is that sensors and
external actions attached to that daemon will become unavailable,
and state machines which depend on them will be notified. In the event
of a network partition separating groups of daemons, the states can
diverge; a resolution mechanism will need to be specified for when
they re-merge.
Having that in place would mean that building a service monitoring
system would merely involve writing a sensor that ran check commands,
and a standard state machine for monitoring a service (with reference
to the state machines of services it depends on), generating suitable
events for other consumers of service state machine to use - and
human-level notification events in a standard format recognised by a
general human notification handler running in the same automation
daemon cluster.
The shared infrastructure would make it easy to integrated automation
systems.
Now, this is a medium-term project as what I have is working OK for
now and I'm focussing on Ugarit development at the moment, but I am
now researching complex event processing systems to start designing
a reliable distributed processing model for it. And I've chosen a
name: "zmiku", the Lojban word for "automatic" or "automaton"; its
goal is, in general, to automate complex systems. As I apply it to
more problems, I'd like to bring in tools from the artificial
intelligence toolbox to make it able to automate things in "smarter"
ways; I feel that many of these techniques are currently difficult to
employ in many cases where automation is required, so it would be good
to make them more available.