The following notes are the result of a conversation with Ivan Merill that occurred after a session of the Sheffield Programmer’s Study Group on the 22nd of March 2017. Take the following with a grain of thought, as it is mostly an unformed stream of consciousness.


The Systems Engineering of Reliability

Dependencies of systems is a transitive property. If system A depends on system B, which depends on C, then A also depends on C.

This matters greatly in the world of microservices, and with the current trend for a fragmented, cloud based architecture for large systems, because it implies that a service’s uptime will actually be lower than the weakest link in the chain. For example, if we consider the simple system below, where the arrows represent dependencies and both B and C have intrinsic reliability of 0.99 (They’ll be available 99% of the time).

A→B→C

The final expected uptime of A is not 0.99, but 0.99×0.99 = 0.98 (98%).

In a real system, the number of dependencies are large, and we are in fact dealing with a tree of dependencies, where each subsystem depends on more than one:

A→[B, E]
B→[C,D]
E→[F,I]
F→[G,H]

Given the above (which we can think of as a tree of dependencies), the final reliability of A will be 0.92 (92%) if all susbsystems have an intrinsic reliability of 0.99. Of course, real systems have a great deal many more dependencies that this, so we can expect their expected reliability to be quite low. This is a fundamental effect that means that as a product gets more complex, its reliability goes down. Not exactly a surprise to anyone who’s ever worked with software.

This problem is compounded by the problems cause by dependencies outside of our control. For example, a third party vendor could go out of business, and when one of our sub-dependencies fail, we suddenly face the unpleasant prospect of a whole chain of dependencies taking our service down. This has been well illustrated recently with AWS S3 service outages that have affected significant portion of the internet, or a tiny NPM library being removed which caused some of the most popular packages on the platform to break badly, affecting millions of people in both cases.

In my mind, this highlights the importance of open source software, and has been a revelation regarding the infectious property of the GPL license. I now understand for the first time why the GPL should “infect" other pieces of software it comes into contact with. What I originally thought to be a dogged ideological position is actually coldly calculated systems reliability engineering. Open source is not an ideological issue, it’s an architecture and reliability issue.

It also highlights a tragicomic vision of the future, where the best site reliability practice is to have a fun and engaging failure page for your user facing applications.

Beyond Availability

Availability is also not the end of the story. Responsiveness (latency/lag) of a service will also cascade quickly to other services, in an additive manner instead of a multiplicative one. Relying on third party services could see your user experience degrade drastically due to circumstance completely outside your control.

This is why we should test the production system in place. No other alternatives will do. We should proactively test load performance response and service degradation on the real production system, proactively attacking our system with unplanned downtimes, artificial latencies, and extreme loads when under real use. Some companies have already began to perform this, such as Netflix’s Chaos Monkey and their ever growing simian army that proactively attacks production system.

We should all assume that things will break, and build our systems in accordance with Murphy’s laws.

Everything that can go wrong will go wrong

Our individual system components should also be treated this way. A good analogy would be to consider all instances of system modules as disposable cattle, instead of as pets we need to look after.

It all reminds me of the old adage that “It’s not a backup until you’ve performed a restore.". It’s the same with performance and availability under duress. That’s because data from normal operations or from test systems won’t help you build an understanding of your system while under duress.

A Higher Plane of Monitoring

The true quest of monitoring is not the collection of data, but in the improvement of the organisation-as-a-system.

This is a significant endeavour, as what we begin with in the world of monitoring is simply data. This requires first transformation into information through a process of integration and correlation, then converted to knowledge through analysis, and finally to action.

DATA→INFORMATION→KNOWLEDGE→IMPROVEMENT

Where:
  • Collection gives data.
  • Aggregation and correlation turn data into information
  • Analysis turn information into knowledge
  • Decision and action turns knowledge into IMPROVEMENT

Ladder of Monitoring Maturity

Understanding the process of monitoring as the transformation of data into action can give us what can be thought of as a “Ladder of monitoring maturity"

LEVEL 0 – Ignorance
An organisation does not collect data, and is ignorant of an issue when something goes wrong (until the company fails or a customer complains).

LEVEL 1 – Availability Tracking
At this level, an organisation monitors whether its services are available or not. This can allow someone to be alerted when something goes wrong (Not that they’d know what to do about it). Typically, this will mean that services are restarted manually and there is no insight into what’s going wrong.

LEVEL 2 – Data Collection
At this level, an organisation will collect logs and data from the services they run. This may involve quite complex setups or quite simple one, but will allow developers and operations engineers to carry out a post-facto forensic investigation of a downtime incident. Note that this has to be practical. Many systems will collect logs, but if getting them back (syslog from a transient instance) is too difficult, then this is not carried out in practice. For level 2, logs must be available and used in practice after an incident or downtime to figure out what’s wrong.

LEVEL 3 – Aggregation & Correlation
At this level, an organisation will correlate and aggregate data from different sources, so that a service issue can turn into an event, and the cause and effect relationship between service issues can be easily established after the facts. This lays the groundwork for the next stage.

LEVEL 4 – Analysis
At this level, an organisation will routinely analyse the aggregated and correlated event information to provide insight into problems with the architecture and services. While this is today a manual process, it is imaginable that in the future, AI could be used to provide analysis of the aggregated and collected data. This enables the issue to be fixed

LEVEL 5 – Learning Organisation
It is not sufficient to analyse and fix a problem. At this level, organisation have a practice of improvement and cause analysis on failures. They are closing the loops and carrying out retrospectives on incidents or downtimes, so that not only the problem is fixed, but other problems of this class are fixed and solved proactively. This the is beginning of proactivity and forward planning, even though it is carried out as a response to external stimulus. The system gets stronger with each problem and could be considered antifragile.

LEVEL 6 – Automation
At this level, the organisation has finally learnt the truth of systems engineering: Shit breaks. Their entire architecture has automated remedial actions in place, so that when an issue occurs, it is detected and handled by the system itself automatically. For example, a failure might be detected in a subsystem, so the subsystem instance will be replaced while an event with all the relevant information (perhaps containing a complete image of the failed system) will be made available for analysis. This level of automated proactivity should greatly reduce the stress of the operations team, since they won’t be woken up in the middle of the night. The system will recover automatically.

LEVEL 7 – Counter-Engineering
At this level, an organisation will proactively attack their own systems and create artificial downtimes and latencies on the production systems. This will enable the organisation to study real issues in relative calm. instead of waiting for weaknesses to reveal themselves, they will be proactively hunted down, with the deliberate aim of making the system more robust.

Monitoring as an organisational practice

The essence of monitoring is in fact at a higher level still.

Monitoring is an organisational process related directly to both management and operational science.

From the operational science side, monitoring is a key concept that relates to John Boyd’s OODA Loop

“OBSERVE, ORIENT, DECIDE, ACT"

Where monitoring is an organisation’s observation and orientation. The OODA loop has real implications for organisations. Exactly in the same manner as for aircrafts dogfighting, shrotening the complete OODA cycle and “getting inside the competition’s OODA loop" lets an organisation outmanoeuvre the competition or the environment. Monitoring solutions, processes and tools should be built with these idea in mind, and should endeavour to help the organisation shorten its OODA loop.

From the management side, it greatly reminds us of Peter Drucker’s Maxim

“You can’t manage what you can’t measure" or “You improve what you measure".

It should be clear from this that monitoring plays a crucial role in a business, and is a much broader idea that transcends the traditional boundaries of an IT department. It’s in fact part of the fundamental principles of Business-As-A-System which is a broader perspective that many companies have not yet adopted into their practice.

Once we understand what monitoring is, we should see that it should involve the entire organisation, an in particular for our discussion, the accounting department.

“Show me the Money!"

Any monitoring metric besides revenue is only a weak proxy to it. Often quite a poor one. It doesn’t matter how fast the page loads and if all the tests pass if the purchase button is hidden behind a style issue. Traffic will still flow, uptime will be good, but revenue will be zero.

Of course, revenue needs not be your ultimate goal. Charities, Governments, NGOs and academic institutions will have different goals, and monitoring should take these into account instead. If you’re in business though, and monitoring doesn’t show you how your actions affect the bottom line, you’re not monitoring at all.