How You Can Create Resilience Out of Chaos Engineering
by Caitlin Stanford, on 6/30/20
Uptime is the performance measure customers and service users judge you on. But in today’s interconnected world, a good score is getting harder to achieve.
We’ve moved on from systems that are monolithic, highly controlled and only occasionally updated. Today, software runs on multiple servers, relies on distributed networks, and is frequently updated. There are many more opportunities for things to go wrong, problems are much harder to find – and in many cases the problem will be outside your control in any case.
There really isn’t any room for manoeuvre, either. The difference between 99% and the gold standard of 99.9999% uptime is significant. It’s the difference between over 3.5 days of downtime in a year, which would be unacceptable to many people, and just over 30 seconds, which is potentially barely noticeable.
Testing goes some way to finding and fixing the problems. But by its very nature testing only finds and fixes known problems or problems that can be anticipated.
It doesn’t test for different configurations, different error conditions or the many factors beyond your control, such as the failure of your third party host server or a surge in usage.
It’s these problems that will really trip you up – and bring down your systems.
Of course, your customers or service users don’t care about the complexity of your systems or the load that’s being placed on them. All they see is a system that is unreliable. In many cases, downtime is frustrating and leads to a damaging loss of reputation. But in some sectors, such as aerospace, defense, or health, downtime could be literally life-threatening.
It is possible to fix problems – or at least find workarounds – on-the-fly when the problem happens. But there are risks to fixing problems in a pressurized situation. And with the clock ticking, every second counts.
Much better is to investigate and develop solutions for problems before they happen. To do this, many organizations are turning to a testing concept developed by engineers at Netflix. It’s called chaos engineering.
The subject and its benefits to business is the theme of our latest ebook – Achieving Strength Through Chaos Engineering.
The method sees testers proactively perform experiments, inject failures and engineer disaster scenarios so solutions can be developed thoroughly and calmly rather than in the heat of the moment when the clock is ticking.
The idea is to understand what happens when chaos ensues – not cause chaos. It’s therefore very tightly controlled. Amongst other considerations to take the control to the next level, best practice also recommends the use of tools to provide structure to the testing and the automation of tests of known failures to maximize efficiency.
One of the most powerful platforms that gives testers the structure and automation they need is Eggplant Performance. The software provides open, extensible, and easy-to-use performance and load testing tools that can test the widest range of technology and scale up to simulate any load you need. It has the unique ability to simulate virtual users at both the application UI and the network protocol level, which makes it the only solution that gives a true understanding of the UX impact at scale.
In short, by harnessing what chaos engineering has to offer and embracing tools such as Eggplant Performance, testers can deliver more resilient systems that offer more reliability and a better ROI.