On the S3 Outage and Agile Cloud Native Serverless Failures

Our newsfeeds are filled with hundreds of articles about Amazon’s S3 failure on Tuesday. Doom, gloom, S3 down in US-East-1 and cascading failure — dogs and cats, living together: mass hysteria! It’s been covered on Forbes, Business Insider — even The Daily Mail and The Sun (hardly bastions of technology reporting, but “OMG Instagram is down!” cuts off their supply of click bait).

The real story is not about S3 taking a nap, nor about the cascading failures across Amazon’s services, nor about Amazon’s response — it’s about companies that use AWS failing to build resilient, scalable solutions. It’s about the continuing trend of people misrepresenting Cloud and Agile as blanket buzzwords for ‘unplanned’ and ‘unmanaged’.

Computers break, systems break, software breaks. It’s a fact of life, and the more complex the system, the more points of failure it will have. We’ve know about that for decades, and we have decades of experience building resilient, scalable solutions to address this.

I think part of the problem is that HA and DR — proper, tested, reliable HA and DR — aren’t viewed as interesting or sexy. There’s a lot of testing, a lot of planning, a lot of expense — and when people don’t understand the value in the first place, it’s hard to get them to sign off on this.

We can’t see or measure potential failure, we can only measure uptime and availability. And if a system delivers 5 9s, does it really need all that extra care and maintenance? Surely it will work without that? Because, you know, there are these Cloud services where the provider takes care of everything. They can provide solutions at web scale!

Cloud is fundamentally other people’s commodity infrastructure. Agile is just another approach to writing code — but we’re still building and delivering software, that runs on infrastructure. We still have to plan for infrastructure failure, but now more of the burden of architecture around failure falls on your developers. We still need project management and KPIs to measure milestones and success.

Increasingly though, I’m witnessing people misrepresenting Cloud and Agile as “Infrastructure is someone else’s problem” and “We don’t need project management, RACIs, or KPIs”. We can reinvent the wheel but this time make it turn faster! We’ll ignore any consequences because by that time we’ll have moved on to a new startup! The whole ‘Serverless Architecture’ conversation just makes this worse.

Used in this way, Cloud and Agile aren’t new or innovative, and they don’t deliver the value they should — they are excuses for mismanagement, bad planning, and poor implementation.

If our business is purely web based, then our software services are core, run the business systems. They need to be treated as such, designed as such. The hype-gasm to be first to market creates technical debt in the solution (as well as the business plan). We just need to look at the current insane valuations of companies without a viable business plan. A warning sign not just of a tech bubble: it’s a giant warning sign of impending failure for common applications that people are relying on more and more.

Too often “governance” and “project management” are seen as dirty words, pointless make-work that has no place in the shiny new world of Agile Cloud Native Serverless applications. In reality, project management processes like Prince2 are in fact a perfect fit for Agile delivery.

When we get down to it, these are the processes via which we provide the proof that we have planned the solution, tested it, and will measure it’s success. In exactly the same way that, when writing software using Agile methodologies, we create the tests, then write the code, then test the software.

People love to give presentations about OODA, PDCA, or the Shewhart Cycle when they talk about writing code. But these are valuable tools that can be applied to delivering infrastructure and entire programmes.

I’ve recently had conversations with clients where I’ve heard the following (and not just from the client!):

“We don’t need a RACI to define support roles — it’s Cloud”

“Why would we need any KPIs? This is an Agile project!”

These attitudes are driven by lack of understanding, and the belief that Cloud and Agile can be used as an excuse to reinvent the wheel: only this time it’s got half the spokes and is slightly square. These behaviours are driven by snake oil technologists who are more interested in cashing out than building and delivering a viable, sustainable business.

Until we address this, at both management and technology levels, we’ll see more instances of a localised Cloud outage causing out of proportion cascading business failures. Buckle up, it’s going to be a bumpy ride.


Niloufar Namvar