No, the AWS outage doesn't necessarily strengthen multi-cloud
Photo generated by Gemini

Yesterday, AWS experienced service interruptions that lasted for a good part of the day. There is a lot of talk about DNS, but when relying on the timeline shared by Amazon, it appears that, following the initial problem, difficulties related to the launch of new instances and the management of their internal network in the us-east-1 region prolonged the downtime of all services.

Obviously, when AWS’s DNS goes down, the entire Internet trembles and, inevitably, we are treated to an avalanche of comments from many cloud computing “experts”. While waiting for AWS’s detailed analysis on the exact origin of the problem, this post aims to clarify some important truths to keep in mind in the face of this deluge of armchair analyses, often written by LLMs…

  • “Customers just need to have redundant architectures to avoid depending on a single region.” Fundamentally, I entirely agree, except that in this case, the DNS service is a global service. Consequently, even if it is indeed tied to the us-east-1 region, it affects absolutely all other regions. For anyone making this kind of comment, I recommend taking an Architect Associate certification.

  • “This is the trap of a single cloud provider, adopt multi-cloud architectures.” Generally, this is the kind of comment from cloud architects who have never worked on complex environments in large enterprises. The reality is that, while it is indeed possible to have multi-cloud architectures, it represents a huge investment that is absolutely not justified when looking at the number of service interruption hours per year from your cloud provider. It’s a very good idea, but one that generally does not pass the detailed evaluation stage, except for very rare and particularly critical business cases.

  • “The cloud is a trap, we must stay on-prem.” Now that one is my favorite! Once again, these are the words of “professionals” who confuse activism with business reality. The choice to migrate to the cloud is not made with a promise of 100% availability, but with a promise of value in terms of innovation, elasticity, and speed. And very often, availability is also vastly improved, because — brace yourselves — on-prem can also break and become unavailable! It actually happens regularly; it’s just that we talk about it much less because, by definition, it doesn’t affect many companies simultaneously.

So, what should be done?

This type of event should not make you doubt your cloud strategy. It should, however, alert you to the need to implement highly decoupled and multi-region architectures for your critical systems.

Indeed, following these various AWS problems, it was noticeable that the SaaS using their services clearly did not recover their availability at the same time. On one hand, we had a Reddit, for example, which came back online quickly, and on the other, an Atlassian which, once again, stood out for its delay in returning to production. Why? Certainly thanks to decoupled architectures using highly elastic systems to handle the influx of messages to be processed when the services came back online, and perhaps also to failover plans to other regions when the EC2 service experienced difficulties in us-east-1.

Return-to-production times improved by more than 6 hours between SaaS providers is clearly something significant that warrants an investment in your architectures. And unfortunately, this is not really the type of post we can read today on LinkedIn…

No, the AWS outage doesn't necessarily strengthen multi-cloud
Older post

The responsibility of security in the Cloud

When security is everybody's business, it becomes nobody's responsibility. Why total decentralization of IT security is a bad idea.

Newer post

SAAS security starts with you

SaaS cyberattacks are exploding due to a lack of adequate client configuration; frameworks like CSA SSCF and SSPM tools are becoming indispensable.

No, the AWS outage doesn't necessarily strengthen multi-cloud