With countless hours of lost productivity, financial systems disrupted for millions of users, and potentially hundreds of billions of dollars lost, this week’s AWS outage made for an unquestionably terrible day for global IT teams. Of course, it was also the worst global cloud catastrophe since the last one…and until the next.  

Whether you’re on AWS, GCP, Azure, or any other platform, major outages are a given of cloud computing reality. So what can your firm do to soften the blow? Below, I’ll offer four steps that your team can take immediately.   

1. Bring Your Skepticism – Do Your Homework

Often, teams will court disaster by walking into cloud arrangements assuming that major cloud corporations are inherently reliable. To be sure, the most-trusted firms have earned their reputation for a reason. At the same time, every cloud and hyperscaler offers a wide array of infrastructure options – AWS North America alone has 31 Availability Zones and 31 Edge Network Locations – and some options are vastly more reliable than the others.  

Indeed, AWS’s US-EAST-1 Region, the cause of this week’s outage, had been behind major disruptions in 2020, 2021, and 2023, and it was long known in certain IT circles as the least reliable region. Many firms likely understood the situation but took a calculated risk given the region’s low cost and plentiful offerings. But given the scope of the outage, it’s impossible not to consider how many firms were taken wholly by surprise – and would have surely opted for the more reliable regions had they been aware of the trade-offs. I personally have met IT leaders who chose to move to other AWS regions only after bad experiences with US-EAST-1 in the past. 

The lesson here is to do your due diligence when it comes to cloud infrastructure options, no matter what cloud you’re working with. Places to start include free tools such as Cloudprice, Cloudping, and the historical incident views from hyperscaler-provided Cloud Service Health tools.  

2. Choose Portable Over Cloud-Native 

When you’re architecting cloud configurations, the simpler route is to go cloud-native. But while it’s convenient to select applications ready-built by and for your cloud provider, these cloud-native options leave you more exposed if your cloud goes down.  

To avoid that additional layer of cloud dependency, opt for independent and/or open-source products where possible. A few examples of replacements include the ones below:
 

Category  Native Offering Example  Open-Source Alternatives Include… 
Authentication & Identity  AWS Cognito  Keycloak 
Search  Azure Monitor  Elasticsearch 
Relational Databases  Google Cloud SQL  PostgreSQL 
NoSQL Databases  AWS DynamoDB  MongoDB 
Container Orchestration  Azure Kubernetes Service (AKS)  Kubernetes 
Monitoring & Observability  Google Cloud Monitoring  Prometheus + Grafana 
Message Queues  AWS SQS/SNS  Apache Kafka 
Object Storage  Azure Blob Storage  MinIO 
API Gateway  Google Cloud API Gateway  Kong 

 

To be sure, building more of your cloud stack from scratch means more work for your teams. However, in my experience, once you have the infrastructure up and running, there’s little to no difference between adding workload to an established home-grown infrastructure or to operating on a cloud-native one. And the benefits in terms of resiliency – not to mention reduced cloud lock-in – make independent options highly worthwhile.  

3. Engineer for Failure

Given that cloud failures will happen, be sure to design your products with cloud failure in mind. One example to look at is Datadog: in a 2023 incident, the firm suddenly lost access to over half of its Kubernetes nodes in production and completely redesigned its disaster approach in response. Changes included removing architectural bottlenecks and addressing technical debt so that partial failures wouldn’t cascade through the system, improving data ingestion and storage for greater data availability during outages, and building systems to automatically recover at scale. One great place to start in your journey is to follow Datadog’s recommendation to “start with what’s important to the end user,” and build fail-safes to protect what matters most. 

4. Run on at Least Two Clouds

Of course, the best way to not be beholden to cloud failures is multicloud redundancy. Achieving true multicloud fluidity is a huge undertaking for many firms, since it’s extremely hard to translate infrastructure from one cloud into another. But building out infrastructure on just two clouds is a strong – and often doable – place to start. Critical to making this work is having a team in place with an expert in each of the clouds you’re running on.  

To be sure, nothing can shield firms completely from the impact of a massive outage like the one we saw this week. But with the right due diligence, a cloud-portable approach, engineering for failure, and using “dual-cloud” as a stepping stone to true multicloud, firms can be far more nimble when the next (and unfortunately inevitable) major cloud incident strikes.