AWS, SolarWinds

A global outage of cloud services provided by Amazon Web Services (AWS) disrupted access to hundreds of thousands of Web sites and applications today that were dependent as many as 78 impacted services.

AWS confirmed that there were significant error rates for application programming interface (API) requests made to the DynamoDB database endpoint in its US-EAST-1 Region. The root cause appears to be related to a Domain Name Server (DNS) resolution issue involving the DynamoDB API endpoint, according to AWS.

At 6:35am EST today, AWS reported that the underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. However, some requests may be throttled while we work toward full resolution. Additionally, some services, such as CloudTrail and Lambda, are continuing to work through a backlog of events. While most operations are recovered, requests to launch new EC2 instances or services that launch EC2 instances such as ECS in the US-EAST-1 Region are still experiencing increased error rates, according to AWS

Organizations impacted by the outage include Amazon, AT&T, Canva, Delta Air Lines, Disney+ +, Fortnite, Hulu, McDonald’s, Pokémon, Roblox, Snapchat and United Airlines.

It’s too early to assess the economic impact of this latest outage but many IT teams will undoubtedly be reviewing their dependency on cloud service providers in its wake. Additionally, calls to revamp how DNS is designed might also gain additional traction given how vulnerable the system used to provide access to Web services has proven to be of late.

JP Morgenthal, founder and fractional CTO for NexAgent Solutions, a systems integrator, said regardless of the root cause the lesson here is that distributed applications require redundancy of services to ensure high-availability. DNS-based outagesm in particular, can be extremely difficult for cloud customers because it also manages the communications within the cloud as well as external to cloud, he added.

In theory, applications running on cloud services should be able to take advantage of multiple regions to ensure availability, but given the number of recent outages involving AWS, Google, IBM, Microsoft and others, it’s clear there are multiple single points of failure. Most organizations today rely on multiple cloud service providers to run applications, but very few are designed to failover to another cloud service provider when there is a disruption, instead preferring to rely on the availability enabled by distributing application workloads across multiple geographic regions.

Jack Gold, principal for J.Gold Associates, said AWS is such an important backbone that when it experiences an issue, it ripples through the ecosystem and has a huge impact. When applications are hosted by one service, that means there is a single point of failure if for some reason that cloud service becomes unavailable or is corrupted in some way, he added.

It’s actually surprising that we haven’t seen more of these outages, which shows that AWS and others are pretty competent in keeping things running smoothly. But no system is perfect, as we see today, noted Gold.

However, there have been an increasing number of cloud outages that might drive more organizations to reconsider their strategy. According to ThousandEyes, a unit of Cisco, the most common causes of these issues are unintentional failure vectors that accidentally spread failures across a distributed environment, hidden functional failures, and misconfigurations that cascade through interconnected systems.

The challenge, of course, is to minimize those risks as much as possible in a way that doesn’t wind up breaking the IT budget.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Tech Field Day Events

SHARE THIS STORY