
Data centers promise near-perfect uptime, with an industry standard of 99.999%. But even a 0.0001% vulnerability can be detrimental to most companies and catastrophic to some, as it was for the Death Star in Star Wars.
A minor problem (DNS error) caused havoc in the recent AWS outage.
“Between 11:49 p.m. PDT on October 19 and 2:24 a.m. PDT on October 20, AWS experienced increased error rates across services in the US-EAST-1 Region, impacting Amazon.com, its subsidiaries and AWS support operations. By 12:26 a.m. PDT on October 20, we determined that the root cause was the result of DNS resolution issues for the regional DynamoDB service endpoints. The incident was fully mitigated by 2:24 a.m. PDT.”
The error was quickly identified and soon resolved; however, the traffic wasn’t cleared for 12 hours, resulting in continued issues for companies whose critical applications were hosted on the affected AWS servers. Outages like this aren’t isolated to Amazon. Google Cloud platform suffered an outage in June that affected its customers’ critical internet services and applications, while Microsoft’s Azure also suffered an outage that affected customers worldwide. All three major U.S. cloud providers have suffered a major outage in less than six months.
The Uptime Institute reports outages are decreasing, as are their severity. Most data centers have redundancies in place to mitigate problems, including uninterruptible power supplies, multiple telco options (fiber, ethernet) and redundant array of independent disks (RAID) configurations for data safety. At the same time, risks such as the impact of climate change, increasing energy needs and human errors are also rising. When an outage occurs, the cascading can cause a pile-up (like we saw with AWS), leading to a devastating effect.
According to the Uptime Institute, “More than half (54%) of the respondents to Uptime’s 2024 annual survey say their most recent significant, serious or severe outage cost more than $100,000, with one in five saying that their most recent outage cost more than $1 million.”
Lowering Risks
How can businesses that rely on cloud hosting for critical applications lessen the risks of devastating outages? A three-pronged strategy utilizing diversified hosting, advanced technology and engaging a managed services provider (MSP) can work wonders.
Diversified Hosting With Co-Located Servers: A multi-cloud approach, with multiple providers running in parallel, minimizes disruptions by ensuring a smooth transition to an active server when one is down or not backed up due to an outage. In this scenario, providers should be fully independent of one another as it’s unlikely that both will suffer an outage simultaneously.
Keep High-Cost Workloads on GPU Clouds: Access to the latest hardware allows faster processing and parallel processing, ensuring scalability for fluctuating demands — especially important for high-demand tasks such as AI and ML.
Engage an MSP: There are numerous MSPs available 24/7 that handle companies’ cloud needs, such as security, migration, optimization and compliance. Having an MSP can help ensure rapid mitigation of any potential issues that might otherwise cripple mission-critical applications. With co-located servers, MSPs can handle the switch from one cloud to the other as soon as an outage occurs — rather than waiting for outage alerts to stack up, especially during after-business hours when in-house IT staff is unavailable.
Identifying the Most Critical Applications
For teams hosting and building applications in the cloud, it can be tempting to think of every component as absolutely critical (they may be — depending on the size of the company). However, storing all components in a multi-cloud redundancy can drive up costs significantly. To keep costs in check, it’s essential to identify components that are truly critical for smooth operation. To do so, identify the components that will cause apps to stop functioning if they suddenly become unavailable — those should be hosted in parallel to ensure availability.
Danger Lies in Complacency
It can be tempting to trust the claims of 99.999% availability; after all, what harm could a few minutes of downtime cause? In abstract, very little; but in reality, that vulnerability could be catastrophic — causing millions of dollars in lost revenue, as the cascade effect can result in brief interruptions, leading to hours-long traffic jams. With companies increasingly relying on cloud hosting for its benefits, they must also be aware of the risks. As the string of high-profile outages continues, taking steps to mitigate those risks can pay huge dividends, keeping critical apps and operations live.

