When One Cloud Isn’t Enough
October 21, 2025

Andrew McKay
Director of Marketing

We’re not going to rehash why outages happen—they always will. What matters is whether your business can keep moving while the cloud recovers.
On October 20th, 2025, around 3 a.m. ET, phones started buzzing and dashboards lit up. Operations teams were pulled into emergency bridges as tickets, Slack messages, and Teams notifications flooded in. Within minutes, AWS US-EAST-1 was showing widespread disruption. According to AWS’s updates, DNS resolution issues on the DynamoDB API endpoint rippled outward, interrupting the control plane (the layer of APIs and automation that keeps workloads coordinated).
Developers did everything right. Infrastructure was spread across multiple Availability Zones, failover scripts were ready, and monitoring was live. But when the control plane became unreliable, recovery tools stopped responding. The workloads were ready, but the systems that controlled them weren’t.
It wasn’t negligence or poor design. It was a demonstration of how even the best-built environments can be stopped by a dependency you don’t control.
How Do We Make Sure This Doesn’t Happen Again?
With another outage behind us, thousands of teams are combing through logs, replaying alerts, and answering the same question from leadership: how do we make sure this doesn’t happen again?
For operations teams, the postmortem begins immediately—tracking what failed, what didn’t, and what they couldn’t control. Developers sift through automation logs and retry queues, trying to understand why their failover scripts never fired.
CIOs, meanwhile, face a familiar tension: confidence in AWS’s resilience on one hand, and a clear reminder on the other that no single provider can promise uninterrupted continuity.
The answer is not moving off AWS. The platform remains essential. What’s needed is continuity—the ability to keep running when a cloud goes down. But that level of resilience doesn’t come from adding more redundancy inside one platform, it comes from extends beyond it.
Why “Well-Architected” Fell Short
The AWS Well-Architected Framework has taught teams to design for failure: distribute workloads, automate recovery, and manage infrastructure as code. Under normal conditions, those principles work. Yesterday’s outage revealed their limits—because when the control plane itself degrades, the very systems designed to enable resilience become part of the outage.
Failover policies that depend on AWS APIs, CloudFormation, or Lambda were impacted because those same services reside in the affected region. In some cases, automation pipelines couldn’t authenticate to IAM, meaning recovery steps failed before they started. Multi-region architectures still required access to Route 53 or regional DNS services to reroute traffic, but those lookups were being resolved through the same control plane that was down. Even organizations with pre-staged standby environments found their automation triggers—often Lambda functions or Step Functions workflows—were pinned to unavailable endpoints.
This isn’t a criticism of AWS. Every hyperscaler operates with some shared management infrastructure that can introduce systemic risk. The takeaway is that resilience engineered inside a single cloud stops where the provider’s control plane begins.
True continuity requires a layer of control that exists outside your provider’s operational boundary—one that can issue commands, reroute traffic, or promote environments even when the platform’s own management systems are impaired. That’s the foundation hybrid architectures are built on.
Hybrid Cloud Is the Foundation of Resilience
When AWS East went down, the issue wasn’t capacity. It was control. APIs, automation, and identity services that keep workloads responsive became unreliable. Many teams watched healthy servers sit idle because the systems directing them had stopped listening. Deployments paused, DNS updates lagged, and scaling scripts failed mid-run.
That’s the limit of single-platform resilience. Redundancy inside one cloud can’t help if the tools that orchestrate recovery depend on the same region that failed.
Hybrid architecture breaks that loop. It separates where workloads run from where they’re managed. In a hybrid model, key control functions such as automation, routing, and identity can live in a different environment, ready to act when the primary cloud can’t.
Imagine a DevOps team running its production environment in AWS while maintaining a secondary management environment on VMware infrastructure, using Kubernetes clusters on Tanzu and virtualized workloads on ESXi. Their build pipelines, DNS, and orchestration tools connect to AWS through private network links. When a regional outage disrupts AWS control services, those systems remain online. They can redirect user traffic, activate pre-staged workloads in other regions, and maintain authentication until AWS restores its control plane.
Now consider another approach: a company that uses Azure as a lightweight redundancy layer for its AWS environment. Databases replicate asynchronously, and DNS is hosted externally through a neutral provider. When AWS experiences a regional event, that company can serve users from its Azure footprint while the primary environment stabilizes. Customers might notice slower responses, but not downtime.
Both examples point to the same principle. Resilience isn’t about replacing public cloud—it’s about keeping control when one cloud loses it.
This is why Lightedge has fully embraced a hybrid-first philosophy. It’s a practical, proven model for continuity across hyperscale and private platforms. Hybrid brings together the scalability of public cloud with the governance and reliability of private infrastructure. It ensures your organization can run anywhere and recover everywhere.
Hybrid cloud isn’t an alternative to public cloud. It’s the reason your business stays online when one cloud isn’t.
Portability Is the Key To Making Hybrid Work
The key to resilience is portability, the ability to move workloads, data, and automation across environments without rewriting everything. It is what turns hybrid architecture from theory into practice.
Portability starts with design. Workloads built on open technologies such as containers, Kubernetes, Terraform, and Ansible can run consistently across platforms. The same applies to open database engines like MySQL and PostgreSQL, which operate on both private cloud and public cloud infrastructure. These choices do not replace cloud-native services, but they prevent them from becoming single points of dependency.
Data continuity is equally important. Replicating databases or object stores through native replication tools or change data capture (CDC) pipelines keeps information available even if a region becomes isolated. Combined with independent DNS and traffic management, workloads can shift without waiting for a provider’s internal systems to recover.
Operational portability matters as well. When your CI/CD pipelines or automation logic exist entirely inside one cloud, they share that provider’s fate. Hosting those control layers within a private or secondary environment, whether a managed hybrid platform or your own infrastructure, creates a command surface that stays online even when a cloud’s automation is degraded.
Portable architectures still rely on hyperscale resources, but they are designed to operate independently of any single provider’s control plane. This is how real resilience is achieved: through open standards, consistent tooling, and the ability to act when your cloud cannot.
At Lightedge, we believe resilience begins with open, portable, hybrid infrastructure that keeps organizations in control wherever their workloads run.
Building Beyond the Cloud You Depend On
The old definition of “well-architected” focused on automating everything inside one cloud. The new definition expands that scope: automation must work across clouds and outside them when needed.
Control-plane independence, open standards, and hybrid connectivity are now foundational to business continuity. They make sure that when a major platform experiences disruption, your response doesn’t have to wait for its recovery.
This evolution reflects where cloud architecture is heading: toward hybrid environments that combine the elasticity of hyperscale with the control of private infrastructure. It’s not about abandoning the public cloud. It’s about ensuring that your business doesn’t pause when it does.
Yesterday’s outage didn’t reveal a flaw in AWS. It revealed a universal truth: reliance on a single platform, no matter how reliable, creates blind spots.
Each hyperscaler brings immense capability. But real resilience isn’t about choosing between them—it’s about creating systems that can function across them. When continuity depends on more than one environment, your business gains flexibility, leverage, and control.
Cloud computing has always been about agility and scale. The next phase is about continuity and control—building architectures that can withstand failures without waiting for a provider to recover first. That means designing environments that are open, portable, and connected, whether they run in AWS, a private data center, or somewhere in between.
The Path Forward
For most organizations, achieving this does not mean rebuilding everything. It means adding deliberate, lightweight layers of independence around what already works. The network that connects these layers is just as important as the infrastructure itself. A resilient network fabric links private, public, and secondary environments through secure, low-latency connections so workloads and users can move freely when a provider experiences issues.
Build outward from what matters most.
• Identify the systems that must stay online no matter what—authentication, orchestration, and core data—and ensure they have a secondary presence in another hyperscaler or a private environment designed for continuity.
• Replicate or back up critical data to a secondary provider or managed private cloud.
• Keep essential automation tools and DNS systems in an environment that is not dependent on the same control plane.
• Build a network fabric that can reroute workloads and users quickly without starting from scratch.
Each of these steps reduces dependency and increases your ability to respond on your own timeline. That’s what resilience really means.
Public cloud computing remains one of the greatest enablers of innovation in modern IT. But as yesterday reminded us, no single platform can stand alone forever. The next stage of evolution isn’t more automation within the cloud—it’s more flexibility and control beyond it.