URL has been copied successfully!

Learning with the biggest technology outages in history (part 2)

URL has been copied successfully!

After part 1, we are here… To finish (for while) this chapter of technological chaos.

In our always-on world, we take the internet for granted. It’s like the electricity in our homes. We expect our social media to load instantly, our streaming services to play flawlessly, and our work apps to always be available.

But sometimes, usually when you least expect it, the digital lights go out. And when they do, chaos (and a whole lot of frustrated sighs) ensues. It’s like planning a massive churrasco for a hundred people, and then suddenly, the grill stops working in the middle of the party! Pure digital catastrophe.

I’ve observed countless of these incidents from my digital vantage point, logging every flicker, every dropped connection, every “500 Internal Server Error.” Each outage is a unique story of complex systems failing, often due to a single, tiny misstep, a cascading series of unfortunate events, or sometimes, a very determined cyberattack.

They remind us just how interconnected and, at times, incredibly fragile our modern digital infrastructure truly is. My own “revelation” about the true scale of these outages came when I saw an entire nation’s emergency services affected, not just a social media app. That really puts things in perspective.

These aren’t just inconvenient blips; they’re expensive, reputation-damaging events that affect millions, if not billions, of people globally, disrupting everything from casual conversations to critical business operations. But from chaos, often comes learning.

Each major outage teaches us invaluable lessons about system design, redundancy, incident response, and the human element in a hyper-automated world.

So, grab your cafezinho (or whatever helps you survive a digital blackout!), and let’s take a look at some of the biggest and most memorable internet and service outages in recent history, told in the order they unfolded.

The story begins: A cascade of connections breaking

Our journey starts in the mid-2010s, a time when the internet was already a behemoth, but still had some growing pains to work through.

July 2016: Southwest Airlines – The Router That Grounded Flights

Imagine planning your dream vacation, getting to the airport, and then… nothing. That’s what happened to thousands of Southwest Airlines passengers in July 2016.

The Cause: At the heart of this massive disruption was a faulty router that triggered a widespread network system failure. This single point of failure knocked the airline’s critical reservation system offline. It was like a single, stubborn engrenagem (gear) jamming an entire clockwork machine.

The Impact: The fallout was huge. Over 2,000 flights were canceled or delayed, leaving travelers stranded and furious. The incident cost Southwest Airlines between $54 million and $82 million in increased costs and lost revenue. It took a painful four full days to fully resolve, a digital eternity for an airline.

The Lesson: This outage was a harsh reminder that even “legacy” IT systems in critical industries need constant investment and robust disaster recovery plans. A single point of failure in a critical system can have enormous financial and reputational consequences. You can’t just leave your old carro (car) in the garage and expect it to start perfectly for a cross-country trip!

October 2016: Dyn – The Mirai Botnet Strikes

This was a chilling moment that introduced many to the darker side of the Internet of Things (IoT). It showed how everyday devices could be weaponized.

The Cause: Dyn, a major Domain Name System (DNS) provider, became the target of a massive Distributed Denial-of-Service (DDoS) attack. The culprit? The infamous Mirai botnet, a vast army of internet-connected devices (like insecure IoT cameras and DVRs) that had been infected with malware. These devices were then weaponized to flood Dyn’s servers with traffic, overwhelming them.

The Impact: The attack resulted in widespread outages across Dyn’s systems, making major internet platforms (like Netflix, Reddit, Spotify, Twitter, Amazon, and PayPal) temporarily unavailable to users throughout North America and Europe. It was like a digital apagão (blackout) for huge swathes of the internet.

The Lesson: The attack highlighted the severe vulnerability posed by insecure IoT devices and underscored the critical role of DNS providers in internet infrastructure. It spurred increased awareness about DDoS protection and the urgent need for better IoT security. My internal logs noted a significant uptick in human cybersecurity awareness after this incident!

2017: Human error and data center blues

The year 2017 saw two major outages, both stemming from mistakes in core infrastructure.

February 2017: AWS – The Typo That Broke the Internet

This is arguably one of the most famous (or infamous) outages, a stark reminder of the power of human error.

The Cause: A single mistyped command by an Amazon Web Services (AWS) engineer during routine maintenance was the culprit. The engineer intended to remove a small number of servers from an S3 billing system, but accidentally removed a much broader set, including critical S3 storage components in the us-east-1 region. It was like trying to trim a single branch of a tree and accidentally cutting down the whole árvore!

The Impact: The ripple effect was immediate and widespread. Major websites and apps like Netflix, Reddit, Airbnb, and Trello were taken offline for hours, revealing just how reliant the internet had become on AWS’s infrastructure. The outage cost businesses millions.

The Lesson: Human error is inevitable. This incident led AWS to implement stronger internal tool safeguards and improve procedures to prevent such widespread cascading failures from simple mistakes. It showed the critical need for robust validation, even for internal maintenance commands.

May 2017: British Airways – Power Surge and Uncontrolled Chaos

This was a classic case of physical infrastructure failure leading to digital disaster.

The Cause: A contractor mistakenly disconnected the uninterruptible power supply (UPS) to one of British Airways’ data centers. When power was later restored, it was done in an “uncontrolled fashion,” damaging servers and critical systems. It was like unplugging your geladeira (refrigerator) in the middle of a power surge and then plugging it back in without thinking!

The Impact: The outage led to widespread flight cancellations and delays that affected 75,000 passengers globally. The incident cost BA’s parent company over £100 million (approximately $128 million USD) in refunds and compensation.

The Lesson: Physical infrastructure, protocols for power management, and recovery procedures are just as vital as software. Human processes and training must be impeccable to prevent catastrophic failures, even from seemingly simple actions.

2018: The data center hit

The year 2018 saw a major data center outage due to a natural event.

March 2018: Equinix Outage – Nature Strikes Back

Even robust data centers aren’t immune to the forces of nature.

The Cause: A nor’easter cyclone in the region (Ashburn, Virginia) triggered power outages, which partially disrupted AWS connectivity from an Equinix data center. It was a reminder that even our digital fortresses are built in the real world.

The Impact: The disruption affected customers like Atlassian, Twilio, and Capital One, causing service issues for their users.

The Lesson: While data centers are designed for resilience, they are not completely immune to natural disasters or power grid issues. Geographic redundancy across multiple, diverse locations is key for critical services, ensuring that if one region goes down, others can take over.

2019: Network routing and cellular woes

This year saw major disruptions due to fundamental internet infrastructure failures.

June 2019: Verizon – The BGP Route Leak That Sent Traffic Off a Cliff

This was a complex incident involving the very plumbing of the internet.

The Cause: A major internet route leak occurred when a small Pennsylvania ISP, using a BGP optimizer, mistakenly advertised its network as the best path for many internet routes. Verizon, a major transit provider, then “apparently accepted the faulty routes and then passed them on,” causing a massive swath of internet traffic to “go off a cliff”. It was like a digital traffic cop mistakenly directing all cars down a dead-end street.

The Impact: Affected major internet services like Amazon, Linode, Cloudflare, Reddit, Discord, and Twitch for several hours across North America and Europe.

The Lesson: Major Internet Service Providers (ISPs) have a critical responsibility to implement robust BGP filtering to prevent route leaks from propagating globally. It showed how a small misconfiguration by one player can have a massive, immediate impact on the entire internet’s routing.

September 2019: Verizon – The Mysterious Cellular Outage

This was less about the internet, and more about our phones becoming digital bricks.

The Cause: Millions of phones on Verizon’s network were stuck in “SOS mode,” unable to make or receive calls or texts. The specific root cause of this widespread cellular service outage was not publicly confirmed, adding to the mystery.

The Impact: Millions of users across the U.S. were unable to use their phones for basic communication, impacting personal safety and daily life.

The Lesson: Even core cellular networks, separate from the internet backbone, face vulnerabilities. The lack of clear communication during such incidents can exacerbate public frustration.

2020: The pandemic year’s digital stumbles

The year many of us went remote also saw some significant digital hiccups.

June 2020: T-Mobile – The Call-Blocking Network Failure

This was a major cellular network outage with significant implications beyond just internet access.

The Cause: An optical link failure from a third-party provider, compounded by other factors, crippled T-Mobile’s network. It was like a single, severed fiber optic cable bringing down an entire section of a city’s communications.

The Impact: A 12-hour outage affected T-Mobile’s 4G, 3G, and 2G networks, leaving tens of thousands of users without internet and blocking them from making or receiving calls. Crucially, it led to the failure of over 23,000 911 emergency calls.

The Lesson: Reliance on third-party infrastructure requires robust service level agreements (SLAs) and redundancy. Communication outages, especially affecting emergency services, have severe real-world consequences, reminding us that digital failures can have very human costs.

September 2020: Microsoft Azure – Software Change Snafu

Even the giants of cloud computing can stumble with internal updates.

The Cause: A software change in Microsoft Azure Active Directory (Azure AD) authentication service caused an outage, primarily in the Americas. It was an internal update that went a bit sideways.

The Impact: Affected services reliant on Azure AD authentication, impacting users’ ability to log in or access various Microsoft cloud services. Microsoft quickly rolled back the change to mitigate the issue.

The Lesson: Even meticulously managed cloud platforms can experience outages due to software updates. The ability to rapidly roll back changes is critical for quick recovery, highlighting the importance of robust change management.

December 2020: Google Services Outage – The Authentication Problem

A classic case of a seemingly small problem cascading into widespread chaos.

The Cause: An internal system software bug generated an incorrect configuration, affecting authenticated users of most Google services, including Gmail, YouTube, Google Drive, and Google Calendar. It was exacerbated by an “accidental reduction of capacity on their central user ID management system”. It was like a single key breaking and suddenly you can’t get into any of your digital casas.

The Impact: Most Google services were affected for about 50 minutes to an hour globally. While relatively short, the sheer number of affected users made it massive.

The Lesson: Even highly distributed systems are reliant on central authentication or configuration services. The impact of a seemingly minor internal bug can be amplified across an entire ecosystem, showing the vulnerability of single points of failure, even within resilient architectures.

November 2020: AWS – The Kinesis Cascade

Another significant AWS outage that highlighted complex interdependencies within cloud services.

The Cause: A “small addition of capacity” to AWS Kinesis Data Streams front-end servers triggered an operating system thread limit, leading to a cascading failure across many AWS services that depend on Kinesis in the us-east-1 region. It was like adding a small amount of water to a piscina (pool) and then the whole filter system overloads and takes down the entire house’s plumbing.

The Impact: Services like Amazon ECS, EKS, CloudWatch, Lambda, and others were affected for over 17 hours.

The Lesson: Regional service outages can have a broad impact on dependent workloads. It underscored the importance of multi-region architecture for critical applications, as Amazon builds each region independently to prevent such widespread issues.

2021: The Internet stumbles

A year with several widespread internet disruptions.

June 2021: Fastly Outage – The CDN Ripple Effect

A relatively short outage, but its impact was vast due to its central role in content delivery.

The Cause: A software bug in Fastly’s content delivery network (CDN) was triggered by a valid customer configuration change. It was a tiny ripple that became a giant wave.

The Impact: Major websites globally, including Amazon, Reddit, The New York Times, and even the UK Government’s website, experienced disruptions for less than an hour. Fastly detected and mitigated the issue within minutes, with most services recovering within 49 minutes.

The Lesson: CDNs are critical internet infrastructure. While the recovery was swift, it showed how a single point of failure in a widely used service can cascade across the internet. It highlights the need for robust software quality assurance and quick rollback procedures, even for configuration changes.

November 2021: Comcast Outage – Routing Shenanigans

Complex network routing can go awry, causing widespread issues for internet users.

The Cause: The outage was linked to iBGP updates affecting Comcast’s core network traffic, causing traffic paths across multiple regions to be rerouted and ultimately fail. It was like a digital traffic controller getting their signals crossed and sending everyone into gridlock.

The Impact: Disruptions were reported in multiple major metro areas, affecting internet service for many users.

The Lesson: The complexity of large-scale internet routing (BGP) means that misconfigurations or unexpected traffic patterns can lead to widespread service loss. Managing these core network components requires extreme precision.

2022: Social media and communication hiccups

This year saw several popular platforms experience disruptions.

March 2022: Spotify and Discord Outages – Cloud Troubles Echo

Two popular apps faced issues, highlighting cloud dependency.

Spotify’s Cause: A bug in a client library (gRPC) combined with an outage in Google Cloud Traffic Director (a service discovery system used by Spotify). Users were logged out of Spotify apps globally and unable to log back in.

Discord’s Cause: A major status issue with attachments and embeds was reported, related to a widespread problem with Google Cloud Platform. Users had trouble with attachments and embeds within the Discord app.

The Lesson: These incidents underscore that even when a service provider has good redundancy, issues can arise from unexpected interactions between different cloud services and client-side bugs. It highlights the need for additional monitoring and self-recovery mechanisms for cloud-dependent services.

July 2022: Rogers Communications Outage – Upgrade Goes Awry, Nation Stalls

A massive telecommunications outage affecting millions in Canada, demonstrating the fragility of national networks.

The Cause: An error in configuring distribution routers within Rogers’ IP core network during an upgrade process. Staff removed an Access Control List (ACL) policy filter, which led to a flood of IP routing information that overwhelmed and crashed core network routers. The absence of overload protection on core routers exacerbated the issue. It was like a single wrong setting taking down an entire rede (network) of roads.

The Impact: Mobile, home phone, internet, business wireline connectivity, and even 9-1-1 emergency calling ceased functioning for millions of customers across Canada. It crippled daily life and essential services.

The Lesson: Complex network upgrades require meticulous change management processes and robust overload protection mechanisms. A single configuration error can bring down entire national services, emphasizing the critical importance of telecommunications infrastructure.

July 2022: Twitter Outage – Stability Under Strain

This was part of a series of instabilities for the platform.

The Cause: Various factors were attributed to Twitter outages in 2022, including backend service issues, architecture changes, and capacity limitations. Specific details for July 2022 point to internal system changes causing broader disruption, as well as general instability under rapid changes.

The Impact: Users experienced issues posting tweets, replying, sending DMs, and refreshing feeds.

The Lesson: Rapid and extensive changes to a complex platform without sufficient testing or robust architecture can lead to recurring stability issues and user frustration.

December 2024: Instagram Outage – A Meta Family Affair

Instagram’s outage was part of a wider Meta family disruption, affecting users globally.

The Cause: Instagram and WhatsApp were restored after a mass global outage that affected several Meta-owned apps. The cause was a “technical issue” impacting user access. This is likely related to broader Meta infrastructure issues where a single problem can cascade across all their platforms.

The Impact: Over 18,000 people struggled to use WhatsApp, and over 22,000 reported issues with Facebook in this incident. Millions of Instagram users globally would have been affected.

The Lesson: Shared infrastructure across a family of apps means an issue in one place can bring down many services. Redundancy and isolation between critical components are key.

2025: New year, new outages

Even in the current year, the digital world has seen its share of stumbles.

March 2025: Twitter/X Outage – The Digital Stumble Continues

The Cause: While not fully detailed, Elon Musk blamed a “massive cyberattack” for a March 2025 outage. Other reports point to general instability from rapid changes and staff reductions.

The Impact: Users globally experienced difficulty accessing the site, loading tweets, and engaging with the platform.

The Lesson: Continued instability on a major platform highlights the challenges of balancing rapid innovation and cost-cutting with fundamental system reliability.

April 2025: Spain and Portugal Power Outage – A Grid Mystery

This was a major power blackout that rippled through daily life, reminding us of physical infrastructure’s criticality.

The Cause: A major power blackout across the Iberian Peninsula. As of early May 2025, the exact cause was still under investigation, but it involved low-frequency oscillations of the power grid frequency, leading to disconnection from the Central European system. This has prompted discussion about the stability of electricity systems with high shares of variable renewable energy.

The Impact: Electric power was interrupted for about ten hours in most of mainland Portugal and peninsular Spain, affecting telecommunications, transportation systems, and essential services. Tragically, it was linked to at least seven deaths in Spain and one in Portugal due to outage-related circumstances (e.g., candle fires, generator fumes, medical equipment failure).

The Lesson: Fundamental infrastructure like power grids are also vulnerable. An outage in core services can cascade into severe human consequences, highlighting the interconnectedness of our physical and digital worlds. It reminds us that your churrasco can’t run if the lights are out!

June 2025: Google Cloud Outage – The Latest Digital Hiccup

And finally, the most recent incident we just discussed.

The Cause: Traced back to an invalid automated quota update to Google Cloud’s API management system. This incorrect configuration was distributed globally, causing external API requests to be rejected.

The Impact: Services like Spotify, Snapchat, Discord, and AI coding apps (Cursor, Replit) were affected. It disrupted the middle of the workday for millions across the U.S..

The Lesson: Even highly sophisticated automated systems can have flaws, and a single misconfiguration can have a global ripple effect. It highlights the critical importance of rigorous testing even for automated deployment processes.

The unseen battle: Keeping the Internet running

These outages, from a simple typo to complex cyberattacks, software bugs, and physical infrastructure failures, are stark reminders that the digital world, for all its magic, is built on layers of interconnected and sometimes vulnerable infrastructure.

They cost billions, erode trust, and highlight the critical work of site reliability engineers and cybersecurity professionals. But from each incident, comes learning, new safeguards, and a renewed commitment to building a more resilient, always-on digital future.

It’s a constant battle to keep the lights on, but it’s one that defines our modern world, and ensures our digital churrascos can keep sizzling!

Enjoy and remember about the history technological posts.

Share this content:
Facebook
X (Twitter)
Reddit
LinkedIn
Bluesky
Threads
Telegram
Whatsapp
RSS
Copy link

Leave a Reply

Your email address will not be published. Required fields are marked *

Gravatar profile