Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.
The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.
Probably. This was years ago so the details have faded but I do recall that we did weigh about 6 different valid approaches of varying complexity in the war room before deciding this /etc/hosts hack was the right approach for our situation
I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.
Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?
Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.
On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".
Well, you put a lot of trust in the individuals in this case.
A disgruntled employee can just let the bad guys in on purpose, saying "Yes they belong here".
That works until they run into a second person. In a big corp where people don't recognize each other you can also let the bad guys in, and once they're in nobody thinks twice about it.
way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.
late reply but, no, i really needed to hit the button but didn't have valid ID at the time. My driver's license was expired and i couldn't get it renewed because of a outstanding tickets iirc. I was able to talk my way in and had been there many times before so knew my way around and what words to say. I was able to do what i needed before another admin came up and told me that without valid ID they have no choice but to ask me to leave (probably like an insurance thing). I was being a bit dramatic when i said "getting thrown out" the datacenter guys were very nice and almost apologetic about asking me to leave.
It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.
There's some computer lore out there about someone tripping a fire alarm by accident or some other event that triggered a gas system used to put out fires without water but isn't exactly compatible with life. The story goes some poor sys admin had to stand there with their finger on like a pause button until the fire department showed up to disarm the system. If they released the button the gas would flood the whole DC.
My point is that while the failure rate may be low the failure method is dude burns to death in a locked server room. Even classified room protocols place safety of personnel over safety of data in an emergency.
I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.
> To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.
Classic.
In my first job I worked on ATM software, and we had a big basement room full of ATMs for test purposes. The part the money is stored in is a modified safe, usually with a traditional dial lock. On the inside of one of them I saw the instructions on how to change the combination. The final instruction was: "Write down the combination and store it safely", then printed in bold: "Not inside the safe!"
There is a video from the lock pick lawyer where he receives a padlock in the mail with so much tape that it takes him whole minutes to unpack.
Concrete is nice, other options are piles of soil or brick in front of the door. There probably is a sweet spot where enough concrete slows down an excavator and enough bricks mixed in the soil slows down the shovel. Extra points if there is no place nearby to dump the rubble.
Probably one of those lost in translation or gradual exaggeration stories.
If you just wanted recovery keys that were secure from being used in an ordinary way you can use Shamir to split the key over a couple hard copies stored in safety deposit boxes a couple different locations.
The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.
The memory is hazy since it was 15+ years ago, but I'm fairly sure I knew someone who worked at a company whose servers were stolen this way.
The thieves had access to the office building but not the server room. They realized the server room shared a wall with a room that they did have access to, so they just used a sawzall to make an additional entrance.
my across the street neighbor had some expensive bikes stolen this way. The thieves just cut a hole in the side of their garage from the alley, security cameras were facing the driveway and with nothing on the alley side. We (the neighborhood) think they were targeted specifically for the bikes as nothing else was stolen and your average crack head isn't going to make that level of effort.
I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.
add a bunch of other poinless scifi and evil villan lair tropes in as well...
Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.
Still have my "my other datacenter is made of razorblades and hate" sticker. \o/
Not sure if you’re joking but a relatively small datacenter I’m familiar with has reduced oxygen in it to prevent fires. If you were to break in unannounced you would faint or maybe worse (?).
Not quite - while you can reduce oxygen levels, they have to be kept within 4pp so at worst, will make you light headed. Many athletes train at the same levels though so it’s easy to overcome.
That'd make for a decent heist comedy - a bunch of former professional athletes get hired to break in to a low-oxygen data center, but the plan goes wrong and they have to use their sports skills in improbable ways to pull it off.
Halon was used back in the day for fire suppression but I thought it was only dangerous at high enough concentrations to suffocate you by displacing oxygen.
I had a summer job at a hospital one year in the data center when an electrician managed to trigger the halon system and we all had to evacuate and wait for the process to finish and the gas to vent. The four firetrucks and station master who shoved up was both annoyed and relieved it was not real.
Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.
Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.
Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.
P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.
I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.
Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.
I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.
Ah, but have they verified how far down the turtles go, and has that changed since they verified it?
In the mid-2000s most of the conference call traffic started leaving copper T1s and going onto fiber and/or SIP switches managed by Level3, Global Crossing, Qwest, etc. Those companies combined over time into Century Link which was then rebranded Lumen.
Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.
Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)
If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.
This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.
Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.
I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.
"Meta Data Center Simulator 2021: As Real As It Gets (TM)"
That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.
Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!
That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)
That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.
There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.
Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.
You should assume it will not work unless you test it regularly. That's a big part of why having active/active multi-region is attractive, even though it's much more complex.
Not sure if this counts fully as 'distributed' here, but we (Authentik Security) help many companies self-host authentik multi-region or in (private cloud + on-prem) to allow for quick IAM failover and more reliability than IAMaaS.
There's also "identity orchestration" tools like Strata that let you use multiple IdPs in multiple clouds, but then your new weakest link is the orchestration platform.
It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.
Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.
Sure it was, you just needed to login to the console via a different regional endpoint. No problems accessing systems from ap-southeast-2 for us during this entire event, just couldn’t access the management planes that are hosted exclusively in us-east-1.
Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.
I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright
People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.
This is having a direct impact on my wellbeing. I was at Whole Foods in Hudson Yards NYC and I couldn’t get the prime discount on my chocolate bar because the system isn’t working. Decided not to get the chocolate bar. Now my chocolate levels are way too low.
Alexa is super buggy now anyway. I switched my Echo Dot to Alexa+, and it fails turning on and off my Samsung TV all the time now. You usually have to do it twice.
This has been my impetus to do Home Assistant things and I already can tell you that I'm going to spend far more time setting it up and tweaking it than I actually save, but false economy is a tinkerer's best friend. It's pretty impressive what a local LLM setup can do though, and I'm learning that all my existing smart devices are trivially available if anyone gets physical access to my network I guess!
This is the kind of thing Claude Code (bypassing permissions) shines at. I‘m about to setup HA myself and intend to not write a single line of config myself.
Something I love about HA is that all thr gui can always be directly edited using yaml. So you can ask claude for a v1 then tweak it a bit then finish with the gui. And all of this directly from the gui.
Ugh. Reminds me that some time ago Siri stopped responding to “turn off my TV.” Now I have to remember to say “turn off my Apple TV.” (Which with the magic of HDMI CEC turns off my entire system.) Given how groggy I am when I want to turn off the TV, I often forget.
How can this be? I had great luck with GPT3 way back when… and I didn’t have function calling or chat… had to parse the JSON myself, extraction “action” and “response-text” fields… How has this been so hard for AMZN? Is it a matter of token cost and trying to use small models?
that's a reasonable theory. they've likely delayed the launch this long due to the inference cost compared to the more basic Alexa engine.
I would also guess the testing is incomplete. Alexa+ is a slow roll out so they can improve precision/recall on the intents with actual customers. Alexa+ is less deterministic than the previous model was wrt intents
I was attempting to use self checkout for some lunch I grabbed from the hotbar and couldn’t understand why my whole foods barcode was failing. It took me a full 20 seconds to realize the reason for the failure.
First World treatlerite problems. /s What's going to suck years after too many SREs/SWEs will have long been fired, like the Morlocks & Eloi and Idiocracy, there won't be anyone left who can figure out that plants need water. There will be a few trillionaires surrounded by aristocratic, unimaginable opulence while most of humanity toils in favelas surrounded by unfixable technology that seems like magic. One cargo cult will worship 5.25" floppy disks and their arch enemies will worship CD-Rs.
Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
Once you've had an outage on AWS, Cloudflare, Google Cloud, Akismet. What are you going to do? Host in house? None of them seem to be immune from some outage at some point. Get your refund and carry on. It's less work for the same outcome.
Why not host in house? If you have an application with stable resource needs, it can often be the cheaper and more stable option. At a certain scale, you can buy the servers, hire a sysadmin, and still spend less money than relying on AWS.
If you have an app that experiences 1000x demand spikes at unpredictable times then sure, go with the cloud. But there are a lot of companies that would be better off if they seriously considered their options before choosing the cloud for everything.
Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.
++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy
double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0
Shouldn't be double in the long term. Think of the second cloud as a cold standby. Depends on the system. Periodic replication of data layer (object storage/database) and CICD configured to be able to build services and VMs on multiple clouds. Have automatic tests weekly/monthly that represent end-to-end functionality, have scaled tests semi-annually.
This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.
Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.
And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.
If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.
I totally agree with you. Where I work, we self-host almost everything. Exceptions are we use a CDN for one area where we want lower latency, and we use BigQuery when we need to parse a few billion datapoints into something usable.
It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.
Not GP, but my company also self-hosts. We rent rackspace in a colo. We used to keep my team's research server in the back closet before we went full-remote.
This. When Andy Jassy got challenged by analysts on the last earnings call on why AWS has fallen so far behind on innovation in areas his answer was a hand wavy response that diverted attention to say AWS is durable, stable, and reliable and customers care more about that. Oops.
The culture changed. When I first worked there, I was encouraged to take calculated risks. When I did my second tour of duty, people were deathly afraid of bringing down services. It has been a while since my second tour of duty, but I don't think it's back to "Amazon is a place where builders can build".
Somewhat inevitable for any company as they get larger. Easy to move fast and break things when you have 1 user and no revenue. Very different story when much of US commerce runs you on.
For folks who came of age in the late 00's, seeing companies once thought of as disruptors and innovators become the old stalwarts post-pandemic/ZIRP has been quite an experience.
Maybe those who have been around longer have seen this before, but its the first time for me.
If you bring something down in a real way, you can forget about someone trusting you with a big project in the future. You basically need to switch orgs
Nah, I used to work for defense contractors, and worked with ex-military people, so...
Anyway, I actually loved my first time at AWS. Which is why I went back. My second stint wasn't too bad, but I probably wouldn't go back, unless they offered me a lot more than what I get paid, but that is unlikely.
I listened to the earnings call. I believe the question was mostly focused on why AWS has been so behind on AI. Jassy did flub the question quite badly and rambled on for a while. The press has mentioned the botched answer in a few articles recently.
They have been pushing me and company extremely hard to vet their various AI-related offerings. When we decide to look into whatever service it is, we come away underwhelmed. It seems like their biggest selling point so far is “we’ll give it to you free for several months”. Not great.
In fairness, that's been my experience with everyone except OpenAI and Anthropic where I only occasionally come out underwhelmed
Really I think AWS does a fairly poor job bringing new services to market and it takes a while for them to mature. They excel much more in the stability of their core/old services--especially the "serverless" variety like S3, SQS, Lambda, EC2-ish, RDS-ish (well, today notwithstanding)
I honestly feel bad for the folks at AWS whose job it is to sell this slop. I get AWS is in panic mode trying to catch up, but it’s just awful and frankly becoming quite exhausting and annoying for customers.
The comp might be decent but most folks I know that are still there say they’re pretty miserable and the environment is becoming toxic. A bit more pay only goes so far.
Everything except us-east-1 is generally pretty reliable. At $work we have a lot of stuff that's only on eu-west-1 (yes not the best practice) and we haven't had any issues, touch wood
My impression is that `us-east-1` has the worst reliability track record of any region. We've always run our stuff in `us-west-2` and there has never been an outage that took us down in that region. By contrast, a few things that we had in `us-east-1` have gone down repeatedly.
It’s the “original” AWS region. It has the most legacy baggage, the most customer demand (at least in the USA), and it’s also the region that hosts the management layer of most “global” services. Its availability has also been dogshit, but because companies only care about costs today and not harms tomorrow, they usually hire or contract out to talent that similarly only cares about the bottom line today and throws stuff into us-east-1 rather than figure out AZs and regions.
The best advice I can give to any org in AWS is to get out of us-east-1. If you use a service whose management layer is based there, make sure you have break-glass processes in place or, better yet, diversify to other services entirely to reduce/eliminate single points of failure.
Former AWS employee here. There's a number of reasons but it mostly boils down to:
It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.
> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.
Was quite some time ago so I don't have the data, but AWS never came out on top.
It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.
AWS has been in long term decline, most of the platform is just in keeping the lights on mode. Its also why they are behind on AI, alot of would be innovative employees get crushed under red tape and performance management
This is the real problem. Even if you don't run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn't matter if those services are in other availability zones. AWS's own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.
It really is a single point of failure for the majority of the Internet.
This becomes the reason to run in us-east-1 if you're going to be single region. When it's down nobody is surprised that your service is affected. If you're all-in on some other region and it goes down you look like you don't know what you're doing.
> Even if you don't run anything in AWS directly, something you integrate with will.
Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"
It's easy to say this, but in the real world, most of the critical path is heavily-dependent on third party integrations. User auth, storage, logging, etc. Even if you're somewhat-resilient against failures (i.e. you can live without logging and your app doesn't hard fail), it's still potentially going to cripple your service. And even if your entire app is resilient and doesn't fail, there are still bound to be tons of integrations that will limit functionality, or make the app appear broken in some way to users.
The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.
It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.
If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.
Clearly these are non-trivial trade-offs, but I think using third parties is not an either or question. Depending on the app and the type of third-party service, you may be able to make design choices that allow your systems to survive a third-party outage for a while.
E.g., a hospital could keep recent patient data on-site and sync it up with the central cloud service as and when that service becomes available. Not all systems need to be linked in real time. Sometimes it makes sense to create buffers.
But the downside is that syncing things asynchronously creates complexity that itself can be the cause of outages or worse data corruption.
I guess it's a decision that can only be made on a case by case basis.
Not necessarily our critical path but today circleci was affected greatly which also affected our capacity to deploy. Luckily it was a Monday morning therefore we didn’t even have to deploy an hot fix.
Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.
The only ones I can really think of are the cloud providers themselves- I was at Microsoft, and absolutely everything was in-house (often to our detriment).
I think you missed the "critical path" part. Why would your product stop functioning if your admins can't log in with IAM / VPN in, do you really need hands-on maintenance constantly? Why would your product stop functioning if Office is down, are you managing your ops in Excel or something?
"Some kind of dependency" is fine and unavoidable, but well-architected systems don't have hard downtime just because someone somewhere you have no control over fucked up.
Since 2020 for some reason lot of companies have fully remote workforce. If the VPN or auth goes down and workers can't login, that's a problem. Think banks, call center work, customer service.
Glad that you're taking the first step toward resiliency. At times, big outages like these are necessary to give a good reason why the company should Multicloud. When things are working without problems, no one cares to listen to the squeaky wheel.
I would be interested in a follow up in 2-3 years as to whether you've had fewer issues with a multi-cloud setup than just AWS. My suspicion is that will not be the case.
Still no serverless inference for models or inference pipes that are not available on bedrock, still no auto scaling GPU workers. We started bothering them in 2022...crickets
Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.
February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.
I used to work at a company where the SLA was measured as the percentage of successful requests on the server. If the load balancer (or DNS or anything else network) was dropping everything on the floor, you'd have no 500s and 100% SLA compliance.
I’ve been customer for at least four separate products where this was true.
I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>
I spent enough time ~15 years back to find an external monitoring service that did not run on AWS and looked like a sustainable business instead of a VC fueled acquisition target - for our belts-n-braces secondary monitoring tool since it's not smart to trust CloudWatch to be able to send notifications when it's AWS's shit that's down.
Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.
(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)
An outage like this does not happen every year, The last big outage happened in December 2021, roughly 3 years 10 month = 46 months ago.
The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.
They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.
For DynamoDB, I'm not sure but I think its covered. https://aws.amazon.com/dynamodb/sla/. "An "Error" is any Request that returns a 500 or 503 error code, as described in DynamoDB". There were tons of 5XX errors. In addition, this calculation uses percentage of successful requests, so even partial degradation counts against the SLA.
The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.
This 100% seems to be what they're saying. I have not been able to get a single Airflow task to run since 7 hours ago. Being able to query Redshift only recently came back online. Despite this all their messaging is that the downtime was limited to some brief period early this morning and things have been "coming back online". Total lie, it's been completely down for the entire business day here on the east coast.
I haven't done any RFP responses for a while but this question always used to make me furious. Our competitors (some of who had had major incidents in the past) claimed 99.99% availability or more, knowing they would never have to prove it, and knowing they were actually 100% until the day they weren't.
We were more honest, and it probably cost us at least once in not getting business.
An SLA is a commitment, and an RFP is a business document, not a technical one. As an MSP, you don’t think in terms of “what’s our performance”, you think of “what’s the business value”.
If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.
I don't think anyone would quote availability as availability in every region I'm in?
While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.
They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.
AWS GovCloud East is actually located in Ohio IIRC. Haven't had any issues with GovCloud West today; I'm pretty sure they're logically separated from the commercial cloud.
I don't think this is true anymore. In the early days bad enough outages in us-east-1 would bring down everything because some metadata / control pane stuff was there, I remember getting affected while in other regions, but there's been many years since this has happened.
Today for example no issues. I just avoid us-east-1 and everyone else should to. It's their worst region by far in terms of reliability because they launch all the new stuff there and are always messing it up.
A secondary problem is that a lot of the internal tools are still on US East, so likely the response work is also being impacted by the outage. Been a while since there was a true Sev1 LSE (Large Scale Event).
Well that’s the default pattern anyway. When I worked in cloud there were always some services that needed cross-regional dependencies for some reason or other and this was always supposed to be called out as extra risk, and usually was. But as things change in a complex system, it’s possible for long-held assumptions about independence to change and cause subtle circular dependencies that are hard to break out of. Elsewhere in this thread I saw someone mentioning being migrated to auth that had global dependencies against their will, and I groaned knowingly. Sometimes management does not accept “this is delicate and we need to think carefully” in the midst of a mandate.
I do not envy anyone working on this problem today.
1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"
2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.
3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.
4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.
5. Many Amazon features are available in that region first and then spread out to other locations.
6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.
7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?
It's the world's default hosting location, and today's outages show it.
In every SKU I've ever looked at / priced out, all of the AWS NA regions have ~equal pricing. What's cheaper specifically in us-east-1?
> Europe-friendly
Why not us-east-2?
> Many Amazon features are available in that region first and then spread out to other locations.
Well, yeah, that's why it breaks. Using not-us-east-1 is like using an LTS OS release: you don't get the newest hotness, but it's much more stable as a "build it and leave it alone" target.
> It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks.
This is a better argument, but in practice, it's very niche — 2-5ms of speed-of-light delay doesn't matter to anyone but HFT folks; anyone else can be in a DC one state away with a pre-arranged tier1-bypassing direct interconnect, and do fine. (This is why OVH is listed on https://www.cloudinfrastructuremap.com/ despite being a smaller provider: their DCs have such interconnects.)
For that matter, if you want "low-latency to North America and Europe, and high-throughput lowish-latency peering to many other providers" — why not Montreal [ca-central-1]? Quebec might sound "too far north", but from the fiber-path perspective of anywhere else in NA or Europe, it's essentially interchangeable with Virginia.
How is it a flaw!? Building datacenters in different regions come with very different costs, and different costs to run. Power doesn't cost exactly the same in different regions. Local construction services are not priced exactly the same everywhere. Insurance, staff salaries, etc, etc... it all adds up, and it's not the same costs everywhere. It only makes sense that it would cost different amounts for the services run in different regions. Not sure how you're missing these easy to realize facts of life.
Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.
> the occasional outage isn't worth the cost and effort of moving out.
And looked at from the perspective of an individual company, as a customer of AWS, the occasional outage is usually an acceptable part of doing business.
However, today we’ve seen a failure that has wiped out a huge number of companies used by hundreds of millions - maybe billions - of people, and obviously a huge number of companies globally all at the same time. AWS has something like 30% of the infra market so you can imagine, and most people reading this will to some extent have experienced, the scale of disruption.
And the reality is that whilst bigger companies, like Zoom, are getting a lot of the attention here, we have no idea what other critical and/or life and death services might have been impacted. As an example that many of us would be familiar with, how many houses have been successfully burgled today because Ring has been down for around 8 out of the last 15 hours (at least as I measure it)?
I don’t think that’s OK, and I question the wisdom of companies choosing AWS as their default infra and hosting provider. It simply doesn’t seem to be very responsible to be in the same pond as so many others.
Were I a legislator I would now be casting a somewhat baleful eye at AWS as a potentially dangerous monopoly, and see what I might be able to do to force organisations to choose from amongst a much larger pool of potential infra providers and platforms, and I would be doing that because these kinds of incidents will only become more serious as time goes on.
You're suffering from survivorship bias. You know that old adage about the bullet holes in the planes, and someone pointed out that you should reinforce that parts without bullet holes, because these are the planes that came back.
It's the same thing here. Do you think other providers are better? If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.
At least this way, everyone knows why it's down, our industry has developed best practices for dealing with these kinds of outages, and AWS can apply their expertise to keeping all their customers running as long as possible.
> If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.
That is the point, though: Correlated outages are worse than uncorrelated outages. If one payment provider has an outage, chose another card or another store and you can still buy your goods. If all are down, no one can shop anything[1]. If a small region has a power blackout, all surrounding regions can provide emergency support. If the whole country has a blackout, all emergency responders are bound locally.
[1] Except with cash – might be worth to keep a stash handy for such purposes.
That’s a pretty bold claim. Where’s your data to back it up?
More importantly you appear to have misunderstood the scenario I’m trying to avoid, which is the precise situation we’ve seen in the past 24 hours where a very large proportion of internet services go down all at the same time precisely because they’re all using the same provider.
And then finally the usual outcome of increased competition is to improve the quality of products and services.
I am very aware of the WWII bomber story, because it’s very heavily cited in corporate circles nowadays, but I don’t see that it has anything to do with what I was talking about.
AWS is chosen because it’s an acceptable default that’s unlikely to be heavily challenged either by corporate leadership or by those on the production side because it’s good CV fodder. It’s the “nobody gets fired for buying IBM” of the early mid-21st century. That doesn’t make it the best choice though: just the easiest.
And viewed at a level above the individual organisation - or, perhaps from the view of users who were faced with failures across multiple or many products and services from diverse companies and organisations - as with today (yesterday!) we can see it’s not the best choice.
Reality is, though, that you shouldn't put all your eggs in the same basket. And it was indeed the case before the cloud. One service going down would have never had this cascade effect.
I am not even saying "build your own DC", but we barely have resiliency if we all rely on the same DC. That's just dumb.
From the standpoint of nearly every individual company, it's still better to go with a well-known high-9s service like AWS than smaller competitors though. The fact that it means your outages will happen at the same time as many others is almost like a bonus to that decision — your customers probably won't fault you for an outage if everyone else is down too.
That homogeneity is a systemic risk that we all bear, of course. It feels like systemic risks often arise that way, as an emergent result from many individual decisions each choosing a path that truly is in their own best interests.
We're on Azure and they are worse in every aspect, bad deployment of services, and status pages that are more about PR than engineering.
At this point, is there any cloud provider that doesn't have these problems? (GCP is a non-starter because a false-positive YouTube TOS violation get you locked out of GCP[1]).
Are you warned about the risks in an active war one? Yes.
Does Google warn you about this when you sign up? No.
And PayPal having the same problem in no way identifies Google. It just means that PayPal has the same problem and they are also incompetent (and they also demonstrate their incompetence in many other ways).
> It just means that PayPal has the same problem and they are also incompetent
Do you consider regular brick-and-mortar savings banks to be incompetent when they freeze someone's personal account for receiving business amounts of money into it? Because they all do, every last one. Because, again, they expect you to open a business account if you're going to do business; and they look at anything resembling "business transactions" happening in a personal account through the lens of fraud rather than the lens of "I just didn't realize I should open a business account."
And nobody thinks this is odd, or out-of-the-ordinary.
Do you consider municipal governments to be incompetent when they tell people that they have to get their single-family dwelling rezoned as mixed-use, before they can conduct business out of it? Or for assuming that anyone who is conducting business (having a constant stream of visitors at all hours) out of a residentially-zoned property, is likely engaging in some kind of illegal business (drug sales, prostitution, etc) rather than just being a cafe who didn't realize you can't run a cafe on residential zoning?
If so, I don't think many people would agree with you. (Most would argue that municipal governments suppress real, good businesses by not issuing the required rezoning permits, but that's a separate issue.)
There being an automatic level of hair-trigger suspicion against you on the part of powerful bureaucracies — unless and until you proactively provide those bureaucracies enough information about yourself and your activities for the bureaucracies to form a mental model of your motivations that makes your actions predictable to them — is just part of living in a society.
Heck, it's just a part of dealing with people who don't know you. Anthropologists suggest that the whole reason we developed greeting gestures like shaking hands (esp. the full version where you pull each-other in and use your other arms to pat one-another on the back) is to force both parties to prove to the other that they're not holding a readied weapon behind their backs.
---
> Are you warned about the risks in an active war one? Yes. Does Google warn you about this when you sign up? No.
As a neutral third party to a conflict, do you expect the parties in the conflict to warn you about the risks upon attempting to step into the war zone? Do you expect them to put up the equivalent of police tape saying "war zone past this point, do not cross"?
This is not what happens. There is no such tape. The first warning you get from the belligerents themselves of getting near either side's trenches in an active war zone, is running face-first into the guarded outpost/checkpoint put there to prevent flanking/supply-chain attacks. And at that point, you're already in the "having to talk yourself out of being shot" point in the flowchart.
It has always been the expectation that civilian settlements outside of the conflict zone will act of their own volition to inform you of the danger, and stop you from going anywhere near the front lines of the conflict. By word-of-mouth; by media reporting in newspapers and on the radio; by municipal governments putting up barriers preventing civilians from even heading down roads that would lead to the war zone. Heck, if a conflict just started "up the road", and you're going that way while everyone's headed back the other way, you'll almost always eventually be flagged to pull over by some kind stranger who realizes you might not know, and so wants to warn you that the only thing you'll get by going that way is shot.
---
Of course, this is all just a metaphor; the "war" between infrastructure companies and malicious actors is not the same kind of hot war with two legible "sides." (To be pedantic, it's more like the "war" between an incumbent state and a constant stream of unaffiliated domestic terrorists, such as happens during the ongoing only-partially-successful suppression of a populist revolution.)
But the metaphor holds: just like it's not a military's job to teach you that military forces will suspect that you're a spy if you approach a war zone in plainclothes; and just like it's not a bank's job to teach you that banks will suspect that you're a money launderer if you start regularly receiving $100k deposits into your personal account; and just like it's not a city government's job to teach you that they'll suspect you're running a bordello out of your home if you have people visiting your residentially-zoned property 24hrs a day... it's not Google's job to teach you that the world is full of people that try to abuse Internet infrastructure to illegal ends for profit; and that they'll suspect you're one of those people, if you just show up with your personal Google account and start doing some of the things those people do.
Rather, in all of these cases, it is the job of the people who teach you about life — parents, teachers, business mentors, etc — to explain to you the dangers of living in society. Knowing to not use your personal account for business, is as much a component of "web safety" as knowing to not give out details of your personal identity is. It's "Internet literacy", just like understanding that all news has some kind of bias due to its source is "media literacy."
If you can't figure out how to use a different Google account for YouTube from the GCP billing account, I don't know what to say. Google's in the wrong here, but spanner's good shit! (If you can afford it. and you actually need it. you probably don't.)
The problem isn't specifically getting locked out of GCP (though it is likely to happen for those out of the loop on what happened). It is that Google themselves can't figure out that a social media ban shouldn't affect your business continuity (and access to email or what-have-you).
It is an extremely fundamental level of incompetence at Google. One should "figure out" the viability of placing all of one's eggs in the basket of such an incompetent partner. They screwed the authentication issue up and, this is no slippery slope argument, that means they could be screwing other things up (such as being able to contact a human for support, which is what the Terraria developer also had issues with).
We have discussions coming up to evict ourselves from AWS entirely. Didn't seem like there was much of an appetite for it before this but now things might have changed. We're still small enough of a company to where the task isn't as daunting as it might otherwise be.
> Is there some reason why "global" services aren't replicated across regions?
On AWS's side, I think us-east-1 is legacy infrastructure because it was the first region, and things have to be made replicable.
For others on AWS who aren't AWS themselves: because AWS outbound data transfer is exorbitantly expensive. I'm building on AWS, and AWS's outbound data transfer costs are a primary design consideration for potential distribution/replication of services.
It is absolutely crazy how much AWS charges for data. Internet access in general has become much cheaper and Hetzner gives unlimited AWS. I don't recall AWS ever decreasing prices for outbound data transfer
I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.
And yes, AWS' rates are highway robbery. If you assume $1500/mo for a 10 Gbps port from a transit provider, you're looking at $0.0005/GB with a saturated link. At a 25% utilization factor, still only $0.002/GB. AWS is almost 50 times that. And I guarantee AWS gets a far better rate for transit than list price, so their profit margin must be through the roof.
> I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.
Which makes sense, but even their rates for traffic between AWS regions are still exorbitant. $0.10/GB for transfer to the rest of the Internet somewhat discourages integration of non-Amazon services (though you can still easily integrate with any service where most of your bandwidth is inbound to AWS), but their rates for bandwidth between regions are still in the $0.01-0.02/GB range, which discourages replication and cross-region services.
If their inter-region bandwidth pricing was substantially lower, it'd be much easier to build replicated, highly available services atop AWS. As it is, the current pricing encourages keeping everything within a region, which works for some kinds of services but not others.
Even their transfer rates between AZs _in the same region_ are expensive, given they presumably own the fiber?
This aligns with their “you should be in multiple AZs” sales strategy, because self-hosted and third-party services can’t replicate data between AZs without expensive bandwidth costs, while their own managed services (ElastiCache, RDS, etc) can offer replication between zones for free.
Hetzner is "unlimited fair use" for 1Gbps dedicated servers, which means their average cost is low enough to not be worth metering, but if you saturate your 1Gbps for a month they will force you to move to metered. Also 10Gbps is always metered. Metered traffic is about $1.50 per TB outbound - 60 times cheaper than AWS - and completely free within one of their networks, including between different European DCs.
In general it seems like Europe has the most internet of anywhere - other places generally pay to connect to Europe, Europe doesn't pay to connect to them.
So provide a way to check/uncheck which zones you want replication to. Most people aren't going to need more than a couple of alternatives, and they'll know which ones will work for them legally.
My guess is that for IAM it has to do with consistency and security. You don't want regions disagreeing on what operations are authorized. I'm sure the data store could be distributed, but there might be some bad latency tradeoffs.
The other concerns could have to do with the impact of failover to the backup regions.
Regions disagree on what operations are authorized. :-)
IAM uses eventual consistency. As it should...
"Changes that I make are not always immediately visible": - "...As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any changes that you make in IAM (or other AWS services), including attribute-based access control (ABAC) tags, take time to become visible from all possible endpoints. Some delay results from the time it takes to send data from server to server, replication zone to replication zone, and Region to Region. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out...
...You must design your global applications to account for these potential delays. Ensure that they work as expected, even when a change made in one location is not instantly visible at another. Such changes include creating or updating users, groups, roles, or policies. We recommend that you do not include such IAM changes in the critical, high availability code paths of your application. Instead, make IAM changes in a separate initialization or setup routine that you run less frequently. Also, be sure to verify that the changes have been propagated before production workflows depend on them..."
Mostly AWS relies on each region being its own isolated copy of each service. It gets tricky when you have globalized services like IAM. AWS tries to keep those to a minimum.
For us, we had some minor impacts but most stuff was stable. Our bigger issue was 3rd party SaaS also hosted on us-east-1 (Snowflake and CircleCI) which broke CI and our data pipeline
So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.
I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.
One advantage to being in the biggest region: when it goes down the headlines all blame AWS, not you. Sure you’re down too, but absolutely everybody knows why and few think it’s your fault.
This was a major issue, but it wasn't a total failure of the region.
Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.
I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.
We definitely learnt something here about both our software and our 3rd party dependencies.
You have to remember that health status dashboards at most (all?) cloud providers require VP approval to switch status. This stuff is not your startup's automated status dashboard. It's politics, contracts, money.
Downdetector had 5,755 reports of AWS problems at 12:52 AM Pacific (3:53 AM Eastern).
That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).
However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).
Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.
Where do they source those reports from? Always wondered if it was just analysis of how many people are looking at the page, or if humans somewhere are actually submitting reports.
So if my browser auto-completes their domain name and I accept that (causing me to navigate directly to their site and then I click AWS) it's not a report; but if my browser doesn't or I don't accept it (because I appended "AWS" after their site name) causing me to perform a Google search and then follow the result to the AWS page on their site, it's a report? That seems too arbitrary... they should just count the fact that I went to their AWS page regardless of how I got to it.
I don't know the exact details, but I know that hits to their website do count as reports, even if you don't click "report". I assume they weight it differently based on how you got there (direct might actually be more heavily weighted, at least it would be if I was in charge).
Lambda create-function control plane operations are still failing with InternalError for us - other services have recovered (Lambda, SNS, SQS, EFS, EBS, and CloudFront). Cloud availability is the subject of my CS grad research, I wrote a quick post summarizing the event timeline and blast radius as I've observed it from testing in multiple AWS test accounts: https://www.linkedin.com/pulse/analyzing-aws-us-east-1-outag...
Definitely seems to be getting worse, outside of AWS itself, more websites seem to be having sporadic or serious issues. Concerning considering how long the outage has been going.
Is it hard wired? If so, and if the alarm module doesn’t have an internal battery, can you go to the breaker box and turn off the circuit it’s on? You should be able to switch off each breaker in turn until it stops if you don’t know which circuit it’s on.
If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.
Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.
Northern Virginia's Fairfax County public schools have the day off for Diwali, so that's not an unreasonable question.
In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.
If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.
Seems like a lot of people missing that this post was made around midnight PST time and thus it would be more reasonable to ping people at lunch in IST before waking up people in EST or PST.
Yeah. We had a brief window where everything resolved and worked and now we're running into really mysterious flakey networking issues where pods in our EKS clusters timeout talking to the k8s API.
Have not gotten a data pipeline to run to success since 9AM this morning when there was a brief window of functioning systems. Been incredibly frustrating seeing AWS tell the press that things are "effectively back to normal". They absolutely are not! It's still a full outage as far as we are concerned.
Agreed, every time the impacted services list internally gets shorter, the next update it starts growing again.
A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.
This looks like one their worst outage in 15 years and us-east-1 still shows as degraded but I had no outages, as dont use us-east-1. Are you seeing issues on other regions?
The closest to their identification of a root cause seems to be this one:
"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."
The problems now seem mostly related to starting new instances. Our capacity is slowly decaying as existing services spin down and new EC2 workloads fail to start.
Unclear. ‘Foo’ has a life and origin of its own and is well attested in MIT culture going back to the 1930s for sure, but it seems pretty likely that it’s counterpart ‘bar’ appears in connection with it as a comical allusion to FUBAR.
'During the United States v. Microsoft Corp. trial, evidence was presented that Microsoft had tried to use the Web Services Interoperability organization (WS-I) as a means to stifle competition, including e-mails in which top executives including Bill Gates and Steve Ballmer referred to the WS-I using the codename "foo".[13]'
"FUBAR" comes up in the movie Saving Private Ryan. It's not a plot point, but it's used to illustrate the disconnect between one of the soldiers dragged from a rear position to the front line, and the combat veterans in his squad. If you haven't seen the movie, you should. The opening 20 minutes contains one of the most terrifying and intense combat sequences ever put to film.
One unexpected upside moving from a DC to AWS is when a region is down, customers are far more understanding. Instead of being upset, they often shrug it off since nothing else they needed/wanted was up either.
This is a remarkable and unfair truth. I have had this experience with Office365...when they're down a lot of customers don't care because all their customers are also down.
I was once told that our company went with Azure because when you tell the boomer client that our service is down because Microsoft had an outage, they go from being mad at you, to accepting that the outage was an act of god that couldn’t be avoided.
As they say, every cloud outage has a silver lining.
* Give the computers a rest, they probably need it. Heck, maybe the Internet should just shut down in the evening so everyone can go to bed (ignoring those pesky timezone differences)
* Free chaos engineering at the cloud provider region scale, except you didn't opt in to this one and know about in advance, making it extra effective
* Quickly figure out a map which of the things you use have a dependency on a single AWS region without no capability to change or re-route
This still happens in some places. In various parts of Europe there are legal obligations not to email employees out of hours if it is avoidable. Volkswagen famously adopted a policy in Germany of only enabling receipt of new email messages for most of their employees 30 minutes before start of the working day, then disabling 30 minutes after the end, with weekends turned off also. You can leave work on Friday and know you won't be receiving further emails until Monday.
Is us-east-1 equally unstable to the other regions? My impression was that Amazon deployed changes to us-east-1 first so it's the most unstable region.
I've heard this so many times and not seen it contradicted so I started saying it myself. Even my last Ops team wanted to run some things in us-east-1 to get prior warning before they broke us-west-1.
But there are some people on Reddit who think we are all wrong but won't say anything more. So... whatever.
Nothing in the outage history really stands out as "this is the first time we tried this and oops" except for us-east-1.
It's always possible for things to succeed at a smaller scale and fail at full scale, but again none of them really stand out as that to me. Or at least, not any in the last ten years. I'm allowing that anything older than that is on the far side of substantial process changes and isn't representative anymore.
It took me so long to realise this is what's important in enterprise. Uptime isn't important, being able to blame someone else is what's important.
If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.
If you're down for 5 hours a year but this affected other companies too, it's not your fault
From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.
When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.
If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.
As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.
A slightly less cynical view: execs have a hard filter for “things I can do something about” and “things I can’t influence at all.” The bad ones are constantly pushing problems into the second bucket, but there are legitimately gray area cases. When an exec smells the possibility that their team could have somehow avoided a problem, that’s category 1 and the hammer comes down hard.
After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.
Check the URL, we had an issue a couple of years ago with the Workspaces. US East was down but all of our stuff was in EU.
Turns out the default URL was hardcoded to use the us east interface and just by going to workspaces and then editing your URL to be the local region got everyone working again.
Unless you mean nothing is working for you at the moment.
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”
Dumb question but what's the difference between the two? If the underlying config is broken then DNS resolution would fail, and that's basically the only way resolution fails, no?
My speculation: 1st one - it just DNS fails and you can repeat later. second one - you need working DNS to update your DNS servers with new configuration endpoints where DynamoDB fetches its config (classical case of circular dependencies - i even managed get similar problem with two small dns servers...)
DNS is trivial to distribute if your backing storage is accessible and/or local to each resolver, so it's a reasonable distinction to make: It suggests someone has preferred consistency at a level where DNS doesn't really provide consistency (due to caching in resolvers along the path) anyway, over a system with fewer failure points.
I feel like even Amazon/AWS wouldn't be that dim, they surely have professionals who know how to build somewhat resilient distributed systems when DNS is involved :)
I doubt a circular dependency is the cause here (probably something even more basic). That being said, I could absolutely see how a circular dependency could accidentally creep in, especially as systems evolve over time.
Systems often start with minimal dependencies, and then over time you add a dependency on X for a limited use case as a convenience. Then over time, since it's already being used it gets added to other use cases until you eventually find out that it's a critical dependency.
I don't think that's necessarily true. The outage updates later identified failing network load balancers as the cause--I think DNS was just a symptom of the root cause
I suppose it's possible DNS broke health checks but it seems more likely to be the other way around imo
I don’t work for aws, but a different cloud provider so this is not a description of this incident, but an example of the kind of thing that can happen
One particular “dns” issue that caused an outage was actually a bug in software that monitors healthchecks.
It would actively monitor all servers for a particular service (by updating itself based on what was deployed) and update dns based on those checks.
So when the health check monitors failed, servers would get removed from dns within a few milliseconds.
Bug gets deployed to health check service. All of a sudden users can’t resolve dns names because everything is marked as unhealthy and removed from dns.
So not really a “dns” issue, but it looks like one to users
DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.
I actually think the design of DNS is really cool. I'm sure we could do better designing from a clean slate today, especially around security (designing with the assumption of an adversarial environment).
But DNS was designed in the 80s! It's actually a minor miracle it works as well as it does
I worked in a similar system. The raw data from the field first goes to a cloud hosted event queue of some sort, then a database, then back to whatever app/screen on field. The data doesn't just power on-field displays. There's a lot of online websites, etc that needs to pull data from an api.
I wouldn't be at all surprised if people pay for API access to the data. I've worked with live sports data before, it's a very profitable industry to be in when you're the one selling the data.
Of course in a sane world you'd have an internal fallback for when cloud connectivity fails but I'm sure someone looked at the cost and said "eh, what's the worst that could happen?"
Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
Yeah, it's not clear how resilient CloudFront is but it seems good. Since content is copied to the points of presence and cached it's the lightly used stuff that can break (we don't do writes through CloudFront, which in IMHO is an anti-pattern). We setup multiple "origins" for the content so hopefully that provides some resiliency -- not sure if it contributed positively in this case since CF is such a black box. I might setup some metadata for the different origins so we can tell which is in use.
There is always more than a way to do things with AWS. But CloudFront Origin groups can’t use HTTP POST. They’re limited to read requests. Without origin groups you opt-out of some resiliency. IMHO that’s a bad trade-off. To each their own.
Yep if you wrote lambda@edge functions, which are part of Cloudfront and can be used for authentication among other things, they can only be deployed to us-east-1
I was under the impression it's similar to IAM where the control plane is in us-east-1 and the config gets replicated to other regions. In that case, existing stuff would likely continue to work but updates may fail
True for certs but not the log bucket (but it’s still going to be in a single region, just doesn’t have to be Virginia). I’m guessing those certs are cached where needed, but I can also imagine a perfect storm where I’m unable to rotate them due to an outage.
I prefer the API Gateway model where I can create regional endpoints and sew them together in DNS.
The data layer is DynamoDB with Global Tables providing replication between regions, so we can write to any region. It's not easy to get this right, but our use case is narrow enough and rate of change low enought (intentionally) that it works well. That said, it still isn't clear that replication to us-east-1 would be perfect so we did "diff" tables just to be sure (it has been for us).
There is some S3 replication as well in the CI/CD pipeline, but that doesn't impact our customers directly. If we'd seen errors there it would mean manually taking Virginia out of the pipeline so we could deploy everyehere else.
Our stacks in us-east-1 stopped getting traffic when the errors started and we’ve kept them out of service for now, so those tables aren’t being used. When we manually checked around noon (Pacific) they were fine (data matched) but we may have just gotten lucky.
cool thanks, we've been considering dynamo global tables for the same. We have S3 replication setup for cold storage data. For primary/hot DB there doesn't seem to be many other options for doing local writes
We use AWS for keys and certs, with aliases for keys so they resolve properly to the specific resources in each region. For any given HTTP endpoint there is a cert that is part of a the stack in that region (different regions use different certs).
The hardest part is that our customers' resources aren't always available in multiple regions. When they are we fall back to a region where they exist that is next closest (by latency, courtesy of https://www.cloudping.co/).
That’s what I’d expect a basic setup to look like - region/space specific
So you’re minimally hydrating everyone’s data everywhere so that you can have some failover. Seems smart and a good middle ground to maximize HA. I’m curious what your retention window for the failover data redundancy is. Days/weeks? Or just a fifo with total data cap?
Just config information, not really much customer data. Customer data stays in their own AWS accounts with our service. All we hold is the ARNs of the resources serving as destinations.
We’ve gone to great lengths to minimize the amount of information we hold. We don’t even collect an email address upon sign-up, just the information passed to us by AWS Marketplace, which is very minimal (the account number is basically all we use).
One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.
It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.
Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.
Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?
There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.
I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.
Both of which seem to prop up in post mortems for these widespread outages.
They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.
It's not a direct dependency.
Route 53 is humming along... DynamoDB decided to edit its DNS records that are propagated by Route 53... they were bogus, but Route 53 happily propagated the toxic change to the rest of the universe.
DynamoDB is not going to set up its own DNS service or its own Route 53.
Maybe DynamoDB should have had tooling that tested DNS edits before sending it to Route 53, or Route53 should have tooling to validate changes before accepting them. I'm sure smart people at AWS are yelling at each other about it right now.
When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?
> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.
IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).
There's also dynamodb-fips.us-east-1.amazonaws.com if the main endpoint is having trouble. I'm not sure if this record was affected the same way during this event.
Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.
It's not surprising that it's impacting other services in the region because DynamoDB is one of those things that lots of other services build on top of. It is a little bit surprising that the blast radius seems to extend beyond us-east-1, mind.
In the coming hours/days we'll find out if AWS still have significant single points of failure in that region, or if _so many companies_ are just not bothering to build in redundancy to mitigate regional outages.
I'm real curious how much of AWS GovCloud has continued through this actually. But even if it's fine, from a strategic perspective how much damage did we just discover you could do with a targeted disruption at the right time?
The US federal government is in yet another shutdown right now, so how would some sub-agency even know if there were an unplanned outage, who would report it, and who would try to fix it?
AWS engineers are trained to use their internal services for each new system. They seem to like using DynamoDB. Dependencies like this should be made transparent.
Ex employee here who built an aws service. Dynamo is basically mandated. You need like VP approval to use a relational database because of some scaling stuff they ran into historically. That sucks because we really needed a relational database and had to bend over backwards to use dynamo and all the nonsense associated with not having sql. It was super low traffic too
Not sure why this is downvoted - this is absolutely correct.
A lot of AWS services under the hood depend on others, and especially us-east-1 is often used for things that require strong consistency like AWS console logins/etc (where you absolutely don't want a changed password or revoked session to remain valid in other regions because of eventual consistency).
Not "like using", they are mandated from the top to use DynamoDB for any storage. At my org in the retail page, you needed director approval if you wanted to use a relational DB for a production service.
Self hosting is golden. Sadly we already feel like we have too many services for our company's size, and the sensitivity of vulnerabilities in customer systems precludes unencrypted comms. IRC+TLS could be used but we also regularly send screenshots and such in self-destructing messages (not that an attacker couldn't disable that, but to avoid there being a giant archive when we do have some sort of compromise), so we'd rather fall back to something with a similar featureset
As a degraded-state fallback, email is what we're using now (we have our clients configured to encrypt with PGP by default, we use it for any internal email and also when the customer has PGP so everyone knows how to use that)
> in self-destructing messages (not that an attacker couldn't disable that, but to avoid there being a giant archive when we do have some sort of compromise
Admitting to that here?
In civilised jurisdictions that should be criminal.
Using cryptography to avoid accountability is wrong. Drug dealing and sex work, OK, but in other businesses? Sounds very crooked to me
They're not sidestepping accountability, or at least we can't infer that from the information volunteered. They're talking about retention, and the less data you retain (subject to local laws), the less data can be leaked in a breach.
self-hosting isn't "golden", if you are serious about the reliability of complex systems, you can't afford to have your own outages impede your own engineers from fixing them.
if you seriously have no external low dep fallback, please at least document this fact now for the Big Postmortem.
The engineers can walk up to the system and do whatever they need to fix them. At least, that's how we self host in the office. If your organisation hosts it far away then yeah, it's not self hosted but remote hosted
Including falling back to third-party hosting when relevant. One doesn't exclude the other
My experience with self hosting has been that, at least when you keep the services independent, downtime is not more common than in hosted environments, and you always know what's going on. Customising solutions, or workarounds in case of trouble, is a benefit you don't get when the service provider is significantly bigger than you are. It has pros and cons and also depends on the product (e.g. email delivery is harder than Mattermost message delivery, or if you need a certain service only once a year or so) but if you have the personell capacity and a continuous need, I find hosting things oneself to be the best solution in general
Including fallback to your laptop if nothing else works. I saved a demo once by just running the whole thing from my computer when the Kubernetes guys couldn't figure out why the deployed version was 403'ing. Just had to poke the touchpad every so often so it didn't go to sleep.
> Just had to poke the touchpad every so often so it didn't go to sleep
Unwarranted tip: next time, if you use macOS, just open the terminal and run `caffeinate -imdsu`.
I assume Linux/Windows have something similar built-in (and if not built-in, something that's easily available). For Windows, I know that PowerToys suite of nifty tools (officially provided by Microsoft) has Awake util, but that's just one of many similar options.
The key thing that AWS provides is the capacity for infinite redundancy. Everyone that is down because us-east-1 is down didn't learn the lesson of redundancy.
Active-active RDBMS - which is really the only feasible way to do HA, unless you can tolerate losing consistency (or the latency hit of running a multi-region PC/EC system) - is significantly more difficult to reason about, and to manage.
Except Google Spanner, I’m told, but AWS doesn’t have an answer for that yet AFAIK.
Some organizations’ leadership takes one look at the cost of redundancy and backs away. Paying for redundant resources most organizations can stomach. The network traffic charges are what push many over the edge of “do not buy”.
The cost of re-designing and re-implementing applications to synchronize data shipping to remote regions and only spinning up remote region resources as needed is even larger for these organizations.
And this is how we end up with these massive cloud footprints not much different than running fleets of VM’s. Just about the most expensive way to use the cloud hyperscalers.
Most non-tech industry organizations cannot face the brutal reality that properly, really leveraging hyperscalers involves a period of time often counted in decades for Fortune-scale footprints where they’re spending 3-5 times on selected areas more than peers doing those areas in the old ways to migrate to mostly spot instance-resident, scale-to-zero elastic, containerized services with excellent developer and operational troubleshooting ergonomics.
Even internal Amazon tooling is impacted greatly - including the internal ticketing platform which is making collaboration impossible during the outage. Amazon is incapable of building multi-region services internally. The Amazon retail site seems available, but I’m curious if it’s even using native AWS or is still on the old internal compute platform. Makes me wonder how much juice this company has left.
Amazon's revenue in 2024 was about the size of Belgium's GDP. Higher than Sweden or Ireland. It makes a profit similar to Norway, without drilling for offshore oil or maintaining a navy. I think they've got plenty of juice left.
The universe's metaphysical poeticism holds that it's slightly more likely than it otherwise would be that the company that replaced Sears would one day go the way of Sears.
You’re right about that. I guess what I mean is, how long will people be enthusiastic about AWS and its ability to innovate. But AWS undeniably has some really strong product offerings - it’s just that their pace of innovation has slowed. Their managed solutions for open source applications is generally good, but some of their bespoke alternatives are lacking over the last few years (ecs kinesis code* tools) - it wasn’t always like that (sqs ddb s3 ec2).
Sure, but it’s not reasonable that internal collaboration platforms built for ticketing engineers about outages doesn’t work during the outage. That would be something worth making multi-region at a minimum.
I saw a quote from a high end AWS support engineer that said something like "submitting tickets for AWS problems is not working reliably: customers are advised to keep retrying until the ticket is submitted".
As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?
To my understanding the main problem is DynamoDB being down, and DynamoDB is what a lot of AWS services use for their eventing systems behind the scenes. So there's probably like 500 billion unprocessed events that'll need to get processed even when they get everything back online. It's gonna be a long one.
I wonder how many companies have properly designed their clients. So that the timing before re-attempt is randomised and the re-attempt timing cycle is logarithmic.
In short, if it’s all at the same schedule you’ll end up with surges of requests followed by lulls. You want that evened out to reduce stress on the server end.
It's just a safe pattern that's easy to implement. If your services back-off attempts happen to be synced, for whatever reason, even if they are backing off and not slamming AWS with retries, when it comes online they might slam your backend.
It's also polite to external services but at the scale of something like AWS that's not a concern for most.
> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.
> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.
This thought process suggests something very wrong. The guess "it will last again as long as it has lasted so far" doesn't give any real insight. The wall was actually as likely to end five months from when they visited it, as it was to end 500 years from then.
What this "time-wise Copernican principle" gives you is a guarantee that, if you apply this logic every time you have no other knowledge and have to guess, you will get the least mean error over all of your guesses. For some events, you'll guess that they'll end in 5 minutes, and they actually end 50 years later. For others, you'll guess they'll take another 50 years and they actually end 5 minutes later. Add these two up, and overall you get 0 - you won't have either a bias to overestimating, nor to underestimating.
But this doesn't actually give you any insight into how long the event will actually last. For a single event, with no other knowledge, the probability that it will after 1 minute is equal to the probability that it will end after the same duration that it lasted so far, and it is equal to the probability that it will end after a billion years. There is nothing at all that you can say about the probability of an event ending from pure mathematics like this - you need event-specific knowledge to draw any conclusions.
So while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation.
But you will never guess that the latest tik-tok craze will last another 50 years, and you'll never guess that Saturday Night Live (which premiered in 1075) will end 5-minutes from now. Your guesses are thus more likely to be accurate than if you ignored the information about how long something has lasted so far.
Sure, but the opposite also applies. If in 1969 you guessed that the wall would last another 20 years, then in 1989, you'll guess that the wall of Berlin will last another 40 years - when in fact it was about to fall. And in 1949, when the wall was a few months old, you'll guess that it will last for a few months at most.
So no, you're not very likely to be right at all. Now sure, if you guess "50 years" for every event, your average error rate will be even worse, across all possible events. But it is absolutely not true that it's more likely that SNL will last for another 50 years as it is that it will last for another 10 years. They are all exactly as likely, given the information we have today.
If I understand the original theory, we can work out the math with a little more detail... (For clarity, the berlin wall was erected in 1961.)
- In 1969 (8 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1972 (8x4/3=11 years) and 1993 (8x4=32 years)
- In 1989 (28 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1998 (28x4/3=37 years) and 2073 (28x4=112 years)
- In 1961 (when the wall was, say, 6 months old): You'd calculate that there's a 50% chance that the wall will fall between 1961 (0.5x4/3=0.667 years) and 1963 (0.5x4=2 years)
I found doing the math helped to point out how wide of a range the estimate provides. And 50% of the times you use this estimation method; your estimate will correctly be within this estimated range. It's also worth pointing out that, if your visit was at a random moment between 1961 and 1989, there's only a 3.6% chance that you visited in the final year of its 28 year span, and 1.8% chance that you visited in the first 6 months.
> Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here.
It's relatively unlikely that you'd visit the Berlin Wall shortly after it's erected or shortly before it falls, and quite likely that you'd visit it somewhere in the middle.
No, it's exactly as likely that I'll visit it at any one time in its lifetime. Sure, if we divide its lifetime into 4 quadrants, its more likely I'm in quadrant 2-3 than in either of 1 or 4. But this is slight of hand: it's still exactly as likely that I'm in quadrant 2-3 than in quadrant (1 or 4) - or, in other words, it's as likely I'm at one of the ends of the lifetime as it is that I am in the middle.
> The wall was actually as likely to end five months from when they visited it, as it was to end 500 years from then.
I don't think this is correct; as in something that has been there for say hundreds of years had more probability to be there in a hundred years than something that has been there for a month.
> while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation
It's important to flag that the principle is not trite, and it is useful.
There's been a misunderstanding of the distribution after the measurement of "time taken so far", (illuminated in the other thread), which has lead to this incorrect conclusion.
To bring the core clarification from the other thread here:
The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the estimate `time_left=time_so_far` is useful.
If this were actually correct, than any event ending would be a freak accident: since, according to you, the probability of something continuing increases drastically with its age. That is, according to your logic, the probability of the wall of Berlin falling within the year was at its lowest point in 1989, when it actually fell. In 1949, when it was a few months old, the probability that it would last for at least 40 years was minuscule, and that probability kept increasing rapidly until the day the wall was collapsed.
> However, the average expected future lifetime increases as a thing ages, because survival is evidence of robustness.
This is a completely different argument that relies on various real-world assumptions, and has nothing to do with the Copernican principle, which is an abstract mathematical concept. And I actually think this does make sense, for many common categories of processes.
However, even this estimate is quite flawed, and many real-world processes that intuitively seem to follow it, don't. For example, looking at an individual animal, it sounds kinda right to say "if it survived this long, it means it's robust, so I should expect it will survive more". In reality, the lifetime of most animals is a binomial distribution - they either very young, because of glaring genetic defects or simply because they're small, fragile, and inexperienced ; or they die at some common age that is species dependent. For example, a humab that survived to 20 years of age has about the same chance of reaching 80 as one that survived to 60 years of age. And an alien who has no idea how long humans live and tries to apply this method may think "I met this human when they're 80 years old - so they'll probably live to be around 160".
Why is the most likely time right now? What makes right now more likely than in five minutes? I guess you're saying if there's nothing that makes it more likely to fail at any time than at any other time, right now is the only time that's not precluded by it failing at other times? I.E. it can't fail twice, and if it fails right now it can't fail at any other time, but even if it would have failed in five minutes it can still fail right now first?
Yes that's pretty much it. There will be a decaying probability curve, because given you could fail at any time, you are less likely to survive for N units of time than for just 1 unit of time, etc.
Is this a weird Monty hall thing where the person next to you didnt visit the wall randomly (maybe they decided to visit on some anniversary of the wall) so for them the expected lifetime of the wall is different?
Note that this is equivalent to saying "there's no way to know". This guess doesn't give any insight, it's just the function that happens to minimize the total expected error for an unknowable duration.
Edit: I should add that, more specifically, this is a property of the uniform distribution, it applies to any event for which EndsAfter(t) is uniformly distributed over all t > 0.
I'm not sure about that. Is it not sometimes useful for decision making, when you don't have any insight as to how long a thing will be? It's better than just saying "I don't know".
Not really, unless you care about something like "when I look back at my career, I don't want to have had a bias to underestimating nor overestimating outages". That's all this logic gives you: for every time you underestimate a crisis, you'll be equally likely to overestimate a different crisis. I don't think this is in any way actually useful.
Also, the worse thing you can get from this logic is to think that it is actually most likely that the future duration equals the past duration. This is very much false, and it can mislead you if you think it's true. In fact, with no other insight, all future durations are equally likely for any particular event.
The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic. That will easily beat this method of estimation.
You've added some useful context, but I think you're downplaying it's use. It's non-obvious, and in many cases better than just saying "we don't know". For example, if some company's server has been down for an hour, and you don't know anything more, it would be reasonable to say to your boss: "I'll look into it, but without knowing more about it, stastically we have a 50% chance of it being back up in an hour".
> The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic
True, and all the posts above have acknowledged this.
> "I'll look into it, but without knowing more about it, stastically we have a 50% chance of it being back up in an hour"
This is exactly what I don't think is right. This particular outage has the same a priori chance of being back in 20 minutes, in one hour, in 30 hours, in two weeks, etc.
Ah, that's not correct... That explains why you think it's "trite", (which it isn't).
The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the statement above is correct, and the estimate `time_left=time_so_far` is useful.
Can you suggest some mathematical reasoning that would apply?
If P(1 more minute | 1 minute so far) = x, then why would P(1 more minute | 2 minutes so far) < x?
Of course, P(it will last for 2 minutes total | 2 minutes elapsed) = 0, but that can only increase the probabilities of any subsequent duration, not decrease them.
(t_obs is time observed to have survived, t_more how long to survive)
Case 1 (x): It has lasted 1 minute (t_obs=1). The probability of it lasting 1 more minute is: 1 / (1 + 1) = 1/2 = 50%
Case 2: It has lasted 2 minutes (t_obs=2). The probability of it lasting 1 more minute is: 2 / (2 + 1) = 2/3 ≈ 67%
I.e. the curve is a decaying curve, but the shape / height of it changes based on t_obs.
That gets to the whole point of this, which is that the length of time something has survived is useful / provides some information on how long it is likely to survive.
Where are you getting this formula from? Either way, it doesn't have the property we were originally discussing - the claim that the best estimate of the duration of an event is the double of it's current age. That is, by this formula, the probability of anything collapsing in the next millisecond is P(1 more millisecond | t_obs) = t_obs / t_obs + 1ms ~= 1 for any t_obs >> 1ms. So by this logic, the best estimate for how much longer an event will take is that it will end right away.
The formula I've found that appears to summarize the original "Copernican argument" for duration is more complex - for 50% confidence, it would say:
P(t_more in [1/3 t_obs, 3t_obs]) = 50%
That is, if given that we have a 50% chance to be experiencing the middle part of an event, we should expect its future life to be between one third and three times its past life.
Of course, this can be turned on its head: we're also 50% likely to be experiencing the extreme ends of an event, so by the same logic we can also say that P(t_more = 0 [we're at the very end] or t_more = +inf [we're at the very beginning and it could last forever] ) is also 50%. So the chance t_more > t_obs is equal to the chance it's any other value. So we have precisely 0 information.
The bottom line is that you can't get more information out a uniform distribution. If we assume all future durations have the same probability, then they have the same probability, and we can't predict anything useful about them. We can play word games, like this 50% CI thing, but it's just that - word games, not actual insight.
It's not a uniform distribution after the first measurement, t_obs. That enables us to update the distribution, and it becomes a decaying one.
I think you mistakenly believe the distribution is still uniform after that measurement.
The best guess, that it will last for as long as it already survived for, is actually the "median" of that distribution. The median isn't the highest point on the probability curve, but the point where half the area under the curve is before it, and half the area under the curve is after it.
The cumulative distribution actually ends up pretty exponential which (I think) means that if you estimate the amount of time left in the outage as the mean of all outages that are longer than the current outage, you end up with a flat value that's around 8 hours, if I've done my maths right.
Not a statistician so I'm sure I've committed some statistical crimes there!
Unfortunately I can't find an easy way to upload images of the charts I've made right now, but you can tinker with my data:
cause,outage_start,outage_duration,incident_duration
Cell management system bug,2024-07-30T21:45:00.000000+0000,0.2861111111111111,1.4951388888888888
Latent software defect,2023-06-13T18:49:00.000000+0000,0.08055555555555555,0.15833333333333333
Automated scaling activity,2021-12-07T15:30:00.000000+0000,0.2861111111111111,0.3736111111111111
Network device operating system bug,2021-09-01T22:30:00.000000+0000,0.2583333333333333,0.2583333333333333
Thread count exceeded limit,2020-11-25T13:15:00.000000+0000,0.7138888888888889,0.7194444444444444
Datacenter cooling system failure,2019-08-23T03:36:00.000000+0000,0.24583333333333332,0.24583333333333332
Configuration error removed setting,2018-11-21T23:19:00.000000+0000,0.058333333333333334,0.058333333333333334
Command input error,2017-02-28T17:37:00.000000+0000,0.17847222222222223,0.17847222222222223
Utility power failure,2016-06-05T05:25:00.000000+0000,0.3993055555555555,0.3993055555555555
Network disruption triggering bug,2015-09-20T09:19:00.000000+0000,0.20208333333333334,0.20208333333333334
Transformer failure,2014-08-07T17:41:00.000000+0000,0.13055555555555556,3.4055555555555554
Power loss to servers,2014-06-14T04:16:00.000000+0000,0.08333333333333333,0.17638888888888887
Utility power loss,2013-12-18T06:05:00.000000+0000,0.07013888888888889,0.11388888888888889
Maintenance process error,2012-12-24T20:24:00.000000+0000,0.8270833333333333,0.9868055555555555
Memory leak in agent,2012-10-22T17:00:00.000000+0000,0.26041666666666663,0.4930555555555555
Electrical storm causing failures,2012-06-30T02:24:00.000000+0000,0.20902777777777776,0.25416666666666665
Network configuration change error,2011-04-21T07:47:00.000000+0000,1.4881944444444444,3.592361111111111
Generally expect issues for the rest of the day, AWS will recover slowly, then anyone that relies on AWS will recovery slowly. All the background jobs which are stuck will need processing.
Yes. 15-20 years ago when I was still working on network-adjacent stuff I witnessed the shift to the devops movement.
To be clear, the fact that devops don't plan for AWS failures isn't an indication that they lack the sysadmin gene. Sysadmins will tell you very similar "X can never go down" or "not worth having a backup for service Y".
But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.
"Thankfully" nowadays containers and application hosting abstracts a lot of it back away. So today I'd be willing to say that devops are sufficient for small to medium companies (and dare I say more efficient?).
> But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.
Depends on the devops team. I have worked with so many devops engineers who came from network engineering, sysadmin, or SecOps backgrounds. They all bring a different perspective and set of priorities.
That's not very surprising. At this point you could say that your microwave has a better uptime. The complexity comparison to all the Amazon cloud services and infrastructure would be roughly the same.
As Amazon moves from day-1 company as it claimed once, to be the sales company like Oracle focusing on raking money, expect more outages to come, and longer to be resolved.
Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.
>You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
"And so, a quiet suspicion starts to circulate: where have the senior AWS engineers who've been to this dance before gone? And the answer increasingly is that they've left the building — taking decades of hard-won institutional knowledge about how AWS's systems work at scale right along with them."
...
"AWS has given increasing levels of detail, as is their tradition, when outages strike, and as new information comes to light. Reading through it, one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow. To be clear: I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time."
....
"This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. "
...
"I want to be very clear on one last point. This isn't about the technology being old. It's about the people maintaining it being new. If I had to guess what happens next, the market will forgive AWS this time, but the pattern will continue."
That is why technical leaders’ role wouldn’t demand they not only gather data, but also report things like accurate operational, alternative, and scenario cost analysis; financial risks; vendor lock-in; etc.
However, as may be apparent just from that small set, it is not exactly something technical people often feel comfortable with doing. It is why at least in some organizations you get the friction of a business type interfacing with technical people in varying ways, but also not really getting along because they don’t understand each other and often there are barriers of openness.
I think business types vs technical types inherently have different perspectives especially for american companies. One has the "get it done at all costs" the other has "this can't be done since impossible/it will break this".
When a company moves from engineering/technical driven to sales/profit/stock price/shareholders satisfaction driven, once it was not possible to cut (technical) corners, now becomes the de facto. If you push the L7s/L8s out of the discussion room, who would definitely stop or veto circular dependencies, and replace with sir-yes-sir people, now you've successfully created short term KPI wins for the lofty chairs but with a burning fuse of catastrophic failures to come.
I know Postman has kinda gone to shit over the years but it's hilarious my local REST client that makes requests from my machine has AWS as a dependency .
In AWS, if you take out one of dynamo db, S3 or lambda you're going to be in a world of pain. Any architecture will likely use those somewhere including all the other services on top.
If in your own datacenter your storage service goes down, how much remains running
When these major issues come up, all they have is symptoms and not causes. Maybe not until the dynamo oncall comes on and says its down, then everyone knows at least the reason for their teams outage.
The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.
I wonder what kind of outage or incident or economic change will be required to cause a rejection of the big commercial clouds as the default deployment model.
The costs, performance overhead, and complexity of a modern AWS deployment are insane and so out of line with what most companies should be taking on. But hype + microservices + sunk cost, and here we are.
I don't expect the majority of tech companies to want to run their own physical data centers. I do expect them to shift to more bare-metal offerings.
If I'm a mid to large size company built on DynamoDB, I'd be questioning if it's really worth the risk given this 12+ hour outage.
I'd rather build upon open source tooling on bare metal instances and control my own destiny, than hope that Amazon doesn't break things as they scale to serve a database to host the entire internet.
For big companies, it's probably a cost savings too.
I think that prediction severely underestimates the amount of cargo culting present at basically every company when it comes to decisions like this. Using AWS is like the modern “no one ever got fired for buying IBM”.
Imagine using vercel, a company that literally contributes to the starvation of children and is proud of it. Also, literally just learn to use a Dockerfile and a vps, like why do these PaaS even exist, you're paying 3x for the same AWS services.
This is just a silly anecdote, but every time a cloud provider blips, I'm reminded. The worst architecture I've ever encountered was a system that was distributed across AWS, Azure, and GCP. Whenever any one of them had a problem, the system went down. It also cost 3x more than it should.
I've seen the exact same thing at multiple companies. The teams were always so proud of themselves for being "multi-cloud" and managers rewarded them for their nonsense. They also got constant kudos for their heroic firefighting whenever the system went down, which it did constantly. Watching actually good engineers get overlooked because their systems were rock-solid while those characters got all the praise for designing an unadulterated piece of shit was one of the main reasons I left those companies.
I became one of the founding engineers at a startup, which worked for a little while until the team grew beyond my purview, and no good engineering plan survives contact with sales directors who lie to customers about capabilities our platform has.
> Watching actually good engineers get overlooked because their systems were rock-solid while those characters got all the praise for designing an unadulterated piece of shit
That is the computing business. There is no actual accountability, just ass covering
multi-cloud... any leader that approves such a boondoggle should be labelled incompetent. These morons sell it as a cost-cutting "migration". Never once have I seen such a project complete and it more than doubles complexity and costs.
looks like very few get it right. A good system would have few minutes of blip when one cloud provider went down, which is a massive win compared to outages like this.
They all make it pretty hard, and a lot of resume-driven-devs have a hard time resisting the temptation of all the AWS alphabet soup of services.
Sure you can abstract everything away, but you can also just not use vendor-flavored services. The more bespoke stuff you use the more lock in risk.
But if you are in a "cloud forward" AWS mandated org, a holder of AWS certifications, alphabet soup expert... thats not a problem you are trying to solve. Arguably the lock in becomes a feature.
Lock-in is another way to say "bespoke product offering". Sometimes solving the problem a cloud provider service exposes is not worth it. This locks you in for the same reasons that a specific restaurant locks you in because its their recipe.
I'd counter that past a certain scale, certainly the scale of a firm that used to & could run its own datacenter.. it's probably your responsibility to not use those services.
Sure it's easier, but if you decide feature X requires AWS service Y that has no GCP/Azure/ORCL equivalent.. it seems unwise.
Just from a business perspective, you are making yourself hostage to a vendor on pricing.
If you're some startup trying to find traction, or a small shop with an IT department of 5.. then by all means, use whatever cloud and get locked in for now.
But if you are a big bank, car maker, whatever.. it seems grossly irresponsible.
On the east coast we are already approaching an entire business day being down today. Gonna need a decade without an outage to get all those 9s back.
And not to be catastrophic but.. what if AWS had an outage like this that lasted.. 3 days? A week?
The fact that the industry collectively shrugs our shoulders and allows increasing amounts of our tech stacks to be single-vendor hostage is crazy.
> I'd counter that past a certain scale, certainly the scale of a firm that used to & could run its own datacenter.. it's probably your responsibility to not use those services.
It's actually probably not your responsibility, it's the responsibility of some leader 5 levels up who has his head in the clouds (literally).
It's a hard problem to connect practical experience and perspectives with high-level decision-making past a certain scale.
> The fact that the industry collectively shrugs our shoulders and allows increasing amounts of our tech stacks to be single-vendor hostage is crazy.
Well, nobody is going to get blamed for this one except people at Amazon. Socially, this is treated as as a tornado. You have to be certain that you can beat AWS in terms of reliability for doing anything about this to be good for your career.
In 20+ years in the industry, all my biggest outages have been AWS... and they seem to be happening annually.
Most of my on-prem days, you had more frequent but smaller failures of a database, caching service, task runner, storage, message bus, DNS, whatever.. but not all at once. Depending on how entrenched your organization is, some of these AWS outages are like having a full datacenter power down.
Might as well just log off for the day and hope for better in the morning. That assumes you could login, which some of my ex-US colleagues could not for half the day, despite our desktops being on-prem. Someone forgot about the AWS 2FA dependency..
In general, the problem with abstracting infrastructure means you have to code to the lowest common denominator. Sometimes its worth it. For companies I work for it really isn't.
You mean multi-cloud strategy ! You wanna know how you got here ?
See the sales team from Google flew out an executive to NBA Finals, Azure Sales team flew out another executive to NFL superBowl and the AWS team flew out yet another executive to Wimbledon finals. And thats how you end up with multi-cloud strategy.
In this particular case, it was resume-oriented architecture (ROAr!) The original team really wanted to use all the hottest new tech. The management was actually rather unhappy, so the job was to pare that down to something more reliable.
Eh, businesses want to stay resilient to a single vendor going down. My least favorite question in interviews this past year was around multi-cloud. Because imho it just isn't worth it- the increased complexity, the trying to like-like services across different clouds that aren't always really the same, and then just the ongoing costs of chaos monkeying and testing that this all actually works, especially in the face of a partial outage like this vs something "easy" like a complete loss of network connectivity... but that is almost certainly not what CEOs want to hear (mostly who I am dealing with here going for VPE or CTO level jobs).
I could care less about having more vendor dinners when I know I am promising a falsehood that is extremely expensive and likely going to cost me my job or my credibility at some point.
On the flip side, our SaaS runs primarily on GCP so our users are fine. But our billing and subscription system runs on AWS so no one can pay us today.
i'll bet there are a large number of systems that are dependent on multiple cloud platforms being up without even knowing it. They run on AWS, but rely on a tool from someone else that runs on GCP or on Azure and they haven't tested what happens if that tools goes down...
Common Cause Failures and false redundancy are just all over the place.
Was just on a Lufthansa and then United flight - both of which did not have WiFi. Was wondering if there was something going on at the infrastructure level.
I know there's a lot of anecdotal evidence and some fairly clear explanations for why `us-east-1` can be less reliable. But are there any empirical studies that demonstrate this? Like if I wanted to back up this assumption/claim with data, is there a good link for that, showing that us-east-1 is down a lot more often?
1. When aws deploys changes they run through a pipeline which pushes change to regions one at a time. Most services start with us-east-1 first.
2. us-east-1 is MASSIVE and considerably larger than the next largest region. There's no public numbers but I wouldn't be surprised if it was 50% of their global capacity. An outage in any other region never hits the news.
Each AWS service may choose different pipeline ordering based on the risks specific to their architecture.
In general:
You don't deploy to the largest region first because of the large blast radius.
You may not want to deploy to the largest region last because then if there's an issue that only shows up at that scale you may need to roll every single region back (divergent code across regions is generally avoided as much as possible).
A middle ground is to deploy to the largest region second or third.
Agreed. Most services start deployments on a small number of hosts in single AZs in small less-known regions, ramping up from there. In all my years there I don’t recall “us-east-1 first”.
I don't think its fair to dismiss a lot of anecdotal evidence, much of human experience is based off of it, and just being anecdotal doesn't make it incorrect. For those of us using aws for the last decade, there have been a handful of outages that are pretty hard to forget. Often those same engineers have services in other regions - so we witness these things going down more frequently in us-east-1. Now can I say definitively that us-east-1 goes down the most - nope. Have I had 4 outages in us-east-1 I can remember and only 1-2 us-west-2, yep.
The length and breadth of this outage has caused me to lose so much faith in AWS. I knew from colleagues who used to work there how understaffed and inefficient the team is due to bad management, but this just really concerns me.
I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.
Yes, and that's exactly the problem. It's like choosing a microservice architecture for resiliency and building all the services on top of the same database or message queue without underlying redundancy.
afaik they have a tiered service architecture, where tier 1 services are allowed to rely on tier 0 services but not vice-versa, and have a bunch of reliability guarantees on tier 0 services that are higher than tier 1.
It is kinda cool that the worst aws outages are still within a single region and not global.
But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.
I think a lot of its probably technical debt. So much internally still relies on legacy systems in US-East-1, and every time this happens I'm sure theres a discussion internally about decoupling that reliance which then turns into a massive diagram that looks like a family tree dating back a thousand years of all the things that need to be changed to stop it happening.
There's also the issue of sometimes needing actual strong consistency. Things like auth or billing for example where you absolutely can't tolerate eventual consistency or split-brain situations, in which case you need one region to serve as the ultimate source of truth.
Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.
Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.
Banking/transactions is full of split-brains where everyone involved prays for eventual consistency.
If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.
Sounds plausible. It's also a "fat and happy" symptom not to be able to fix deep underlying issues despite an ever growing pile of cash in the company.
Fixing deep underlying issues tends to fare poorly on performance reviews because success is not an easily traceable victory event. It is the prolonged absence of events like this, and it's hard to prove a negative.
Yeah I think there are a number of "hidden" dependencies on different regions, especially us-east-1. It's an artifact of it being AWS' largest region, etc.
us-east-2 does exist; it’s in Ohio. One major issue is a number of services have (had? Not sure if it’s still this way) a control plane in us-east-1, so if it goes down, so does a number of other services, regardless of their location.
> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.
AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...
Is that material to a conversation about service uptime of existing resources, though? Are there customers out there that are churning through the full lifecycle of ephemeral EC2 instances as part of their day-to-day?
We spend ~$20,000 per month in AWS for the product I work on. In the average day we do not launch an EC2 instance. We do not do any dynamic scaling. However, there are many scenarios (especially during outages and such) that it would be critical for us to be able to launch a new instance (and or stop/start an existing instance.)
I understand scaling. I’m saying there is a difference in severity of several orders of magnitude between “the computers are down” and “we can’t add additional computers”.
We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight
Even with redundancy, the response time between NYC and Amazon East in Ashburn is something like 10 ms. The impedance mismatch and dropped packets and increased latency would doom most organizations craplications.
US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.
AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.
They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.
It's possible that you really could endure any zone failure. But I take these claims people make all the time with a grain of salt, unless you're working on AWS scale (basically just 3 companies) and have actually run for years and seen every kind of failure mode claiming to be higher availability is not something that's able to be accurately evaluated.
(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)
Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.
I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.
Just letting you know how this response looks to other people -- Anon1096 raises legitimate objections, and their post seems very measured in their concerns, not even directly criticizing you. But your response here is very defensive, and a bit snarky. Really I don't think you even respond directly to their concerns, they say they'd want to see scale equivalent to AWS because that's the best way to see the wide variety of failure modes, but you mostly emphasize the auditors, which is good but not a replacement for the massive real load and issues that come along with it. It feels miscalibrated to Anon's comment. As a result, I actually trust you less. If you can respond to Anon's comment without being quite as sassy, I think you'd convince more people.
I appreciate the feedback, truly. Defensive and snarky are both fair, though I'm not trying to convince. The business and practices exist, today.
At risk of more snark [well-intentioned]: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.
Ya, I totally believe that cloud platforms don't need a single point of failure. In fact, seeing the vulnerability makes me excited, because I realize there is _still_ potential for innovation in this area! To be fair it's not my area of expertise, so I'm very unlikely to be involved, but it's still exciting to see more change on the horizon :)
What company did you do it with, can you say? Definitely, they may have been an early mover, but they can (and I'll say will!) still be displaced eventually, that's how business goes.
It's fine if someone guesses the well-known company, but I can't confirm/deny; like privacy a bit too much/post a bit too spicy. This wasn't a darling VC thing, to be fair. Overstated my involvement with 'made' for effect. A lot of us did the building and testing.
Definitely, that makes sense. Ya no worries at all, I think we all know these kinds of things involve 100+ human work-years, so at best we all just have some contribution to them.
> think we all know these kinds of things involve 100+ human work-years
No kidding! The customers differ, business/finance/governments, but the volume [systems/time/effort] was comparable to Amazon. The people involved in audits were consumed practically for a whole quarter, if memory serves. Not necessarily for testing itself: first, planning, sharing the plan, then dreading the plan.
Anyway, I don't miss doing this at all. Didn't mean to imply mitigation is trivial, just feasible :) 'AWS scale' is all the more reason to do business continuity/disaster recovery testing! I guess I find it being surprising, surprising.
Competitors have an easier time avoiding the creation of a Gordian Knot with their services... when they aren't making a new one every week. There are significant degrees to PaaS, a little focus [not bound to a promotion packet] goes a long way.
Yes, it was something we would do to maintain certain contracts. Sounds crazy, isn't: they used a significant portion of the capacity, anyway. They brought the auditors.
Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.
edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.
If you go far up enough the pyramid, there is always a single point of failure. Also, it's unlikely that 1) all regions have the same power company, 2) all of them are on the same payment schedule, 3) all of them would actually shut off a major customer at the same time without warning, so, in your specific example, things are probably fine.
No. It’s just that in my entire career when anyone claims that they have the perfect solution to a tough problem, it means either that they are selling something, or that they haven’t done their homework. Sometimes it’s both.
For what's left of your career: sometimes it's neither. You're confused, perfection? Where? A past employer, who I've deliberately not named, is selling something: I've moved on. Their cloud was designed with multiple-zone regions, and importantly, realizes the benefit: respects the boundaries. Amazon, and you, apparently have not.
Yes, everything has a weakness. Not every weakness is comparable to 'us-east-1'. Ours was billing/IAM. Guess what? They lived in several places with effective and routinely exercised redundancy. No single zone held this much influence. Service? Yes, that's why they span zones.
Said in the absolute kindest way: please fuck off. I have nothing to prove or, worse, sell. The businesses have done enough.
Yea, let's play along. Our CEO is personally choosing to not pay any entire class of partners across the planet. Are we even still in business? I'm so much more worried about being paid than this line of questioning.
A Cloud with multiple regions, or zones for that matter, that depend on one is a poorly designed Cloud; mine didn't, AWS does. So, let's revisit what brought 'whatever1', here:
> Your experiment proves nothing. Anyone can pull it off.
Fine, our overseas offices are different companies and bills are paid for by different people.
Not that "forgot to pay" is going to result in a cut off - that doesn't happen with the multi-megawatt supplies from multiple suppliers that go into a dedicated data centre. It's far more likely that the receivers will have taken over and will pay the bill by that point.
Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.
Yes, good point. Pricing is a bit higher. As another reply pointed out: there's ~three that work on the same scale. This was one, another hint I guess: it's mostly B2B. Normal people don't typically go there.
Azure, from my experience with it has stuff go down a lot and degrades even more. Seems to either not admit the degradation happened or rely on 1000 pages of fine print SLA docs to prove you don't get any credits for it. I suppose that isn't the same as "lose a region resiliency" so it could still be them given the poster said it is B2B focused and Azure is subject to a lot of exercises like this from it's huge enterprise customers. FWIW I worked as a IaC / devops engineer with the largest tenant in one of the non-public Azure clouds.
My $3/mo AWS instance is far cheaper than any DIY solution I could come up with, especially when I have to buy the hardware and supply the power/network/storage/physical space. Not to mention it's not worth my time to DIY something like that in the first place.
False equivalence/moving goalposts IMO... I was only refuting your claim of "AWS is not cheap", as if it's somehow impossible for it to be cheap... which I'm saying isn't the case.
Sorry to jump in y'alls convo :) AWS is cheaper than the Cloud we built... I just don't think it's significant. Ours cost more because businesses/governments would pay it, not because it was optimal.
Price is beside my original point: Amazon has enjoyed decades for arbitrage. This sounds more accusatory than intended: the 'us-south-1' problem exists because it's allowed/chosen. Created in 2006!
Now, to retract that a bit: I could see technical debt/culture making this state of affairs practical, if not inevitable. Correct? No, if I was Papa Bezos I'd be incredibly upset my Supercomputer is so hamstrung. I think even the warehouses were impacted!
The real differentiator was policy/procedure. Nobody was allowed to create a service or integration with this kind of blast area. Design principles, to say the least. Fault zones and availability zones exist for a reason beyond capacity, after all.
Right, like I said: crazy. Anything production with certain other clouds must be multi-AZ. Both reinforced by culture and technical constraints. Sometimes BCDR/contract audits [zones chosen by a third party at random].
The disconnect case was simple: breakage was as expected. The island was lost until we drew it on the map again. Things got really interesting when it was a full power-down and back on.
Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.
This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.
Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?
It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic
They could put a failover site in Colorado or Seattle or Atlanta, handling just their infrastructure. It's not like the NSA wouldn't be able to backhaul from those places.
AWS _had_ architected away from single-region failure modes. There are only a few services that are us-east-1 only in AWS (IAM and Route53, mostly), and even they are designed with static stability so that their control plane failure doesn't take down systems.
It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.
So the control plane for DNS and the identity management system are tied to us-east-1 and we’re supposed to think that’s OK? Those seem like exactly the sorts of things that should NOT be reliant on only one region.
It's worse than that. The entire DNS ultimately depends on literally one box with the signing key for the root zone.
You eventually get services that need to be global. IAM and DNS are such examples, they have to have a global endpoint because they apply to the global entities. AWS users are not regionalized, an AWS user can use the same key/role to access resources in multiple regions.
not quite true - there are some regions that have a different set of AWS users / credentials. I can't remember what this is called off the top of my head.
These are different AWS partitions. They are completely separate from each other, requiring separate accounts and credentials.
There's one for China, one for the AWS government cloud, and there are also various private clouds (like the one hosting the CIA data). You can check their list in the JSON metadata that is used to build the AWS clients (e.g. https://github.com/aws/aws-sdk-go-v2/blob/1a7301b01cbf7e74e4... ).
Been a while since I last suffered from AWS arbitrary complexity, but afaik you can only associate certificates to cloudfront if they are generated in us-east-1, so it's undoubtedly a single point of failure for all CDN if this is still the case.
I worked at AMZN for a bit and the complexity is not exactly arbitrary; it's political. Engineers and managers are highly incentivized to make technical decisions based on how they affect inter-team dependencies and the related corporate dynamics. It's all about review time.
I have seen one promo docket get rejected for doing work that is not complex enough... I thought the problem was challenging, and the simple solution brilliant, but the tech assessor disagreed. I mean once you see there is a simple solution to a problem, it looks like the problem is simple...
I had a job interview like this recently: "what's the most technically complex problem you've ever worked on?"
The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"
I think this could still be a very useful question for an interviewer. If I were hiring for a position working on a complex system, I would want to know what level of complexity a prospect was comfortable dealing with.
I was once very unpopular with a team of developers when I pointed out a complete solution to what they had decided was an "interesting" problem - my solution didn't involve any code being written.
I suppose it depends on what you are interviewing for but questions like that I assume are asked more to see how you answer than the specifics of what you say.
Most web jobs are not technically complex. They use standard software stacks in standard ways. If they didn't, average developers (or LLMs) would not be able to write code for them.
Yeah, I think this. I've asked this in interviews before, and it's less about who has done the most complicated thing and more about the candidate's ability to a) identify complexity, and b) avoid unnecessary complexity.
I.e. a complicated but required system is fine (I had to implement a consensus algorithm for a good reason).
A complicated but unrequired system is bad (I built a docs platform for us that requires a 30-step build process, but yeah, MkDocs would do the same thing.
I really like it when people can pick out hidden complexity, though. "DNS" or "network routing" or "Kubernetes" or etc are great answers to me, assuming they've done something meaningful with them. The value is self-evident, and they're almost certainly more complex than anything most of us have worked on. I think there's a lot of value to being able to pick out that a task was simple because of leveraging something complex.
>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions
I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.
I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.
You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late
It depends on the outage. There was one a year or two ago (I think? They run together) that impacted EC2 such that as long as you weren’t trying to scale, or issue any commands, your service would continue to operate. The EKS clusters at my job at the time kept chugging along, but had Karptenter tried to schedule more nodes, we’d have had a bad time.
Meanwhile, AWS has always marketed itself as "elastic". Not being able to start new VMs in the morning to handle the daytime load will wreck many sites.
Well that sounds like exactly the sort of thing that shouldn’t happen when there’s an issue given the usual response is to spin things up elsewhere, especially on lower priority services where instant failover isn’t needed.
Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.
That's a good point, but I'd just s/Amazon engineers/AWS leadership/ , as I'm pretty sure that there's a few layers of management removed between the engineers on the ground at AWS, those who deprioritise any longer-term resilience work needed (which is a very strategic decisioN), and those those who are in charge of external comms/education about best practices for AWS customers.
Luckily, those people are the ones that will be getting all the phonecalls from angry customers here. If you're selling resilience and selling twice the service (so your company can still run if one location fails), and it still failed, well... phones will be ringing.
What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.
The claim was that that they're total hypocrites aren't multi region at all. That's totally false, the amount of redundancy in aws is staggering. But there are foundational parts which, I guess, have been too difficult to do that for (or perhaps they are redundant but the redundancy failed in this case? I dunno)
There's multiple single points of failure for their entire cloud in us-east-1.
I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.
That's absurd. It's hypocritical to describe best practices as best practices because you haven't perfectly implemented them? Either they're best practice or they aren't. The customers have the option of risking non-redundancy also, you know.
Yes it's hypocritical to push customers to pay you more money with best practices for uptime when you yourself don't follow them and your choices to not follow them actually make the best practices you pushed your customers to pay you more money for not fully work.
Hey! Pay us more money so when us-east-1 goes down you're not down (actually you'll still go down because us-east-1 is a single point of failure even for our other regions).
Amazon are planning to launch the EU Sovereign Cloud by the end of the year. They claim it will be completely independent. It may be possible then to have genuine resiliency on AWS. We'll see.
This is the difference between “partitions” and “regions”. Partitions have fully separate IAM, DNS names, etc. This is how there are things like US Gov Cloud, the Chinese AWS cloud, and now the EU sovereign cloud
Yes, although unfortunately it’s not how AWS sold regions to customers. AWS folks consistently told customers that regions were independent and customers architected on that belief.
It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.
My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.
"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"
The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.
The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.
You act as if that is a bug not a feature. As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself. Besides none of your customers are going to blame you if every other major site is down.
> As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself.
That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.
But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.
How many businesses can’t afford to suffer any downtime though?
But I’ve led enough cloud implementations where I discuss the cost and complexity between - multi-AZ (it’s almost free so why not), multi region , and theoretically multi cloud (never came up in my experience) and then cold, warm and hot standby, RTO and RPO, etc
And for the most part, most businesses are fine with just multi-AZ as long as their data can survive catastrophe.
I'm saying the importance is on uptime, not on who to blame, when services are critical.
You don't have one data center with critical services. You know lots of companies are still not in the cloud, and they manage their own datacenters, and they have 2-3 of them. There are cost, support, availability and regulatory reasons not to be in the cloud for many parties.
Or it is a matter of efficiency. If 1 million companies design and maintain their servers, there would be 1 million (or more) incidents like these. Same issues. Same fixes. Not so efficient.
It might be worse in terms of total downtime but it likely would be much less noticable as it woould be scattered individual outages not everyone at the same time.
It also doesn't help that most companies using AWS aren't remotely close to multi-region support, and that us-east-1 is likely the most populated region.
even if us-east-1 was a normal region there is not enough spare capacity to take up all the workloads from us-east-1 in other regions so t's a moot point
> Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.
Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )
It sounds like they want to avoid split-brain scenarios as much as possible while sacrificing resilience. For things like DNS, this is probably unavoidable. So, not all the responsibility can be placed on AWS.
If my application relies on receipts (such as an airline ticket), I should make sure I have an offline version stored on my phone so that I can still check in for my flight. But I can accept not to be able to access Reddit or order at McDonalds with my phone. And always having cash at hand is a given, although I almost always pay with my phone nowadays.
I hope they release a good root cause analysis report.
Sure, but you want to make sure that changes propagate as soon as possible from the central authority. And for AWS, the control plane for that authority happens to be placed in US-EAST-1. Maybe Blockchain technology can decentralize the control plane?
aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.
Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).
Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.
In general it is not expensive. In most cases you can either load balance across two regions all the time or have a fallback region that you scale out/up and switch to if needed.
Multi tenancy is expensive. You’d need to have every single service you depend on, including 3rd party services, on multi tenancy. In many cases such as the main DB, you need dedicated resources. You’re most likely to also going to need expensive enterprise SLAs.
Servers are easy. I’m sure most companies already have servers that can be spun up. Things related to data are not.
You wrote "You’d need to have every single service you depend on, including 3rd party services, on multi tenancy.". This is highly incorrect. I worked at several companies that have a multi tenancy strategy. It is:
* Automated.
* Scoped to business critical services. Typically not including many of the 3rd party services.
* Uses data replication, which is a feature in any modern cloud.
* Load balancing, by DNS basically for free or a real LB somewhere on the edge.
If you fail at this you probably fail at disaster recovery too or any good practice on how to run things in the cloud. Most likely because of very poor architecture.
>> Redundancy is insanely expensive especially for SaaS companies
That right there means the business model is fucked to begin with. If you can't have a resilient service, then you should not be offering that service. Period. Solution: we were fine before the cloud, just a little slower. No problem going back to that for some things. Not everything has to be just in time at lowest possible cost.
The part that makes no sense is - it's not cost management. AWS costs ten to a hundred times MORE than any other option - they just don't put it in the headline number.
Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.
They've acknowledged an issue now on the status page. For me at least, it's completely down, package installation straight up doesn't work. Thankfully current work project uses a pull-through mirror that allows us to continue working.
It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.
Eh, the "best practices" that would've prevented this aren't trivial to implement and are definitely far beyond what most engineering teams are capable of, in my experience. It depends on your risk profile. When we had cloud outages at the freemium game company I worked at, we just shrugged and waited for the systems to come back online - nobody dying because they couldn't play a word puzzle. But I've also had management come down and ask what it would take to prevent issues like that from happening again, and then pretend they never asked once it was clear how much engineering effort it would take. I've yet to meet a product manager that would shred their entire roadmap for 6-18 months just to get at an extra 9 of reliability, but I also don't work in industries where that's super important.
Like any company over a handful of years old, I'm sure they have super old, super critical systems running they dare not touch for fear of torching the entire business. For all we know they were trying to update one of those systems to be more resilient last night and things went south.
My systems didn't actually seem to be affected until what I think was probably a SECOND spike of outages at about the time you posted.
The retrospective will be very interesting reading!
(Obviously the category of outages caused by many restored systems "thundering" at once to get back up is known, so that'd be my guess, but the details are always good reading either way).
Even though us-east-1 is the region geographically closest to me, I always choose another region as default due to us-east-1 (seemingly) being more prone to these outages.
Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.
> There is one IAM control plane for all commercial AWS Regions, which is located in the US East (N. Virginia) Region. The IAM system then propagates configuration changes to the IAM data planes in every enabled AWS Region. The IAM data plane is essentially a read-only replica of the IAM control plane configuration data.
and I believe some global services (like certificate manager, etc.) also depend on the us-east-1 region
This is the right move. 10 years ago, us-east-1 was on the order of 10x bigger than the next largest region. This got a little better now, but any scaling issues still tend to happen in us-east-1.
The AWS has been steering people to us-east-2 for a while. For example, traffic between us-east-1 and us-east-2 has the same cost as inter-AZ traffic within the us-east-1.
Web scale? It is an _web_ app, so it is already web scaled, hehe.
Seriously, this thing runs already on 3 servers.
A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS.
The database is already replicated for reads. And I could switch to sharding if necessary.
I can easily scale to 5, 7, whatever dedicated servers.
But I do not have to right now. The primary is at 1% (sic!) load.
There really is no magic behind this.
And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.
My Ring doorbell works just fine without an internet connection (or during a cloud outage). The video storage and app notifications are another matter, but the doorbell itself continues to ring when someone pushes the button.
It's actually kinda great. When AWS has issues it makes national news and that's all you need to put on your status page and everyone just nods in understanding. It's a weird kind of holiday in a way.
Amazing, I wonder what their interview process is like, probably whiteboarding a next-gen LLM in WASM, meanwhile, their entire website goes down with us-east-1... I mean.
Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.
>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.
It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.
Not very many people realize that there are some services that still run only in us-east-1.
But that is the stance for a lot of electrical utilities. Sometimes weather or a car wreck takes out power and since its too expensive to have spares everywhere, sometimes you have to wait a few hours for a spare to be brought in
No, that's not the stance for electrical utilities (at least in most developed countries, including the US): the vast majority of weather events cause localized outages (the grid as a whole has redundancies built in; distribution to (residential and some industrial) does not. It expects failures of some power plants, transmission lines, etc. and can adapt with reserve power, or, in very rare cases by partial degradation (i.e. rolling blackouts). It doesn't go down fully.
> Sometimes weather or a car wreck takes out power
Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.
Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.
Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.
I think this is right, but depending on where you live, local weather-related outages can still not-infrequently look like entire towns going dark for a couple days, not streets for hours.
(Of course that's still not the same as a big boy grid failure (Texas ice storm-sized) which are the things that utilities are meant to actively prevent ever happening.)
The grid actually already has a fair number of (non-software) circular dependencies. This is why they have black start [1] procedures and run drills of those procedures. Or should, at least; there have been high profile outages recently that have exposed holes in these plans [2].
You'd be surprised. See. GP asks a very interesting question. And some grid infra indeed relies on AWS, definitely not all of it but there are some aspects of it that are hosted by AWS.
This is already happening. I have looked at quite a few companies in the energy space this year, two of them had AWS as a critical dependency in their primary business processes and that could definitely have an impact on the grid. To their defense: AWS presumably tests their fall-back options (generators) with some regularity. But this isn't a farfetched idea at all.
Also, every time your cloud went down, the parent company begged you to reconsider, explaining that all they need you to do is remove the disturbingly large cobwebs so they can migrate it. You tell them that to do so would violate your strongly-held beliefs, and when they stare at you in bewilderment, you yell “FREEDOM!” while rolling armadillos at them like they’re bowling balls.
That's the wrong analogy though. We're not talking about the supplier - I'm sure Amazon is doing its damnedest to make sure that AWS isn't going down.
The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.
Utility companies do not have redundancy for every part of their infrastructure either. Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.
Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?
Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.
Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.
> Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.
Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".
> Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".
You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.
In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"
> nobody of any competency runs stuff on AWS complacently.
I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.
That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.
> As the meme goes "one does not simply just run something on AWS"
The meme has currency for a reason, unfortunately.
---
That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.
Nobody was suggesting that loss of utilities is a result of bad risk management. We are saying that all competent businesses run risk management and for most businesses, the cost of AWS being down is less than the cost of going multi cloud.
This is particularly true when Amazon hand out credits like candy. So you just need to moan to your AWS account manager about the service interruption and you’ll be covered.
Peacetime = When not actively under a sustained attack by a nation-state actor. The implication being, if you expect there to be a “wartime”, you should also expect AWS cloud outages to be more frequent during a wartime.
I think people generally mean "state", but in the US-centric HN community that word is ambiguous and will generally be interpreted the wrong way. Maybe "sovereign state" would work?
As someone with a political science degree whose secondary focus was international relations, "Nation-state" has a number of different, definitions, an (despite the fact that dictionaries often don't include it), one of the most commonly encountered for a very long time has been "one of the principle subjects of international law, held to possess what is popularly, but somewhat inaccuratedly, referred to as Westphalian sovereignty" (there is a historical connection between this use and the "state roughly correlating with single nation" sense that relates to the evolution of “Westphalian sovvereignty” as a norm, but that’s really neither here nor there, because the meaning would be the meaning regardless of its connection to the other meaning.)
You almost never see the definition you are referring used except in the context of explicit comparison of different bases and compositions of states, and in practice there is very close to zero ambiguity which sense is meant, and complaining about it is the same kind of misguided prescriptivism as (also popular on HN) complaining about the transitive use of "begs the question" because it has a different sense than the intransitive use.
Not really: nations state level actor: a hacker group funded by a country, not necessarily directly part of that country's government but at the same time kept at arms length for deniability purposes. For instance, hacking groups operating from China, North Korea, Iran and Russia are often doing this with the tacit approval and often funding from the countries they operate in, but are not part of the 'official' government. Obviously the various secret services in so far as they have personnel engaged in targeted hacks are also nation state level actors.
It could be a multinational state actor, but the term nation-state is the most commonly used, regardless of accuracy. You can argue over whether of not the term itself is accurate, but you still understood the meaning.
> Not very many people realize that there are some services that still run only in us-east-1.
The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.
During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.
I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.
It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.
> Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.
Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.
IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.
> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.
Absurd claim.
Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.
It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.
Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.
> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing
That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.
But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.
Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.
I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.
I worked for a fortune 500, twice a year we practiced our "catastrophe outage" plan. The target SLA for recovering from a major cloud provider outage was 48 hours.
Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.
One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.
And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.
It is like discussing zombie apocalypse. People who are invested in bunkers will hardly understand those who are just choosing death over living in those bunkers for a month longer.
This. I wouldn't try to instantly failover to another service if AWS had a short outage, but I would plan to be able to recover from a permanent AWS outage by ensuring all your important data and knowledge is backed up off-AWS, preferably to your own physical hardware and having a vague plan of how to restore and bring things up again if you need to.
"Permanent AWS outage" includes someone pressing the wrong button in the AWS console and deleting something important or things like a hack or ransomware attack corrupting your data, as well as your account being banned or whatever. While it does include AWS itself going down in a big way, it's extremely unlikely that it won't come back, but if you cover other possibilities, that will probably be covered too.
This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.
But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.
Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.
Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.
Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.
This depends on the scale of company. A fully functional DR plan probably costs 10% of the infra spend + people time for operationalization. For most small/medium businesses its a waste to plan for a once per 3-10 year event. If you’re a large or legacy firm the above costs are trivial and in some cases it may become a fiduciary risk not to take it seriously.
We started that planning process at my previous company after one such outage but it became clear very quickly that the costs of such resilience would be 2-3x hosting costs in perpetuity and who knows how many manhours. Being down for an hour was a lot more palatable to everyone
What if AWS dumps you because your country/company didn't please the commander in chief enough?
If your resilience plan is to trust a third party, that means you don't really care about going down, does it?
Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.
I worked at an adtech company where we invested a bit in HA across AZ + regions. Lo and behold there was an AWS outage and we stayed up. Too bad our customers didn't and we still took the revenue hit.
Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.
I have this theory of something I call “importance radiation.”
An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.
Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.
> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.
Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".
To be clear, I'm not advocating for this or trying to suggest it's a good thing. That's just reality as I see it.
If my site's offline at the same time as the BBC has front page articles about how AWS is down and it's broken half the internet... it makes it _really_ easy for me to avoid blame without actually addressing the problem.
I don't need to deflect blame from my customers. Chances are they've already run into several other broken services today, they've seen news articles about it, and all from third parties. By the time they notice my service is down, they probably won't even bother asking me about it.
I can definitely see this encouraging more centralization, yes.
Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.
This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.
While I don't know your specific case, I have seen it happen often enough that there are only two possibilities left:
1. they are idiots
2. they do it on purpose and they think you are an idiot
For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).
It’s even worse than that - us-east-1 is so overloaded, and they have roughly 5+ outages per year on different services. They don’t publish outage numbers so it’s hard to tell.
At this point, being in any other region cuts your disaster exposure dramatically
We don’t deploy to us-east but still so many of our API partners and 3rd party services were down a large chunk of the service was effectively down. Including stuff like many dev tools
> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt
If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.
> If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.
Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.
"Is that also your contingency plan if unrelated X happens", and "make sure your investors know" are also not exactly good faith or without snark, mind you.
I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.
The point is, many companies do need those nines and they count on AWS to deliver and there is no backup plan if they don't. And that's the thing I take issue with, AWS is not so reliable that you no longer need backups.
My experience is that very few companies actually need those 9s. A company might say they need them, but if you dig in it turns out the impact on the business of dropping a 9 (or two) is far less than the cost of developing and maintaining an elaborate multi-cloud backup plan that will both actually work when needed and be fast enough to maintain the desired availability.
Again, of course there are exceptions, but advising people in general that they should think about what happens if AWS goes offline for good seems like poor engineering to me. It’s like designing every bridge in your country to handle a tomahawk missile strike.
HN denizens are more often than not founders of exactly those companies that do need those 9's. As I wrote in my original comment: the founders are usually shocked at the thought that such a thing could happen and it definitely isn't a conscious decision that they do not have a fall-back plan. And if it was a conscious decision I'd be fine with that, but it rarely is. About as rare as companies that have in fact thought about this and whose incident recovery plans go further than 'call George to reboot the server'. You'd be surprised how many companies have not even performed the most basic risk assessment.
We all read it.. AWS not coming back up is your point on nat having a backup plan?
You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.
I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).
I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?
I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!
I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.
Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.
The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.
Resilient systems work autonomously and can synchronize - but don't need to synchronize.
* Git is resilient.
* Native E-Mail clients - with local storage enabled - are somewhat resilient.
* A local package repository is - somewhat resilient.
* A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.
We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.
The internet seems resilient enough for all intents and purposes, we haven't had a global internet-wide catastrophe impacting the entire internet as far as I know, but we have gotten close to it sometimes (thanks BGP).
But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.
Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".
I don't wanna jinx anything, but yeah, seems. I can't remember a single global internet outage for the 30+ years I've been alive. But again, large services gone down, but the internet infrastructure seems to keep on going regardless.
That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.
It usually gets worse, when not outages happens for some time. Because that increases blind trust.
You are absolutely correct but this distinction is getting less and less important, everything is using APIs nowadays, including lots of stuff that is utterly invisible until it goes down.
Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.
> Most companies just aren't important enough to worry about "AWS never come back up."
But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.
We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.
Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.
In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop our service. Who knows?
For small and medium sized companies it's not easy to perform an accurate due diligency.
It would behoove a lot of devs to learn the basics of Linux sysadmin and how to setup a basic deployment with a VPS. Once you understand that, you'll realize how much of "modern infra" is really just a mix of over-reliance on AWS and throwing compute at underperforming code. Our addiction to complexity (and burning money on the illusion of infinite stability) is already and will continue to strangle us.
If AWS goes down unexpectedly and never comes back up it's much more likely that we're in the middle of some enormous global conflict where day to day survival takes priority over making your app work than AWS just deciding to abandon their cloud business on a whim.
Can also be much easier than that. Say you live in Mexico, hosting servers with AWS in the US because you have US customers. But suddenly the government decides to place sanctions on Mexico, and US entities are no longer allowed to do business with Mexicans, so all Mexican AWS accounts get shut down.
For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.
what's most realistic is something like a major scandal at AWS. The FBI seizes control and no bytes come in our out until the investigation is complete. A multi-year total outage effectively.
Or Bezos selling his soul to the Orange Devil and kicking you off when the Conman-in-chief puts the squeeze on some other aspect of Bezos' business empire
> The internet got its main strengths from the fact that it was completely decentralized.
Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.
The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.
> Decentralized in terms of many companies making up the internet
Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately
I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.
But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.
Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?
And secondly, how often do you create that backup and are you willing to lose the writes since the last backup?
That backup is absolutely something people should have, but I doubt those are ever used to bring a service back up. That would be a monumental failure of your hosting provider (colo/cloud/whatever)
Decentralized with respect to connectivity. If a construction crew cuts a fiber bundle routing protocols will route around the damage and packets keep showing up at the destination. Or, only a localized group of users will be affected. That level of decentralization is not what we have at higher levels in the stack with AWS being a good example.
Even connectivity has it's points of failure. I've touched with my own hands fiber runs that, with a few quick snips from a wire cutter, could bring sizable portions of the Internet offline. Granted that was a long time ago so those points of failure may no longer exist.
Well, that is exactly what resilient distributed network are about. Not that much the technical details we implement them through, but the social relationship and balanced in political decision power.
Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.
I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.
Sure, but at that point you go from bog standard to "enterprise grade redundancy for every single point of failure" which I can assure you is more heavily engineered than many enterprises (source: see current outage). Its just not worth the manpower and dollars for a vast majority of businesses.
OK, you pull it to your own repo. Now where do you store it? Do you also have fallback stores for that? What about the things which arent vendorable, ie external services?
Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.
Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…
I find this hard to judge in the abstract, but I'm not quite convinced the situation for the modal company today is worse than their answer to "what if your colo rack catches fire" would have been twenty years ago.
I used to work at an SME that ran ~everything on its own colo'd hardware, and while it never got this bad, there were a couple instances of the CTO driving over to the dc because the oob access to some hung up server wasn't working anymore. Fun times...
Reminiscing: this was a rite of passage for pre-cloud remote systems administrators.
Proper hardware (Sun, Cisco) had a serial management interface (ideally "lights-out") which could be used to remedy many kinds of failures. Plus a terminal server with a dial-in modem on a POTS line (or adequate fakery), in case the drama took out IP routing.
Then came Linux on x86, and it took waytoomanyyears for the hardware vendors to outgrow the platform's microsoft local-only deployment model. Aside from Dell and maybe Supermicro, I'm not sure if they ever worked it out.
Then came the cloud. Ironically, all of our systems are up and happy today, but services that rely on partner integrations are down. The only good thing about this is that it's not me running around trying to get it fixed. :)
First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.
Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.
Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?
>Let me ask you: how do you prepare your website for the complete collapse of western society?
That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".
We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.
We got off pretty easy (so far). Had some networking issues at 3am-ish EDT, but nothing that we couldn't retry. Having a pretty heavily asynchronous workflow really benefits here.
One strange one was metrics capturing for Elasticache was dead for us (I assume Cloudwatch is the actual service responsible for this), so we were getting no data alerts in Datadog. Took a sec to hunt that down and realize everything was fine, we just don't have the metrics there.
I had minor protests against us-east-1 about 2.5 years ago, but it's a bit much to deal with now... Guess I should protest a bit louder next time.
> “The Machine,” they exclaimed, “feeds us and clothes us and houses us; through it we speak to one another, through it we see one another, in it we have our being. The Machine is the friend of ideas and the enemy of superstition: the Machine is omnipotent, eternal; blessed is the Machine.”
..
> "she spoke with some petulance to the Committee of the Mending Apparatus. They replied, as before, that the defect would be set right shortly. “Shortly! At once!” she retorted"
..
> "there came a day when, without the slightest warning, without any previous hint of feebleness, the entire communication-system broke down, all over the world, and the world, as they understood it, ended."
All the big leagues take "piracy" very seriously and constantly try to clamp down on it.
TV rights is one of their main revenue sources, and it's expected to always go up, so they see "piracy" as a fundamental threat. IMO, it's a fundamental misunderstanding on their side, because people "pirating" usually don't have a choice - either there is no option for them to pay for the content (e.g. UK's 3pm blackout), or it's too expensive and/or spread out. People in the UK have to pay 3-4 different subscriptions to access all local games.
The best solution, by far, is what France's Ligue 1 just did (out of necessity though, nobody was paying them what they wanted for the rights after the previous debacles). Ligue 1+ streaming service, owned and operated by them which you can get access through a variety of different ways (regular old TV paid channel, on Amazon Prime, on DAZN, via Bein Sport), whichever suits you the best. Same acceptable price for all games.
MLB in the US does the same thing for the regular season, it's awesome despite the blackouts which prevent you from watching your local team but you can get around that with a simple VPN. But alas I believe that they will be making the service part of ESPN which will undoubtedly make the product worse just like they will do with NFL Red Zone.
The problem is that leagues miss out on billions of dollars of revenue when they do this AND they also have to maintain the streaming service which is way outside their technical wheelhouse.
MLS also has a pretty straightforward streaming service through AppleTV which I also enjoy.
What i find weird is that people complain (at least in the case of the MLS deal) that it's a BAD thing, that somehow having an easily accessible service that you just pay for and get access to without a contract or cable is diminishing popularity / discoverability of the product?
After rereading my comment I think I was a bit vague, but i'll try to clarify.
Most leagues DO sell their rights to other big companies to have them handle it however they see fit for a large annual fee.
MLB does it partially, some games are shown through cable tv (There are so many games a year that only a small portion is actually aired nationally) the rest are done via regional sports networks (RSNs) that aren't shown nationally. In order to make some money out of this situation MLB created MLBtv that lets you watch all games as long as there are not nationally aired or a local team that is serviced by a RSN. Recently there have been changes because one of the biggest conglomerate of RSNs has gone bankrupt forcing MLB to buy them out and MLB is trying to negotiate a new national cable package with the big telecoms. I believe ESPN has negotiated with MLB to buy out MLBtv but details are scarce.
MLS is a smaller league and Apple bought out exclusive streaming rights for 10 years for some ungodly amount of money. NFL and NBA also have some streaming options but I am less knowledgeable about them but I assume it's similar to MLBtv where there are too many games to broadcast so you can just watch them with a subscription to their service.
In the end of the day these massive deals are the biggest source of revenue for the leagues and the more ways they can divide up the pie among different companies they can extract more money in total. Just looking that the amount of contracts for the US alone is overwhelming.
Shameless from them to make it look like it's a user problem.
It was loading fine for me one hour ago, now I refresh the page and their message states I'm doing too many requests and should chill out (1 request per hour is too many for you?)
I remember that I made a website and then I got a report that it doesn't work on newest Safari. Obviously, Safari would crash with a message blaming the website. Bro, no website should ever make your shitty browser outright crash.
Well, but tomorrow there will be CTOs asking for a contingency plan if AWS goes down, even if planning, preparing, executing and keeping it up to date as the infra evolves will cost more than the X hours of AWS outage.
There are certainly organizations for which that cost is lower than the overall damage of services being down due to AWS fault, but tomorrow we will hear CTOs from smaller orgs as well.
It's so true it hurts.
If you are new in any infra/platform management position you will be scared as hell this week. Then you will just learn that feeling will just disappear by itself in a few days.
Yep, when I was a young programmer I lived in dread of an outage or worse been responsible for a serious bug in production, then I got to watch what happened when it happened to others (and that time I dropped the prod database at half past four on a Friday).
When everything is some varying degree of broken at all times been responsible for a brief uptick in the background brokenness isn't the drama you think it is.
It would be different if the systems I worked on where true life and death (ATC/Emergency Services etc) but in reality the blast radius from my fucking up somewhere is monetary and even at the biggest company I worked for constrained (while 100+K per hour from an outage sounds horrific - in reality the vast majority of that was made up when the service was back online, people still needed to order the thing in the end).
This applies to literally half of random "feature requests" and "tasks" that are urgent and needed to get done yesterday incoming from the business team..
No really true for large systems. We are doing things like deploying mitigations to avoid scale-in (eg services not receiving traffic incorrectly autoscaling down), preparing services for the inevitable storm, managing various circuit breakers, changing service configurations to ease the flow of traffic through the system, etc. We currently have 64 engineers in our on-call room managing this. There's plenty of work to do.
Well, some engineer somewhere made the recommendation to go with AWS, even tho it is more expensive than alternatives. That should raise some questions.
I feel bad for the people impacted by the outage. But at the same time there's a part of me that says we need a cataclysmic event to shake the C-Suite out of their current mindset of laying off all of their workers to replace them with AI, the cheapest people they can find in India, or in some cases with nothing at all, in order to maximize current quarter EPS.
After some thankless years preventing outages for a big tech company, I will never take an oncall position again in my life.
Most miserable working years I have had. It's wild how normalized working on weekends and evenings becomes in teams with oncall.
But it's not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
And outside of Google you don't even get paid for oncall at most big tech companies! Company losing millions of dollars an hour, but somehow not willing to pay me a dime to jump in at 3AM? Looks like it's not my problem!
When I used to be on call for Cisco WebEx services. I got paid extra, and got extra time of. Even if nothing happened. In addition we where enough people on the rotation, so I didn't have to do it that often.
I believe the rules varied based on jurisdiction, and I think some had worse deals, and some even better. But I was happy with our setup in Norway.
Tbh I do not think we would have had, what we had if it wasn't for the local laws and regulations. Sometimes worker friendly laws can be nice.
Welcome to the typical American salary abuse. There's even a specific legal cutout exempting information technology, scientific and artistic fields from the overtime pay requirements of the Fair Labor Standards Act.
There's a similar cutout for management, which is how companies like GameStop squeeze their retail managers. They just don't give enough payroll hours for regular employees, so the salaried (but poorly paid) manager has to cover all of the gaps.
Follow the sun does not happen by itself. Very few if any engineering teams are equally split across thirds of the globe in such a way that (say) Asia can cover if both EMEA and the Americas are offline.
Having two sites cover the pager is common, but even then you only have 16 working hours at best and somebody has to take the pager early/late.
> But this is not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
It is completely normal for staff to have to work 24/7 for critical services.
Plumbing, HVAC, power plant engineers, doctors, nurses, hospital support staff, taxi drivers, system and network engineers - these people keep our modern world alive, all day, every day. Weekends, midnights, holidays, every hour of every day someone is AT WORK to make sure our society functions.
Not only is it normal, it is essential and required.
It’s ok that you don’t like having to work nights or weekends or holidays. But some people absolutely have to. Be thankful there are EMTs and surgeons and power and network engineers working instead of being with their families on holidays or in the wee hours of the night.
Nice try at guilt-tripping people doing on-call, and doing it for free.
But to parent's points: if you call a plumber or HVAC tech at 3am, you'll pay for the privilege.
And doctors and nurses have shifts/rotas. At some tech places, you are expected to do your day job plus on-call. For no overtime pay. "Salaried" in the US or something like that.
You’re looking for a job in this economy with a ‘he said no to being on call’ in your job history.
This is plainly bad regulation, the market at large discovered the marginal price of oncall is zero, but it’s rather obviously skewed in employer’s favor.
Yup, that is precisely what I did and what I'm encouraging others to do as well.
Edit: On-call is not always disclosed. When it is, it's often understated. And finally, you can never predict being re-orged into a team with oncall.
I agree employees should still have the balls to say "no" but to imply there's no wrongdoing here on companies' parts and that it's totally okay for them to take advantage of employees like this is a bit strange.
Especially for employees that don't know to ask this question (new grads) or can't say "no" as easily (new grads or H1Bs.)
If you or anyone else are doing on-call for no additional pay, precisely nobody is forcing you to do that. Renegotiate, or switch jobs. It was either disclosed up front or you missed your chance to say “sorry, no” when asked to do additional work without additional pay. This is not a problem with on call but a problem with spineless people-pleasers.
Every business will ask you for a better deal for them. If you say “sure” to everything you’re naturally going to lose out. It’s a mistake to do so, obviously.
An employee’s lack of boundaries is not an employer’s fault.
> It is completely normal for staff to have to work 24/7 for critical services.
> Not only is it normal, it is essential and required.
Now you come with the weak "you don't have to take the job" and this gem:
> An employee’s lack of boundaries is not an employer’s fault.
As if there isn't a power imbalance, or employers always disclose everything or chance their mind. But of course, let's blame those entitled employees!
No one dies if our users can't shitpost until tomorrow morning.
I'm glad there are people willing to do oncall. Especially for critical services.
But the software engineering profession as a whole would benefit from negotiating concessions for oncall. We have normalized work interfering with life so the company can squeeze a couple extra millions from ads. And for what?
Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
> Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
Interestingly, when I worked on analytics around bugs we found that often (in the ads space), there actually wasn't an impact when advertisers were unable to create ads, as they just created all of them when the interface started working again.
Now, if it had been the ad serving or pacing mechanisms then it would've been a lot of money, but not all outages are created equal.
Not all websites are for shitposting. I can’t talk to my clients for whom I am on call because Signal is down. I also can’t communicate with my immediate family. There are tons of systems positively critical to society downstream from these services.
Anthem Health call center disconnected my wife numerous times yesterday with an ominous robo-message of "Emergency in our call center"; curious if that was this. Seems likely, but what a weird message.
AWS has made the internet into a single-point-of failure.
What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?
Clourflare is not merely a single point of failure. They are the official MITM of the internet. They control the flow of information. They probably know more about your surfing habits than Google at this point. There are some sites I can not even connect to using IP addresses anymore.
That company is very concerning and not because of an outage. In fact, I wish one day we have a full cloudflare outage and the entire net goes dark and it finally sink in how much control this one f'ing company has over information in our so called free society.
Not every site is controlled by Cloudflare. My site doesn't use it at all (the fact that it is currently not working properly is entirely coincidental), because I don't really see a reason to use it. Whenever they go down, I'm unaffected
I can't believe this. When status page first created their product, they used to market how they were in multiple providers so that they'd never be affected by downtime.
One thing has become quite clear to me over the years. Much of the thinking around uptime of information systems has become hyperbolic and self-serving.
There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
> The takeaway I always have from these events is that you should engineer your business to be resilient
An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.
I'm getting old, but this was the 1980's, not the 1800's.
In other words, to agree with your point about resilience:
A lot of the time even some really janky fallbacks will be enough.
But to somewhat disagree with your apparent support for AWS: While it is true this attitude means you can deal with AWS falling over now and again, it also strips away one of the main reasons people tend to give me for why they're in AWS in the first place - namely a belief in buying peace of mind and less devops complexity (a belief I'd argue is pure fiction, but that's a separate issue). If you accept that you in fact can survive just fine without absurd levels of uptime, you also gain a lot more flexibility in which options are viable to you.
The cost of maintaining a flawless eject button is indeed high, but so is the cost of picking a provider based on the notion that you don't need one if you're with them out of a misplaced belief in the availability they can provide, rather than based on how cost effectively they can deliver what you actually need.
I would argue that you are still buying peace of mind by hosting on AWS, even when there are outages. This outage is front page news around the world, so it's not as much of a shock if your company's website goes down at the same time.
Some of the peace of mind comes just from knowing it’s someone else’s (technical) problem if the system goes down. And someone else’s problem to monitor the health of it. (Yes, we still have to monitor and fix all sorts of things related to how we’ve built our products, but there’s a nontrivial amount of stuff that is entirely the responsibility of AWS)
The cranked tills (or registers for the Americans) is an interesting example, because it seems safe to assume they don't have that equipment anymore, and could not so easily do that.
We have become much more reliant on digital tech (those hand cranked tills were prob not digital even when the electricity was on), and much less resilient to outages of such tech I think.
> An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.
what did the do with the frozen food section? Was all that inventory lost?
Tech companies, and in particular ad-driven companies, keep a very close eye on their metrics and can fairly accurately measure the cost of an outage in real dollars
Just yesterday I saw another Hetzner thread where someone claimed AWS beats them in uptime and someone else blasted AWS for huge incidents. I bet his coffee tastes better this morning.
To be fair, my Hetzner server had ten minutes of downtime the other day. I've been a customer for years and this was the second time or so, so I love Hetzner, but everything has downtime.
Their auction systems are interesting to dig through, but to your point, everything fails. Especially these older auction systems. Great price/service, though. Less than an hour for more than one ad-hoc RAID card replacement
Yeah, I really want one of their dedicated servers, but it's a bit too expensive for what I use it for. Plus, my server is too much of a pet, so I'm spoiled on the automatic full-machine backups.
I honestly wonder if there is safety in the herd here. If you have a dedicated server in a rack somewhere that goes down and takes your site with it. Or even the whole data center has connectivity issues. As far as the customer is concerned, you screwed up.
If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
> If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."
My guess is their infrastructure is set up through clickops, making it extra painful to redeploy in another region. Even if everything is set up through CloudFormation, there's probably umpteen consumers of APIs that have their region hardwired in. By the time you get that all sorted, the region is likely to be back up.
You can take advantage by having an unplanned service window every time a large cloud provider goes down. Then tell your client that you where the reason why AWS went down.
Feel bad for the Amazon SDR randomly pitching me AWS services today. Although apparently our former head of marketing got that pitch from four different LinkedIn accounts. Maybe there's a cloud service to rein them in that broke ;)
I'd say that this is true for the average admin who considers PaaS, Kubernetes and microservices one giant joke. Vendor-neutral monolithic deployments keep on winning.
I think AWS should use, and provide as an offering to big customers, a Chaos Monkey tool that randomly brings down specific services in specific AZs. Example: DynamoDB is down in us-east-1b. IAM is down in us-west-2a.
Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.
I used to tell people there that my favorite development technique was to sit down and think about the system I wanted to build, then wait for it to be announced at that year's re:Invent. I called it "re:Invent and Simplify". "I" built my best stuff that way.
Stupid question, why isn't the stock down? Couldn't this lead to people jumping to other providers and at the very least require some pretty big fees for do dramatically breaking SLA? Is it just not a biggest fraction of revenue to matter?
Non-technical people don't really notice these things. They hear it and shrug, because usually it's fixed within a day.
CNBC is supposed to inform users about this stuff, but they know less than nothing about it. That's why they were the most excited about the "Metaverse" and telling everyone to get on board (with what?) or get left behind.
The market is all about perception of value. That's why Musk can tweet a meme and double a stocks price, it's not based in anything real.
It's plausible that Amazon removes unhealthy servers from all round-robins including DNS. If all servers are unhealthy, no DNS.
Alternatively, perhaps their DNS service stopped responding to queries or even removed itself from BGP. It's possible for us mere mortals to tell which of these is the case.
Chances are there's some cyclical dependencies. These can creep up unnoticed without regular testing, which is not really possible at AWS scale unless they want to have regular planned outages to guard against that.
Trouble is one can't fully escape us-east-1. Many services are centralized there like: S3, Organizations, Route53, Cloudfront, etc. It is THE main region, hence suffering the most outages, and more importantly, the most troubling outages.
We're mostly deployed on eu-west-1 but still seeing weird STS and IAM failures, likely due to internal AWS dependencies.
Also we use Docker Hub, NPM and a bunch of other services that are hosted by their vendors on us-east-1 so even non AWS customers often can't avoid the blast radius of us-east-1 (though the NPM issue mostly affects devs updating/adding dependencies, our CI builds use our internal mirror)
FYI:
1. AWS IAM mutations all go through us-east-1 before being replicated to other public/commercial regions. Read/List operations should use local regional stacks. I expect you'll see a concept of "home region" give you flexibility on the write path in the future.
2. STS has both global and regional endpoints. Make sure you're setup to use regional endpoints in your clients https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...
us-east-1 was, probably still is, AWS' most massive deployment. Huge percentage of traffic goes through that region. Also, lots of services backhaul to that region, especially S3 and CloudFront. So even if your compute is in a different region (at Tower.dev we use eu-central-1 mostly), outages in us-east-1 can have some halo effect.
This outage seems really to be DynamoDB related, so the blast radius in services affected is going to be big. Seems they're still triaging.
Agreed, my company had been entirely on us-east-1 predating my joining ~12 years ago. ~7 years ago, after multiple us-east-1 outages, I moved us to us-west-2 and it has been a lot less bumpy since then.
I don't recommend to my clients they use us-east-1. It's the oldest and most prone to outages. I usually always recommend us-east-2 (Ohio) unless they require West Coast.
and if they need West Coast, it's us-west-2. I consider us-west-1 to be a failed region. They don't get some of the new instance types, you can't get three AZs for your VPCs, and they're more expensive than the other US regions.
Amazon has spent most of its HR post-pandemic efforts in:
• Laying off top US engineering earners.
• Aggressively mandating RTO so the senior technical personnel would be pushed to leave.
• Other political ways ("Focus", "Below Expectations") to push engineering leadership (principal engineers, etc) to leave, without it counting as a layoff of course.
• Migrating serious, complex workloads to entry-level employees in cheap office locations (India, Spain, etc).
This push was slow but mostly completed by Q1 this year. Correlation doesn't imply causation? I find that hard to believe in this case. AWS had outages before, but none like this "apparently nobody knows what to do" one.
Our entire data stack (Databricks and Omni) are all down for us also. The nice thing is that AWS is so big and widespread that our customers are much more understanding about outages, given that its showing up on the news.
This is from Amazon's latest earnings call when Andy Jessy was asked why they aren't growing as much as there competitors
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area."
also
"And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.
Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup
https://m.youtube.com/watch?v=KFvhpt8FN18 clear detailed explanation of the AWS outage and how properly designed systems should have shielded the issue with zero client impact
If they don’t obfuscate the downtime (they will, of course), this outage would put them at, what, two nines? Thats very much out of their SLA.
People also keep talking about it as if its one region, but there are reports in this thread of internal dependencies inside AWS which are affecting unrelated regions with various services. (r53 updates for example)
It sounds like you think the SLA is just toilet paper? When in reality it's a contract which defines AWS's obligations. So the lesson here is that they broke their contract big time. So yes. Shaming is the right approach. Also it seems you missed somehow the other 1700+ comments agreeing with shaming
I wouldn't go that far. The SLA is a contract, and they are clear on the remedy (up to 100% refund if they don't hit 95% uptime in a month).
Just like reading medication side effects, they are letting you know that downtime is possible, albeit unlikely.
All of the documentation and training programs explain the consequence of single-region deployments.
The outage was a mistake. Let's hope it doesn't indicate a trend. I'm not defending AWS. I'm trying to help people translate the incident into a real lesson about how to proceed.
You don't have control over the outage, but you do have control over how your app is built to respond to a similar outage in the future.
Since I'm 5+ years out from my NDA around this stuff, I'll give some high level details here.
Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.
Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.
Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)
So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)
Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P
Thats so interesting to me, I always assume companies like google who have "unlimited" dollars will always be happy to eat the cost to keep customers, especially given gcp usage outside googles internal services is way smaller compared to azure and aws. Also interesting to see snapchat had a hacky solution with AppEngine
These are the best additional bits of information that I can find to share with you if you're curious to read more about Snap and what they did. (They were spending $400m per year on GCP which was famously disclosed in their S-1 when they IPO'd)
The "unlimited dollars" come from somewhere after all.
GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.
I’m not sure what you mean by Azure being more painful for FOSS stacks. That is not my experience. Old you elaborate?
However I have seen many people flee from GCP because: Google lacks customer focus, Google is free about killing services, Google seems to not care about external users, people plain don’t trust Google with their code, data or reputation.
Customers would rather choose Azure. GCP has a bad rep, bad documentation, bad support compared to AWS / Azure. & with google cutting off products, their trust is damaged.
Google does not give even a singular fuck about keeping their customers. They will happily kill products that are actively in use and are low-effort for... convenience? Streamlining? I don't know, but Google loves to do that.
AWS often deploys its new platform products and latest hardware (think GPUs) into us-east-1, so everyone has to maintain a footprint in us-east-1 to use any of it.
So as a result, everyone keeps production in us-east-1. =)
Not to mention, even if you went through the hassle of a diverse multi-cloud deployment, there's still something in your stack that has a dependency on us-east-1, and it's probably some weird frontend javascript module that uses a floatilla of free tier lambda services to format dates.
Eh, us-east-1 is the oldest AWS region and if you get some AWS old timers talking over some beers they'll point out the legacy SPOFs that still exist in us-east-1.
2) People who thought that just having stuff "in the cloud" meant that it was automatically spread across regions. Hint, it's not; you have to deploy it in different regions and architect/maintain around that.
With more and more parts of our lives depending on often only one cloud infrastructure provider as a single point of failure, enabling companies to have built-in redundancy in their systems across the world could be a great business.
>Oct 20 12:51 AM PDT We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.
Weird that case creation uses the same region as the case you'd like to create for.
The support apis only exist in us-east-1, iirc. It’s a “global” service like IAM, but that usually means modifications to things have to go through us-east-1 even if they let you pull the data out elsewhere.
This isn't a "cloud failure". All of these apps would be running now had they spent the additional 5% development costs to add failover to another region.
us-east1 is supposed to consist of a number of "availability zones" that are independent and stay "available" even if one goes down. That's clearly not happening, so yes, this is very much a cloud failure.
It would have to be catastrophic for most businesses to make think about escaping the cloud. The cost of migration and maintenance are massive for small and medium businesses.
The point of AWS is to promise you the nines and make you feel good about it. Your typical "growth & engagement" startup CEO can feel good and make his own customers feel good about how his startup will survive a nuclear war.
Delivery of those nines is not a priority. Not for the cloud provider - because they can just lie their way out of it by not updating their status page - and even when they don't, they merely have to forego some of their insane profit margin for a couple hours in compensation. No provider will actually put their ass on the line and offer you anything beyond their own profit margin.
This is not an issue for most cloud clients either because they keep putting up with it (lying on the status page wouldn't be a thing if clients cared) - the unspoken truth is that nobody cares that your "growth & engagement" thing is down for an hour or so, so nobody makes anything more than a ceremonial stink about it (chances are, the thing goes down/misbehaves regularly anyway every time the new JS vibecoder or "AI employee" deploys something, regardless of cloud reliability).
Things where nines actually matter will generally invest in self-managed disaster recovery plans that are regularly tested. This also means it will generally be built differently and far away from your typical "cloud native" dumpster fire. Depending on how many nines you actually need (aka what's the cost of not meeting that target - which directly controls how much budget you have to ensure you always meet it), you might be building something closer to aircraft avionics with the same development practices, testing and rigor.
I can tell you from personal experience that improving/maintaining uptime (by doing root cause analysis, writing correction of error reports, going through application security reviews, writing/reviewing design docs for safely deploying changes, working on operational improvements to services) probably takes up a majority of most AWS engineers' time. I'm genuinely curious what you are basing the opinion "Delivery of those nines is not a priority" off of.
It's usually true if you arent in US-East-1 which is widely known to be the least reliable location. Theres no reason anyone should be deploying anything new to it these days.
Actual multi-region replication is hard and forces you to think about complicated things like the CAP theorem/etc. It's easier to pretend AWS magically solves that problem for you.
Which is actually totally fine for the vast majority of things, otherwise there would be actual commercial pressures to make sure systems are resilient to such outages.
Last time I checked the standard SLA is actually 99 % and the only compensation you get for downtime is a refund. Which is why I don't use AWS for anything mission critical.
> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people
credits for outages when it seems fair" policy, I credited everyone's Tarsnap accounts with 50% of a month's storage costs.
So in this case the downtime was roughly 26 hours, and the refund was for 50% of a month, so that's more than a 1-1 downtime refund.
Most "legacy" hosts do yes. The norm used to be a percentage of your bill for every hour of downtime once uptime dropped below 99.9%. If the outage was big enough you'd get credit exceeding your bill, and many would allow credit withdrawal in those circumstances. There were still limits to protect the host but there was a much better SLA in place.
Cloud providers just never adopted that and the "ha, sucks to be you" mentality they have became the norm.
Depends on which service you're paying for. For pure hosting the answer is no, which is why it rarely makes sense to go AWS for uptime and stability because when it goes down there's nothing you can do. As opposed to bare metal hosting with redundancy across data centers, which can even cost less than AWS for a lot of common workloads.
Theres literally thousands of options. 99% of people on AWS do not need to be on AWS. VPS servers or load balanced cloud instances from providers like Hetzner are more than enough for most people.
It still baffles me how we ended up in this situation where you can almost hear peoples disapproval over the internet when you say AWS / Cloud isn't needed and you're throwing money away for no reason.
There's nothing particularly wrong with AWS, other than the pricing premium.
The key is that you need to understand no provider will actually put their ass on the line and compensate you for anything beyond their own profit margin, and plan accordingly.
For most companies, doing nothing is absolutely fine, they just need to plan for and accept the occasional downtime. Every company CEO wants to feel like their thing is mission-critical but the truth is that despite everything being down the whole thing will be forgotten in a week.
For those that actually do need guaranteed uptime, they need to build it themselves using a mixture of providers and test it regularly. They should be responsible for it themselves, because the providers will not. The stuff that is actually mission-critical already does that, which is why it didn't go down.
Been using AWS too, but for a critical service we mirrored across three Hetzner datacenters with master-master replication as well as two additional locations for cluster node voting.
You would think that after the previous big us-east-1 outages (to be fair there have been like 3 of them in the past decade, but still, that's plenty), companies would have started to move to other AWS regions and/or to spread workloads between them.
It’s not that simple. The bit AWS doesn’t talk much about publicly (but will privately if you really push them) is that there’s core dependencies behind the scenes on us-east-1 for running AWS itself. When us-east-1 goes down the blast radius has often impacted things running in other regions.
It impacts AWS internally too. For example rather ironically it looks like the outage took out AWS’s support systems so folks couldn’t contact support to get help.
Unfortunately it’s not as simple as just deploying in multiple regions with some failover load balancing.
our eu-central-1 services had zero disruption during this incident, the only impact was that if we _had_ had any issues we couldn't log in to the AWS console to fix them
So moving stuff out of us-east-1 absolutely does help
Sure it helps. Folks just saying there’s lots of examples where you still get hit by the blast radius of a us-east-1 issue even if you’re using other regions.
> Amazon Alexa: routines like pre-set alarms were not functioning.
It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.
I half-seriously like to say things like, "I'm excited for a time when we have powerful enough computers to actually run applications on them instead of being limited to only thin clients." Only problem is most of the younger people don't get the reference anymore, so it's mainly the olds that get it
we[1] operate out of `us-east-1` but chose to not use any of the cloud based vendor lockin (sorry vercel, supabase, firebase, planetscale etc). Rather a few droplets in DigitalOcean(us-east-1) and Hetzner(eu). We serve 100 million requests/mo, few million user generated content(images)/mo at monthly cost of just about $1000/mo.
It's not difficult, it's just that we engineers chose convenience and delegated uptime to someone else.
I agree but then again it’s always a humans fault in the end. So probably a root cause will have a bit more neuance. I was more thinking of the possible headlines and how that would potentially affect the public AI debate. Since this event is big enough to actually get the attention of eg risk management at not-insignificant orgs.
> but then again it’s always a humans fault in the end
Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.
"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."
Potentially-ignoramus comment here, apologies in advance, but amazon.com itself appears to be fine right now. Perhaps slower to load pages, by about half a second. Are they not eating (much of) their own dog food?
They are 100% fully on AWS and nothing else. (I’m Ex-Amazon)
It seems like the outage is only effecting one region so AWS is likely falling back to others. I’m sure parts of the site are down but the main sites are resilient
those sentences aren't logically connected - the outage this post is about is mostly confined to `us-east-1`, anyone who wanted to build a reliable system on top of AWS would do so across multiple regions, including Amazon itself.
`us-east-1` is unfortunately special in some ways but not in ways that should affect well-designed serving systems in other regions.
He once said about an open source project that I was the third highest contributor on at AWS “This may be the worst AWS naming of 2021.” It was one of the proudest moments in my career.
Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.
Similar experience here. People laughed and some said something like "well, if something like AWS falls then we have bigger problems". They laugh because honestly is too far-fetched to think the whole AWS infra going down. Too big to fail as they say in the US. Nothing short of a nuclear war would fuck up the entire AWS network so they're kinda right.
Until this happen. A single region in a cascade failure and your saas is single region.
They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.
Why would your competitors go down? AWS has at best 30-35% market share. And that's ignoring the huge mass of companies who still run their infrastructure on bare metal.
Everyone using Recall for meeting recordings is down.
In some domains, a single SaaS dominates the domain and if that SaaS sits on AWS, it doesn't matter if AWS is 35% marketshare because the SaaS that dominates 80% of the domain is on AWS so the effect is wider than just AWS's market share.
We're on GCP, but we have various SaaS vendors on AWS so any of the services that rely on AWS are gone.
Many chat/meeting services also run on AWS Chime so even if you're not on AWS, if a vendor uses Chime, that service is down.
Part of the company I work at is doing infrastructure consulting. We're in fact seeing companies moving to bare metal, with the rise of turnkey container systems between Nutanix, Purestorage, Redhat, ... At this point in time, a few remotely managed boxes in a rack can offer a really good experience for containers for very little effort.
And this comes in a time with regulations like Dora and the BaFin tightening things - managing these boxes becomes less effort than maintaining compliance across vendors.
There have been plenty of solutions for a while. Pivotal Cloud Foundry, Open Shift, etc. None of these were "turnkey" though. If you're doing consulting, is it more newer, easier to install/manage tech, or is it cost?
I'm not in our consulting parts for infra, but based on their feedback and some talks at a conference a bit back: Vendors have been working on standardizing on Kubernetes components and mechanics and a few other protocols, which is simplifying configuration and reducing the configuration you have to do infrastructurally a lot.
Note, I'm not affiliated with any of these companies.
For example, Purestorage has put a lot of work into their solution and for a decent chunk of cache, you get a system that slots right into VMware, offers iSCSI for other infrastructure providers, offers a CSI plugin for containers, and speaks S3. And integration with a few systems like OpenShift has been simplified as well.
This continues. You can get ingress/egress/network monitoring compliance from Calico slotting in as a CNI plugin, some systems managing supply chain security, ... Something like Nutanix is an entirely integrated solution you rack and then you have a container orchestration with storage and all of the cool things.
Cost is not really that much a factor in this market. Outsourcing regulatory requirements and liability to vendors is great.
Because your competitor probably depends on a service which uses aws.
They may host all their stuff in azure, but use cloudfront as cache which uses aws and goes down.
>> People laughed and some said something like "well, if something like AWS falls then we have bigger problems".
> They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.
They made their own bigger problems by all crowding into the same single region.
imagine a beach, with icecream vendors. You'd think it would be optimal for two vendors to each split it half north, half south. However, in wanting to steal some of the other vendors' customers, you end up with two icecream stands in the center.
So too with outages. Safety / loss of blame in numbers.
I feel like I don't really like AWS and prefer self hosted vps or even google cloud / cloudflare more but and so I agree with what you are trying to say but Let me be the devil's argument.
I mean I agree but what you are saying is that where else are you gonna host it? If you host it yourself and then it turns out to be an issue and you go down then that's entirely on you and 99% of the internet still works.
But if Aws goes down, lets say 50% of the internet goes down.
So, in essence, nobody blames a particular team/person just as the parent comment said that nobody gets fired for picking IBM.
Although, I still think that the idea which is worrying is such massive centralization of servers that we have a single switch which can turn half the internet off. So I am a bit worried from the centralization side of thing's.
The question really becomes, did you make money that you wouldn't have made when services came back up? As in, will people just shift their purchase time to tomorrow when you are back online? Sure, some % is completely lost but you have to weigh that lost amount against the ongoing costs to be multi-cloud (or multi-provider) and the development time against those costs. For most people I think it's cheaper to just be down for a few hours. Yes, this outage is longer than any I can remember but most people will shrug it off and move on once it comes back up fully.
At the end of the day most of us aren't working on super critical things. No one is dying because they can't purchase X item online or use Y SaaS. And, more importantly, customers are _not_ willing to pay the extra for you to host your backend in multiple regions/providers.
In my contracts (for my personal company) I call out the single-point-of-failure very clearly and I've never had anyone balk. If they did I'd offer then resiliency (for a price) and I have no doubt that they would opt to "roll the dice" instead of pay.
Lastly, it's near-impossible to verify what all your vendors are using so even if you manage to get everything resilient it only takes one chink in the armor the bring it all down (See: us-east-1 and various AWS services that rely on that even if you don't host anything in us-east-1 directly).
I'm not trying to downplay this, pretend it doesn't matter, or anything like that. Just trying to point out that most people don't care because no one seems to care (or want to pay for it). I wish that was different (I wish a lot of things were different) but wishing doesn't pay my bills and so if customers don't want to pay for resiliency then this is what they get and I'm at peace with that.
If you were dependent upon a single distribution (region) of that Service, yes it would be a massive single point of failure in this case. If you weren't dependent upon a particular region, you'd be fine.
Of course lots of AWS services have hidden dependencies on us-east-1. During a previous outage we needed to update a Route53(DNS) record in us-west-2, but couldn't because of the outage in us-east-1.
So, AWS's redundant availability goes something like "Don't worry, if nothing is working in us-east-1, it will trigger failover to another regions" ... "Okay, where's that trigger located?" ... "In the us-east-1 region also" ... "Doens't that seem a problem to you?" ... "You'd think it might be! But our logs say it's never been used."
Relying on AWS is a single point of failure. Not as much as relying on a single AWS region, but it's still a single point.
It's fairly difficult to avoid single points of failure completely, and if you do it's likely your suppliers and customers haven't managed to.
It's about how much your risk level is.
AWS us-east-1 fails constantly, it has terrible uptime, and you should expect it to go. A cyberattack which destroyed AWSs entire infrastructure would be less likely. BGP hijacks across multiple AWS nodes are quite plausible though, but that can be mitigated to an extent with direct connects.
Sadly it seems people in charge of critical infrastructure don't even bother thinking about these things, because next quarters numbers are more important.
I can avoid London as a single point of failure, but the loss of Docklands would cause so much damage to the UK's infrastructure I can't confidently predict that my servers in Manchester connected to peering points such as IXman will be able to reach my customer in Norwich. I'm not even sure how much international connectivity I could rely on. In theory Starlink will continue to work, but in practice I'm not confident.
When we had power issues in Washington DC a couple of months ago, three of our four independent ISPS failed, as they all had undeclared requirements on active equipment in the area. That wasn't even a major outage, just a local substation failure. The one circuit which survived was clearly just fibre from our (UPS/generator backed) equipment room to a data centre towards Baltimore (not Ashburn).
It won't be over until long after AWS resolves it - the outages produce hours of inconsistent data. It especially sucks for financial services, things of eventual consistency and other non-transactional processes. Some of the inconsistencies introduced today will linger and make trouble for years.
What are the design best practices and industry standards for building on-premise fallback capabilities for critical infrastructure? Say for health care/banking ..etc
A relative of mine lived and worked in the US for Oppenheimer Funds in the 1990's and they had their own datacenters all over the US, multiple redundancy for weather or war. But every millionaire feels entitled to be a billionaire now, so all of that cost was rolled into a single point of cloud failure.
If we see more of this, it would not be crazy to assume that all this compelling of engineers to "use AI" and the flood of Looks Good To Me code is coming home.
Big if, major outages like this aren't unheard of, and so far, fairly uncommon. Definitely hit harder than their SLAs promise though. I hope they do an honest postmortem, but I doubt they would blame AI even if it was somehow involved. Not to mention you can't blame AI unless you go completely hands-off - but that's like blaming an outsourcing partner, which also never happens.
I'm not sure if this is directly related, but I've noticed my Apple Music app has stopped working (getting connection error messages). Didn't realize the data for Music was also hosted on AWS, unless this is entirely unrelated? I've restarted my phone and rebooted the app to no avail, so I'm assuming this is the culprit.
Wow, about 9 hours later and 21 of 24 Atlassian services are still showing up as impacted on their status page.
Even @ 9:30am ET this morning, after this supposedly was clearing up, my doctor's office's practice management software was still hosed. Quite the long tail here.
I forget where I read it originally, but I strongly feel that AWS should offer a `us-chaos-1` region, where every 3-4 days, one or two services blow up. Host your staging stack there and you build real resiliency over time.
(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)
This is usually something I see on Reddit first, within minutes. I’ve barely seen anything on my front page. While I understand it’s likely the subs I’m subscribed to, that was my only reason for using Reddit. I’ve noticed that for the past year - more and more tech heavy news events don’t bubble up as quickly anymore. I also didn’t see this post for a while for whatever reason. And Digg was hit and miss on availability for me, and I’m just now seeing it load with an item around this.
I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.
Yeah, Reddit has been half-working all morning. Last time this happened I had an account get permabanned because the JavaScript on the page got stuck in a no-backoff retry loop and it banned me for "spamming." Just now it put me in rate limit jail for trying to open my profile one time, so I've closed out all my tabs.
Anecdotally, I think you should disregard this. I found out about this issue first via Reddit, roughly 30 minutes after the onset (we had an alarm about control plane connectivity).
Reddit is worthless now, and posting about your tech infrastructure on reddit is a security and opsec lapse. My workplace has reddit blocked at the edge. I would trust X more than reddit, and that is with X having active honeypot accounts (it is even a meme about Asian girls).
In fact, heard about this outage on X before anywhere else.
interesting, which auth provider u are using? browser based auth via Google wasn't working for me. tailscale used as jumphost for private subnets in aws, and... so that was painful incident as access to corp resources is mandatory for me
I can't log in to my AWS account, in Germany, on top of that it is not possible to order anything or change payment options from amazon.de.
No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.
Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.
I wanted to log into my Audible account after a long time on my phone, I couldn't, started getting annoyed, maybe my password is not saved correctly, maybe my account was banned, ... Then checking desktop, still errors, checking my Amazon.de, no profile info... That's when I started suspecting that it's not me, it's you, Amazon! Anyway, I guess, I'll listen to my book in a couple of hours, hopefully.
Btw, most parts of the amazon.de is working fine, but I can't load profiles, and can't login.
We use IAM Identity Center (née SSO) which is hosted in the eu-central-1 region, and I can log in just fine. Its admin pages are down, though. Ditto for IAM.
I'm on Amazon.de and I literally ordered stuff seconds before posting the comment. They took the money and everything. The order is in my order history list.
Not remotely surprised. Any competent engineer knows full well the risk of deploying into us-east-1 (or any “default” region for that matter), as well as the risks of relying on global services whose management or interaction layer only exists in said zone. Unfortunately, us-east-1 is the location most outsourcing firms throw stuff, because they don’t have to support it when it goes pear-shaped (that’s the client’s problem, not theirs).
My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.
> as well as the risks of relying on global services whose management or interaction layer only exists in said zone.
Is this well known/documented? I don't have anything on AWS but previously worked for a company that used it fairly heavily. We had everything in EU regions and I never saw any indication/warning that we had a dependency on us-east-1. But I assume we probably did based on the blast radius of today's outage.
Displaying and propagating accurate error messages is an entire science unto itself... ...I can see why it's sometimes sensible to invest resource elsewhere and fall back to 'something'.
Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
As a data point, I've been running stuff at Hetzner for 10 years now, in two datacenters (physical servers). There were brief network outages when they replaced networking equipment, and exactly ONE outage for hardware replacement, scheduled weeks in advance, with a 4-hour window and around 1-2h duration.
It's just a single data point, but for me that's a pretty good record.
It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.
Well the complexity comes not from Kubernetes per se but that the problem it wants to solve (generalized solution for distributed computing) is very hard in itself.
Only if you actually has a system complex enough to require it. A lot of systems that use kubernetes are not complex enough to require it, but use it anyway. In that case kubernetes does indeed add unnecessary complexity.
Except that k8s doesn't solve the problem of generalized distributed computing at all. (For that you need distributed fault-tolerant state handling which k8s doesn't do.)
K8s solves only one problem - the problem of organizational structure scaling. For example, when your Ops team and your Dev team have different product deadlines and different budgets. At this point you will need the insanity of k8s.
I am so happy to read that someone views kubernetes the same way I do. for many years i have been surrounded by people who "kubernetes all the things" and that is absolute madness to me.
Yes, I remember when Kubernetes hit the scene and it was only used by huge companies who needed to spin-up fleets of servers on demand. The idea of using it for small startup infra was absurd.
As another data point, I run a k8s cluster on Hetzner (mainly for my own experience, as I'd rather learn on my pet projects vs production), and haven't had any Hetzner related issues with it.
So Hetzner is OK for the overly complex as well, if you wish to do so.
I think it sounds quite realistic especially if you’re using something like Talos Linux.
I’m not using k8s personally but the moment I moved from traditional infrastructure (chef server + VMs) to containers (Portainer) my level of effort went down by like 10x.
yes and mobile phones existed before smartphones, what's the point? So far in terms of scalability nothing beats k8s. And from OpenAI and Google we also see that it even works for high performance use case such as LLM trainings with huge amounts of nodes.
On the other hand, I had the misfortune of having a hardware failure on one of my Hetzner servers. They got a replacement harddrive in fairly quickly, but still complete data loss on that server, so I had to rebuild it from scratch.
This was extra painful, because I wasn't using one of the OS that is blessed by Hetzner, so it requires a remote install. Remote installs require a system that can run their Java web plugin, and that have a stable and fast enough connection to not time out. The only way I have reliably gotten them to work is by having an ancient Linux VM that was also running in Hetzner, and had the oldest Firefox version I could find that still supported Java in the browser.
My fault for trying to use what they provide in a way that is outside their intended use, and props to them for letting me do it anyway.
That can happen with any server, physical or virtual, at any time, and one should be prepared for it.
I learned a long time ago that servers should be an output of your declarative server management configuration, not something that is the source of any configuration state. In other words, you should have a system where you can recreate all your servers at any time.
In your case, I would indeed consider starting with one of the OS base installs that they provide. Much as I dislike the Linux distribution I'm using now, it is quite popular, so I can treat it as a common denominator that my ansible can start from.
Do you monitor your product closely enough to know that there weren't other brief outages? E.g. something on the scale of unscheduled server restarts, and minute-long network outages?
I personally do through status monitors at larger cloud providers at 30 sec resolutions, never noticed a downtime. They will sometimes drop ICMP though, even though the host is alive and kicking.
actually, why do people block ICMP? I remember in 1997-1998 there were some Cisco ICMP vulnerabilities and people started blocking ICMP then and mostly never stopped, and I never understood why. ICMP is so valuable for troubleshooting in certain situations.
Security through obscurity mostly, I don't know who continues to push the advice to block ICMP without a valid technical reason since at best if you tilt your head and squint your eyes you could almost maybe see a (very new) script kiddie being defeated by it.
I've rarely actually seen that advice anywhere, more so 20 years ago than now but people are still clearly getting it from circles I don't run in.
I do for some time now, on the scale of around 20 hosts in their cloud offering. No restarts or network outages. I do see "migrations" from time to time (vm migrating to a different hardware, I presume), but without impact on metrics.
to stick to the above point, this wasn't a minute long outage.
if you care about seconds/minutes long outages, you monitor. running on aws, hetzer, ovh, or a raspberry in a shoe box makes no difference
I do. Routers, switches, and power redundancy are solved problems in datacenter hardware. Network outages rarely occur because of these systems, and if any component goes down, there's usually an automatic failover. The only thing you might notice is TCP connections resetting and reconnecting, which typically lasts just a few seconds.
Yes, but those days are numbered. For many years AWS was in a league of its own. Now they’ve fallen badly behind in a growing number of areas and are struggling to catch up.
There’s a ton of momentum associated with the prior dominance, but between the big misses on AI, a general slow pace of innovation on core services, and a steady stream of top leadership and engineers moving elsewhere they’re looking quite vulnerable.
Can you throw out an example or two, because in my experience, AWS is the 'it just works' of the cloud world. There's a service for everything and it works how you'd expect.
I'm not sure what feature they're really missing, but my favorite is the way they handle AWS Fargate. The other cloud providers have similar offerings but I find Fargate to have almost no limitations when compared to the others.
You’ve given a good description of IBM for most of the 80s through the 00s. For the first 20 years of that decline “nobody ever got fired for buying IBM” was still considered a truism. I wouldn’t be surprised if AWS pulls it off for as long as IBM did.
I think that the worst thing that can happen to an org is to have that kind of status ("nobody ever got fired for buying our stuff" / "we're the only game in town").
It means no longer being hungry. Then you start making mistakes. You stop innovating. And then you slowly lose whatever kind of edge you had, but you don't realize that you're losing it until it's gone
Unfortunately I think AWS is there now. When you talk to folks there they don’t have great answers to why their services are behind or not as innovative as other things out there. The answer is basically “you should choose AWS because we’re AWS.” It’s not good.
I couldn't agree more, there was clearly a big shift when Jassy became CEO of amazon as a whole and Charlie Bell left (which is also interesting because it's not like azure is magically better now).
The improvements to core services at AWS hasn't really happened at the same pace post covid as it did prior, but that could also have something to do with overall maturity of the ecosystem.
Although it's also largely the case that other cloud providers have also realized that it's hard for them to compete against the core competency of other companies, whereas they'd still be selling the infrastructure the above services are run on.
Given recent earnings and depending on where things end up with AI it’s entirely plausible that by the end of the decade AWS is the #2 or #3 cloud provider.
AWS' core advantage is price. No one cares if they are "behind on AI" or "the VP left." At the end of the day they want a cheap provider. Amazon knows how to deliver good-enough quality at discount prices.
That story was true years ago but I don’t know that it rings true now. AWS is now often among the more expensive options, and with services that are struggling to compete on features and quality.
Simultaneously too confused to be able to make their own UX choices, but smart enough to understand the backend of your infrastructure enough to know why it doesn't work and excuses you for it.
The morning national TV news (BBC) was interrupted with this as breaking news, and about how many services (specifically snapchat for some reason) are down because of problems with "Amazon's Web Services, reported on DownDetector"
I thought we didn't like when things were "too big to fail" (like the banks being bailed out because if we didn't the entire fabric of our economy would collapse; which emboldens them to take more risks and do it again).
A typical manager/customer understands just enough to ask their inferiors to make their f--- cloud platform work, why haven't you fixed it yet? I need it!
In technically sophisticated organizations, this disconnect simply floats to higher levels (e.g. CEO vs. CTO rather than middle manager vs. engineer).
I only got €100.000 bounded to a year, then a 20% discount for spend in the next year.
(I say "only" because that certainly would be a sweeter pill, €100.000 in "free" credits is enough to make you get hooked, because you can really feel the free-ness in the moment).
> In the end, you live with the fact that your service might be down a day or two per year.
This is hilarious. In the 90s we used to have services which ran on machines in cupboards which would go down because the cleaner would unplug them. Even then a day or two per year would be unacceptable.
On one hand it's allows to shift the blame but on other hand is shows a disadvantage of hyper centralization - if AWS is down too many important services are down at the same time which makes it worse. E. g. when AWS is down it's important to have communication/monitoring services UP so engineers can discuss / co-ordinate workarounds and have good visibility but Atlassian was (is) significantly degraded today too.
100%. When AWS was down, we'd say "AWS is down!", and our customers would get it. Saying "Hetzner is down!" raises all sorts of questions your customers aren't interested in.
I've ran a production application off Hetzner for a client for almost a decade and I don't think I have had to tell them "Hetzner is down", ever, apart from planned maintenance windows.
Hosting on second- or even third-tier providers allows you to overprovision and have much better redundancy, provided your solution is architected from the ground up in a vendor agnostic way. Hetzner is dirt cheap, and there are countless cheap and reliable providers spread around the globe (Europe in my case) to host a fleet of stateless containers that never fail simultaneously.
Stateful services are much more difficult, but replication and failover is not rocket science. 30 minutes of downtime or 30 seconds of data loss rarely kill businesses. On the contrary, unrealistic RTOs and RPOs are, in my experience, more dangerous, either as increased complexity or as vendor lock-in.
Customers don't expect 100% availability and noone offers such SLAs. But for most businesses, 99.95% is perfecty acceplable, and it is not difficult to have less than 4h/year of downtime.
The point seems to be not that Hetzner will never have an outage, but rather that they have a track record of not having outages large enough for everyone to be affected.
Seems like large cloud providers, including AWS, are down quite regularly in comparison, and at such a scale that everything breaks for everyone involved.
> The point seems to be not that Hetzner will never have an outage, but rather that they have a track record of not having outages large enough for everyone to be affected.
If I am affected, I want everyone to be affected, from a messaging perspective
Okay, that helps for the case when you are affected. But what about the case when you are not affected and everyone else is? Doesn't that seem like good PR?
Take the hit of being down once every 10 years compared to being up for the remaining 9 that others are down.
> An Amazon Web Services outage is causing major disruptions around the world. The service provides remote computing services to many governments, universities and companies, including The Boston Globe.
> On DownDetector, a website that tracks online outages, users reported issues with Snapchat, Roblox, Fortnite online broker Robinhood, the McDonald’s app and many other services.
That's actually a fairly decent description for the non-tech crowd and I am going to adopt it, as my company is in the cloud native services space and I often have a problem explaining the technical and business model to my non-technical relatives and family - I get bogged down in trying to explain software defined hardware and similar concepts...
I asked ChatGPT for a succinct definition, and I thought it was pretty good:
“Amazon Web Services (AWS) is a cloud computing platform that provides on-demand access to computing power, storage, databases, and other IT resources over the internet, allowing businesses to scale and pay only for what they use.”
For us techies yes, but to the regular folks that is just as good as our usual technical gobbledy-gook - most people don´t differentiate between a database and a hard-drive.
> access to computing power, storage, databases, and other IT resources
could be simplified to: access to computer servers
Most people who know little about computers can still imagine a giant mainframe they saw in a movie with a bunch of blinking lights. Not so different, visually, from a modern data center.
You can argue about Hetzner's uptime, but you can 't argue about Hetzner's pricing which is hands down the best there is. I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
I switched to netcup for even cheaper private vps for personal noncritical hosting. I'd heard of netcup being less reliable but so far 4 months+ uptime and no problems. Europe region.
Hetzner has the better web interface and supposedly better uptime, but I've had no problems with either. Web interface not necessary at all either when using only ssh and paying directly.
I used netcup for 3 years straight for some self hosting and never noticed an outage. I was even tracking it with smokeping so if the box disappeared I would see it but all of the down time was mine when I rebooted for updates. I don't know how they do it but I found them rock solid.
I've been running my self-hosting stuff on Netcup for 5+ years and I don't remember any outages. There probably were some, but they were not significant enough for me to remember.
netcup is fine unless you have to deal with their support, which is nonexistent. Never had any uptime issues in the two years I've been using them, but friends had issues. Somewhat hit or miss I suppose.
Exactly. Hetzner is the equivalent of the original Raspberry Pi. It might not have all fancy features but it delivers and for the price that essentially unblocks you and allows you to do things you wouldn't be able to do otherwise.
> I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
Comments like this are so exaggerated that they risk moving the goodwill needle back to where it was before. Hetzner offers no service that is similar to DynamoDB, IAM or Lambda. If you are going to praise Hetzner as a valid alternative during a DynamoDB outage caused by DNS configuration, you would need to a) argue that Hetzner is a better option regarding DNS outages, b) Hetzner is a preferable option for those who use serverless offers.
I say this as a long-time Hetzner user. Herzner is indeed cheaper, but don't pretend that Herzner let's you click your way into a highly-availale nosql data store. You need non-trivial levels of you're ow work to develop, deploy, and maintain such a service.
> but don't pretend that Herzner let's you click your way into a highly-availale nosql data store.
The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.
Of course nobody else offers AWS products, but people use AWS for their solutions to compute problems and it can be easy to forget virtually all other providers offer solutions to all the same problems.
>The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda
With some services I'd agree with you, but DynamoDB and Lambda are easily two of their 'simplest' to configure and understand services, and two of the ones that scale the easiest. IAM roles can be decently complicated, but that's really up to the user. If it's just 'let the Lambda talk to the table' it's simple enough.
S3/SQS/Lambda/DynamoDB are the services that I'd consider the 'barebones' of the cloud. If you don't have all those, you're not a cloud provider, your just another server vendor.
> With some services I'd agree with you, but DynamoDB and Lambda are easily two of their 'simplest' to configure and understand services, and two of the ones that scale the easiest. IAM roles can be decently complicated, but that's really up to the user. If it's just 'let the Lambda talk to the table' it's simple enough.
We agree, but also, I feel like you're missing my point: "let the Lambda talk to the table" is what quickstarts produce. To make a lambda talk to a table at scale in production, you'll want to setup your alerting and monitoring to notify you when you're getting close to your service limits.
If you're not hitting service limits/quotas, you're not running even close to running at scale.
Not if you want to build something production ready. Even a simple thing like say static IP ingress for the Lambda is very complicated. The only AWS way you can do this is by using Global Accelerator -> Application Load Balancer -> VPC Endpoint -> API Gateway -> Lambda !!.
There are so many limits for everything that is very hard to run production workloads without painful time wasted in re-architecture around them and the support teams are close to useless to raise any limits.
Just in the last few months, I have hit limits on CloudFormation stack size, ALB rules, API gateway custom domains, Parameter Store size limits and on and on.
That is not even touching on the laughably basic tooling both SAM and CDK provides for local development if you want to work with Lambda.
Sure Firecracker is great, and the cold starts are not bad, and there isn't anybody even close on the cloud. Azure functions is unspeakably horrible, Cloud Run is just meh. Most Open Source stacks are either super complex like knative or just quite hard to get the same cold start performance.
We are stuck with AWS Lambda with nothing better yes, but oh so many times I have come close to just giving up and migrate to knative despite the complexity and performance hit.
>Not if you want to build something production ready.
>>Gives a specific edge case about static IPs and doing a serverless API backed by lambda.
The most naive solution you'd do on any non-cloud vendor, just have a proxy with a static ip that then routes traffic whereever it needed to go, would also work on AWS.
So if you think AWS's solution sucks why not just go with that? What you described doesn't even sound complicated when you think of the networking magic behind the scenes that will take place if you ever do scale to 1 million tps.
Don’t know what you think should mean but for me that means
1. Declarative IaaC in either in CF/terraform
2. Fully Automated discovery which can achieve RTO/RPO objectives
3. Be able to Blue/Green and % or other rollouts
Sure I can write ansible scripts, have custom EC2 images run HA proxy and multiple nginx load balancers in HA as you suggest, or host all that to EKS or a dozen other “easier” solutions
At the point why bother with Lambda ? What is the point of being cloud native and serverless if you have to literally put few VMs/pod in front and handle all traffic ? Might as well host the app runtime too .
> doesn’t even sound complicated .
Because you need a full time resource who is AWS architect and keeps up with release notes and documentation or training and constantly works to scale your application - because every single component has a dozen quotas /limits and you will hit them - it is complicated.
If you spend few million a year on AWS then spending 300k on an engineer to do just do AWS is perhaps feasible .
If you spend few hundred thousands on AWS as part of mix of workloads it is not easy or simple.
The engineering of AWS impressive as it maybe has nothing to the products being offered . There is a reason why Pulumi, SST or AWS SAM itself exist .
Sadly SAM is so limited I had to rewrite everything to CDK in couple of months . CDK is better but I am finding that I have to monkey patching limits on CDK with the SDK code now, while possible , the SDK code will not generate Cloudformation templates .
> Don’t know what you think should mean but for me that means
I think your inexperience is showing, if that's what you try to mean by "production-ready". You're making a storm in a teacup over features that you automatically onboard if you go through an intro tutorial, and "production-ready" typically means way more than a basic run-of-the-mill CICD pipeline.
As most of the times, the most vocal online criticism comes from those who have the least knowledge and experience over the topic they are railing against, and their complains mainly boil down to criticising their own inexperience and ignorance. There is plenty of things to criticize AWS for, such as cost and vendor lock-in, but being unable and unwilling to learn how to use basic services is not it.
Try telling that to customers who can only do outbound API calls to whitelisted IP addresses
When you are working with enterprise customers or integration partners it doesn’t even have to be regulated sectors like finance or healthcare, these are basic asks you cannot get away from .
people want to be able to know whitelist your egress and ingress IPs or pin certificates. It is not up to me to say on efficacy of these rules .
I don’t make the rules of the infosec world , I just follow them.
> The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.
I'll bite. Explain exactly what work you think you need to do to get your pick of service running on Hetnzer to have equivalent fault-tolerance to, say, a DynamoDB Global Table created with the defaults.
Are you Netflix? Because is not theres a 99% probability you dont need any of those AWS services and just have a severe case of shiny object syndrome in your organisation.
Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
To be honest I don't trust myself running a HA PostgreSQL setup with correct backups without spending an exorbitant effort to investigate everything (weeks/months) - do you ? I'm not even sure what effort that would take. I can't remember last time I worked with unmanaged DB in prod where I did not have a dedicated DBA/sysadmin. And I've been doing this for 15 years now. AFAIK Hetzner offers no managed database solution. I know they offer some load balancer so there's that at least.
At some point in the scaling journey bare metal might be the right choice, but I get the feeling a lot of people here trivialize it.
That doesn’t give you high availability; it doesn’t give you monitoring and alerting; it doesn’t give you hardware failure detection and replacement; it doesn’t solve access control or networking…
Managed databases are a lot more than apt install postgresql.
If you're doing it yourself, learn Ansible, you'll do it once and be set forever.
You do not need "managed" database services. A managed database is no different from apt install postgesql followed by a scheduled backup.
Genuinely no disrespect, but these statements really make it seem like you have limited experience building an HA scalable system. And no, you don't need to be Netflix or Amazon to build software at scale, or require high availability.
Backups with wall-g and recurring pg_dump are indeed trivial. (Modulo S3 outage taking so long that your WAL files fill up the disk and you corrupt the entire database.)
It's the HA part, especially with a high-volume DB that's challenging.
But that's the thing - if I have an ops guy who can cover this then sure it makes sense - but who does at an early stage ? As a semi competent dev I can setup a terraform infra and be relatively safe with RDS. I could maybe figure out how to do it on my own in some time - but I don't know what I don't know - and I don't want to do a weekend production DB outage debugging because I messed up the replication setup or something. Maybe I'm getting old but I just don't have the energy to deal with that :)
If youre not Netflix then just sudo yum install postgresql and pg_dump every day, upload to S3. Has worked for me for 20 years at various companies, side projects, startups …
> If youre not Netflix then just sudo yum install postgresql and pg_dump every day, upload to S3.
database services such as DynamoDB support a few backup strategies out of the box, including continuous backups. You just need to flips switch and never bother about it again.
> Has worked for me for 20 years at various companies, side projects, startups …
That's perfectly fine. There are still developers who don't even use version control at all. Some old habits die hard, even when the whole world moved on.
> Are you Netflix? Because is not theres a 99% probability you dont need any of those AWS services and just have a severe case of shiny object syndrome in your organisation.
I think you don't even understand the issue you are commenting on. It's irrelevant if you are Netflix or some guy playing with a tutorial. One of the key traits of serverless offerings is how it eliminates the need to manage and maintain a service or even worry about you have enough computational resources. You click a button to provision everything, you configure your clients to consume that service, and you are done.
If you stop to think about the amount of work you need to invest to even arrive at a point where you can actually point a client at a service, you'll be looking at what the value of serverless offerings.
Ironically, it's the likes of Netflix who can put together a case against using serverless offerings. They can afford to have their own teams managing their own platform services with the service levels they are willing to afford. For everyone else, unless you are in the business of managing and tuning databases or you are heavily motivated to save pocket change on a cloud provider bill, the decision process is neither that clear not favours running your own services.
> Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
And almost all of them need a database, a load balancer, maybe some sort of cache. AWS has got you covered.
Maybe some of them need some async periodic reporting tasks. Or to store massive files or datasets and do analysis on them. Or transcode video. Or transform images. Or run another type of database for a third party piece of software. Or run a queue for something. Or capture logs or metrics.
And on and on and and on. AWS has got you covered.
This is Excel all over again. "Excel is too complex and has too many features, nobody needs more than 20% of Excel. It's just that everyone needs a different 20%".
You're right AWS does have you covered. But that doesn't mean thats the only way of doing it. Load balancing is insanely easy to do yourself, databases even easier. Caching, ditto.
I think a few people who claim to be in devops could do with learning the basics about how things like Ansible can help them as there's a fair few people who seem to be under the impression AWS is the only, and the best option, which unless you're FAANG really is rarely the case.
> You're right AWS does have you covered. But that doesn't mean thats the only way of doing it. Load balancing is insanely easy to do yourself, databases even easier. Caching, ditto
I think you don't understand the scenario you are commenting on. I'll explain why.
It's irrelevant if you believe that you are able to imagine another way to do something, and that you believe it's "insanely easy" to do those yourself. What matters is that others can do that assessment themselves, and what you are failing to understand is that when they do so, their conclusion is that the easiest way by far to deploy and maintain those services is AWS.
And it isn't even close.
You mention load balancing and caching. The likes of AWS allows you to setup a global deployment of those services with a couple of clicks. In AWS it's a basic configuration change. And if you don't want it, you just tear down everything with a couple of clicks as well.
Why do you think a third of all the internet runs on AWS? Do you think every single cloud engineer in the world is unable to exercise any form of critical thinking? Do you think there's a conspiracy out there to force AWS to rule the world?
You can spin up a redundant database setup with backups and monitoring and automatic fail over in 10 mins (the time it takes in AWS)? And maintain it? If you've done this a few times before and have it highly automated, sure. But let's not pretend it's "even easier" than "insanely easy".
Load balancing is trivial unless you get into global multicast LBs, but AWS have you covered there too.
If you need the absolutely stupid scale DynamoDB enables what is the difference compared to running for example FoundationDb on your own using Hetzner?
> The key thing you should ask yourself: do you need DynamoDB or Lambda? Like "need need" or "my resume needs Lambda".
If you read the message you're replying to, you will notice that I singled out IAM, Lambda, and DynamoDB because those services were affected by the outage.
If Hetzner is pushed as a better or even relevant alternative, you need to be able to explain exactly what you are hoping to say to Lambda/IAM/DynamoDB users to convince them that they would do better if they used Hetzner instead.
Making up conspiracy theories over CVs doesn't cut it. Either you know anything about the topic and you actually are able to support this idea, or you're an eternal September admission whose only contribution is noise and memes.
TBH, in my last 3 years with Hetzner, i never saw a downtime to my servers other than myself doing some routin maitenance for os updates. Location Falkenstein.
You really need your backup procedures and failover procedures though, a friend bought a used server and the disk died fairly quickly leaving him sour.
Younger guy with ambitions but little experience, I think my point was that used servers with Hetzner are still used so if someone has been running disk heavy jobs you might want to request new disks or multiple ones and not just pick the cheapest options at the auction.
(Interesting that an anectode like above got downvoted)
> (Interesting that an anectode like above got downvoted)
experts almost unilaterally judge newbies harshly, as if the newbies should already know all of the mistakes to avoid. things like this are how you learn what mistakes to avoid.
"hindsight is 20/20" means nothing to a lot of people, unfortunately.
What is the Hetzner equivalent for those in Windows Server land? I looked around for some VPS/DS providers that specialize in Windows, and they all seem somewhat shady with websites that look like early 2000s e-commerce.
I work at a small / medium company with about ~20 dedicated servers and ~30 cloud servers at Hetzner. Outages have happened, but we were lucky that the few times it did happen, it was never a problem / actual downtime.
One thing to note is that there are some scheduled maintenances were we needed to react.
We've been running our services on Hetzner for 10 years, never experienced any significant outages.
That might be datacenter dependant of course, since our root servers and cloud services are all hosted in Europe, but I really never understood why Hetzner is said to be less reliable
> 99.99% uptime infra significantly cheaper than the cloud.
I guess that's another person that has never actually worked in the domain (SRE/admin) but still wants to talk with confidence on the topic.
Why do I say that? Because 99.99% is frickin easy
That's almost one full hour of complete downtime per year.
It only gets hard in the 99.9999+ range ... And you rarely meet that range with cloud providers either as requests still fail for some reason, like random 503 when a container is decommissioned or similar
>Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
That's not necessarily ironic. Seems like you are suffering from recency bias.
The only hard dependency I am still aware of is write operations to the R53 control plane. Failover records and DNS queries would not be impacted. So business workflows would run as if nothing happened.
(There may still be some core IAM dependencies in USE1, but I haven’t heard of any.)
We don't know that (yet) - it's possible that this is simply a demonstration of how many companies have a hard dependency on us-east-1 for whatever reason (which I can certainly believe).
We'll know when (if) some honest RCAs come out that pinpoint the issue.
I had a problem with an ACME cert terraform module. It was doing the R53 to add the DNS TXT record for the ACME challenge and then querying the change status from R53.
R53 seems to use Dynamo to keep track of the syncing of the DNS across the name servers, because while the record was there and resolving, the change set was stuck in PENDING.
After DynamoDB came back up, R53's API started working.
I'm not affiliated and won't be compensated in any way for saying this: Hetzner are the best business partners ever. Their service is rock solid, their pricing is fair, their support is kind and helpful.
Going forward I expect American companies to follow this European vibe, it's like the opposite of enshitification.
I don't know how often Hetzner has similar outages, but outages at the rack and server level, including network outages and device failure happen for individual customers. If you've never experienced this, it is probably just survivor's bias.
Aws/cloud has similar outages too, but more redundancy and automatic failover/migrations that are transparent to customers happen. You don't have to worry about DDOS and many other admin burdens either.
YMMV, I'm just saying sometimes Aws makes sense, other times Hetzner does.
Stop making things up. As someone who commented on the thread in favour of AWS, there is almost no mention of better uptime in any comment I could find.
I could find one or two downvoted or heavily critisized comments, but I can find more people mentioning the opposite.
It's less about company loyalty and more about protecting their investment into all the buzzwords from their resumes.
As long as the illusion that AWS/clouds are the only way to do things continues, their investment will keep being valuable and they will keep getting paid for (over?)engineering solutions based on such technologies.
The second that illusion breaks down, they become no better than any typical Linux sysadmin, or teenager ricing their Archlinux setup in their homelab.
I'm a tech worker, and have been paid by a multi-billion dollar company to be a tech worker since 2003.
Aside from Teams and Outlook Web, I really don't interact with Microsoft at all, haven't done since the days of XP. I'm sure there is integration on our corporate backends with things like active directory, but personally I don't have to deal with that.
Teams is fine for person-person instant messaging and video calls. I find it terrible for most other functions, but fortunately I don't have to use it for anything other than instant messaging and video calls. The linux version of teams still works.
I still hold out a healthy suspicion of them from their behaviour when I started in the industry. I find it amusing the Microsoft fanboys of the 2000s with their "only needs to work in IE6" and "Silverlight is the future" are still having to maintain obsolete machines to access their obsolete systems.
Meanwhile the stuff I wrote to be platform-agnostic 20 years ago is still in daily use, still delivering business benefit, with the only update being a change from "<object" to "<video" on one internal system when flash retired.
AWS and Cloudflare are HN darlings. Go so far as to even suggest a random personal blog doesn't need Cloudflare and get downvoted with inane comments as "but what about DDOS protection?!"
The truth is one under the age of 35 is able to configure a webserver any more, apparently. Especially now that static site generators are in vogue and you don't even need to worry about php-fpm.
Well, we have a naming issue (Hetzner also has Hetzner Cloud, it looks people still equal cloud with the three biggest public cloud providers).
In any case, in order for this to happen, someone would have to collect reliable data (not all big cloud providers like to publish precise data, usually they downlplay the outages and use weasel words like "some customers... in some regions... might have experienced" just not to admit they had an outage) and present stats comparing the availability of Heztner Cloud vs the big three.
May be because of this that trying to pay with PayPal on Lenovo's website has failed thrice for me today? Just asking... Knowing how everything is connected nowadays it wouldn't surprise me at all.
My site was down for a long time after they claimed it was fixed. Eventually I realized the problem lay with Network Load Balancers so I bypassed them for now and got everything back up and running.
I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.
Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.
LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?
I thought it was a pretty well-known issue that the rest of AWS depends on us-east-1 working. Basically any other AWS region can get hit by a meteor without bringing down everything else – except us-east-1.
It means that in order to be certified you have to use providers that in turn are certified or you will have to prove that you have all of your ducks in a row and that goes way beyond certain levels of redundancy, to the point that most companies just give up and use a cloud solution because they have enough headaches just getting their internal processes aligned with various certification requirements.
Medical, banking, insurance to name just a couple are heavily regulated and to suggest that it 'just means certain levels of redundancy' is a very uninformed take.
It is definitely not true that only big companies can do this. It is true that every regulation added adds to the power of big companies, which explains some regulation, but it is definitely possible to do a lot of things yourself and evidence that you've done it.
What's more likely for medical at least is that if you make your own app, that your customers will want to install it into their AWS/Azure instance, and so you have to support them.
I am the CEO of the company and started it because I wanted to give engineering teams an unbreakable cloud. You can mix-n-match services of ANY cloud provider, and workloads failover seamlessly across clouds/on-prem environments.
"Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1..."
A physical on-air broadcast station, not a web stream? That likely violates a license; they're required to perform station identification on a regular basis.
Of course if they had on-site staff it wouldn't be an issue (worst case, just walk down to the transmitter hut and use the transmitter's aux input, which is there specifically for backup operations like this), but consolidation and enshittification of broadcast media mean there's probably nobody physically present.
Yeah, real over the air fm radio. This particular station is a Jack one owned by iHeart; they don't have DJs. Probably no techs or staff in the office overnight.
Btw. we had a forced EKS restart last week on thursday due to Kubernetes updates. And something was done with DNS there. We had problems with ndots. Caused some trouble here. Would not be surprised, if it is related, heh.
My ISP's DNS servers were inaccessible this morning. Cloudflare and Google's DNS servers have all been working fine, though: 1.1.1.1, 1.0.0.1, and 8.8.8.8
So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.
Half the internet goes down because part of AWS goes down... what happened to companies having redundant systems and not having a single point of failure?
Maybe unrelated, but yesterday I went to pick up my package from an Amazon Locker in Germany, and the display said "Service unavailable". I'll wait until later today before I go and try again.
I wonder why a package locker needs connectivity to give you a package. Since your package can't be withdrawn again from a different location, partitioning shouldn't be an issue.
Generally speaking, it's easier to have computation (logic, state, etc.) centralized. If the designers didn't prioritize scenarios where decentralization helped, then centralization would've been the better option.
I was just about to post that it didn't affect us (heavy AWS users, in eu-west-1). Buut, I stopped myself because that was just massively tempting fate :)
Happened to be updating a bunch of NPM dependencies and then saw `npm i` freeze and I'm like... ugh what did I do. Then npm login wasn't working and started searching here for an outage, and wala.
That depends on a lot of factors, but for me personally, yes it is. Much worse.
Assuming we’re talking about hosting things for Internet users. My fiber internet connection has gone down multiple times, though relatively quickly restored. My power has gone out several times in the last year, with one storm having it out for nearly 24 hrs. I was sleep when it went out and I didn’t start the generator until it was out for 3-4 hours already, far longer than my UPSes could hold up. I’ve had to do maintenance and updates both physical and software.
All of those things contribute to a downtime significantly higher than I see with my stuff running on Linode, Fly.io or AWS.
I run Proxmox and K3s at home and it makes things far more reliable, but it’s also extra overhead for me to maintain.
Most or all of those things could be mitigated at home, but at what cost?
I've been dabbling in this since 1998, It's almost always ISP and power outages that get you. There are ways to mitigate those (primary/secondary ISPs, UPSes, and generators) but typically unless you're in a business district area of a city, you'll just always be subject to problems
So for me, extremely anecdotally, I host a few fairly low-importance things on a home server (which is just an old desktop computer left sitting under a desk with Ubuntu slapped on it): A VPN (WireGuard), a few Discord bots, a Twitch bot + some auth stuff, and a few other services that I personally use.
These are the issues I've ran into that have caused downtime in the last few years:
- 1x power outage: if I had set up restart on power, probably would have been down for 30-60 minutes, ended up being a few hours (as I had to manually press the power button lol). Probably the longest non-self-inflicted issue.
- Twitch bot library issues: Just typical library bugs. Unrelated to self-hosting.
- IP changes: My IP actually barely ever changes, but I should set up DDNS. Fixable with self-hosting (but requires some amount of effort).
- Running out of disk space: Would be nice to be able to just increase it.
- Prooooooobably an internet outage or two, now that I think about it? Not enough that it's been a serious concern, though, as I can't think of a time that's actually happened. (Or I have a bad memory!)
I think that's actually about it. I rely fairly heavily on my VPN+personal cloud as all my notes, todos, etc are synced through it (Joplin + Nextcloud), so I do notice and pay a lot of attention to any downtime, but this is pretty much all that's ever happened. It's remarkable how stable software/hardware can be. I'm sure I'll eventually have some hardware failure (actually, I upgraded my CPU 1-2 years ago because it turns out the Ryzen 1700 I was using before has some kind of extremely-infrequent issue with Linux that was causing crashes a couple times a month), but it's really nice.
To be clear, though, for an actual business project, I don't think this would be a good idea, mainly due to concerns around residential vs commercial IPs, arbitrary IPs connecting to your local network, etc that I don't fully pay attention to.
Unanswerable question. Better to perform a failure mode analysis. That rack in your basement would need redundant power (two power companies or one power company and a diesel generator which typically won't be legal to have at your home), then redundant internet service (actually redundant - not the cable company vs the phone company that underneath use the same backhaul fiber).
Maybe actually making the interviews less of a hazing ritual would help.
Hell, maybe making today's tech workplace more about getting work done instead of the series of ritualistic performances that the average tech workday has degenerated to might help too.
Ergo, your conclusion doesn't follow from your initial statements, because interviews and workplaces are both far more broken than most people, even people in the tech industry, would think.
Well it looks like if companies and startups did their job in hiring the proper distributed systems skills more rather than hazing for the wrong skills we wouldn't be in this outage mess.
Many companies on Vercel don't think to have a strategy to be resilient to these outages.
I rarely see Google, Ably and others serious about distributed systems being down.
Does anyone know if having Global Accelerator set up would help right now? It's in the list of affected services, I wonder if it's useful in scenarios like this one.
I seem to recall other issues around this time in previous years. I wonder if this is some change getting shoe-horned in ahead of some reinvent release deadline...
Thing is us-east-1 the primary region for many services of AWS. DynamoDB is a very central offering used by many service. And the issue that has happend is very common[^1].
I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].
Can confirm. I was trying to send the newsletter (with SES) and it didn't work. I was thinking my local boto3 was old, but I figured I should check HN just in case.
It's "DNS" because the problem is that at the very top of the abstraction hierarchy in any system is a bit of manual configuration.
As it happens, that naturally maps to the bootstrapping process on hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.
But it's the inevitability of the manual process that's the issue here, not the technology. We're at a spot now where the rest of the system reliability is so good that the only things that bring it down are the spots where human beings make mistakes on the tiny handful of places where human operation is (inevitably!) required.
> hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.
DHCP can only tell you who the local DNS server is. That's not what's failed, nor what needs human configuration.
At the top of the stack someone needs to say "This is the cluster that controls boot storage", "This is the IP to ask for auth tokens", etc... You can automatically configure almost everything but there still has to be some way to get started.
1) GDPR is never enforced other than token fines based on technicalities. The vast majority of the cookie banners you see around are not compliant, so it the regulation was actually enforced they'd be the first to go... and it would be much easier to go after those (they are visible) rather than audit every company's internal codebases to check if they're sending data to a US-based provider.
2) you could technically build a service that relies on a US-based provider while not sending them any personal data or data that can be correlated with personal data.
Read my post again. You can go to any website and see evidence of their non-compliance (you don't have to look very hard - they generally tend to push these in your face in the most obnoxious manner possible).
You can't consider a regulation being enforced if everyone gets away with publishing evidence of their non-compliance on their website in a very obnoxious manner.
In this moments I think devs should invest in vendor independence if they can. While I'm not to that stage yet (cloudlfare dependence) using open technologies like docker (or Kubernetes), Traefik instead of managed services can help in this disaster situations by switching to a different provider in a faster way than having to rebuild from zero.
as a disclosure I'm not still to that point on my infrastructure But I'm trying to slowly define one for my self
I missed a parcel delivery because a computer server in Virginia, USA went down, and now the doorbell on my house in England doesn't work. What. The. Fork.
How the hell did Ring/Amazon not include a radio-frequency transmitter for the doorbell and chime? This is absurd.
To top it off, I'm trying to do my quarterly VAT return, and Xero is still completely borked, nearly 20 hours after the initial outage.
In us-east-1? That doesn't sound that impactful, have always heard that us-east-1's network is a ring.
Back before AWS provided transparency into AZ assignments, it was pretty common to use latency measurements to try and infer relative locality and mappings of AZs available to an account.
AWS CodeArtifact can act as a proxy and fetch new packages from npm when needed. A bit late for that though but sharing if you want to future proof against the yearly us-east-1 outage
> And everybody starting a hosting company is definitely a profit driven activity.
Absolutely, nobody was doing it out of charity, but there is more diversity in the market and thus more innovation and then the market decides. Right now we have 3 major providers, and that makes up the lion's share. That's consolidation of a service. I believe that's not good for the market or the internet as a whole.
It should be! When I was a complete newbie at AWS my first question was why do you have to pick a region, I thought the whole point was you didn't have to worry about that stuff
One might hope that this, too, would be handled by the service. Send the traffic to the closest region, and then fallback to other regions as necessary. Basically, send the traffic to the closest region that can successfully serve it.
But yeah, that's pretty hard and there are other reasons customers might want to explicitly choose the region.
I wonder how much better the uptime would be if they made a sincere effort to retain engineering staff.
Right now on levels.fyi, the highest-paying non-managerial engineering role is offered by Oracle. They might not pay the recent grads as well as Google or Microsoft, but they definitely value the principal engineers w/ 20 years of experience.
Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.
But they aren't abusing their market power, are they? I mean, they are too big and should definitely be regulated but I don't think you can argue they are much of a monopoly when others, at the very least Google, Microsoft, Oracle, Cloudflare (depending on the specific services you want) and smaller providers can offer you the same service and many times with better pricing. Same way we need to regulate companies like Cloudflare essentially being a MITM for ~20% of internet websites, per their 2024 report.
One of the open secrets of AWS is that even though AWS has a lot of regions and availability zones, a lot of AWS services have control planes that are dependent on / hosted out of us-east-1 regardless of which region / AZ you're using, meaning even if you are using a different availability zone in a different region, us-east-1 going down still can mess you up.
It's fun to see SRE jumping left and right when they can do basically nothing at all.
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.
Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
No shot that happens until an outage breaks at least an entire workday in the US timezones. The only complaint I personally heard was from someone who couldn't load reddit on the train to work.
Well by the time it really happens for a whole day Amazon leadership will be brazen enough to say "OK, enough of this my site is down, we will call back once systems are up so don't bother for a while". Also maybe responsible human engineers would fired by then and AI can be infinitely patient while working through insolvable issues.
> Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.
And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.
The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.
IAM, hands down, is one of the most amazing pieces of technology there is.
The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.
And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.
To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.
It's not simple, that's the point! The filter rules and ways to combine rules and their effects are highly complex. The achievement is how fast it is _despite_ network being involved on at least two hops - first service to IAM and then IAM to database.
I think it's simple. It's just a stemming pattern matching tree, right?
The admin UX is ... awkward and incomplete at best. I think the admin UI makes the service appear more complex than it is.
The JSON representation makes it look complicated, but with the data compiled down into a proper processable format, IAM is just a KVS and a simple rules engine.
Not much more complicated than nginx serving static files, honestly.
(Caveat: none of the above is literally simple, but it's what we do every day and -- unless I'm still missing it -- not especially amazing, comparatively).
IAM policies can have some pretty complex conditions that require it to sync to other systems often. Like when a tag value is used to allow devs access to all servers with the role:DEV tag.
In my (imagined) architecture, the auth requester sends the asset attributes (including tags in this example) with the auth request, so the auth service doesn't have to do any lookup to other systems. Updates are pushed in a message queue style manner, policy tables are cached and eventually consistent.
It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.
There are plenty of ways to address this risk. But the companies impacted would have to be willing to invest in the extra operational cost and complexity. They aren’t.
Let's be nice. I'm sure devs and ops are on fire right now, trying to fix the problems. Given the audience of HN, most of us could have been (have already been?) in that position.
Affecting Coinbase[1] as well, which is ridiculous. Can't access the web UI at all. At their scale and importance they should be multi-region if not multi-cloud.
Seems the underlying issue is with DynamoDB, according to the status page, which will have a big blast radius in other services. AWS' services form a really complicated graph and there's likely some dependency, potentially hidden, on us-east-1 in there.
I get the impression that this has been thought about to some extent, but its a constantly changing architecture with new layers and new ideas being added, so for every bit of progress there's the chance of new Single Points Of Failure being added. This time it seems to be a DNS problem with DynamoDB
Nah, because European services should not be affected by a failure in the US. Whatever systems they have running in us-east-1 should have failovers in all major regions. Today it's an outage in Virginia, tomorrow it could be an attack on undersea cables (which I'm confident are mined and ready to be severed at this point by multiple parties).
Well, except for a lot of business leaders saying that they don't care if it's Amazon that goes down, because "the rest of the internet will be down too."
Dumb argument imho, but that's how many of them think ime.
Surprising and sad to see how many folks are using DynamoDB
There are more full featured multi-cloud options that don't lock you in and that don't have the single point of failure problems.
And they give you a much better developer experience...
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
My app deployed on Vercel and therefore indirectly deployed on us-east-1 was down for about 2 hours today then came back up and then went down again 10 minutes ago for 2 or 3 minutes. It seems like they are still intermittent issues happening.
Sounds like a circular error with monitoring is flooding their network with metrics and logs, causing DNS to fail and produce more errors, flooding the network. Likely root cause is something like DNS conflicts or hosts being recreated on the network. Generally this is a small amount of network traffic but the LBs are dealing with host address flux, causing the hosts to keep colliding host addresses as they attempt to resolve to a new host address which are being lost from dropped packets and with so many hosts in one AZ, there's a good chance they end up with a new conflicting address.
I in-housed an EMR for a local clinic because of latency and other network issues taking the system offline several times a month (usually at least once a week). We had zero downtime the whole first year after bringing it all in house, and I got employee of the month for several months in a row.
Paying for resilience is expensive. not as expensive as AWS, but it's not free.
Modern companies live life on the edge. Just in time, no resilience, no flexibility. We see the disaster this causes whenever something unexpected happens - the Evergiven blocking Suez for example, let alone something like Covid
However increasingly what should be minor loss of resilience, like an AWS outage or a Crowdstrike incident, turns into major failures.
This fragility is something government needs to legislate to prevent. When one supermarket is out that's fine - people can go elsewhere, the damage is contained. When all fail, that's a major problem.
On top of that, the attitude that the entire sector has is also bad. People thing IT should tail once or twice a year and it's not a problem. If that attitude affect truly important systems it will lead to major civil projects. Any civilitsation is 3 good meals away from anarchy.
There's no profit motive to avoid this, companies don't care about being offline for the day, as long as all their mates are also offline.
It's weird that we're living in a time where this could be a taste of a prolonged future global internet blackout by adversarial nations. Get used to this feeling I guess :)
"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."
Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.
Ensure your single point of failure risk is appropriate for your business. I don't have full resilience for my companies AS going down, but we do have limited DR capability. Same with the loss of a major city or two.
I'm not 100% confident in a Thames Barrier flood situation, as I suspect some of our providers don't have the resilience levels we do, but we'd still be able to provide some minimal capability.
Not having control or not being responsible are perhaps major selling points of cloud solutions. To each their own, I also rather have control than having to deal with a cloud provider support as a tiny insignificant customer. But in this case, we can take a break and come back once it's fixed without stressing.
Many businesses ARE fully vertically integrated. And many make real stuff, in meat space, where it's 100,000x harder. But software companies can't do it?
Obviously there's pros and cons. One of the pros being that you're so much more resilient to what goes on around you.
Name one commercial company that is entirely/fully vertically integrated and can indefinitely continue business operations 100% without any external suppliers.
It's a sliding scale - eventually you rely on land. But in n out is close. And they make burgers, which is much more difficult than making software. Yes, I'm being serious.
But if you look at open source projects, many are close to perfectly verifically integrated.
There's also a big big difference between relying on someone's code and relying on someone's machines. You can vender code - you, however, rely on particular machines being up and connected to the internet. Machines you don't own and you aren't allowed to audit.
You said "Many businesses ARE fully vertically integrated." so why name one that is close to fully vertically integrated, just name one of the many others that are fully vertically integrated. I don't really care about discussing things which prove my point instead of your point as if they prove your point.
> open source projects, many are close to perfectly verifically integrated
Comparing code to services seems odd, not sure how GitLab the software compares to GitLab the service for example. Code is just code, a service requires servers to run on, etc. GitLab the software can't have uptime because it's just code. It can only have an uptime once someone starts running it, at which point you can't attribute everything to the software anymore as the people running it have a great deal of responsibility for how well it runs, and even then, even if GitLab the software would have been "close to perfectly vertically integrated" (like if they used no OS, as if anyone would ever want that), then the GitLab serivice still needs many things from other suppliers to operate.
And again, "close to perfectly verifically integrated" is not "perfectly verifically integrated".
If you are wrong, and in fact nothing in our modern world is fully vertically integrated as I said, then it's best to just admit that and move on from that and continue discussing reality.
Allowing them to not take responsibility is an enabler for unethical business practices. Make businesses accountable for their actions, simple as that.
How are they not accountable though? Is Docker not accountable for their outage that follows as a consequence? How should I make them accountable? I don't have to do shit here, the facts are what they are and the legal consequences are what they are. Docker gives me a free service and free software, they receive 0 dollars from me, I think the deal I get is pretty fucking great.
Okay, let's discard Docker, take all other companies. How are they not accountable? How should I make them accountable? I don't have to do shit here, the facts are what they are, and the legal consequences are what they are. They either are legally accountable or not. Nothing I need to do to make that a reality.
If a company sold me a service with guaranteed uptime, I'd expect the guaranteed uptime or expect a compensation in case they cant keep up with their promises.
Honestly anyone do have outages, that's nothing extraordinary, what's wrong is the number of impacted services. We choose (at least almost choose) to ditch mainframes for clusters also for resilience. Now with cheap desktop iron labeled "stable enough to be a serious server" we have seen mainframes re-created sometimes with a cluster of VM on top of a single server, sometimes with cloud services.
Ladies and Gentleman's it's about time to learn reshoring in the IT world as well. Owning nothing, renting all means extreme fragility.
imagine spending millions on devops and sre to still have your mission critical service go down because amazon still has baked in regional dependencies
This reminds me of the twitter-based detector we had at Facebook that looked for spikes in "Facebook down" messages.
When Facebook went public, the detector became useless because it fired anytime someone wrote about the Facebook stock being down and people retweeted or shared the article.
I invested just enough time in it to decide it was better to turn it off.
Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.
First, not all outages are created equal, so you cannot compare them like that.
I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.
But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.
AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.
If AWS fully decentralized its control planes, they’d essentially be duplicating the cost structure of running multiple independent clouds and I understand that is why they don't however as long as AWS is reliant upon us-east-1 to function, they have not achieved what they claim to me. A single point of failure for IAM? Nah, no thanks.
Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.
That's poor design, after all these years. They've had time to fix this.
Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...
If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.
You can only decentralized your control plane if you don't have conflicting requirements?
Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem
In the narrow distributed-systems sense? Yes, however those requirements are self-imposed. AWS chose strong global consistency for IAM and billing... they could loosen it at enormous expense.
The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.
I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.
The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.
In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.
AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.
This is the reason why it is important to plan Disaster recovery and also plan Multi-Cloud architectures.
Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.
Critical businesses should also plan for replication across multiple cloud platforms.
You may use some of the existing solutions out there that can help with such implementations for data platforms.
- Qlik replicate
- HexaRocket
and some more.
Or rather implement native replication solutions available with data platforms.
Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.
The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.
> temporary kludge shim was the perfect level of abstraction for the problem at hand.
Thats some nice manager deactivating jargon.
Manager deactivating jargon is a great phrase - it’s broadly applicable and also specific.
Yeah that sentence betrays my BigCorp experience it’s pulling from the corporate bullshit generator for sure
[dead]
+1...hee hee
Couldn't you just patch your coredns deployment to specify different forwarders?
Probably. This was years ago so the details have faded but I do recall that we did weigh about 6 different valid approaches of varying complexity in the war room before deciding this /etc/hosts hack was the right approach for our situation
This is the en of the thread of the first comment. Now i can find below the second comment
I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.
Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?
Wishful thinking, but I hope an engineer somewhere got to ram a door down to fix a global outage. For the stories.
Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.
On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".
> So security was basically "does someone else recognize you?"
I actually can't think of a more secure protocol. Doesn't scale, though.
Well, you put a lot of trust in the individuals in this case. A disgruntled employee can just let the bad guys in on purpose, saying "Yes they belong here".
That works until they run into a second person. In a big corp where people don't recognize each other you can also let the bad guys in, and once they're in nobody thinks twice about it.
Vulnerable to byzantine fault.
I would imagine this is how it works for the President and Cabinet
way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.
/those were the days
Oh I've definitely done that. They had remote hands but we were over our rack limit and we didn't want them to see inside.
The early oughts were a different time.
Just to test the security, or...?
late reply but, no, i really needed to hit the button but didn't have valid ID at the time. My driver's license was expired and i couldn't get it renewed because of a outstanding tickets iirc. I was able to talk my way in and had been there many times before so knew my way around and what words to say. I was able to do what i needed before another admin came up and told me that without valid ID they have no choice but to ask me to leave (probably like an insurance thing). I was being a bit dramatic when i said "getting thrown out" the datacenter guys were very nice and almost apologetic about asking me to leave.
That sounds like an Equinix datacenter. They were painfully slow at 350 E. Cermak.
It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.
I was in a datacenter when the fire alarm went off and all door locks were automatically disabled.
Lmao, so unathorized access on demand by pulling the fire alarm?
There's some computer lore out there about someone tripping a fire alarm by accident or some other event that triggered a gas system used to put out fires without water but isn't exactly compatible with life. The story goes some poor sys admin had to stand there with their finger on like a pause button until the fire department showed up to disarm the system. If they released the button the gas would flood the whole DC.
Don't ask about fire power switch
Essentially yes. They should really divide data centers into zones and only unlock doors inside a zone where smoke is detected.
> They should really divide data centers into zones and only unlock doors inside a zone where smoke is detected.
just make sure the zone based door lock/unlock system isn't on AWS ;)
Because surely every smoke detector will work while the building is burning down…
most data centers are made out of concrete and isolate fires.
My point is that while the failure rate may be low the failure method is dude burns to death in a locked server room. Even classified room protocols place safety of personnel over safety of data in an emergency.
The story was that they had to use an angle grinder to get in.
I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.
Not quite; you're probably thinking of: https://google.github.io/building-secure-and-reliable-system...
> To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.
Classic.
In my first job I worked on ATM software, and we had a big basement room full of ATMs for test purposes. The part the money is stored in is a modified safe, usually with a traditional dial lock. On the inside of one of them I saw the instructions on how to change the combination. The final instruction was: "Write down the combination and store it safely", then printed in bold: "Not inside the safe!"
That's a wonderful read, thanks for that.
This is how John Wick did it. He buried his gold and weapons in his garage and poured concrete over it.
It only worked for Wick because he is a man of focus, commitment, and sheer will.
He’s not the bogey man. He’s the one you send to kill the fucking bogeyman.
Hooked from that moment! The series got progressively more ridiculous but what a start!
The bulletproof suits were very stylish though! So much fun.
This is the way.
There is a video from the lock pick lawyer where he receives a padlock in the mail with so much tape that it takes him whole minutes to unpack.
Concrete is nice, other options are piles of soil or brick in front of the door. There probably is a sweet spot where enough concrete slows down an excavator and enough bricks mixed in the soil slows down the shovel. Extra points if there is no place nearby to dump the rubble.
Probably one of those lost in translation or gradual exaggeration stories.
If you just wanted recovery keys that were secure from being used in an ordinary way you can use Shamir to split the key over a couple hard copies stored in safety deposit boxes a couple different locations.
Louvre gang decides they can make more money contracting to AWS.
The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.
I prefer to use a sawzall and just go through the wall.
The memory is hazy since it was 15+ years ago, but I'm fairly sure I knew someone who worked at a company whose servers were stolen this way.
The thieves had access to the office building but not the server room. They realized the server room shared a wall with a room that they did have access to, so they just used a sawzall to make an additional entrance.
my across the street neighbor had some expensive bikes stolen this way. The thieves just cut a hole in the side of their garage from the alley, security cameras were facing the driveway and with nothing on the alley side. We (the neighborhood) think they were targeted specifically for the bikes as nothing else was stolen and your average crack head isn't going to make that level of effort.
That would be a sawswall, in that case.
I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.
add a bunch of other poinless scifi and evil villan lair tropes in as well...
Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.
Still have my "my other datacenter is made of razorblades and hate" sticker. \o/
there are datacentres not made of razorblades and hate?
They do commonly have poisonous gas though.
Not sure if you’re joking but a relatively small datacenter I’m familiar with has reduced oxygen in it to prevent fires. If you were to break in unannounced you would faint or maybe worse (?).
Not quite - while you can reduce oxygen levels, they have to be kept within 4pp so at worst, will make you light headed. Many athletes train at the same levels though so it’s easy to overcome.
That'd make for a decent heist comedy - a bunch of former professional athletes get hired to break in to a low-oxygen data center, but the plan goes wrong and they have to use their sports skills in improbable ways to pull it off.
Halon was used back in the day for fire suppression but I thought it was only dangerous at high enough concentrations to suffocate you by displacing oxygen.
To be of any use, it also has to suffocate a fire.
Halon doesn't work that way, by displacing oxygen.
Flame chemistry is weird. Halogenated fire suppression agents work by making Hydrogen (!) out of free radicals.
https://www.nist.gov/system/files/documents/el/fire_research...
I had a summer job at a hospital one year in the data center when an electrician managed to trigger the halon system and we all had to evacuate and wait for the process to finish and the gas to vent. The four firetrucks and station master who shoved up was both annoyed and relieved it was not real.
And lasers come to think of it
No FM200 isn't poisonous.
tell that to my dead uncle jack :)
Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.
Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.
Sometimes a little good old fashioned mayhem is good for employee morale
Every good firefighter knows this feeling.
Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.
P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.
Yep. And their internal comms were on the same server if memory serves. They were also down.
I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.
Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.
I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.
AWS, for the ultimate backup, relies on a phone call bridge on the public phone network.
Ah, but have they verified how far down the turtles go, and has that changed since they verified it?
In the mid-2000s most of the conference call traffic started leaving copper T1s and going onto fiber and/or SIP switches managed by Level3, Global Crossing, Qwest, etc. Those companies combined over time into Century Link which was then rebranded Lumen.
As of last October, Lumen is now starting to integrate more closely with AWS, managing their network with AWS's AI: https://convergedigest.com/lumen-expands-fiber-network-to-su...
"Oh what a tangled web we weave..."
I once suggested at work that we list diesel distributors using payment infra not on on us near our datacenters.
Thanks for the correction, that sounds right. I thought I had remembered IRC but wasn't sure.
Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.
Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)
Auditors made you disable credential caching but missed the door that could be shimmed open..
Sounds like they earned their fee!
If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.
This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.
There should be many more smaller instances with smaller blast radius.
Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.
I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.
"Meta Data Center Simulator 2021: As Real As It Gets (TM)"
That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.
Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!
That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)
That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.
There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.
So sick of billion dollar companies not hiring that one more guy
That is perhaps why they are billion dollar companies and why my company is not very successful.
Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.
The last place I worked actively switched traffic over to the backup nodes regularly (at least monthly) to ensure we could do it when necessary.
We learned that lesson by having to do emergency failovers and having some problems. :)
You should assume it will not work unless you test it regularly. That's a big part of why having active/active multi-region is attractive, even though it's much more complex.
That wouldn't have even caught that, most likely unless they verified they had no incidental tie ins with us-east-1.
> Identity Center and only put it in us-east-1
Is it possible to have it in multiple regions? Last I checked, it only accepted one region. You needed to remove it first if you wanted to move it.
Security people and ignoring resiliency and failure modes: a tale as old as time
Correct. That does make it a centralized failure mode and everyone is in the same boat on that.
I’m unaware of any common and popular distributed IDAM that is reliable
Not sure if this counts fully as 'distributed' here, but we (Authentik Security) help many companies self-host authentik multi-region or in (private cloud + on-prem) to allow for quick IAM failover and more reliability than IAMaaS.
There's also "identity orchestration" tools like Strata that let you use multiple IdPs in multiple clouds, but then your new weakest link is the orchestration platform.
> I’m unaware of any common and popular distributed IDAM that is reliable
Other clouds, lmao. Same requirements, not the same mistakes. Source: worked for several, one a direct competitor.
It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.
If you don’t regularly restore a backup, you don’t have one.
Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.
Who watches the watchers.
for what it's worth, we were unable to login with root credentials anyway
i don't think any method of auth was working for accessing the AWS console
Sure it was, you just needed to login to the console via a different regional endpoint. No problems accessing systems from ap-southeast-2 for us during this entire event, just couldn’t access the management planes that are hosted exclusively in us-east-1.
Like the other poster said, you need to use a different region. The default region (of course) sends you to us-east-1
e.x. https://us-east-2.console.aws.amazon.com/console/home
Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.
The usability of AWS is so poor.
They don’t charge anything for Identity Center and so it’s not considered an important priority for the revenue counters.
I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright
Sounds like a lot of companies need to update their BCP after this incident.
"If you're able to do your job, InfoSec isn't doing theirs"
People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.
[dead]
[dead]
This is having a direct impact on my wellbeing. I was at Whole Foods in Hudson Yards NYC and I couldn’t get the prime discount on my chocolate bar because the system isn’t working. Decided not to get the chocolate bar. Now my chocolate levels are way too low.
"alexa turn on coffee pot" stopped working this morning, and I'm going bonkers.
Alexa is super buggy now anyway. I switched my Echo Dot to Alexa+, and it fails turning on and off my Samsung TV all the time now. You usually have to do it twice.
This has been my impetus to do Home Assistant things and I already can tell you that I'm going to spend far more time setting it up and tweaking it than I actually save, but false economy is a tinkerer's best friend. It's pretty impressive what a local LLM setup can do though, and I'm learning that all my existing smart devices are trivially available if anyone gets physical access to my network I guess!
This is the kind of thing Claude Code (bypassing permissions) shines at. I‘m about to setup HA myself and intend to not write a single line of config myself.
Most of HA is configured in the gui these days, you won't need to write any config anyways.
Something I love about HA is that all thr gui can always be directly edited using yaml. So you can ask claude for a v1 then tweak it a bit then finish with the gui. And all of this directly from the gui.
Ugh. Reminds me that some time ago Siri stopped responding to “turn off my TV.” Now I have to remember to say “turn off my Apple TV.” (Which with the magic of HDMI CEC turns off my entire system.) Given how groggy I am when I want to turn off the TV, I often forget.
I just use a "Alexa goodnight" to trigger turning off the tv and lights
i agree. the new LLM is better for dialog and Q&A, but they haven't properly tested intents and IOT integration at all.
How can this be? I had great luck with GPT3 way back when… and I didn’t have function calling or chat… had to parse the JSON myself, extraction “action” and “response-text” fields… How has this been so hard for AMZN? Is it a matter of token cost and trying to use small models?
that's a reasonable theory. they've likely delayed the launch this long due to the inference cost compared to the more basic Alexa engine.
I would also guess the testing is incomplete. Alexa+ is a slow roll out so they can improve precision/recall on the intents with actual customers. Alexa+ is less deterministic than the previous model was wrt intents
I upgraded to the old Alexa. Alexa+ is a hot pile of crap.
Someone posting on HN should know better than using Alexa and Samsung TVs. These devices are a unique combination of malware and spyware.
I was attempting to use self checkout for some lunch I grabbed from the hotbar and couldn’t understand why my whole foods barcode was failing. It took me a full 20 seconds to realize the reason for the failure.
This is a fun example, but now you've got me wondering: has anyone checked on folks who might have been in an Amazon Go store during the outage?
Life indeed is a struggle
First World treatlerite problems. /s What's going to suck years after too many SREs/SWEs will have long been fired, like the Morlocks & Eloi and Idiocracy, there won't be anyone left who can figure out that plants need water. There will be a few trillionaires surrounded by aristocratic, unimaginable opulence while most of humanity toils in favelas surrounded by unfixable technology that seems like magic. One cargo cult will worship 5.25" floppy disks and their arch enemies will worship CD-Rs.
https://xkcd.com/2347/
We're getting awfully close to that scenario. Like frogs in a warming kettle.
0th world problems
I had to buy a donut and the gas station with cash, like a peasant.
That's it, internet centralization has gone too far, call your congress(wo)man
Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
Once you've had an outage on AWS, Cloudflare, Google Cloud, Akismet. What are you going to do? Host in house? None of them seem to be immune from some outage at some point. Get your refund and carry on. It's less work for the same outcome.
Why not host in house? If you have an application with stable resource needs, it can often be the cheaper and more stable option. At a certain scale, you can buy the servers, hire a sysadmin, and still spend less money than relying on AWS.
If you have an app that experiences 1000x demand spikes at unpredictable times then sure, go with the cloud. But there are a lot of companies that would be better off if they seriously considered their options before choosing the cloud for everything.
Multi-cloud. It's fairly unlikely that AWS and Google Cloud are going to fail at the same time.
Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.
> double++
I'd suggest to ++double the cost. Compare:
++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy
double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0
Lol :)
And likely far more than double the cost since you have to use the criminally-priced outbound bandwidth to keep everything in sync.
Shouldn't be double in the long term. Think of the second cloud as a cold standby. Depends on the system. Periodic replication of data layer (object storage/database) and CICD configured to be able to build services and VMs on multiple clouds. Have automatic tests weekly/monthly that represent end-to-end functionality, have scaled tests semi-annually.
This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.
Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.
And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.
How do you handle replication lag for databases?
If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.
Certainly if you aren't even multi-region, then multi-cloud is a pipe dream
> What are you going to do? Host in house?
Yep. Although it's just anecdata, it's what we do where I work - haven't had a slightest issue in years.
Cheaper, faster, in house people understands what’s going on. It should be a given for many services but somehow it’s not.
I totally agree with you. Where I work, we self-host almost everything. Exceptions are we use a CDN for one area where we want lower latency, and we use BigQuery when we need to parse a few billion datapoints into something usable.
It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.
On premise? Or do you build servers in a data center? Or do you lease dedicated servers?
Not GP, but my company also self-hosts. We rent rackspace in a colo. We used to keep my team's research server in the back closet before we went full-remote.
> Host in house?
Yes, mostly.
This. When Andy Jassy got challenged by analysts on the last earnings call on why AWS has fallen so far behind on innovation in areas his answer was a hand wavy response that diverted attention to say AWS is durable, stable, and reliable and customers care more about that. Oops.
behind on innovation how exactly?
The culture changed. When I first worked there, I was encouraged to take calculated risks. When I did my second tour of duty, people were deathly afraid of bringing down services. It has been a while since my second tour of duty, but I don't think it's back to "Amazon is a place where builders can build".
Somewhat inevitable for any company as they get larger. Easy to move fast and break things when you have 1 user and no revenue. Very different story when much of US commerce runs you on.
For folks who came of age in the late 00's, seeing companies once thought of as disruptors and innovators become the old stalwarts post-pandemic/ZIRP has been quite an experience.
Maybe those who have been around longer have seen this before, but its the first time for me.
If you bring something down in a real way, you can forget about someone trusting you with a big project in the future. You basically need to switch orgs
Curious. When did AWS hit “Day Two”, or what year was your 2nd tour of duty?
When they added the CM bar raiser, I felt like it hit day 2. When was that? 2014ish?
I've never heard tour of duty being used outside of the military, is it really that bad over at AWS it has to be called that?
Nah, I used to work for defense contractors, and worked with ex-military people, so...
Anyway, I actually loved my first time at AWS. Which is why I went back. My second stint wasn't too bad, but I probably wouldn't go back, unless they offered me a lot more than what I get paid, but that is unlikely.
I listened to the earnings call. I believe the question was mostly focused on why AWS has been so behind on AI. Jassy did flub the question quite badly and rambled on for a while. The press has mentioned the botched answer in a few articles recently.
They have been pushing me and company extremely hard to vet their various AI-related offerings. When we decide to look into whatever service it is, we come away underwhelmed. It seems like their biggest selling point so far is “we’ll give it to you free for several months”. Not great.
>we come away underwhelmed
In fairness, that's been my experience with everyone except OpenAI and Anthropic where I only occasionally come out underwhelmed
Really I think AWS does a fairly poor job bringing new services to market and it takes a while for them to mature. They excel much more in the stability of their core/old services--especially the "serverless" variety like S3, SQS, Lambda, EC2-ish, RDS-ish (well, today notwithstanding)
I honestly feel bad for the folks at AWS whose job it is to sell this slop. I get AWS is in panic mode trying to catch up, but it’s just awful and frankly becoming quite exhausting and annoying for customers.
AWS was gutted by layoffs over the last couple of years. None of this is surprising.
Why feel bad for them when they don’t? The paychecks and stock options keep them plenty warm at night.
The comp might be decent but most folks I know that are still there say they’re pretty miserable and the environment is becoming toxic. A bit more pay only goes so far.
Sorry, "becoming" toxic? Amazon has been famous for being toxic since forever.
It's a perspective issue. Amazon designs the first year to not "feel" toxic to most people. Thereafter, any semblance of propriety disappears.
> stock options
Timing.
If Amazon has peaked then they will not be worth much. Shares go down. Even in rising markets shares of failing companies go down...
Mind tho, Amazon has so much mind share they will need to fail harder to fail totally...
Fascinating, thanks for sharing this.
I found this summary:
https://fortune.com/2025/07/31/amazon-aws-ai-andy-jassy-earn...
And the transcript (there’s an annoying modal obscuring a bit of the page, but it’s still readable):
https://seekingalpha.com/article/4807281-amazon-com-inc-amzn...
(search for the word “tough”)
Everything except us-east-1 is generally pretty reliable. At $work we have a lot of stuff that's only on eu-west-1 (yes not the best practice) and we haven't had any issues, touch wood
My impression is that `us-east-1` has the worst reliability track record of any region. We've always run our stuff in `us-west-2` and there has never been an outage that took us down in that region. By contrast, a few things that we had in `us-east-1` have gone down repeatedly.
Just curious, what's special about us-east-1?
It’s the “original” AWS region. It has the most legacy baggage, the most customer demand (at least in the USA), and it’s also the region that hosts the management layer of most “global” services. Its availability has also been dogshit, but because companies only care about costs today and not harms tomorrow, they usually hire or contract out to talent that similarly only cares about the bottom line today and throws stuff into us-east-1 rather than figure out AZs and regions.
The best advice I can give to any org in AWS is to get out of us-east-1. If you use a service whose management layer is based there, make sure you have break-glass processes in place or, better yet, diversify to other services entirely to reduce/eliminate single points of failure.
I have a joke from 15 years ago, where I compared my friend who flaked out all the time as "having less availability than US-EAST-1".
This is not a new issue caused by improper investment, it's always been this way.
Former AWS employee here. There's a number of reasons but it mostly boils down to:
It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.
It's closest to "geographical center" so traffic from Europe feels faster than us-west
> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.
Was quite some time ago so I don't have the data, but AWS never came out on top.
It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.
Netcraft confirmed it? I haven't heard that name since the Slashdot era :)
Which cloud provider came out on top?
This makes sense given all the open source projects coming out of Netflix like chaos monkey.
AWS has been in long term decline, most of the platform is just in keeping the lights on mode. Its also why they are behind on AI, alot of would be innovative employees get crushed under red tape and performance management
Good thing they are the biggest investor into Anthropic
But then you will be affected by outages of every dependency you use.
This is the real problem. Even if you don't run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn't matter if those services are in other availability zones. AWS's own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.
It really is a single point of failure for the majority of the Internet.
This becomes the reason to run in us-east-1 if you're going to be single region. When it's down nobody is surprised that your service is affected. If you're all-in on some other region and it goes down you look like you don't know what you're doing.
This whole incident has been pretty uneventful down in Australia where everything AWS is on ap-southeast-2.
> Even if you don't run anything in AWS directly, something you integrate with will.
Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"
It's easy to say this, but in the real world, most of the critical path is heavily-dependent on third party integrations. User auth, storage, logging, etc. Even if you're somewhat-resilient against failures (i.e. you can live without logging and your app doesn't hard fail), it's still potentially going to cripple your service. And even if your entire app is resilient and doesn't fail, there are still bound to be tons of integrations that will limit functionality, or make the app appear broken in some way to users.
The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.
It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.
If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.
Clearly these are non-trivial trade-offs, but I think using third parties is not an either or question. Depending on the app and the type of third-party service, you may be able to make design choices that allow your systems to survive a third-party outage for a while.
E.g., a hospital could keep recent patient data on-site and sync it up with the central cloud service as and when that service becomes available. Not all systems need to be linked in real time. Sometimes it makes sense to create buffers.
But the downside is that syncing things asynchronously creates complexity that itself can be the cause of outages or worse data corruption.
I guess it's a decision that can only be made on a case by case basis.
> Why would a third-party be in your product's critical path?
i bet only 1-2% of AI startups are running their own models and the rest are just bouncing off OpenAI, Azure, or some other API.
Not necessarily our critical path but today circleci was affected greatly which also affected our capacity to deploy. Luckily it was a Monday morning therefore we didn’t even have to deploy an hot fix.
That's nearly every ai start-up done for
No man is an island, entire of itself
With the exception of Amazon, anyone in this situation already has a third-party product in their critical path - AWS itself.
* IAM / Okta * Cloud VPN services * Cloud Office (GSuite, Office365)
Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.
The only ones I can really think of are the cloud providers themselves- I was at Microsoft, and absolutely everything was in-house (often to our detriment).
I think you missed the "critical path" part. Why would your product stop functioning if your admins can't log in with IAM / VPN in, do you really need hands-on maintenance constantly? Why would your product stop functioning if Office is down, are you managing your ops in Excel or something?
"Some kind of dependency" is fine and unavoidable, but well-architected systems don't have hard downtime just because someone somewhere you have no control over fucked up.
Since 2020 for some reason lot of companies have fully remote workforce. If the VPN or auth goes down and workers can't login, that's a problem. Think banks, call center work, customer service.
This was a single region outage, right? If you aren't cross-region, cross-cloud is the same but harder
Glad that you're taking the first step toward resiliency. At times, big outages like these are necessary to give a good reason why the company should Multicloud. When things are working without problems, no one cares to listen to the squeaky wheel.
How did the call go?
> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud.
If an internal "AWS team" then this translates to "I am comfortable using this tool, and am uninterested in having to learn an entirely new stack."
If you have to diversify your cloud workloads give your devops team more money to do so.
Aren't you deployed in multiple regions?
I would be interested in a follow up in 2-3 years as to whether you've had fewer issues with a multi-cloud setup than just AWS. My suspicion is that will not be the case.
Please tell me there was a mixup and for some reason they didn’t show up.
Still no serverless inference for models or inference pipes that are not available on bedrock, still no auto scaling GPU workers. We started bothering them in 2022...crickets
Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.
It has been quite a while, wondering how many 9s are dropped.
365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.
9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.
In a company where I worked, the tool measuring downtime was at the same server, so even if the server was down they still showed 100% up.
If the server didnt work - the tool too measure didnt work too! Genius
This happened to AWS too.
February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.
https://aws.amazon.com/message/41926/
Happened a couple of times :)
- 2008 - https://news.ycombinator.com/item?id=116445
- 2010 - https://news.ycombinator.com/item?id=1396191
- 2015 - https://news.ycombinator.com/item?id=10033172
- 2017 - https://news.ycombinator.com/item?id=13755673 (Postmortem: https://news.ycombinator.com/item?id=13775667)
- 2024 - https://news.ycombinator.com/item?id=41770111
Five times is no longer a couple. You can use stronger words there.
It happened a murder of times.
Ha! Shall I bookmark this for the eventual wiki page?
Have we ever figured out what “red” means? I understand they’ve only ever gone to yellow.
If it goes red, we aren't alive to see it
I'm sure we need to go to Blackwatch Plaid first.
obligatory https://x.com/lintzston/status/791761626890469377
Published in the same week of October ...9 years ago ...Spooky...
I used to work at a company where the SLA was measured as the percentage of successful requests on the server. If the load balancer (or DNS or anything else network) was dropping everything on the floor, you'd have no 500s and 100% SLA compliance.
Similar to hosting your support ticketing system with same infra. "What problem? Nobody's complaining"
I’ve been customer for at least four separate products where this was true.
I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>
I spent enough time ~15 years back to find an external monitoring service that did not run on AWS and looked like a sustainable business instead of a VC fueled acquisition target - for our belts-n-braces secondary monitoring tool since it's not smart to trust CloudWatch to be able to send notifications when it's AWS's shit that's down.
Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.
(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)
Nagios is still a thing and you can host it wherever you like.
If its not on the dashboard, it didn't happen
Common SLA windows are hour, day, week, month, quarter, and year. They're out of SLA for all of those now.
When your SLA holds within a joke SLA window, you know you goofed.
"Five nines, but you didn't say which nines. 89.9999...", etc.
These are typically calculated system-wide, so if you include all regions, technically only a fraction of customers are impacted.
Customers in all regions were affected…
Indirectly yes but not directly.
Our only impact was some atlassian tools.
I shoot for 9 fives of availability.
5555.55555% Really stupendous availableness!!!
I see what you did there, mister :P
I prefer shooting for eight eights.
You mean nine fives.
You added a zero. There are ~8760 hours per year, so 8 hours is ~1 in 1000, 99.9%.
An outage like this does not happen every year, The last big outage happened in December 2021, roughly 3 years 10 month = 46 months ago.
The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.
They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.
Won’t the end result be people keeping more servers warm in other AWS regions which means Amazon profits from their own fuckups?
There was a pretty big outage 2023
Oh you are right!
I'm sure they'll find some way to weasel out of this.
For DynamoDB, I'm not sure but I think its covered. https://aws.amazon.com/dynamodb/sla/. "An "Error" is any Request that returns a 500 or 503 error code, as described in DynamoDB". There were tons of 5XX errors. In addition, this calculation uses percentage of successful requests, so even partial degradation counts against the SLA.
From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/
The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.
It's not down time, it's degradation. No outage, just degradation of a fraction[0] of the resources.
[0] Fraction is ~ 1
This 100% seems to be what they're saying. I have not been able to get a single Airflow task to run since 7 hours ago. Being able to query Redshift only recently came back online. Despite this all their messaging is that the downtime was limited to some brief period early this morning and things have been "coming back online". Total lie, it's been completely down for the entire business day here on the east coast.
We continue to see early signs of progress!
It doesn't count. It's not downtime, it's unscheduled maintenance event.
If you aren’t making $10 for every dollar you pay Amazon you need to look at your business model.
The refund they give you isn’t going to dent lost revenue.
Check the terms of your contract. The public terms often only offer partial service credit refunds, if you ask for it, via a support request.
Where were you guys the other day when someone was calling me crazy for trying to make this same sort of argument?
I haven't done any RFP responses for a while but this question always used to make me furious. Our competitors (some of who had had major incidents in the past) claimed 99.99% availability or more, knowing they would never have to prove it, and knowing they were actually 100% until the day they weren't.
We were more honest, and it probably cost us at least once in not getting business.
An SLA is a commitment, and an RFP is a business document, not a technical one. As an MSP, you don’t think in terms of “what’s our performance”, you think of “what’s the business value”.
If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.
it's a matter of perspective... 9.9999% is real easy
Only if you remember to spend your unavailability budget
It's a single region?
I don't think anyone would quote availability as availability in every region I'm in?
While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.
They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.
It’s THE region. All of AWS operates out of it. All other regions bow before it. Even the government is there.
"The Cloud" is just a computer that you don't own that's located in Reston, VA.
Facts.
The Rot Starts at the Head.
AWS GovCloud East is actually located in Ohio IIRC. Haven't had any issues with GovCloud West today; I'm pretty sure they're logically separated from the commercial cloud.
> All of AWS operates out of it.
I don't think this is true anymore. In the early days bad enough outages in us-east-1 would bring down everything because some metadata / control pane stuff was there, I remember getting affected while in other regions, but there's been many years since this has happened.
Today for example no issues. I just avoid us-east-1 and everyone else should to. It's their worst region by far in terms of reliability because they launch all the new stuff there and are always messing it up.
A secondary problem is that a lot of the internal tools are still on US East, so likely the response work is also being impacted by the outage. Been a while since there was a true Sev1 LSE (Large Scale Event).
What the heck? Most internal tools were in Oregon when I worked in BT pre 2021.
The primary ticketing system was up and down apparently, so tcorp/SIM must still have critical components there.
tell me it isn't true while telling me there isn't an outage across AWS because us-east-1 is down...
I help run quite a big operation in a different region and had zero issues. And this has happened many times before.
If that were true, you’d be seeing the same issues we are in us-west-1 as well. Cheers.
Global services such as STS have regional endpoints, but is it really that common to hit specific endpoint rather than use the default?
The regions are independent, so you measure availability for each on its own.
Except if they aren't quite as independent as people thought
Well that’s the default pattern anyway. When I worked in cloud there were always some services that needed cross-regional dependencies for some reason or other and this was always supposed to be called out as extra risk, and usually was. But as things change in a complex system, it’s possible for long-held assumptions about independence to change and cause subtle circular dependencies that are hard to break out of. Elsewhere in this thread I saw someone mentioning being migrated to auth that had global dependencies against their will, and I groaned knowingly. Sometimes management does not accept “this is delicate and we need to think carefully” in the midst of a mandate.
I do not envy anyone working on this problem today.
But is is a partial outage only, so it doesn't count. If you retry a million times everything still works /s
I'm wondering why your and other companies haven't just evicted themselves from us-east-1. It's the worst region for outages and it's not even close.
Our company decided years ago to use any region other than us-east-1.
Of course, that doesn't help with services that are 'global', which usually means us-east-1.
Several reasons, really:
1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"
2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.
3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.
4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.
5. Many Amazon features are available in that region first and then spread out to other locations.
6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.
7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?
It's the world's default hosting location, and today's outages show it.
> it's the cheapest region
In every SKU I've ever looked at / priced out, all of the AWS NA regions have ~equal pricing. What's cheaper specifically in us-east-1?
> Europe-friendly
Why not us-east-2?
> Many Amazon features are available in that region first and then spread out to other locations.
Well, yeah, that's why it breaks. Using not-us-east-1 is like using an LTS OS release: you don't get the newest hotness, but it's much more stable as a "build it and leave it alone" target.
> It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks.
This is a better argument, but in practice, it's very niche — 2-5ms of speed-of-light delay doesn't matter to anyone but HFT folks; anyone else can be in a DC one state away with a pre-arranged tier1-bypassing direct interconnect, and do fine. (This is why OVH is listed on https://www.cloudinfrastructuremap.com/ despite being a smaller provider: their DCs have such interconnects.)
For that matter, if you want "low-latency to North America and Europe, and high-throughput lowish-latency peering to many other providers" — why not Montreal [ca-central-1]? Quebec might sound "too far north", but from the fiber-path perspective of anywhere else in NA or Europe, it's essentially interchangeable with Virginia.
Lots of stuff is priced differently.
Just go to the EC2 pricing page and change from us-east-1 to us-west-1
https://aws.amazon.com/ec2/pricing/on-demand/
us-west-1 is the one outlier. us-east-1, us-east-2, and us-west-2 are all priced the same.
This seems like a flaw Amazon needs to fix.
Incentivize the best behaviors.
Or is there a perspective I don't see?
How is it a flaw!? Building datacenters in different regions come with very different costs, and different costs to run. Power doesn't cost exactly the same in different regions. Local construction services are not priced exactly the same everywhere. Insurance, staff salaries, etc, etc... it all adds up, and it's not the same costs everywhere. It only makes sense that it would cost different amounts for the services run in different regions. Not sure how you're missing these easy to realize facts of life.
> 5. Many Amazon features are available in that region first and then spread out to other locations.
This is the biggest one isn't it? I thought Route 53 isn't even available on any other region.
Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.
> the occasional outage isn't worth the cost and effort of moving out.
And looked at from the perspective of an individual company, as a customer of AWS, the occasional outage is usually an acceptable part of doing business.
However, today we’ve seen a failure that has wiped out a huge number of companies used by hundreds of millions - maybe billions - of people, and obviously a huge number of companies globally all at the same time. AWS has something like 30% of the infra market so you can imagine, and most people reading this will to some extent have experienced, the scale of disruption.
And the reality is that whilst bigger companies, like Zoom, are getting a lot of the attention here, we have no idea what other critical and/or life and death services might have been impacted. As an example that many of us would be familiar with, how many houses have been successfully burgled today because Ring has been down for around 8 out of the last 15 hours (at least as I measure it)?
I don’t think that’s OK, and I question the wisdom of companies choosing AWS as their default infra and hosting provider. It simply doesn’t seem to be very responsible to be in the same pond as so many others.
Were I a legislator I would now be casting a somewhat baleful eye at AWS as a potentially dangerous monopoly, and see what I might be able to do to force organisations to choose from amongst a much larger pool of potential infra providers and platforms, and I would be doing that because these kinds of incidents will only become more serious as time goes on.
You're suffering from survivorship bias. You know that old adage about the bullet holes in the planes, and someone pointed out that you should reinforce that parts without bullet holes, because these are the planes that came back.
It's the same thing here. Do you think other providers are better? If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.
At least this way, everyone knows why it's down, our industry has developed best practices for dealing with these kinds of outages, and AWS can apply their expertise to keeping all their customers running as long as possible.
> If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.
That is the point, though: Correlated outages are worse than uncorrelated outages. If one payment provider has an outage, chose another card or another store and you can still buy your goods. If all are down, no one can shop anything[1]. If a small region has a power blackout, all surrounding regions can provide emergency support. If the whole country has a blackout, all emergency responders are bound locally.
[1] Except with cash – might be worth to keep a stash handy for such purposes.
That’s a pretty bold claim. Where’s your data to back it up?
More importantly you appear to have misunderstood the scenario I’m trying to avoid, which is the precise situation we’ve seen in the past 24 hours where a very large proportion of internet services go down all at the same time precisely because they’re all using the same provider.
And then finally the usual outcome of increased competition is to improve the quality of products and services.
I am very aware of the WWII bomber story, because it’s very heavily cited in corporate circles nowadays, but I don’t see that it has anything to do with what I was talking about.
AWS is chosen because it’s an acceptable default that’s unlikely to be heavily challenged either by corporate leadership or by those on the production side because it’s good CV fodder. It’s the “nobody gets fired for buying IBM” of the early mid-21st century. That doesn’t make it the best choice though: just the easiest.
And viewed at a level above the individual organisation - or, perhaps from the view of users who were faced with failures across multiple or many products and services from diverse companies and organisations - as with today (yesterday!) we can see it’s not the best choice.
This is an assumption.
Reality is, though, that you shouldn't put all your eggs in the same basket. And it was indeed the case before the cloud. One service going down would have never had this cascade effect.
I am not even saying "build your own DC", but we barely have resiliency if we all rely on the same DC. That's just dumb.
From the standpoint of nearly every individual company, it's still better to go with a well-known high-9s service like AWS than smaller competitors though. The fact that it means your outages will happen at the same time as many others is almost like a bonus to that decision — your customers probably won't fault you for an outage if everyone else is down too.
That homogeneity is a systemic risk that we all bear, of course. It feels like systemic risks often arise that way, as an emergent result from many individual decisions each choosing a path that truly is in their own best interests.
Services like SES Inbound are only available in 2x US regions. AWS isn't great about making all services available in all regions :/
We're on Azure and they are worse in every aspect, bad deployment of services, and status pages that are more about PR than engineering.
At this point, is there any cloud provider that doesn't have these problems? (GCP is a non-starter because a false-positive YouTube TOS violation get you locked out of GCP[1]).
[1]: https://9to5google.com/2021/02/26/stadia-port-of-terraria-ca...
Don't worry there was a global GCP outage a few months ago
Global auth is and has been a terrible idea.
[flagged]
That’s an incredibly long comment that does nothing to explain why a YouTube ToS violation should lead to someone’s GCP services being cut off.
Also, Steve Jobs already wrote your comment better. You should have just stolen it. “You’re holding it wrong”.
[flagged]
Are you warned about the risks in an active war one? Yes.
Does Google warn you about this when you sign up? No.
And PayPal having the same problem in no way identifies Google. It just means that PayPal has the same problem and they are also incompetent (and they also demonstrate their incompetence in many other ways).
> It just means that PayPal has the same problem and they are also incompetent
Do you consider regular brick-and-mortar savings banks to be incompetent when they freeze someone's personal account for receiving business amounts of money into it? Because they all do, every last one. Because, again, they expect you to open a business account if you're going to do business; and they look at anything resembling "business transactions" happening in a personal account through the lens of fraud rather than the lens of "I just didn't realize I should open a business account."
And nobody thinks this is odd, or out-of-the-ordinary.
Do you consider municipal governments to be incompetent when they tell people that they have to get their single-family dwelling rezoned as mixed-use, before they can conduct business out of it? Or for assuming that anyone who is conducting business (having a constant stream of visitors at all hours) out of a residentially-zoned property, is likely engaging in some kind of illegal business (drug sales, prostitution, etc) rather than just being a cafe who didn't realize you can't run a cafe on residential zoning?
If so, I don't think many people would agree with you. (Most would argue that municipal governments suppress real, good businesses by not issuing the required rezoning permits, but that's a separate issue.)
There being an automatic level of hair-trigger suspicion against you on the part of powerful bureaucracies — unless and until you proactively provide those bureaucracies enough information about yourself and your activities for the bureaucracies to form a mental model of your motivations that makes your actions predictable to them — is just part of living in a society.
Heck, it's just a part of dealing with people who don't know you. Anthropologists suggest that the whole reason we developed greeting gestures like shaking hands (esp. the full version where you pull each-other in and use your other arms to pat one-another on the back) is to force both parties to prove to the other that they're not holding a readied weapon behind their backs.
---
> Are you warned about the risks in an active war one? Yes. Does Google warn you about this when you sign up? No.
As a neutral third party to a conflict, do you expect the parties in the conflict to warn you about the risks upon attempting to step into the war zone? Do you expect them to put up the equivalent of police tape saying "war zone past this point, do not cross"?
This is not what happens. There is no such tape. The first warning you get from the belligerents themselves of getting near either side's trenches in an active war zone, is running face-first into the guarded outpost/checkpoint put there to prevent flanking/supply-chain attacks. And at that point, you're already in the "having to talk yourself out of being shot" point in the flowchart.
It has always been the expectation that civilian settlements outside of the conflict zone will act of their own volition to inform you of the danger, and stop you from going anywhere near the front lines of the conflict. By word-of-mouth; by media reporting in newspapers and on the radio; by municipal governments putting up barriers preventing civilians from even heading down roads that would lead to the war zone. Heck, if a conflict just started "up the road", and you're going that way while everyone's headed back the other way, you'll almost always eventually be flagged to pull over by some kind stranger who realizes you might not know, and so wants to warn you that the only thing you'll get by going that way is shot.
---
Of course, this is all just a metaphor; the "war" between infrastructure companies and malicious actors is not the same kind of hot war with two legible "sides." (To be pedantic, it's more like the "war" between an incumbent state and a constant stream of unaffiliated domestic terrorists, such as happens during the ongoing only-partially-successful suppression of a populist revolution.)
But the metaphor holds: just like it's not a military's job to teach you that military forces will suspect that you're a spy if you approach a war zone in plainclothes; and just like it's not a bank's job to teach you that banks will suspect that you're a money launderer if you start regularly receiving $100k deposits into your personal account; and just like it's not a city government's job to teach you that they'll suspect you're running a bordello out of your home if you have people visiting your residentially-zoned property 24hrs a day... it's not Google's job to teach you that the world is full of people that try to abuse Internet infrastructure to illegal ends for profit; and that they'll suspect you're one of those people, if you just show up with your personal Google account and start doing some of the things those people do.
Rather, in all of these cases, it is the job of the people who teach you about life — parents, teachers, business mentors, etc — to explain to you the dangers of living in society. Knowing to not use your personal account for business, is as much a component of "web safety" as knowing to not give out details of your personal identity is. It's "Internet literacy", just like understanding that all news has some kind of bias due to its source is "media literacy."
s/in no way identifies Google/in no way indemnifies Google/
Sorry
> Sorry
No, thank you.
I appreciate this long comment.
I am in the middle of convincing the company I just joined to consider building on GCP instead of AWS (at the very least, not to default to AWS).
If you can't figure out how to use a different Google account for YouTube from the GCP billing account, I don't know what to say. Google's in the wrong here, but spanner's good shit! (If you can afford it. and you actually need it. you probably don't.)
The problem isn't specifically getting locked out of GCP (though it is likely to happen for those out of the loop on what happened). It is that Google themselves can't figure out that a social media ban shouldn't affect your business continuity (and access to email or what-have-you).
It is an extremely fundamental level of incompetence at Google. One should "figure out" the viability of placing all of one's eggs in the basket of such an incompetent partner. They screwed the authentication issue up and, this is no slippery slope argument, that means they could be screwing other things up (such as being able to contact a human for support, which is what the Terraria developer also had issues with).
One of those still isn’t us-east-1 though and email isn’t latency-bound.
Except for OTP codes when doing 2fa in auth
100ms isn’t going to make a difference to email-based OTP.
Also, who’s using email-based OTP?
Same calculation everyone makes but that doesn’t stop them from whining about AWS being less than perfect.
We have discussions coming up to evict ourselves from AWS entirely. Didn't seem like there was much of an appetite for it before this but now things might have changed. We're still small enough of a company to where the task isn't as daunting as it might otherwise be.
Is there some reason why "global" services aren't replicated across regions?
I would think a lot of clients would want that.
> Is there some reason why "global" services aren't replicated across regions?
On AWS's side, I think us-east-1 is legacy infrastructure because it was the first region, and things have to be made replicable.
For others on AWS who aren't AWS themselves: because AWS outbound data transfer is exorbitantly expensive. I'm building on AWS, and AWS's outbound data transfer costs are a primary design consideration for potential distribution/replication of services.
It is absolutely crazy how much AWS charges for data. Internet access in general has become much cheaper and Hetzner gives unlimited AWS. I don't recall AWS ever decreasing prices for outbound data transfer
I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.
And yes, AWS' rates are highway robbery. If you assume $1500/mo for a 10 Gbps port from a transit provider, you're looking at $0.0005/GB with a saturated link. At a 25% utilization factor, still only $0.002/GB. AWS is almost 50 times that. And I guarantee AWS gets a far better rate for transit than list price, so their profit margin must be through the roof.
> I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.
Which makes sense, but even their rates for traffic between AWS regions are still exorbitant. $0.10/GB for transfer to the rest of the Internet somewhat discourages integration of non-Amazon services (though you can still easily integrate with any service where most of your bandwidth is inbound to AWS), but their rates for bandwidth between regions are still in the $0.01-0.02/GB range, which discourages replication and cross-region services.
If their inter-region bandwidth pricing was substantially lower, it'd be much easier to build replicated, highly available services atop AWS. As it is, the current pricing encourages keeping everything within a region, which works for some kinds of services but not others.
Even their transfer rates between AZs _in the same region_ are expensive, given they presumably own the fiber?
This aligns with their “you should be in multiple AZs” sales strategy, because self-hosted and third-party services can’t replicate data between AZs without expensive bandwidth costs, while their own managed services (ElastiCache, RDS, etc) can offer replication between zones for free.
Hetzner is "unlimited fair use" for 1Gbps dedicated servers, which means their average cost is low enough to not be worth metering, but if you saturate your 1Gbps for a month they will force you to move to metered. Also 10Gbps is always metered. Metered traffic is about $1.50 per TB outbound - 60 times cheaper than AWS - and completely free within one of their networks, including between different European DCs.
In general it seems like Europe has the most internet of anywhere - other places generally pay to connect to Europe, Europe doesn't pay to connect to them.
"Is there some reason why "global" services aren't replicated across regions?"
us-east-1 is so the government to slurp up all the data. /tin-foil hat
Data residency laws may be a factor in some global/regional architectures.
So provide a way to check/uncheck which zones you want replication to. Most people aren't going to need more than a couple of alternatives, and they'll know which ones will work for them legally.
My guess is that for IAM it has to do with consistency and security. You don't want regions disagreeing on what operations are authorized. I'm sure the data store could be distributed, but there might be some bad latency tradeoffs.
The other concerns could have to do with the impact of failover to the backup regions.
Regions disagree on what operations are authorized. :-) IAM uses eventual consistency. As it should...
"Changes that I make are not always immediately visible": - "...As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any changes that you make in IAM (or other AWS services), including attribute-based access control (ABAC) tags, take time to become visible from all possible endpoints. Some delay results from the time it takes to send data from server to server, replication zone to replication zone, and Region to Region. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out...
...You must design your global applications to account for these potential delays. Ensure that they work as expected, even when a change made in one location is not instantly visible at another. Such changes include creating or updating users, groups, roles, or policies. We recommend that you do not include such IAM changes in the critical, high availability code paths of your application. Instead, make IAM changes in a separate initialization or setup routine that you run less frequently. Also, be sure to verify that the changes have been propagated before production workflows depend on them..."
https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoo...
Global replication is hard and if they weren't designed with that in mind its probably a whole lot of work.
I thought part of the point of using AWS was that such things were pretty much turnkey?\
Mostly AWS relies on each region being its own isolated copy of each service. It gets tricky when you have globalized services like IAM. AWS tries to keep those to a minimum.
For us, we had some minor impacts but most stuff was stable. Our bigger issue was 3rd party SaaS also hosted on us-east-1 (Snowflake and CircleCI) which broke CI and our data pipeline
So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.
I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.
Yep. Many, many companies are fine saying “we’re going to be no more available than AWS is.”
Customers are generally a lot more understanding if half the internet goes down at the same time as you.
Yes, and that's a major reason so many just use us-east-1.
One advantage to being in the biggest region: when it goes down the headlines all blame AWS, not you. Sure you’re down too, but absolutely everybody knows why and few think it’s your fault.
This was a major issue, but it wasn't a total failure of the region.
Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.
I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.
We definitely learnt something here about both our software and our 3rd party dependencies.
cheapest + has the most capacity
You have to remember that health status dashboards at most (all?) cloud providers require VP approval to switch status. This stuff is not your startup's automated status dashboard. It's politics, contracts, money.
Which makes them a flat out lie since it ceases to be a dashboard if it’s not live. It’s just a status page.
Downdetector had 5,755 reports of AWS problems at 12:52 AM Pacific (3:53 AM Eastern).
That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).
However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).
Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.
Where do they source those reports from? Always wondered if it was just analysis of how many people are looking at the page, or if humans somewhere are actually submitting reports.
It turns out that a bunch of people checking if "XYZ is down" is a pretty good heuristic for it actually being down. It's pretty clever I think.
It's both. They count a hit from google as a report of that site being down. They also count that actual reports people make.
So if my browser auto-completes their domain name and I accept that (causing me to navigate directly to their site and then I click AWS) it's not a report; but if my browser doesn't or I don't accept it (because I appended "AWS" after their site name) causing me to perform a Google search and then follow the result to the AWS page on their site, it's a report? That seems too arbitrary... they should just count the fact that I went to their AWS page regardless of how I got to it.
I don't know the exact details, but I know that hits to their website do count as reports, even if you don't click "report". I assume they weight it differently based on how you got there (direct might actually be more heavily weighted, at least it would be if I was in charge).
Down detector agrees: https://downdetector.com/status/amazon/
Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status
Search, Seller Central, Amazon Advertising not working properly for me. Attempting to access from New York.
When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.
Amazon Ads is down indeed https://status.ads.amazon.com/
Lambda create-function control plane operations are still failing with InternalError for us - other services have recovered (Lambda, SNS, SQS, EFS, EBS, and CloudFront). Cloud availability is the subject of my CS grad research, I wrote a quick post summarizing the event timeline and blast radius as I've observed it from testing in multiple AWS test accounts: https://www.linkedin.com/pulse/analyzing-aws-us-east-1-outag...
Definitely seems to be getting worse, outside of AWS itself, more websites seem to be having sporadic or serious issues. Concerning considering how long the outage has been going.
That's probably why Reddit has been down too
worst of all: ring alarm unstoppable siren because the app is down and the keyboard was removed by my parents and put "somewhere in the basement".
Is it hard wired? If so, and if the alarm module doesn’t have an internal battery, can you go to the breaker box and turn off the circuit it’s on? You should be able to switch off each breaker in turn until it stops if you don’t know which circuit it’s on.
If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.
Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.
I'll keep it in mind, thx. I was lucky to find the keypad in the "this is the place where we put electronic shit" in the basement.
Dangerous curiosity ask, is whether the number of folks off for Diwali is a factor or not?
I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.
Northern Virginia's Fairfax County public schools have the day off for Diwali, so that's not an unreasonable question.
In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.
If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.
It's more worse if caused by American engineers , not on holiday
Seems like a lot of people missing that this post was made around midnight PST time and thus it would be more reasonable to ping people at lunch in IST before waking up people in EST or PST.
Seeing as how this is us-east-1, probably not a lot.
I believe the implication is that a lot of critical AWS engineers are of Indian descent and are off celebrating today.
junon's implication may be that AWS engineers of Indian descent would tend to be located on the West Coast.
North Virginia has a very large Indian community.
All the schools in the area have days off for Indian Holidays since so many would be out of school otherwise.
This broke in the middle of the day IST did it not? Why would you start waking up people in VA if it’s 3 in the morning there if you don’t have to?
I bet you haven't gotten an email back from AWS support during twilight hours before.
There are 153k Amazon employees based in India according to LinkedIn.
Missing my point entirely.
Yeah. We had a brief window where everything resolved and worked and now we're running into really mysterious flakey networking issues where pods in our EKS clusters timeout talking to the k8s API.
Yeah, networking issues cleared up for a few hours but now seem to be as bad as before.
Basic services at my worksite have been offline for almost 8 hours now (things were just glitchy for about 4 hours before that). This is nuts.
Have not gotten a data pipeline to run to success since 9AM this morning when there was a brief window of functioning systems. Been incredibly frustrating seeing AWS tell the press that things are "effectively back to normal". They absolutely are not! It's still a full outage as far as we are concerned.
Yep, confirmed worse - DynamoDB now returning "ServiceUnavailableException"
ServiceUnavailableException hello java :)
Here as well…
I noticed the same thing and it seems to have gotten much worse around 8:55 a.m. Pacific Time.
By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.
Agree… still seeing major issues. Briefly looked like it was getting better but things falling apart again.
SEV-0 for my company this morning. We can't connect to RDS anymore.
Yeah we were fine until about 1030 eastern and have been completely down since then, Heroku customer.
The problem is now that, what’s anyone going to do? Leave?
I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.
Same meme would work for Aws today.
> Same meme would work for Aws today.
Not really, there are enough alternatives.
Andy Jassy is the Tim Cook of Amazon
Rest and vest CEOs
In addition to those, Sagemaker also fails for me with an internal auth error specifically in Virginia. Fun times. Hope they recover by tomorrow.
Agreed, every time the impacted services list internally gets shorter, the next update it starts growing again.
A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.
This looks like one their worst outage in 15 years and us-east-1 still shows as degraded but I had no outages, as dont use us-east-1. Are you seeing issues on other regions?
https://health.aws.amazon.com/health/status?path=open-issues
The closest to their identification of a root cause seems to be this one:
"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."
The problems now seem mostly related to starting new instances. Our capacity is slowly decaying as existing services spin down and new EC2 workloads fail to start.
first time i see "fubar", is that a common expression on the industry? jsut curious (english is not my native language)
It is an old US military term that means “F*ked Up Beyond All Recognition”
But you probably have seen the standard example variable names "foo" and "bar" which (together at least) come from `fubar`
Foobar == "Fucked up beyond all recognition "
Even the acronym is fucked.
My favorite by a large margin...
Which are in fact unrelated.
Unclear. ‘Foo’ has a life and origin of its own and is well attested in MIT culture going back to the 1930s for sure, but it seems pretty likely that it’s counterpart ‘bar’ appears in connection with it as a comical allusion to FUBAR.
Interestingly, it was "Fouled Up Beyond All Recognition" when it first appeared in print back towards the end of World War 2.
https://en.wikipedia.org/wiki/List_of_military_slang_terms#F...
Not to be confused with "Foobar" which apparently originated at MIT: https://en.wikipedia.org/wiki/Foobar
TIL, an interesting footnote about "foo" there:
'During the United States v. Microsoft Corp. trial, evidence was presented that Microsoft had tried to use the Web Services Interoperability organization (WS-I) as a means to stifle competition, including e-mails in which top executives including Bill Gates and Steve Ballmer referred to the WS-I using the codename "foo".[13]'
What people would print and what soldiers would say in the 1940s were likely somewhat divergent.
100%
FUBAR being a bit worse than SNAFU: "situation normal: all fucked up" which is the usual state of us-east-1
My favorite is JANFU: Joint Army-Navy Fuck-Up.
It used to be quite common but has fallen out of usage.
"FUBAR" comes up in the movie Saving Private Ryan. It's not a plot point, but it's used to illustrate the disconnect between one of the soldiers dragged from a rear position to the front line, and the combat veterans in his squad. If you haven't seen the movie, you should. The opening 20 minutes contains one of the most terrifying and intense combat sequences ever put to film.
Honestly not sure if this is a joke I'm not in on.
There are documented uses of FUBAR back into the '40s.
What do you mean? The movie storyline takes place in 44 at the Battle of Normandy.
I must've misread. I thought you said that it comes from the movie rather than comes up in the movie.
FUBAR: Fucked Up Beyond All Recognition
Somewhat common. Comes from the US military in WW2.
Yes, although it's military in origin.
Choosing us-east-1 as your primary region is good, because when you're down, everybody's down, too. You don't get this luxury with other US regions!
One unexpected upside moving from a DC to AWS is when a region is down, customers are far more understanding. Instead of being upset, they often shrug it off since nothing else they needed/wanted was up either.
This is a remarkable and unfair truth. I have had this experience with Office365...when they're down a lot of customers don't care because all their customers are also down.
I was once told that our company went with Azure because when you tell the boomer client that our service is down because Microsoft had an outage, they go from being mad at you, to accepting that the outage was an act of god that couldn’t be avoided.
Azure outages: happens all the time, understandable, no way to prevent this
AWS outages: almost never happens, you should have been more prepared for when it does
Sometimes we all need a tech shutdown.
As they say, every cloud outage has a silver lining.
* Give the computers a rest, they probably need it. Heck, maybe the Internet should just shut down in the evening so everyone can go to bed (ignoring those pesky timezone differences)
* Free chaos engineering at the cloud provider region scale, except you didn't opt in to this one and know about in advance, making it extra effective
* Quickly figure out a map which of the things you use have a dependency on a single AWS region without no capability to change or re-route
Back in the day people used to shut down mail servers at the weekend, maybe we should start doing that again.
This still happens in some places. In various parts of Europe there are legal obligations not to email employees out of hours if it is avoidable. Volkswagen famously adopted a policy in Germany of only enabling receipt of new email messages for most of their employees 30 minutes before start of the working day, then disabling 30 minutes after the end, with weekends turned off also. You can leave work on Friday and know you won't be receiving further emails until Monday.
> https://en.wikipedia.org/wiki/Right_to_disconnect
B&H shuts down their site for the sabbath.
Disconnect day
I am down with that lets all build in US-East-1.
Is us-east-1 equally unstable to the other regions? My impression was that Amazon deployed changes to us-east-1 first so it's the most unstable region.
I've heard this so many times and not seen it contradicted so I started saying it myself. Even my last Ops team wanted to run some things in us-east-1 to get prior warning before they broke us-west-1.
But there are some people on Reddit who think we are all wrong but won't say anything more. So... whatever.
Nothing in the outage history really stands out as "this is the first time we tried this and oops" except for us-east-1.
It's always possible for things to succeed at a smaller scale and fail at full scale, but again none of them really stand out as that to me. Or at least, not any in the last ten years. I'm allowing that anything older than that is on the far side of substantial process changes and isn't representative anymore.
Would think that Amazon safeguards their biggest region more, but no idea, I've never worked at AWS
It took me so long to realise this is what's important in enterprise. Uptime isn't important, being able to blame someone else is what's important.
If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.
If you're down for 5 hours a year but this affected other companies too, it's not your fault
From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.
When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.
If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.
As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.
A slightly less cynical view: execs have a hard filter for “things I can do something about” and “things I can’t influence at all.” The bad ones are constantly pushing problems into the second bucket, but there are legitimately gray area cases. When an exec smells the possibility that their team could have somehow avoided a problem, that’s category 1 and the hammer comes down hard.
After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.
> It took me so long to realise this is what's important in enterprise. Uptime isn't important, being able to blame someone else is what's important.
Yes.
What is important is having a Contractual SLA that is defensible. Acts of God are defensible. And now major cloud infrastructure outtages are too.
“No one ever got fired for hiring IBM”
And all your dependencies are co-located.
Doing pretty well up here in Tokyo region for now! Just can't log into console and some other stuff.
Check the URL, we had an issue a couple of years ago with the Workspaces. US East was down but all of our stuff was in EU.
Turns out the default URL was hardcoded to use the us east interface and just by going to workspaces and then editing your URL to be the local region got everyone working again.
Unless you mean nothing is working for you at the moment.
Doesn't this mean you are not regionally isolated from us-east-1?
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”
It’s always DNS.
I wonder how much of this is "DNS resolution" vs "underlying config/datastore of the DNS server is broken". I'd expect the latter.
Dumb question but what's the difference between the two? If the underlying config is broken then DNS resolution would fail, and that's basically the only way resolution fails, no?
My speculation: 1st one - it just DNS fails and you can repeat later. second one - you need working DNS to update your DNS servers with new configuration endpoints where DynamoDB fetches its config (classical case of circular dependencies - i even managed get similar problem with two small dns servers...)
DNS is trivial to distribute if your backing storage is accessible and/or local to each resolver, so it's a reasonable distinction to make: It suggests someone has preferred consistency at a level where DNS doesn't really provide consistency (due to caching in resolvers along the path) anyway, over a system with fewer failure points.
... wonders if the dns config store is in fact dynamodb ...
DNS is managed by Route53 which has no dependency on Dynamodb for data plane
Background on the service: https://aws.amazon.com/builders-library/reliability-and-cons...
I feel like even Amazon/AWS wouldn't be that dim, they surely have professionals who know how to build somewhat resilient distributed systems when DNS is involved :)
I doubt a circular dependency is the cause here (probably something even more basic). That being said, I could absolutely see how a circular dependency could accidentally creep in, especially as systems evolve over time.
Systems often start with minimal dependencies, and then over time you add a dependency on X for a limited use case as a convenience. Then over time, since it's already being used it gets added to other use cases until you eventually find out that it's a critical dependency.
Those aren't really that different.
That's a major way your DNS stops working.
I don’t think it is DNS. The DNS A records were 2h before they announced it was DNS but _after_ reporting it was a DNS issue.
It's always US-EAST-1 :)
Might just be BGP dressed as DNS
I don't think that's necessarily true. The outage updates later identified failing network load balancers as the cause--I think DNS was just a symptom of the root cause
I suppose it's possible DNS broke health checks but it seems more likely to be the other way around imo
I don’t work for aws, but a different cloud provider so this is not a description of this incident, but an example of the kind of thing that can happen
One particular “dns” issue that caused an outage was actually a bug in software that monitors healthchecks.
It would actively monitor all servers for a particular service (by updating itself based on what was deployed) and update dns based on those checks.
So when the health check monitors failed, servers would get removed from dns within a few milliseconds.
Bug gets deployed to health check service. All of a sudden users can’t resolve dns names because everything is marked as unhealthy and removed from dns.
So not really a “dns” issue, but it looks like one to users
Even when it's not DNS, it's DNS.
Sometimes it’s BGP.
/s
Downtime Never Stops!
Someone probably failed to lint the zone file.
DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.
Seems like an example of "worse is better". The worse solution has better survival characteristics (on account of getting actually made).
I wouldn’t say it’s the worst… a largely decentralized worldwide namespace is not an easy thing to tackle and for the most part it totally works.
I actually think the design of DNS is really cool. I'm sure we could do better designing from a clean slate today, especially around security (designing with the assumption of an adversarial environment).
But DNS was designed in the 80s! It's actually a minor miracle it works as well as it does
[dead]
[dead]
That's why they wrote the haiku
Or expired domains which I suppose is related?
the answer is always DNS
The Premier League said there will be only limited VAR today w/o the automatic offside system becasue of the AWS outage. Weird timeline we live in
https://www.bbc.com/news/live/c5y8k7k6v1rt?post=asset%3Ad902...
Why is VAR connected to the internet? Are they trying to gather data on offside players customers to improve recommedations?
I worked in a similar system. The raw data from the field first goes to a cloud hosted event queue of some sort, then a database, then back to whatever app/screen on field. The data doesn't just power on-field displays. There's a lot of online websites, etc that needs to pull data from an api.
I wouldn't be at all surprised if people pay for API access to the data. I've worked with live sports data before, it's a very profitable industry to be in when you're the one selling the data.
Of course in a sane world you'd have an internal fallback for when cloud connectivity fails but I'm sure someone looked at the cost and said "eh, what's the worst that could happen?"
A silver lining to this cloud (outage).
Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
> Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted
This seems like such a low bar for 2025, but here we are.
You're also betting that CloudFront isn't one of the several AWS services that only works when us-east-1 is up.
Yeah, it's not clear how resilient CloudFront is but it seems good. Since content is copied to the points of presence and cached it's the lightly used stuff that can break (we don't do writes through CloudFront, which in IMHO is an anti-pattern). We setup multiple "origins" for the content so hopefully that provides some resiliency -- not sure if it contributed positively in this case since CF is such a black box. I might setup some metadata for the different origins so we can tell which is in use.
CloudFront isn't just for CDN, but also for DDoS protection. Writes through CloudFront are not an anti-pattern.
There is always more than a way to do things with AWS. But CloudFront Origin groups can’t use HTTP POST. They’re limited to read requests. Without origin groups you opt-out of some resiliency. IMHO that’s a bad trade-off. To each their own.
WAF is cheaper on CloudFront and so is traffic (compared to the ALB). It keeps bad traffic near the sender rather than near the recpient.
Yep if you wrote lambda@edge functions, which are part of Cloudfront and can be used for authentication among other things, they can only be deployed to us-east-1
I was under the impression it's similar to IAM where the control plane is in us-east-1 and the config gets replicated to other regions. In that case, existing stuff would likely continue to work but updates may fail
afaik cloudfront TLS certs and access logs S3 buckets must be stored in us-east-1
True for certs but not the log bucket (but it’s still going to be in a single region, just doesn’t have to be Virginia). I’m guessing those certs are cached where needed, but I can also imagine a perfect storm where I’m unable to rotate them due to an outage.
I prefer the API Gateway model where I can create regional endpoints and sew them together in DNS.
active/active? curious what the data stack looks like as that tends to be the hard part
The data layer is DynamoDB with Global Tables providing replication between regions, so we can write to any region. It's not easy to get this right, but our use case is narrow enough and rate of change low enought (intentionally) that it works well. That said, it still isn't clear that replication to us-east-1 would be perfect so we did "diff" tables just to be sure (it has been for us).
There is some S3 replication as well in the CI/CD pipeline, but that doesn't impact our customers directly. If we'd seen errors there it would mean manually taking Virginia out of the pipeline so we could deploy everyehere else.
So your global tables weren't impacted in us-east-1... I thought I read their status showed issues with global table replication
Our stacks in us-east-1 stopped getting traffic when the errors started and we’ve kept them out of service for now, so those tables aren’t being used. When we manually checked around noon (Pacific) they were fine (data matched) but we may have just gotten lucky.
cool thanks, we've been considering dynamo global tables for the same. We have S3 replication setup for cold storage data. For primary/hot DB there doesn't seem to be many other options for doing local writes
How did you do resilient auth for keys and certs?
We use AWS for keys and certs, with aliases for keys so they resolve properly to the specific resources in each region. For any given HTTP endpoint there is a cert that is part of a the stack in that region (different regions use different certs).
The hardest part is that our customers' resources aren't always available in multiple regions. When they are we fall back to a region where they exist that is next closest (by latency, courtesy of https://www.cloudping.co/).
That’s what I’d expect a basic setup to look like - region/space specific
So you’re minimally hydrating everyone’s data everywhere so that you can have some failover. Seems smart and a good middle ground to maximize HA. I’m curious what your retention window for the failover data redundancy is. Days/weeks? Or just a fifo with total data cap?
Just config information, not really much customer data. Customer data stays in their own AWS accounts with our service. All we hold is the ARNs of the resources serving as destinations.
We’ve gone to great lengths to minimize the amount of information we hold. We don’t even collect an email address upon sign-up, just the information passed to us by AWS Marketplace, which is very minimal (the account number is basically all we use).
Ah well that certainly makes it easier
One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.
It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.
I don't remember an event like that, but I'm rather certain the scenario you described couldn't have happened in 2017.
The very large 2017 AWS outage originated in s3. Maybe you're thinking about a different event?
https://share.google/HBaV4ZMpxPEpnDvU9
Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.
Which is interesting because per their health dashboard,
>We recommend customers continue to retry any failed requests.
They should continue to retry but with exponential backoff and jitter. Not in a busy loop!
If the reliability of your system depends upon the competence of your customers then it isn't very reliable.
Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?
There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.
See e.g., https://d1.awsstatic.com/builderslibrary/pdfs/Resilience-les...
Probably stupid question (I am not a network/infra engineer) - can you not simply rate limit requests (by IP or some other method)?
Yes your customers may well implement stupidly aggressive retries, but that shouldn't break your stuff, they should just start getting 429s?
Load shedding effectively does that. 503 is the correct error code here to indicate temporary failure; 429 means you've exhausted a quota.
Can't exactly change existing widespread practice so they're ready for that kind of handling.
I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.
Both of which seem to prop up in post mortems for these widespread outages.
Dynamo is AFAIK, not used by core AWS services.
They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.
It's not a direct dependency. Route 53 is humming along... DynamoDB decided to edit its DNS records that are propagated by Route 53... they were bogus, but Route 53 happily propagated the toxic change to the rest of the universe.
DynamoDB is not going to set up its own DNS service or its own Route 53.
Maybe DynamoDB should have had tooling that tested DNS edits before sending it to Route 53, or Route53 should have tooling to validate changes before accepting them. I'm sure smart people at AWS are yelling at each other about it right now.
When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?
> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.
IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).
I find it very interesting that this is the same issue that took down GCP recently.
Can't resolve any records for dynamodb.us-east-1.amazonaws.com
However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN
curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/
There's also dynamodb-fips.us-east-1.amazonaws.com if the main endpoint is having trouble. I'm not sure if this record was affected the same way during this event.
It's always DNS
Confirmed.
> Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1.
thank you for that info!!!!!
Dude!! Life saver.
Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.
It's not surprising that it's impacting other services in the region because DynamoDB is one of those things that lots of other services build on top of. It is a little bit surprising that the blast radius seems to extend beyond us-east-1, mind.
In the coming hours/days we'll find out if AWS still have significant single points of failure in that region, or if _so many companies_ are just not bothering to build in redundancy to mitigate regional outages.
I'm looking forward to the RCA!
I'm real curious how much of AWS GovCloud has continued through this actually. But even if it's fine, from a strategic perspective how much damage did we just discover you could do with a targeted disruption at the right time?
The US federal government is in yet another shutdown right now, so how would some sub-agency even know if there were an unplanned outage, who would report it, and who would try to fix it?
Gov't IT is mostly contractors, and IME they do 99% of the work, so... That might be your answer.
AWS engineers are trained to use their internal services for each new system. They seem to like using DynamoDB. Dependencies like this should be made transparent.
Ex employee here who built an aws service. Dynamo is basically mandated. You need like VP approval to use a relational database because of some scaling stuff they ran into historically. That sucks because we really needed a relational database and had to bend over backwards to use dynamo and all the nonsense associated with not having sql. It was super low traffic too
Not sure why this is downvoted - this is absolutely correct.
A lot of AWS services under the hood depend on others, and especially us-east-1 is often used for things that require strong consistency like AWS console logins/etc (where you absolutely don't want a changed password or revoked session to remain valid in other regions because of eventual consistency).
> Dependencies like this should be made transparent
even internally, Amazon's dependency graph became visually+logically incomprehensible a long time ago
Not "like using", they are mandated from the top to use DynamoDB for any storage. At my org in the retail page, you needed director approval if you wanted to use a relational DB for a production service.
It's now listing 58 impacted services, so the blast radius is growing it seems
i'm surprised / bothered that the history log shows the issues starting AM 10/20 -- when they seemed to have started around midnight 10/19
The same page now says 58 services - just 23 minutes after your post. Seems this is becoming a larger issue.
When I first visited the page it said like 23 services, now it says 65
74 now. This is an extreme way of finding out just how many AWS services there really are!
82 now.
Looks like AWS detonated six sticks of dynamite under a house of cards...
Up to 104 now, with 33 services reported as having issues that have been resolved.
At 3:03 AM PT AWS posted that things are recovering and sounded like issue was resolved.
Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.
Honestly sounds like AWS doesn’t even really know what’s going on. Not good.
This is exacerbated by the fact that this is Diwali week which means the most of Indian engineers will be out on leave. Tough luck.
[flagged]
I'm unaware of any organization that doesn't give the same number of vacation days based on religion and region- or company-wide holidays...
Right, like vacation?
Signal is down from several vantage points and accounts in Europe, I'd guess because of this dependence on Amazon overseas
We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence
Well, at least I now know that my Belgian university's Blackboard environment is running on AWS :)
Our last resort fallbacks are channels on different IRC servers. They always hold.
Self hosting is golden. Sadly we already feel like we have too many services for our company's size, and the sensitivity of vulnerabilities in customer systems precludes unencrypted comms. IRC+TLS could be used but we also regularly send screenshots and such in self-destructing messages (not that an attacker couldn't disable that, but to avoid there being a giant archive when we do have some sort of compromise), so we'd rather fall back to something with a similar featureset
As a degraded-state fallback, email is what we're using now (we have our clients configured to encrypt with PGP by default, we use it for any internal email and also when the customer has PGP so everyone knows how to use that)
> in self-destructing messages (not that an attacker couldn't disable that, but to avoid there being a giant archive when we do have some sort of compromise
Admitting to that here?
In civilised jurisdictions that should be criminal.
Using cryptography to avoid accountability is wrong. Drug dealing and sex work, OK, but in other businesses? Sounds very crooked to me
They're not sidestepping accountability, or at least we can't infer that from the information volunteered. They're talking about retention, and the less data you retain (subject to local laws), the less data can be leaked in a breach.
self-hosting isn't "golden", if you are serious about the reliability of complex systems, you can't afford to have your own outages impede your own engineers from fixing them.
if you seriously have no external low dep fallback, please at least document this fact now for the Big Postmortem.
The engineers can walk up to the system and do whatever they need to fix them. At least, that's how we self host in the office. If your organisation hosts it far away then yeah, it's not self hosted but remote hosted
> The engineers can walk up to the system and do whatever they need to fix them.
Including fabricating new RAM?
Including falling back to third-party hosting when relevant. One doesn't exclude the other
My experience with self hosting has been that, at least when you keep the services independent, downtime is not more common than in hosted environments, and you always know what's going on. Customising solutions, or workarounds in case of trouble, is a benefit you don't get when the service provider is significantly bigger than you are. It has pros and cons and also depends on the product (e.g. email delivery is harder than Mattermost message delivery, or if you need a certain service only once a year or so) but if you have the personell capacity and a continuous need, I find hosting things oneself to be the best solution in general
Including fallback to your laptop if nothing else works. I saved a demo once by just running the whole thing from my computer when the Kubernetes guys couldn't figure out why the deployed version was 403'ing. Just had to poke the touchpad every so often so it didn't go to sleep.
> Just had to poke the touchpad every so often so it didn't go to sleep
Unwarranted tip: next time, if you use macOS, just open the terminal and run `caffeinate -imdsu`.
I assume Linux/Windows have something similar built-in (and if not built-in, something that's easily available). For Windows, I know that PowerToys suite of nifty tools (officially provided by Microsoft) has Awake util, but that's just one of many similar options.
You can just turn of automatic sleep/screen off in Windows native power settings.
If you self host, you must keep the spares, atleast for an enterprise environment.
The key thing that AWS provides is the capacity for infinite redundancy. Everyone that is down because us-east-1 is down didn't learn the lesson of redundancy.
Active-active RDBMS - which is really the only feasible way to do HA, unless you can tolerate losing consistency (or the latency hit of running a multi-region PC/EC system) - is significantly more difficult to reason about, and to manage.
Except Google Spanner, I’m told, but AWS doesn’t have an answer for that yet AFAIK.
They do now: https://aws.amazon.com/rds/aurora/dsql/
Some organizations’ leadership takes one look at the cost of redundancy and backs away. Paying for redundant resources most organizations can stomach. The network traffic charges are what push many over the edge of “do not buy”.
The cost of re-designing and re-implementing applications to synchronize data shipping to remote regions and only spinning up remote region resources as needed is even larger for these organizations.
And this is how we end up with these massive cloud footprints not much different than running fleets of VM’s. Just about the most expensive way to use the cloud hyperscalers.
Most non-tech industry organizations cannot face the brutal reality that properly, really leveraging hyperscalers involves a period of time often counted in decades for Fortune-scale footprints where they’re spending 3-5 times on selected areas more than peers doing those areas in the old ways to migrate to mostly spot instance-resident, scale-to-zero elastic, containerized services with excellent developer and operational troubleshooting ergonomics.
same here, we never left IRC
Thankfully Slack is still holding up.
It’s super broken for me. Random threads no longer appear.
It’s acting up for me but wondering if it’s unrelated. Imagines failing to post and threads acting strange.
Slack is having issues with huddles, canvas, and messaging per https://slack-status.com/. Earlier it was just huddles and canvas.
Same. My Slack mobile app managed to sync the new messages, but it took it about 30 seconds, while usually it's sub 2 seconds.
Looks like its gone down now
[flagged]
And the post office still works, so ah, at least kidnappers can send ransom demands.
> We're having fun figuring out how to communicate amongst colleagues now!
When Slack was down we used... google... google mail? chat. When you go to gmail there is actually a chat app on the left.
Even internal Amazon tooling is impacted greatly - including the internal ticketing platform which is making collaboration impossible during the outage. Amazon is incapable of building multi-region services internally. The Amazon retail site seems available, but I’m curious if it’s even using native AWS or is still on the old internal compute platform. Makes me wonder how much juice this company has left.
Amazon's revenue in 2024 was about the size of Belgium's GDP. Higher than Sweden or Ireland. It makes a profit similar to Norway, without drilling for offshore oil or maintaining a navy. I think they've got plenty of juice left.
The universe's metaphysical poeticism holds that it's slightly more likely than it otherwise would be that the company that replaced Sears would one day go the way of Sears.
You’re right about that. I guess what I mean is, how long will people be enthusiastic about AWS and its ability to innovate. But AWS undeniably has some really strong product offerings - it’s just that their pace of innovation has slowed. Their managed solutions for open source applications is generally good, but some of their bespoke alternatives are lacking over the last few years (ecs kinesis code* tools) - it wasn’t always like that (sqs ddb s3 ec2).
You could argue Amazon's security is an irregular military force
The navy comment is a bit unfair, as it's well-known that Amazon is more of an airpower (hence "the cloud" etc.)
It seems reasonable to me that Amazon (retail) would build better AZ redundancy into their services than say Snapchat or a bank
Sure, but it’s not reasonable that internal collaboration platforms built for ticketing engineers about outages doesn’t work during the outage. That would be something worth making multi-region at a minimum.
Of course, referring to this
> The Amazon retail site seems available
I saw a quote from a high end AWS support engineer that said something like "submitting tickets for AWS problems is not working reliably: customers are advised to keep retrying until the ticket is submitted".
> The Amazon retail site seems available, but I’m curious if it’s even using native AWS or is still on the old internal compute platform.
Some parts of amazon.com seem to be affected by the outage (e.g. product search: https://x.com/wongmjane/status/1980318933925392719)
Reviewing order history was also spotty. According to my wife “keep hitting refresh getting different dogs each time until it works”
Amazon customer support had a banner saying it was unavailable for most of the day. Couldn't get anything less than 5 day shipping on any item today.
As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?
To my understanding the main problem is DynamoDB being down, and DynamoDB is what a lot of AWS services use for their eventing systems behind the scenes. So there's probably like 500 billion unprocessed events that'll need to get processed even when they get everything back online. It's gonna be a long one.
500 billions events. Always blows my mind how many people use aws
I know nothing. But I'd imagine the number of 'events' generated during this period of downtime will eclipse that number every minute.
"I felt a great disturbance in us-east-1, as if millions of outage events suddenly cried out in terror and were suddenly silenced"
(Be interesting to see how many events currently going to DynamoDB are actually outage information.)
I wonder how many companies have properly designed their clients. So that the timing before re-attempt is randomised and the re-attempt timing cycle is logarithmic.
Most companies will use the AWS SDK client's default retry policy.
nowadays i think a single immediate retry is preferred over exponential backoff with jitter.
if you ran into a problem that an instant retry cant fix, chances are you will be waiting so long that your own customer doesnt care anymore.
Why randomized?
It’s the Thundering Herd Problem.
See https://en.wikipedia.org/wiki/Thundering_herd_problem
In short, if it’s all at the same schedule you’ll end up with surges of requests followed by lulls. You want that evened out to reduce stress on the server end.
Thank you. Bonsai and adzm as well. :)
It's just a safe pattern that's easy to implement. If your services back-off attempts happen to be synced, for whatever reason, even if they are backing off and not slamming AWS with retries, when it comes online they might slam your backend.
It's also polite to external services but at the scale of something like AWS that's not a concern for most.
> they might slam your backend
Heh
Helps distribute retries rather than having millions synchronize
[dead]
Yes, with no prior knowledge the mathematically correct estimate is:
time left = time so far
But as you note prior knowledge will enable a better guess.
Yeah, the Copernican Principle.
> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.
> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.
https://www.newyorker.com/magazine/1999/07/12/how-to-predict...
This thought process suggests something very wrong. The guess "it will last again as long as it has lasted so far" doesn't give any real insight. The wall was actually as likely to end five months from when they visited it, as it was to end 500 years from then.
What this "time-wise Copernican principle" gives you is a guarantee that, if you apply this logic every time you have no other knowledge and have to guess, you will get the least mean error over all of your guesses. For some events, you'll guess that they'll end in 5 minutes, and they actually end 50 years later. For others, you'll guess they'll take another 50 years and they actually end 5 minutes later. Add these two up, and overall you get 0 - you won't have either a bias to overestimating, nor to underestimating.
But this doesn't actually give you any insight into how long the event will actually last. For a single event, with no other knowledge, the probability that it will after 1 minute is equal to the probability that it will end after the same duration that it lasted so far, and it is equal to the probability that it will end after a billion years. There is nothing at all that you can say about the probability of an event ending from pure mathematics like this - you need event-specific knowledge to draw any conclusions.
So while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation.
But you will never guess that the latest tik-tok craze will last another 50 years, and you'll never guess that Saturday Night Live (which premiered in 1075) will end 5-minutes from now. Your guesses are thus more likely to be accurate than if you ignored the information about how long something has lasted so far.
Sure, but the opposite also applies. If in 1969 you guessed that the wall would last another 20 years, then in 1989, you'll guess that the wall of Berlin will last another 40 years - when in fact it was about to fall. And in 1949, when the wall was a few months old, you'll guess that it will last for a few months at most.
So no, you're not very likely to be right at all. Now sure, if you guess "50 years" for every event, your average error rate will be even worse, across all possible events. But it is absolutely not true that it's more likely that SNL will last for another 50 years as it is that it will last for another 10 years. They are all exactly as likely, given the information we have today.
If I understand the original theory, we can work out the math with a little more detail... (For clarity, the berlin wall was erected in 1961.)
- In 1969 (8 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1972 (8x4/3=11 years) and 1993 (8x4=32 years)
- In 1989 (28 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1998 (28x4/3=37 years) and 2073 (28x4=112 years)
- In 1961 (when the wall was, say, 6 months old): You'd calculate that there's a 50% chance that the wall will fall between 1961 (0.5x4/3=0.667 years) and 1963 (0.5x4=2 years)
I found doing the math helped to point out how wide of a range the estimate provides. And 50% of the times you use this estimation method; your estimate will correctly be within this estimated range. It's also worth pointing out that, if your visit was at a random moment between 1961 and 1989, there's only a 3.6% chance that you visited in the final year of its 28 year span, and 1.8% chance that you visited in the first 6 months.
However,
> Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here.
It's relatively unlikely that you'd visit the Berlin Wall shortly after it's erected or shortly before it falls, and quite likely that you'd visit it somewhere in the middle.
No, it's exactly as likely that I'll visit it at any one time in its lifetime. Sure, if we divide its lifetime into 4 quadrants, its more likely I'm in quadrant 2-3 than in either of 1 or 4. But this is slight of hand: it's still exactly as likely that I'm in quadrant 2-3 than in quadrant (1 or 4) - or, in other words, it's as likely I'm at one of the ends of the lifetime as it is that I am in the middle.
>So no, you're not very likely to be right at all.
Well 1/3 of the examples you gave were right.
> Saturday Night Live (which premiered in 1075)
They probably had a great skit about the revolt of the Earls against William the Conquerer.
> The wall was actually as likely to end five months from when they visited it, as it was to end 500 years from then.
I don't think this is correct; as in something that has been there for say hundreds of years had more probability to be there in a hundred years than something that has been there for a month.
> while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation
It's important to flag that the principle is not trite, and it is useful.
There's been a misunderstanding of the distribution after the measurement of "time taken so far", (illuminated in the other thread), which has lead to this incorrect conclusion.
To bring the core clarification from the other thread here:
The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the estimate `time_left=time_so_far` is useful.
If this were actually correct, than any event ending would be a freak accident: since, according to you, the probability of something continuing increases drastically with its age. That is, according to your logic, the probability of the wall of Berlin falling within the year was at its lowest point in 1989, when it actually fell. In 1949, when it was a few months old, the probability that it would last for at least 40 years was minuscule, and that probability kept increasing rapidly until the day the wall was collapsed.
That's a paradox that comes from getting ideas mixed up.
The most likely time to fail is always "right now", i.e. this is the part of the curve with the greatest height.
However, the average expected future lifetime increases as a thing ages, because survival is evidence of robustness.
Both of these statements are true and are derived from:
P(survival) = t_obs / (t_obs + t_more)
There is no contradiction.
> However, the average expected future lifetime increases as a thing ages, because survival is evidence of robustness.
This is a completely different argument that relies on various real-world assumptions, and has nothing to do with the Copernican principle, which is an abstract mathematical concept. And I actually think this does make sense, for many common categories of processes.
However, even this estimate is quite flawed, and many real-world processes that intuitively seem to follow it, don't. For example, looking at an individual animal, it sounds kinda right to say "if it survived this long, it means it's robust, so I should expect it will survive more". In reality, the lifetime of most animals is a binomial distribution - they either very young, because of glaring genetic defects or simply because they're small, fragile, and inexperienced ; or they die at some common age that is species dependent. For example, a humab that survived to 20 years of age has about the same chance of reaching 80 as one that survived to 60 years of age. And an alien who has no idea how long humans live and tries to apply this method may think "I met this human when they're 80 years old - so they'll probably live to be around 160".
Ah no, it is the Copernican principle, in mathematical form.
Why is the most likely time right now? What makes right now more likely than in five minutes? I guess you're saying if there's nothing that makes it more likely to fail at any time than at any other time, right now is the only time that's not precluded by it failing at other times? I.E. it can't fail twice, and if it fails right now it can't fail at any other time, but even if it would have failed in five minutes it can still fail right now first?
Yes that's pretty much it. There will be a decaying probability curve, because given you could fail at any time, you are less likely to survive for N units of time than for just 1 unit of time, etc.
Is this a weird Monty hall thing where the person next to you didnt visit the wall randomly (maybe they decided to visit on some anniversary of the wall) so for them the expected lifetime of the wall is different?
Note that this is equivalent to saying "there's no way to know". This guess doesn't give any insight, it's just the function that happens to minimize the total expected error for an unknowable duration.
Edit: I should add that, more specifically, this is a property of the uniform distribution, it applies to any event for which EndsAfter(t) is uniformly distributed over all t > 0.
I'm not sure about that. Is it not sometimes useful for decision making, when you don't have any insight as to how long a thing will be? It's better than just saying "I don't know".
Not really, unless you care about something like "when I look back at my career, I don't want to have had a bias to underestimating nor overestimating outages". That's all this logic gives you: for every time you underestimate a crisis, you'll be equally likely to overestimate a different crisis. I don't think this is in any way actually useful.
Also, the worse thing you can get from this logic is to think that it is actually most likely that the future duration equals the past duration. This is very much false, and it can mislead you if you think it's true. In fact, with no other insight, all future durations are equally likely for any particular event.
The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic. That will easily beat this method of estimation.
You've added some useful context, but I think you're downplaying it's use. It's non-obvious, and in many cases better than just saying "we don't know". For example, if some company's server has been down for an hour, and you don't know anything more, it would be reasonable to say to your boss: "I'll look into it, but without knowing more about it, stastically we have a 50% chance of it being back up in an hour".
> The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic
True, and all the posts above have acknowledged this.
> "I'll look into it, but without knowing more about it, stastically we have a 50% chance of it being back up in an hour"
This is exactly what I don't think is right. This particular outage has the same a priori chance of being back in 20 minutes, in one hour, in 30 hours, in two weeks, etc.
Ah, that's not correct... That explains why you think it's "trite", (which it isn't).
The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the statement above is correct, and the estimate `time_left=time_so_far` is useful.
Can you suggest some mathematical reasoning that would apply?
If P(1 more minute | 1 minute so far) = x, then why would P(1 more minute | 2 minutes so far) < x?
Of course, P(it will last for 2 minutes total | 2 minutes elapsed) = 0, but that can only increase the probabilities of any subsequent duration, not decrease them.
That's inverted, it would be:
If: P(1 more minute | 1 minute so far) = x
Then: P(1 more minute | 2 minutes so far) > x
The curve is:
P(survival) = t_obs / (t_obs + t_more)
(t_obs is time observed to have survived, t_more how long to survive)
Case 1 (x): It has lasted 1 minute (t_obs=1). The probability of it lasting 1 more minute is: 1 / (1 + 1) = 1/2 = 50%
Case 2: It has lasted 2 minutes (t_obs=2). The probability of it lasting 1 more minute is: 2 / (2 + 1) = 2/3 ≈ 67%
I.e. the curve is a decaying curve, but the shape / height of it changes based on t_obs.
That gets to the whole point of this, which is that the length of time something has survived is useful / provides some information on how long it is likely to survive.
> P(survival) = t_obs / (t_obs + t_more)
Where are you getting this formula from? Either way, it doesn't have the property we were originally discussing - the claim that the best estimate of the duration of an event is the double of it's current age. That is, by this formula, the probability of anything collapsing in the next millisecond is P(1 more millisecond | t_obs) = t_obs / t_obs + 1ms ~= 1 for any t_obs >> 1ms. So by this logic, the best estimate for how much longer an event will take is that it will end right away.
The formula I've found that appears to summarize the original "Copernican argument" for duration is more complex - for 50% confidence, it would say:
That is, if given that we have a 50% chance to be experiencing the middle part of an event, we should expect its future life to be between one third and three times its past life.Of course, this can be turned on its head: we're also 50% likely to be experiencing the extreme ends of an event, so by the same logic we can also say that P(t_more = 0 [we're at the very end] or t_more = +inf [we're at the very beginning and it could last forever] ) is also 50%. So the chance t_more > t_obs is equal to the chance it's any other value. So we have precisely 0 information.
The bottom line is that you can't get more information out a uniform distribution. If we assume all future durations have the same probability, then they have the same probability, and we can't predict anything useful about them. We can play word games, like this 50% CI thing, but it's just that - word games, not actual insight.
I think the main thing to clarify is:
It's not a uniform distribution after the first measurement, t_obs. That enables us to update the distribution, and it becomes a decaying one.
I think you mistakenly believe the distribution is still uniform after that measurement.
The best guess, that it will last for as long as it already survived for, is actually the "median" of that distribution. The median isn't the highest point on the probability curve, but the point where half the area under the curve is before it, and half the area under the curve is after it.
And the above equation is consistent with that.
I used Claude to get the outage start and ends from the post-event summaries for major historical AWS outages: https://aws.amazon.com/premiumsupport/technology/pes/
The cumulative distribution actually ends up pretty exponential which (I think) means that if you estimate the amount of time left in the outage as the mean of all outages that are longer than the current outage, you end up with a flat value that's around 8 hours, if I've done my maths right.
Not a statistician so I'm sure I've committed some statistical crimes there!
Unfortunately I can't find an easy way to upload images of the charts I've made right now, but you can tinker with my data:
Generally expect issues for the rest of the day, AWS will recover slowly, then anyone that relies on AWS will recovery slowly. All the background jobs which are stuck will need processing.
Rule of thumb is that the estimated remaining duration of an outage is equal to the current elapsed duration of the outage.
1440 min
I realize that my basement servers have better uptime than AWS this year!
I think most sysadmin don't plan for AWS outage. And economically it makes sense.
But it makes me wonder, is sysadmin a lost art?
> But it makes me wonder, is sysadmin a lost art?
Yes. 15-20 years ago when I was still working on network-adjacent stuff I witnessed the shift to the devops movement.
To be clear, the fact that devops don't plan for AWS failures isn't an indication that they lack the sysadmin gene. Sysadmins will tell you very similar "X can never go down" or "not worth having a backup for service Y".
But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.
"Thankfully" nowadays containers and application hosting abstracts a lot of it back away. So today I'd be willing to say that devops are sufficient for small to medium companies (and dare I say more efficient?).
> But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.
Depends on the devops team. I have worked with so many devops engineers who came from network engineering, sysadmin, or SecOps backgrounds. They all bring a different perspective and set of priorities.
That's not very surprising. At this point you could say that your microwave has a better uptime. The complexity comparison to all the Amazon cloud services and infrastructure would be roughly the same.
So, we need to centralise all our compute onto one provider because economies of scale and its too complicated to do it well…
… but we should not compare them to self-hosting because hosting that much data and compute is complicated.
The emperor has no clothes.
> But it makes me wonder, is sysadmin a lost art?
I dunno, let me ask chatgpt. Hmmm, it said yes.
ChatGPT often says yes to both a question and its inverse. People like to hear yes more than no.
You missed their point. They were making a joke about over-reliance on AI.
As Amazon moves from day-1 company as it claimed once, to be the sales company like Oracle focusing on raking money, expect more outages to come, and longer to be resolved.
Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.
>You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
WILL see? We've been seeing this since 2019.
it will get worse
https://www.theregister.com/2025/10/20/aws_outage_amazon_bra... quoting from this
"And so, a quiet suspicion starts to circulate: where have the senior AWS engineers who've been to this dance before gone? And the answer increasingly is that they've left the building — taking decades of hard-won institutional knowledge about how AWS's systems work at scale right along with them."
...
"AWS has given increasing levels of detail, as is their tradition, when outages strike, and as new information comes to light. Reading through it, one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow. To be clear: I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time."
....
"This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. "
...
"I want to be very clear on one last point. This isn't about the technology being old. It's about the people maintaining it being new. If I had to guess what happens next, the market will forgive AWS this time, but the pattern will continue."
Do you have data suggesting AWS outages are more frequent and/or take longer to resolve?
This is a prediction, not a historical pattern to be observed now. Only future data can verify if this prediction was correct or not.
AWS has existed for like 2 decades now, is that not enough evidence for you?
And has substantially changed in management in the last couple years. Have you read my first post?
That is why technical leaders’ role wouldn’t demand they not only gather data, but also report things like accurate operational, alternative, and scenario cost analysis; financial risks; vendor lock-in; etc.
However, as may be apparent just from that small set, it is not exactly something technical people often feel comfortable with doing. It is why at least in some organizations you get the friction of a business type interfacing with technical people in varying ways, but also not really getting along because they don’t understand each other and often there are barriers of openness.
I think business types vs technical types inherently have different perspectives especially for american companies. One has the "get it done at all costs" the other has "this can't be done since impossible/it will break this".
When a company moves from engineering/technical driven to sales/profit/stock price/shareholders satisfaction driven, once it was not possible to cut (technical) corners, now becomes the de facto. If you push the L7s/L8s out of the discussion room, who would definitely stop or veto circular dependencies, and replace with sir-yes-sir people, now you've successfully created short term KPI wins for the lofty chairs but with a burning fuse of catastrophic failures to come.
It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.
It's still missing the one that earned me a phone call from a client.
I know Postman has kinda gone to shit over the years but it's hilarious my local REST client that makes requests from my machine has AWS as a dependency .
I found that out about Plex during an outage too.
It's seemingly everything. SES was the first one that I noticed, but from what I can tell, all services are impacted.
In AWS, if you take out one of dynamo db, S3 or lambda you're going to be in a world of pain. Any architecture will likely use those somewhere including all the other services on top.
If in your own datacenter your storage service goes down, how much remains running
Agreed, but you can put EC2 on that list as well
Assuming running instances remain running, ec2 would be less bad I think.
yes, the previous AWS incident to this one iirc ec2 instances were not reachable from AWS console (IAM issue?) but they kept running and working
When these major issues come up, all they have is symptoms and not causes. Maybe not until the dynamo oncall comes on and says its down, then everyone knows at least the reason for their teams outage.
The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.
I wonder what kind of outage or incident or economic change will be required to cause a rejection of the big commercial clouds as the default deployment model.
The costs, performance overhead, and complexity of a modern AWS deployment are insane and so out of line with what most companies should be taking on. But hype + microservices + sunk cost, and here we are.
I don't expect the majority of tech companies to want to run their own physical data centers. I do expect them to shift to more bare-metal offerings.
If I'm a mid to large size company built on DynamoDB, I'd be questioning if it's really worth the risk given this 12+ hour outage.
I'd rather build upon open source tooling on bare metal instances and control my own destiny, than hope that Amazon doesn't break things as they scale to serve a database to host the entire internet.
For big companies, it's probably a cost savings too.
> For big companies, it's probably a cost savings too.
For any sized company, moving away from big clouds back onto traditional VPS or bare-metal offerings will lead to cost savings.
So then we should expect it in the long-term regardless of outages anyway for the sake of growth slone.
I think that prediction severely underestimates the amount of cargo culting present at basically every company when it comes to decisions like this. Using AWS is like the modern “no one ever got fired for buying IBM”.
Well, let's hope a few people just got fired for using AWS. ;)
Honest answer? The outage would need to last about a week.
Looks like it affected Vercel, too. https://www.vercel-status.com/
My website is down :(
(EDIT: website is back up, hooray)
I had a chuckle on my way home yesterday. Standing on the train platform and seeing "Next departure in: (Vercel Connection Error)" on the screen. :P
Static content resolves correctly but data fetching is still not functional.
Imagine using vercel, a company that literally contributes to the starvation of children and is proud of it. Also, literally just learn to use a Dockerfile and a vps, like why do these PaaS even exist, you're paying 3x for the same AWS services.
Have you done anything for it to be back up? Looks like mines are still down.
Looks as if they are rerouting to a different region.
mines are generally down
Service that runs on aws is down when aws is down. Who knew.
This is just a silly anecdote, but every time a cloud provider blips, I'm reminded. The worst architecture I've ever encountered was a system that was distributed across AWS, Azure, and GCP. Whenever any one of them had a problem, the system went down. It also cost 3x more than it should.
I've seen the exact same thing at multiple companies. The teams were always so proud of themselves for being "multi-cloud" and managers rewarded them for their nonsense. They also got constant kudos for their heroic firefighting whenever the system went down, which it did constantly. Watching actually good engineers get overlooked because their systems were rock-solid while those characters got all the praise for designing an unadulterated piece of shit was one of the main reasons I left those companies.
Were you able to find a better company? If yes, what kind of company?
I became one of the founding engineers at a startup, which worked for a little while until the team grew beyond my purview, and no good engineering plan survives contact with sales directors who lie to customers about capabilities our platform has.
Founding engineer is the worst role in tech
Which role do you prefer and why?
Meh. I like it. The key is to not get emotionally invested and help make that transition to a bigger operation if things go well.
> Watching actually good engineers get overlooked because their systems were rock-solid while those characters got all the praise for designing an unadulterated piece of shit
That is the computing business. There is no actual accountability, just ass covering
multi-cloud... any leader that approves such a boondoggle should be labelled incompetent. These morons sell it as a cost-cutting "migration". Never once have I seen such a project complete and it more than doubles complexity and costs.
looks like very few get it right. A good system would have few minutes of blip when one cloud provider went down, which is a massive win compared to outages like this.
They all make it pretty hard, and a lot of resume-driven-devs have a hard time resisting the temptation of all the AWS alphabet soup of services.
Sure you can abstract everything away, but you can also just not use vendor-flavored services. The more bespoke stuff you use the more lock in risk.
But if you are in a "cloud forward" AWS mandated org, a holder of AWS certifications, alphabet soup expert... thats not a problem you are trying to solve. Arguably the lock in becomes a feature.
Lock-in is another way to say "bespoke product offering". Sometimes solving the problem a cloud provider service exposes is not worth it. This locks you in for the same reasons that a specific restaurant locks you in because its their recipe.
Putting aside outages..
I'd counter that past a certain scale, certainly the scale of a firm that used to & could run its own datacenter.. it's probably your responsibility to not use those services.
Sure it's easier, but if you decide feature X requires AWS service Y that has no GCP/Azure/ORCL equivalent.. it seems unwise.
Just from a business perspective, you are making yourself hostage to a vendor on pricing.
If you're some startup trying to find traction, or a small shop with an IT department of 5.. then by all means, use whatever cloud and get locked in for now.
But if you are a big bank, car maker, whatever.. it seems grossly irresponsible.
On the east coast we are already approaching an entire business day being down today. Gonna need a decade without an outage to get all those 9s back. And not to be catastrophic but.. what if AWS had an outage like this that lasted.. 3 days? A week?
The fact that the industry collectively shrugs our shoulders and allows increasing amounts of our tech stacks to be single-vendor hostage is crazy.
> I'd counter that past a certain scale, certainly the scale of a firm that used to & could run its own datacenter.. it's probably your responsibility to not use those services.
It's actually probably not your responsibility, it's the responsibility of some leader 5 levels up who has his head in the clouds (literally).
It's a hard problem to connect practical experience and perspectives with high-level decision-making past a certain scale.
This is the correct answer
> The fact that the industry collectively shrugs our shoulders and allows increasing amounts of our tech stacks to be single-vendor hostage is crazy.
Well, nobody is going to get blamed for this one except people at Amazon. Socially, this is treated as as a tornado. You have to be certain that you can beat AWS in terms of reliability for doing anything about this to be good for your career.
In 20+ years in the industry, all my biggest outages have been AWS... and they seem to be happening annually.
Most of my on-prem days, you had more frequent but smaller failures of a database, caching service, task runner, storage, message bus, DNS, whatever.. but not all at once. Depending on how entrenched your organization is, some of these AWS outages are like having a full datacenter power down.
Might as well just log off for the day and hope for better in the morning. That assumes you could login, which some of my ex-US colleagues could not for half the day, despite our desktops being on-prem. Someone forgot about the AWS 2FA dependency..
In general, the problem with abstracting infrastructure means you have to code to the lowest common denominator. Sometimes its worth it. For companies I work for it really isn't.
I think the problems are:
1) If you try to optimize in the beginning, you tend to fall into the over-optimization/engineering camp;
2) If you just let things go organically, you tend to fall into the big messy camp;
So the ideal way is to examine from time and time and re-architecture once the need arises. But few companies can afford that, unfortunately.
You mean multi-cloud strategy ! You wanna know how you got here ?
See the sales team from Google flew out an executive to NBA Finals, Azure Sales team flew out another executive to NFL superBowl and the AWS team flew out yet another executive to Wimbledon finals. And thats how you end up with multi-cloud strategy.
In this particular case, it was resume-oriented architecture (ROAr!) The original team really wanted to use all the hottest new tech. The management was actually rather unhappy, so the job was to pare that down to something more reliable.
Eh, businesses want to stay resilient to a single vendor going down. My least favorite question in interviews this past year was around multi-cloud. Because imho it just isn't worth it- the increased complexity, the trying to like-like services across different clouds that aren't always really the same, and then just the ongoing costs of chaos monkeying and testing that this all actually works, especially in the face of a partial outage like this vs something "easy" like a complete loss of network connectivity... but that is almost certainly not what CEOs want to hear (mostly who I am dealing with here going for VPE or CTO level jobs).
I could care less about having more vendor dinners when I know I am promising a falsehood that is extremely expensive and likely going to cost me my job or my credibility at some point.
sticker shock / looking at alternative vendors
On the flip side, our SaaS runs primarily on GCP so our users are fine. But our billing and subscription system runs on AWS so no one can pay us today.
i'll bet there are a large number of systems that are dependent on multiple cloud platforms being up without even knowing it. They run on AWS, but rely on a tool from someone else that runs on GCP or on Azure and they haven't tested what happens if that tools goes down...
Common Cause Failures and false redundancy are just all over the place.
Case in point is recent-ish Google Cloud downtime, which ended up taking down Cloudflare and half the internet with it.
Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....
Happened to lots of commercial routers too (free wifi with sign-in pages in stores for example) and that's way outside us-east-1
Was just on a Lufthansa and then United flight - both of which did not have WiFi. Was wondering if there was something going on at the infrastructure level.
Unfortunately that is also be par for the course
What if they use the same router inside AWS and now they cannot login too?
WiFi login portal (Icomera) on the train I'm on doesn't work either.
To everyone that got paged (like me), grab a coffee and ride it out, the week can only get better!
To everyone who was supposed to get paged but didn't, do a postmortem, chances are your service is running via Twilio and needs migrating elsewhere.
Hah good point, pagerduty is working at least.
PD's been tolerant to total AZ failures for years (was an early eng there)
> grab a coffee and ride it out
The way things are today I'm thankful the coffee machine still works without AWS.
With how long it may last, pour some cold water on the coffee and come back to drink the supernatant in a few hr
Feel sorry for anyone with a "smart" coffee machine.
Burn that SLO, next persons problem eh!
I know there's a lot of anecdotal evidence and some fairly clear explanations for why `us-east-1` can be less reliable. But are there any empirical studies that demonstrate this? Like if I wanted to back up this assumption/claim with data, is there a good link for that, showing that us-east-1 is down a lot more often?
The unreliability claim is driven by two factors.
1. When aws deploys changes they run through a pipeline which pushes change to regions one at a time. Most services start with us-east-1 first.
2. us-east-1 is MASSIVE and considerably larger than the next largest region. There's no public numbers but I wouldn't be surprised if it was 50% of their global capacity. An outage in any other region never hits the news.
> a pipeline which pushes change to regions one at a time
> When AWS deploys updates to its services, deployments to Availability Zones in the same Region are separated in time to prevent correlated failure.
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
> 1. When aws deploys changes they run through a pipeline which pushes change to regions one at a time.
This is true.
> Most services start with us-east-1 first.
This is absolutely false. Almost every service will FINISH with the largest and most impactful regions.
Each AWS service may choose different pipeline ordering based on the risks specific to their architecture.
In general:
You don't deploy to the largest region first because of the large blast radius.
You may not want to deploy to the largest region last because then if there's an issue that only shows up at that scale you may need to roll every single region back (divergent code across regions is generally avoided as much as possible).
A middle ground is to deploy to the largest region second or third.
Agreed. Most services start deployments on a small number of hosts in single AZs in small less-known regions, ramping up from there. In all my years there I don’t recall “us-east-1 first”.
I don't think its fair to dismiss a lot of anecdotal evidence, much of human experience is based off of it, and just being anecdotal doesn't make it incorrect. For those of us using aws for the last decade, there have been a handful of outages that are pretty hard to forget. Often those same engineers have services in other regions - so we witness these things going down more frequently in us-east-1. Now can I say definitively that us-east-1 goes down the most - nope. Have I had 4 outages in us-east-1 I can remember and only 1-2 us-west-2, yep.
Where are you getting the sense that anecdotal evidence is being dismissed?
The length and breadth of this outage has caused me to lose so much faith in AWS. I knew from colleagues who used to work there how understaffed and inefficient the team is due to bad management, but this just really concerns me.
"Tech people" are long gone, most projects are death marches of technical debt
I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.
You know how people say X startup is ChatGPT wrapper? A significant chunk of AWS services are wrappers of main services (DynamoDB, EC2, S3 and etc).
Yes, and that's exactly the problem. It's like choosing a microservice architecture for resiliency and building all the services on top of the same database or message queue without underlying redundancy.
afaik they have a tiered service architecture, where tier 1 services are allowed to rely on tier 0 services but not vice-versa, and have a bunch of reliability guarantees on tier 0 services that are higher than tier 1.
It is kinda cool that the worst aws outages are still within a single region and not global.
There IS a huge amount of redundancy built into the core services but nothing is perfect.
DNS is always the single point of failure.
But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.
I think a lot of its probably technical debt. So much internally still relies on legacy systems in US-East-1, and every time this happens I'm sure theres a discussion internally about decoupling that reliance which then turns into a massive diagram that looks like a family tree dating back a thousand years of all the things that need to be changed to stop it happening.
There's also the issue of sometimes needing actual strong consistency. Things like auth or billing for example where you absolutely can't tolerate eventual consistency or split-brain situations, in which case you need one region to serve as the ultimate source of truth.
> billing […] can't tolerate eventual consistency
Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.
Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.
Banking/transactions is full of split-brains where everyone involved prays for eventual consistency.
If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.
Interestingly, TigerBeetle manages to have distributed strict consistency over 6 machines.
Banking is full of examples of eventually consistent systems. ACH, credit card transactions, blockchain...
Sounds plausible. It's also a "fat and happy" symptom not to be able to fix deep underlying issues despite an ever growing pile of cash in the company.
Fixing deep underlying issues tends to fare poorly on performance reviews because success is not an easily traceable victory event. It is the prolonged absence of events like this, and it's hard to prove a negative.
Yeah I think there are a number of "hidden" dependencies on different regions, especially us-east-1. It's an artifact of it being AWS' largest region, etc.
why dont they have us east 2, 3, 4 etc. Actually have different cities.
us-east-1 is actually dozens of physical buildings distributed over a massive area. It's not like a single data center somewhere...
us-east-2 does exist; it’s in Ohio. One major issue is a number of services have (had? Not sure if it’s still this way) a control plane in us-east-1, so if it goes down, so does a number of other services, regardless of their location.
you can't possibly know that?
surely you mean:
> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.
AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!
Sure, but none of those other issues are ever documented by AWS as their status page is usually just one big lie.
Looks like they’re nearly done fixing it.
> Oct 20 3:35 AM PDT
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
In the last hour, I have seen the number of impacted services go from 90 to 92, currently sitting at 97.
It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...
In that region, other regions are able to launch EC2s and ECS/EKS without a problem.
Is that material to a conversation about service uptime of existing resources, though? Are there customers out there that are churning through the full lifecycle of ephemeral EC2 instances as part of their day-to-day?
any company of non trivial scale will surely launch ec2 nodes during the day
one of the main points of cloud computing is scaling up and down frequently
We spend ~$20,000 per month in AWS for the product I work on. In the average day we do not launch an EC2 instance. We do not do any dynamic scaling. However, there are many scenarios (especially during outages and such) that it would be critical for us to be able to launch a new instance (and or stop/start an existing instance.)
I understand scaling. I’m saying there is a difference in severity of several orders of magnitude between “the computers are down” and “we can’t add additional computers”.
Still not fixed and may have gotten worse...
Except it just broke again.
We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight
Even with redundancy, the response time between NYC and Amazon East in Ashburn is something like 10 ms. The impedance mismatch and dropped packets and increased latency would doom most organizations craplications.
> craplications
LOL
Their latest update on the status page says it's a Dynamodb DNS issue
but the cause of that could be anything, including some kind of config getting wiped due to a temporary power outage
US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.
AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.
They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.
> perhaps it's their "Room 641a".
For the uninitiated: https://en.wikipedia.org/wiki/Room_641A
It's possible that you really could endure any zone failure. But I take these claims people make all the time with a grain of salt, unless you're working on AWS scale (basically just 3 companies) and have actually run for years and seen every kind of failure mode claiming to be higher availability is not something that's able to be accurately evaluated.
(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)
Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.
I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.
Just letting you know how this response looks to other people -- Anon1096 raises legitimate objections, and their post seems very measured in their concerns, not even directly criticizing you. But your response here is very defensive, and a bit snarky. Really I don't think you even respond directly to their concerns, they say they'd want to see scale equivalent to AWS because that's the best way to see the wide variety of failure modes, but you mostly emphasize the auditors, which is good but not a replacement for the massive real load and issues that come along with it. It feels miscalibrated to Anon's comment. As a result, I actually trust you less. If you can respond to Anon's comment without being quite as sassy, I think you'd convince more people.
I appreciate the feedback, truly. Defensive and snarky are both fair, though I'm not trying to convince. The business and practices exist, today.
At risk of more snark [well-intentioned]: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.
Ya, I totally believe that cloud platforms don't need a single point of failure. In fact, seeing the vulnerability makes me excited, because I realize there is _still_ potential for innovation in this area! To be fair it's not my area of expertise, so I'm very unlikely to be involved, but it's still exciting to see more change on the horizon :)
Others have raised good points, like: they've already won, why bother? We did it because we weren't first!
What company did you do it with, can you say? Definitely, they may have been an early mover, but they can (and I'll say will!) still be displaced eventually, that's how business goes.
It's fine if someone guesses the well-known company, but I can't confirm/deny; like privacy a bit too much/post a bit too spicy. This wasn't a darling VC thing, to be fair. Overstated my involvement with 'made' for effect. A lot of us did the building and testing.
Definitely, that makes sense. Ya no worries at all, I think we all know these kinds of things involve 100+ human work-years, so at best we all just have some contribution to them.
> think we all know these kinds of things involve 100+ human work-years
No kidding! The customers differ, business/finance/governments, but the volume [systems/time/effort] was comparable to Amazon. The people involved in audits were consumed practically for a whole quarter, if memory serves. Not necessarily for testing itself: first, planning, sharing the plan, then dreading the plan.
Anyway, I don't miss doing this at all. Didn't mean to imply mitigation is trivial, just feasible :) 'AWS scale' is all the more reason to do business continuity/disaster recovery testing! I guess I find it being surprising, surprising.
Competitors have an easier time avoiding the creation of a Gordian Knot with their services... when they aren't making a new one every week. There are significant degrees to PaaS, a little focus [not bound to a promotion packet] goes a long way.
You were in a position to actually cut off production zones with live traffic at Amazon scale and test the recovery?
Yes, it was something we would do to maintain certain contracts. Sounds crazy, isn't: they used a significant portion of the capacity, anyway. They brought the auditors.
Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.
edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.
There are shared resources in different regions. Electricity. Cables. Common systems for coordination.
Your experiment proves nothing. Anyone can pull it off.
The sites were chosen specifically to be more than 50 miles apart, it proved plenty.
I am the CEO of your company. I forgot to pay the electricity bill. How is the multi-region resilience going?
If you go far up enough the pyramid, there is always a single point of failure. Also, it's unlikely that 1) all regions have the same power company, 2) all of them are on the same payment schedule, 3) all of them would actually shut off a major customer at the same time without warning, so, in your specific example, things are probably fine.
I suspect 'whatever1' can't be satisfied, there are no silver bullets. There's always a bigger fish/thing to fail.
The goal posts were fine: bomb the AZ of your choice, I don't care. The Cloud [that isn't AWS, in the case of 'us-east-1'] will still work.
No. It’s just that in my entire career when anyone claims that they have the perfect solution to a tough problem, it means either that they are selling something, or that they haven’t done their homework. Sometimes it’s both.
For what's left of your career: sometimes it's neither. You're confused, perfection? Where? A past employer, who I've deliberately not named, is selling something: I've moved on. Their cloud was designed with multiple-zone regions, and importantly, realizes the benefit: respects the boundaries. Amazon, and you, apparently have not.
Yes, everything has a weakness. Not every weakness is comparable to 'us-east-1'. Ours was billing/IAM. Guess what? They lived in several places with effective and routinely exercised redundancy. No single zone held this much influence. Service? Yes, that's why they span zones.
Said in the absolute kindest way: please fuck off. I have nothing to prove or, worse, sell. The businesses have done enough.
This is not what the resilience expert stated.
if the ceo of your company is personally paying the electric bill, go work for another company :)
Fine, the tab increments. Get back to hyping or something, this is not your job.
I doubt it should be yours if this is how you think about resilience.
Your vote has been tallied
Same failure mode of anything else.
How’s not paying your AWS bill going for you?
If your accounts payable can’t pay the electric bill on time, you’ve got bigger problems.
Yea, let's play along. Our CEO is personally choosing to not pay any entire class of partners across the planet. Are we even still in business? I'm so much more worried about being paid than this line of questioning.
A Cloud with multiple regions, or zones for that matter, that depend on one is a poorly designed Cloud; mine didn't, AWS does. So, let's revisit what brought 'whatever1', here:
> Your experiment proves nothing. Anyone can pull it off.
Amazon didn't, we did. Hmm.
Fine, our overseas offices are different companies and bills are paid for by different people.
Not that "forgot to pay" is going to result in a cut off - that doesn't happen with the multi-megawatt supplies from multiple suppliers that go into a dedicated data centre. It's far more likely that the receivers will have taken over and will pay the bill by that point.
Interesting. Langley isn’t that far away
Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.
Yes, good point. Pricing is a bit higher. As another reply pointed out: there's ~three that work on the same scale. This was one, another hint I guess: it's mostly B2B. Normal people don't typically go there.
I'm guessing Azure which may technically have greater resilience but has dogshit support and UX.
Azure, from my experience with it has stuff go down a lot and degrades even more. Seems to either not admit the degradation happened or rely on 1000 pages of fine print SLA docs to prove you don't get any credits for it. I suppose that isn't the same as "lose a region resiliency" so it could still be them given the poster said it is B2B focused and Azure is subject to a lot of exercises like this from it's huge enterprise customers. FWIW I worked as a IaC / devops engineer with the largest tenant in one of the non-public Azure clouds.
AWS is not cheap. AWS is one to two orders of magnitude more expensive than DIY.
My $3/mo AWS instance is far cheaper than any DIY solution I could come up with, especially when I have to buy the hardware and supply the power/network/storage/physical space. Not to mention it's not worth my time to DIY something like that in the first place.
There can be other valid usecases than your own.
Small things are cheap, yes, news at 11. But did you compare what your $3-$5 gets at Amazon vs a more traditional provider?
False equivalence/moving goalposts IMO... I was only refuting your claim of "AWS is not cheap", as if it's somehow impossible for it to be cheap... which I'm saying isn't the case.
Sorry to jump in y'alls convo :) AWS is cheaper than the Cloud we built... I just don't think it's significant. Ours cost more because businesses/governments would pay it, not because it was optimal.
Price is beside my original point: Amazon has enjoyed decades for arbitrage. This sounds more accusatory than intended: the 'us-south-1' problem exists because it's allowed/chosen. Created in 2006!
Now, to retract that a bit: I could see technical debt/culture making this state of affairs practical, if not inevitable. Correct? No, if I was Papa Bezos I'd be incredibly upset my Supercomputer is so hamstrung. I think even the warehouses were impacted!
The real differentiator was policy/procedure. Nobody was allowed to create a service or integration with this kind of blast area. Design principles, to say the least. Fault zones and availability zones exist for a reason beyond capacity, after all.
It's really not that nefarious.
IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).
Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.
And then other services depend on those services, and may also fall into the same trap.
...and so much of the tech/architectural debt gets concentrated into a single region.
Right, like I said: crazy. Anything production with certain other clouds must be multi-AZ. Both reinforced by culture and technical constraints. Sometimes BCDR/contract audits [zones chosen by a third party at random].
It sure is a blast when they decide to cut off (or simulate the loss of) a whole DC just to see what breaks, I bet :)
The disconnect case was simple: breakage was as expected. The island was lost until we drew it on the map again. Things got really interesting when it was a full power-down and back on.
Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.
This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.
Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?
It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic
They could put a failover site in Colorado or Seattle or Atlanta, handling just their infrastructure. It's not like the NSA wouldn't be able to backhaul from those places.
You mean the surveillance angle as reason for it being in Virginia?
AWS _had_ architected away from single-region failure modes. There are only a few services that are us-east-1 only in AWS (IAM and Route53, mostly), and even they are designed with static stability so that their control plane failure doesn't take down systems.
It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.
So the control plane for DNS and the identity management system are tied to us-east-1 and we’re supposed to think that’s OK? Those seem like exactly the sorts of things that should NOT be reliant on only one region.
It's worse than that. The entire DNS ultimately depends on literally one box with the signing key for the root zone.
You eventually get services that need to be global. IAM and DNS are such examples, they have to have a global endpoint because they apply to the global entities. AWS users are not regionalized, an AWS user can use the same key/role to access resources in multiple regions.
not quite true - there are some regions that have a different set of AWS users / credentials. I can't remember what this is called off the top of my head.
These are different AWS partitions. They are completely separate from each other, requiring separate accounts and credentials.
There's one for China, one for the AWS government cloud, and there are also various private clouds (like the one hosting the CIA data). You can check their list in the JSON metadata that is used to build the AWS clients (e.g. https://github.com/aws/aws-sdk-go-v2/blob/1a7301b01cbf7e74e4... ).
The parent seems to be implying there is something in us-east-1 that could take down all the various regions?
What is the motivation of an effective Monopoly to do anything?
I mean look at their console. Their console application is pretty subpar.
Been a while since I last suffered from AWS arbitrary complexity, but afaik you can only associate certificates to cloudfront if they are generated in us-east-1, so it's undoubtedly a single point of failure for all CDN if this is still the case.
I worked at AMZN for a bit and the complexity is not exactly arbitrary; it's political. Engineers and managers are highly incentivized to make technical decisions based on how they affect inter-team dependencies and the related corporate dynamics. It's all about review time.
I have seen one promo docket get rejected for doing work that is not complex enough... I thought the problem was challenging, and the simple solution brilliant, but the tech assessor disagreed. I mean once you see there is a simple solution to a problem, it looks like the problem is simple...
I had a job interview like this recently: "what's the most technically complex problem you've ever worked on?"
The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"
I think this could still be a very useful question for an interviewer. If I were hiring for a position working on a complex system, I would want to know what level of complexity a prospect was comfortable dealing with.
I was once very unpopular with a team of developers when I pointed out a complete solution to what they had decided was an "interesting" problem - my solution didn't involve any code being written.
I suppose it depends on what you are interviewing for but questions like that I assume are asked more to see how you answer than the specifics of what you say.
Most web jobs are not technically complex. They use standard software stacks in standard ways. If they didn't, average developers (or LLMs) would not be able to write code for them.
Yeah, I think this. I've asked this in interviews before, and it's less about who has done the most complicated thing and more about the candidate's ability to a) identify complexity, and b) avoid unnecessary complexity.
I.e. a complicated but required system is fine (I had to implement a consensus algorithm for a good reason).
A complicated but unrequired system is bad (I built a docs platform for us that requires a 30-step build process, but yeah, MkDocs would do the same thing.
I really like it when people can pick out hidden complexity, though. "DNS" or "network routing" or "Kubernetes" or etc are great answers to me, assuming they've done something meaningful with them. The value is self-evident, and they're almost certainly more complex than anything most of us have worked on. I think there's a lot of value to being able to pick out that a task was simple because of leveraging something complex.
That's what arbitrary means to me, but sure, I see no problem calling it political too
Forced attrition rears its head again
>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions
I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.
I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.
You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late
It depends on the outage. There was one a year or two ago (I think? They run together) that impacted EC2 such that as long as you weren’t trying to scale, or issue any commands, your service would continue to operate. The EKS clusters at my job at the time kept chugging along, but had Karptenter tried to schedule more nodes, we’d have had a bad time.
Static stability is a very valuable infra attribute. You should definitely consider how statically stable your services are in architecting them
Meanwhile, AWS has always marketed itself as "elastic". Not being able to start new VMs in the morning to handle the daytime load will wreck many sites.
Well that sounds like exactly the sort of thing that shouldn’t happen when there’s an issue given the usual response is to spin things up elsewhere, especially on lower priority services where instant failover isn’t needed.
Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.
That's a good point, but I'd just s/Amazon engineers/AWS leadership/ , as I'm pretty sure that there's a few layers of management removed between the engineers on the ground at AWS, those who deprioritise any longer-term resilience work needed (which is a very strategic decisioN), and those those who are in charge of external comms/education about best practices for AWS customers.
Luckily, those people are the ones that will be getting all the phonecalls from angry customers here. If you're selling resilience and selling twice the service (so your company can still run if one location fails), and it still failed, well... phones will be ringing.
They absolutely do do it themselves..
What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.
The claim was that that they're total hypocrites aren't multi region at all. That's totally false, the amount of redundancy in aws is staggering. But there are foundational parts which, I guess, have been too difficult to do that for (or perhaps they are redundant but the redundancy failed in this case? I dunno)
There's multiple single points of failure for their entire cloud in us-east-1.
I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.
That's absurd. It's hypocritical to describe best practices as best practices because you haven't perfectly implemented them? Either they're best practice or they aren't. The customers have the option of risking non-redundancy also, you know.
Yes it's hypocritical to push customers to pay you more money with best practices for uptime when you yourself don't follow them and your choices to not follow them actually make the best practices you pushed your customers to pay you more money for not fully work.
Hey! Pay us more money so when us-east-1 goes down you're not down (actually you'll still go down because us-east-1 is a single point of failure even for our other regions).
They can't even bother to enable billing services in GovCloud regions.
Amazon are planning to launch the EU Sovereign Cloud by the end of the year. They claim it will be completely independent. It may be possible then to have genuine resiliency on AWS. We'll see.
This is the difference between “partitions” and “regions”. Partitions have fully separate IAM, DNS names, etc. This is how there are things like US Gov Cloud, the Chinese AWS cloud, and now the EU sovereign cloud
Yes, although unfortunately it’s not how AWS sold regions to customers. AWS folks consistently told customers that regions were independent and customers architected on that belief.
It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.
Yes - they told me quite specifically that until they launch their their sovereign cloud, the mothership will be around.
Which are lies btw - Amazon has admitted the "EU sovereign cloud" is still susceptible to US government whims.
Then it will be eu-east-1 taking down the EU
gov, iso*, cn are also already separate (unless you need to mess with your bill, or certain kinds of support tickets)
My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.
"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"
The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.
The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.
You act as if that is a bug not a feature. As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself. Besides none of your customers are going to blame you if every other major site is down.
> As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself.
That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.
But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.
How many businesses can’t afford to suffer any downtime though?
But I’ve led enough cloud implementations where I discuss the cost and complexity between - multi-AZ (it’s almost free so why not), multi region , and theoretically multi cloud (never came up in my experience) and then cold, warm and hot standby, RTO and RPO, etc
And for the most part, most businesses are fine with just multi-AZ as long as their data can survive catastrophe.
As someone who hypothetically runs a critical service, I would rather my service be up than down.
And you have never had downtime? If your data center went down - then what?
I'm saying the importance is on uptime, not on who to blame, when services are critical.
You don't have one data center with critical services. You know lots of companies are still not in the cloud, and they manage their own datacenters, and they have 2-3 of them. There are cost, support, availability and regulatory reasons not to be in the cloud for many parties.
Or it is a matter of efficiency. If 1 million companies design and maintain their servers, there would be 1 million (or more) incidents like these. Same issues. Same fixes. Not so efficient.
It might be worse in terms of total downtime but it likely would be much less noticable as it woould be scattered individual outages not everyone at the same time.
Total downtime would likely be the same or more.
There are hints at in their documentation. For example ACM certs for cloudfront and KMS keys for route53 DNSSEC have to be in the us-east1 region.
However these services don't need high write uptime.
FWIW, I tried creating a DNSSEC entry for one of my domains during the outage, and it worked just fine.
It also doesn't help that most companies using AWS aren't remotely close to multi-region support, and that us-east-1 is likely the most populated region.
even if us-east-1 was a normal region there is not enough spare capacity to take up all the workloads from us-east-1 in other regions so t's a moot point
> Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.
Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )
Internet was supposed to be a communication network if the East Coast was nuked.
What it turned into was Daedalus from Deus Ex lol.
It sounds like they want to avoid split-brain scenarios as much as possible while sacrificing resilience. For things like DNS, this is probably unavoidable. So, not all the responsibility can be placed on AWS. If my application relies on receipts (such as an airline ticket), I should make sure I have an offline version stored on my phone so that I can still check in for my flight. But I can accept not to be able to access Reddit or order at McDonalds with my phone. And always having cash at hand is a given, although I almost always pay with my phone nowadays.
I hope they release a good root cause analysis report.
It's not unavoidable for DNS. DNS is inherently eventually consistent anyway, due to time-based caching.
Sure, but you want to make sure that changes propagate as soon as possible from the central authority. And for AWS, the control plane for that authority happens to be placed in US-EAST-1. Maybe Blockchain technology can decentralize the control plane?
Or Paxos or Raft...
Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.
Amazon devs to ClaudeCode: “That didn’t fix it. The service is down, pls fix. Make no mistakes. pls.”
aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.
Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).
Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.
Redundancy is insanely expensive especially for SaaS companies where the biggest cost is cloud.
Are customers willing to pay companies for that redundancy? I think not. Once every few years outage for 3 hours is fine for non critical services.
In general it is not expensive. In most cases you can either load balance across two regions all the time or have a fallback region that you scale out/up and switch to if needed.
Multi tenancy is expensive. You’d need to have every single service you depend on, including 3rd party services, on multi tenancy. In many cases such as the main DB, you need dedicated resources. You’re most likely to also going to need expensive enterprise SLAs.
Servers are easy. I’m sure most companies already have servers that can be spun up. Things related to data are not.
You don't need expensive SLAs to do data replication or load balancing in the cloud. It is pretty basic.
Talking about 3rd party services.
And no, data replication or load balancing is not easy, nor cheap.
You wrote "You’d need to have every single service you depend on, including 3rd party services, on multi tenancy.". This is highly incorrect. I worked at several companies that have a multi tenancy strategy. It is:
* Automated. * Scoped to business critical services. Typically not including many of the 3rd party services. * Uses data replication, which is a feature in any modern cloud. * Load balancing, by DNS basically for free or a real LB somewhere on the edge.
If you fail at this you probably fail at disaster recovery too or any good practice on how to run things in the cloud. Most likely because of very poor architecture.
Quite expensive to build though. Many of these companies don't have the sharpest engineers building multi-cloud.
IMO, going multi AZ or multi-cloud adds a good amount of complexity.
TBH I don't care if last.fm doesn't work for 8 hours a year, that isn't a big deal. My bank? Yeah that should work.
>> Redundancy is insanely expensive especially for SaaS companies
That right there means the business model is fucked to begin with. If you can't have a resilient service, then you should not be offering that service. Period. Solution: we were fine before the cloud, just a little slower. No problem going back to that for some things. Not everything has to be just in time at lowest possible cost.
Three nines might be good enough when you're Fornite. Probably not when you're Robinhood.
The part that makes no sense is - it's not cost management. AWS costs ten to a hundred times MORE than any other option - they just don't put it in the headline number.
Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.
They've acknowledged an issue now on the status page. For me at least, it's completely down, package installation straight up doesn't work. Thankfully current work project uses a pull-through mirror that allows us to continue working.
"Thankfully current work project uses a pull-through mirror that allows us to continue working."
so there is no free coffee time???? lmao
Yep. It's the auditing part that is broken. As a (dangerous) workaround use --no-audit
Also npm audit times out.
It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.
> It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality,
Well, inter-region DR/HA is a expensive thing to ensure (whether on salaries, infra or both), specially when you are in AWS.
Eh, the "best practices" that would've prevented this aren't trivial to implement and are definitely far beyond what most engineering teams are capable of, in my experience. It depends on your risk profile. When we had cloud outages at the freemium game company I worked at, we just shrugged and waited for the systems to come back online - nobody dying because they couldn't play a word puzzle. But I've also had management come down and ask what it would take to prevent issues like that from happening again, and then pretend they never asked once it was clear how much engineering effort it would take. I've yet to meet a product manager that would shred their entire roadmap for 6-18 months just to get at an extra 9 of reliability, but I also don't work in industries where that's super important.
Indeed, yet one would expect AWS to lead by example, including all of those that are only using a single region.
Like any company over a handful of years old, I'm sure they have super old, super critical systems running they dare not touch for fear of torching the entire business. For all we know they were trying to update one of those systems to be more resilient last night and things went south.
So many high profile companies with old deployments stuck in a single region, then.
Best practice does not include plan for when AWS going down. Netflix does not plan for it and they have a very strong eng org.
Did they stop their Chaos Gorilla, which simulates a region outage?
It was only one region.
Does AWS follow its own Well-Architected Framework!?
Er...They appear to have just gone down again.
My systems didn't actually seem to be affected until what I think was probably a SECOND spike of outages at about the time you posted.
The retrospective will be very interesting reading!
(Obviously the category of outages caused by many restored systems "thundering" at once to get back up is known, so that'd be my guess, but the details are always good reading either way).
Mine are more messed up now (12:30 ET) than they were this morning. AWS is lying that they've fixed the issue.
Yep. I don't think they ever fully recovered, but status page is still reporting a lot of issues.
Even though us-east-1 is the region geographically closest to me, I always choose another region as default due to us-east-1 (seemingly) being more prone to these outages.
Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.
What services are only available in us-east-1?
IAM control plane for example:
> There is one IAM control plane for all commercial AWS Regions, which is located in the US East (N. Virginia) Region. The IAM system then propagates configuration changes to the IAM data planes in every enabled AWS Region. The IAM data plane is essentially a read-only replica of the IAM control plane configuration data.
and I believe some global services (like certificate manager, etc.) also depend on the us-east-1 region
https://docs.aws.amazon.com/IAM/latest/UserGuide/disaster-re...
In addition to those listed in sibling comments, new services often roll out in us-east-1 before being made available in other regions.
I recently ran into an issue where some Bedrock functionality was available in us-east-1 but not one of the other US regions.
IAM, Cloudfront, Route53, ACM, Billing...
parts of S3 (although maybe that's better after that major outage years ago)
This is the right move. 10 years ago, us-east-1 was on the order of 10x bigger than the next largest region. This got a little better now, but any scaling issues still tend to happen in us-east-1.
The AWS has been steering people to us-east-2 for a while. For example, traffic between us-east-1 and us-east-2 has the same cost as inter-AZ traffic within the us-east-1.
My minor 2000 users web app hosted on Hetzner works fyi. :-P
Right up until the DNS fails
I am using ClouDNS. That is an AnycastDNS provider. My hopes are that they are more reliable. But yeah, it is still DNS and it will fail. ;-)
But how are you going to web scale it!? /s
Web scale? It is an _web_ app, so it is already web scaled, hehe.
Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.
There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.
Someone, somewhere, had to report that doorbells went down because the very big cloud did not stay up.
I think we're doing the 21st century wrong.
My Ring doorbell works just fine without an internet connection (or during a cloud outage). The video storage and app notifications are another matter, but the doorbell itself continues to ring when someone pushes the button.
Someone, somewhere, had to report that rock throwers went down because the very big cloud did not stay up.
I think we're doing the 16th century wrong.
Except we're not doing the 16th century right now.
Black powder is still used by firearm enthusiasts, and just like in the 16th century I'm sure they don't appreciate it getting wet when it rains
We created a single point of failure on the Internet, so that companies could avoid single points of failure in their data centers.
It's actually kinda great. When AWS has issues it makes national news and that's all you need to put on your status page and everyone just nods in understanding. It's a weird kind of holiday in a way.
China is unaffected.
Robinhood's completely down. Even their main website: https://robinhood.com/
Amazing, I wonder what their interview process is like, probably whiteboarding a next-gen LLM in WASM, meanwhile, their entire website goes down with us-east-1... I mean.
Friends tell their friends about more mature brokerages once the account goes over $100k.
Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.
>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.
It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.
Not very many people realize that there are some services that still run only in us-east-1.
Call it the aws holiday. Most other companies will be down anyway. It's very likely that your company can afford to be down for a few hours, too.
imagine if the electricity supplier too that stance.
But that is the stance for a lot of electrical utilities. Sometimes weather or a car wreck takes out power and since its too expensive to have spares everywhere, sometimes you have to wait a few hours for a spare to be brought in
No, that's not the stance for electrical utilities (at least in most developed countries, including the US): the vast majority of weather events cause localized outages (the grid as a whole has redundancies built in; distribution to (residential and some industrial) does not. It expects failures of some power plants, transmission lines, etc. and can adapt with reserve power, or, in very rare cases by partial degradation (i.e. rolling blackouts). It doesn't go down fully.
Spain and Portugal had a massive power outage this spring, no?
Yeah, and it has a 30 page Wikipedia article with 161 sources (https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...). Does that seem like a common occurrence?
> Sometimes weather or a car wreck takes out power
Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.
Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.
Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.
I think this is right, but depending on where you live, local weather-related outages can still not-infrequently look like entire towns going dark for a couple days, not streets for hours.
(Of course that's still not the same as a big boy grid failure (Texas ice storm-sized) which are the things that utilities are meant to actively prevent ever happening.)
The electric grid is much more important than most private sector software projects by an order of magnitude.
Catastrophic data loss or lack of disaster recovery kills companies. AWS outages do not.
What if the electricity grid depends on some AWS service?
That would be circular dependency.
The grid actually already has a fair number of (non-software) circular dependencies. This is why they have black start [1] procedures and run drills of those procedures. Or should, at least; there have been high profile outages recently that have exposed holes in these plans [2].
1. https://en.wikipedia.org/wiki/Black_start 2. https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...
And?
It doesn't though? Weird what if
You'd be surprised. See. GP asks a very interesting question. And some grid infra indeed relies on AWS, definitely not all of it but there are some aspects of it that are hosted by AWS.
I worked for an energy tech startup that did consulting for big utility companies. It absolutely does.
do you know for sure? And if not yet, you can bet someone will propose it in the future. So not a weird what if at all
This is already happening. I have looked at quite a few companies in the energy space this year, two of them had AWS as a critical dependency in their primary business processes and that could definitely have an impact on the grid. To their defense: AWS presumably tests their fall-back options (generators) with some regularity. But this isn't a farfetched idea at all.
Isn't that basically Texas?
Texas is like if you ran your cloud entirely in SharePoint.
Let's not insult SharePoint like that.
It's like if you ran you cloud on an old dell box in your closet while your parent company is offering to directly host it in AWS for free.
Also, every time your cloud went down, the parent company begged you to reconsider, explaining that all they need you to do is remove the disturbingly large cobwebs so they can migrate it. You tell them that to do so would violate your strongly-held beliefs, and when they stare at you in bewilderment, you yell “FREEDOM!” while rolling armadillos at them like they’re bowling balls.
Fortunately nearly all services running on AWS aren't as important as the electric utility, so this argument is not particularly relevant.
And regardless, electric service all over the world goes down for minutes or hours all the time.
That's the wrong analogy though. We're not talking about the supplier - I'm sure Amazon is doing its damnedest to make sure that AWS isn't going down.
The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.
> But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.
it is fine because the electricity supplier is so good today that people don't see it going down as a risk.
Look at south africa's electricity supplier for a different scenario.
Utility companies do not have redundancy for every part of their infrastructure either. Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.
Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?
Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.
Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.
This is why "risk assessments" are a thing.
> Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.
Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".
> Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".
You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.
In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"
> nobody of any competency runs stuff on AWS complacently.
I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.
That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.
> As the meme goes "one does not simply just run something on AWS"
The meme has currency for a reason, unfortunately.
---
That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.
Nobody was suggesting that loss of utilities is a result of bad risk management. We are saying that all competent businesses run risk management and for most businesses, the cost of AWS being down is less than the cost of going multi cloud.
This is particularly true when Amazon hand out credits like candy. So you just need to moan to your AWS account manager about the service interruption and you’ll be covered.
> imagine if the electricity supplier too that stance.
Imagine if the cloud supplier was actually as important as the electricity supplier.
But since you mention it, there are instances of this and provisions for getting back up and running:
* https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...
* https://en.wikipedia.org/wiki/Northeast_blackout_of_2003
and how many times have aws gone down majorly like that? I think you wouldn't be able to count it.
* https://en.wikipedia.org/wiki/Timeline_of_Amazon_Web_Service...
As someone who lives in Ontario, Canada, I got hit by the 2003 grid outage, which is once in >20 years. Seems like a fairly good uptime to me.
(Each electrical grid can perhaps be considered analogous to a separate cloud provider. Or perhaps, in US/CA, regions:
* https://en.wikipedia.org/wiki/North_American_Electric_Reliab...
)
It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.
Well technically AWS has never failed in wartime.
I don't understand, peacetime?
Peacetime = When not actively under a sustained attack by a nation-state actor. The implication being, if you expect there to be a “wartime”, you should also expect AWS cloud outages to be more frequent during a wartime.
Don't forget stuff like natural disasters and power failures...or just a very adventurous squirrel.
AWS (over-)reliance is insane...
What about being actively attacked by multinational state or an empire? Does it count or not?
Why people keep using "nation-state" term incorrectly in HN comments is beyond me...
I think people generally mean "state", but in the US-centric HN community that word is ambiguous and will generally be interpreted the wrong way. Maybe "sovereign state" would work?
As someone with a political science degree whose secondary focus was international relations, "Nation-state" has a number of different, definitions, an (despite the fact that dictionaries often don't include it), one of the most commonly encountered for a very long time has been "one of the principle subjects of international law, held to possess what is popularly, but somewhat inaccuratedly, referred to as Westphalian sovereignty" (there is a historical connection between this use and the "state roughly correlating with single nation" sense that relates to the evolution of “Westphalian sovvereignty” as a norm, but that’s really neither here nor there, because the meaning would be the meaning regardless of its connection to the other meaning.)
You almost never see the definition you are referring used except in the context of explicit comparison of different bases and compositions of states, and in practice there is very close to zero ambiguity which sense is meant, and complaining about it is the same kind of misguided prescriptivism as (also popular on HN) complaining about the transitive use of "begs the question" because it has a different sense than the intransitive use.
It sounds more technical than “country” and is therefore better
To me it sounds more like saying regime instead of government, gives off a sense of distance and danger.
Not really: nations state level actor: a hacker group funded by a country, not necessarily directly part of that country's government but at the same time kept at arms length for deniability purposes. For instance, hacking groups operating from China, North Korea, Iran and Russia are often doing this with the tacit approval and often funding from the countries they operate in, but are not part of the 'official' government. Obviously the various secret services in so far as they have personnel engaged in targeted hacks are also nation state level actors.
It could be a multinational state actor, but the term nation-state is the most commonly used, regardless of accuracy. You can argue over whether of not the term itself is accurate, but you still understood the meaning.
It makes a lot more sense if they had a typo of peak
Its a different kind of outage when the government disconnects you from the internet. Happens all the time, just not yet in the US.
> there are some services that still run only in us-east-1.
What are those ?
> Not very many people realize that there are some services that still run only in us-east-1.
The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.
During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.
Ironically, our observability provider went down.
I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.
It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.
> Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.
Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.
IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.
> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.
Absurd claim.
Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.
It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.
Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.
> If your company is in anything finance-adjacent or critical infrastructure
GP said:
> most companies
Most companies aren't finance-adjacent or critical infrastructure
> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing
That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.
But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.
Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.
> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.
This describes, what, under 1% of companies out there?
For most companies the cost of being multi-region is much more than just accepting with the occasional outage.
I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.
I worked for a fortune 500, twice a year we practiced our "catastrophe outage" plan. The target SLA for recovering from a major cloud provider outage was 48 hours.
Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.
Exactly this!
One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.
And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.
It is like discussing zombie apocalypse. People who are invested in bunkers will hardly understand those who are just choosing death over living in those bunkers for a month longer.
> Planning for an AWS outage […]
What about if your account gets deleted? Or compromised and all your instances/services deleted?
I think the idea is to be able to have things continue running on not-AWS.
This. I wouldn't try to instantly failover to another service if AWS had a short outage, but I would plan to be able to recover from a permanent AWS outage by ensuring all your important data and knowledge is backed up off-AWS, preferably to your own physical hardware and having a vague plan of how to restore and bring things up again if you need to.
"Permanent AWS outage" includes someone pressing the wrong button in the AWS console and deleting something important or things like a hack or ransomware attack corrupting your data, as well as your account being banned or whatever. While it does include AWS itself going down in a big way, it's extremely unlikely that it won't come back, but if you cover other possibilities, that will probably be covered too.
This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.
But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.
Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.
Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.
Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.
This depends on the scale of company. A fully functional DR plan probably costs 10% of the infra spend + people time for operationalization. For most small/medium businesses its a waste to plan for a once per 3-10 year event. If you’re a large or legacy firm the above costs are trivial and in some cases it may become a fiduciary risk not to take it seriously.
And if you're in a regulated industry it might even be a hard requirement.
Using AWS instead of a server in the closet is step 1.
Step 2 is multi-AZ
Step 3 is multi-region
Step 4 is multi-cloud.
Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+
Multi-cloud is a hole in which you can burn money and not much more
We started that planning process at my previous company after one such outage but it became clear very quickly that the costs of such resilience would be 2-3x hosting costs in perpetuity and who knows how many manhours. Being down for an hour was a lot more palatable to everyone
What if AWS dumps you because your country/company didn't please the commander in chief enough?
If your resilience plan is to trust a third party, that means you don't really care about going down, does it?
Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.
I worked at an adtech company where we invested a bit in HA across AZ + regions. Lo and behold there was an AWS outage and we stayed up. Too bad our customers didn't and we still took the revenue hit.
Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.
Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.
AWS US-East 1 has many outages. Anything significant should account for that.
My website running on an old laptop in my cupboard is doing just fine.
When your laptop dies it's gonna be a pretty long outage too.
I will find another one
I have this theory of something I call “importance radiation.”
An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.
Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.
That's a great concept. It explains a lot, actually!
> to the tune of a few hours every 5-10 years
I presume this means you must not be working for a company running anything at scale on AWS.
That is the vast majority of customers on AWS.
Ha ha, fair, fair.
> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.
Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".
Isn't the endpoint of that kind of thinking an even more centralized and fragile internet?
To be clear, I'm not advocating for this or trying to suggest it's a good thing. That's just reality as I see it.
If my site's offline at the same time as the BBC has front page articles about how AWS is down and it's broken half the internet... it makes it _really_ easy for me to avoid blame without actually addressing the problem.
I don't need to deflect blame from my customers. Chances are they've already run into several other broken services today, they've seen news articles about it, and all from third parties. By the time they notice my service is down, they probably won't even bother asking me about it.
I can definitely see this encouraging more centralization, yes.
More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.
> APIs that don’t do what they document
Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.
This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.
While I don't know your specific case, I have seen it happen often enough that there are only two possibilities left:
For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).> tune of a few hours every 5-10 years
You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long
It’s even worse than that - us-east-1 is so overloaded, and they have roughly 5+ outages per year on different services. They don’t publish outage numbers so it’s hard to tell.
At this point, being in any other region cuts your disaster exposure dramatically
We don’t deploy to us-east but still so many of our API partners and 3rd party services were down a large chunk of the service was effectively down. Including stuff like many dev tools
Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.
Depends on how serious you are with SLA's.
Depends on the business. For 99% of them this is for sure the right answer.
It seems like this can be mostly avoided by not using us-east-1.
Telefonica is moving it 5G core network to AWS
https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...
A few hours could be a problem.
Not to mention it creates valuable a single point of failure for a hostile attack.
In before meteor strike takes a AWS region and they cant restore data.
Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...
Thank you for illustrating my point. You didn't even bother to read the second paragraph.
> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt
Is that also your contingency plan for 'user uploads objectionable content and alerts Amazon to get your account shut down'?
Make sure you let your investors know.
If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.
> If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.
Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.
"Is that also your contingency plan if unrelated X happens", and "make sure your investors know" are also not exactly good faith or without snark, mind you.
I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.
The point is, many companies do need those nines and they count on AWS to deliver and there is no backup plan if they don't. And that's the thing I take issue with, AWS is not so reliable that you no longer need backups.
My experience is that very few companies actually need those 9s. A company might say they need them, but if you dig in it turns out the impact on the business of dropping a 9 (or two) is far less than the cost of developing and maintaining an elaborate multi-cloud backup plan that will both actually work when needed and be fast enough to maintain the desired availability.
Again, of course there are exceptions, but advising people in general that they should think about what happens if AWS goes offline for good seems like poor engineering to me. It’s like designing every bridge in your country to handle a tomahawk missile strike.
HN denizens are more often than not founders of exactly those companies that do need those 9's. As I wrote in my original comment: the founders are usually shocked at the thought that such a thing could happen and it definitely isn't a conscious decision that they do not have a fall-back plan. And if it was a conscious decision I'd be fine with that, but it rarely is. About as rare as companies that have in fact thought about this and whose incident recovery plans go further than 'call George to reboot the server'. You'd be surprised how many companies have not even performed the most basic risk assessment.
We all read it.. AWS not coming back up is your point on nat having a backup plan?
You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.
I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).
I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?
I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!
I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.
Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.
The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.
Resilient systems work autonomously and can synchronize - but don't need to synchronize.
We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.The internet seems resilient enough for all intents and purposes, we haven't had a global internet-wide catastrophe impacting the entire internet as far as I know, but we have gotten close to it sometimes (thanks BGP).
But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.
Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".
> The internet seems resilient enough...
The word "seems" is doing a lot of heavy lifting there.
I don't wanna jinx anything, but yeah, seems. I can't remember a single global internet outage for the 30+ years I've been alive. But again, large services gone down, but the internet infrastructure seems to keep on going regardless.
Sweden and the “Coop” disaster:
https://www.bbc.com/news/technology-57707530
That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.
It usually gets worse, when not outages happens for some time. Because that increases blind trust.
That a Swedish supermarket gets hit by a ransomware attack doesn't prove/disprove the overall stability of the internet, nor the fragility of the web.
You are absolutely correct but this distinction is getting less and less important, everything is using APIs nowadays, including lots of stuff that is utterly invisible until it goes down.
The Internet was much more resilient when it was just that - an internetwork of connected networks; each of which could and did operate autonomously.
Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.
And partially working or indicating this it works (when it doesn’t) is usually even worse.
If you take into account the "the web" vs "the internet" as others have mentioned.
Yes the Internet has stayed stable.
The Web, as defined by a bunch of servers running complex software, probably much less so.
Just the fact that it must necessarily be more complex means that it has more failure modes...
Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.
> Most companies just aren't important enough to worry about "AWS never come back up."
But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.
We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.
Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.
Battery fires.
Many have a hard dependency on AWS && Google && Microsoft!
Exactly.
And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.
You can do the multi-region failover, though that's still possibly overkill for most.
In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop our service. Who knows?
For small and medium sized companies it's not easy to perform an accurate due diligency.
It would behoove a lot of devs to learn the basics of Linux sysadmin and how to setup a basic deployment with a VPS. Once you understand that, you'll realize how much of "modern infra" is really just a mix of over-reliance on AWS and throwing compute at underperforming code. Our addiction to complexity (and burning money on the illusion of infinite stability) is already and will continue to strangle us.
If AWS goes down unexpectedly and never comes back up it's much more likely that we're in the middle of some enormous global conflict where day to day survival takes priority over making your app work than AWS just deciding to abandon their cloud business on a whim.
Can also be much easier than that. Say you live in Mexico, hosting servers with AWS in the US because you have US customers. But suddenly the government decides to place sanctions on Mexico, and US entities are no longer allowed to do business with Mexicans, so all Mexican AWS accounts get shut down.
For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.
what's most realistic is something like a major scandal at AWS. The FBI seizes control and no bytes come in our out until the investigation is complete. A multi-year total outage effectively.
Or Trump decided your country does not deserve it.
Or Bezos.
Or Bezos selling his soul to the Orange Devil and kicking you off when the Conman-in-chief puts the squeeze on some other aspect of Bezos' business empire
> The internet got its main strengths from the fact that it was completely decentralized.
Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.
The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.
> Decentralized in terms of many companies making up the internet
Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately
No we've not lost that at all. Nobody prevents you from doing that.
We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.
> No we've not lost that at all. Nobody prevents you from doing that.
May I introduce you to our Lord and Slavemaster CGNAT?
There’s more than one way to get a server on the Internet. You can pay a local data center to put your machine in one of their racks.
may I introduce you to ipv6?
That depends on who your ISP is.
I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.
But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.
Absolutely, but the cost of perfection (100% uptime in this case) is infinite.
As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?
Often simply the lack of a backup outside of the main cloud account.
Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?
And secondly, how often do you create that backup and are you willing to lose the writes since the last backup?
That backup is absolutely something people should have, but I doubt those are ever used to bring a service back up. That would be a monumental failure of your hosting provider (colo/cloud/whatever)
> Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?
Not, but if some Amazon flunky decides to kill your account to protect the Amazon brand then you will at least survive, even if you'll lose some data.
Decentralized with respect to connectivity. If a construction crew cuts a fiber bundle routing protocols will route around the damage and packets keep showing up at the destination. Or, only a localized group of users will be affected. That level of decentralization is not what we have at higher levels in the stack with AWS being a good example.
Even connectivity has it's points of failure. I've touched with my own hands fiber runs that, with a few quick snips from a wire cutter, could bring sizable portions of the Internet offline. Granted that was a long time ago so those points of failure may no longer exist.
Well, that is exactly what resilient distributed network are about. Not that much the technical details we implement them through, but the social relationship and balanced in political decision power.
Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.
I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.
What good is jumping through extraordinary hoops to be multi cloud if docker, netlify, stripe, intercom, npm, etc all go down along with us-east-1?
Because you should not depend on one payment provider and pull unvendored images, packages, etc directly into your deployment.
There is no reason to have such brittle infra.
Sure, but at that point you go from bog standard to "enterprise grade redundancy for every single point of failure" which I can assure you is more heavily engineered than many enterprises (source: see current outage). Its just not worth the manpower and dollars for a vast majority of businesses.
Pulling unvetted stuff from docker hub, npm, etc. is not a question of redundancy.
OK, you pull it to your own repo. Now where do you store it? Do you also have fallback stores for that? What about the things which arent vendorable, ie external services?
Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.
Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…
Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.
I find this hard to judge in the abstract, but I'm not quite convinced the situation for the modal company today is worse than their answer to "what if your colo rack catches fire" would have been twenty years ago.
> "what if your colo rack catches fire"
I've actually had that.
https://www.webmasterworld.com/webmaster/3663978.htm
I used to work at an SME that ran ~everything on its own colo'd hardware, and while it never got this bad, there were a couple instances of the CTO driving over to the dc because the oob access to some hung up server wasn't working anymore. Fun times...
oh hey, I've bricked a server remotely and had to drive 45 minutes to the DC to get badged in and reboot things :)
Reminiscing: this was a rite of passage for pre-cloud remote systems administrators.
Proper hardware (Sun, Cisco) had a serial management interface (ideally "lights-out") which could be used to remedy many kinds of failures. Plus a terminal server with a dial-in modem on a POTS line (or adequate fakery), in case the drama took out IP routing.
Then came Linux on x86, and it took waytoomanyyears for the hardware vendors to outgrow the platform's microsoft local-only deployment model. Aside from Dell and maybe Supermicro, I'm not sure if they ever worked it out.
Then came the cloud. Ironically, all of our systems are up and happy today, but services that rely on partner integrations are down. The only good thing about this is that it's not me running around trying to get it fixed. :)
First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.
Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.
Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?
> Let me ask you: how do you prepare your website for the complete collapse of western society?
How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?
> Second, preparing for the disappearance of AWS is even more silly.
What's silly is not thinking ahead.
>Let me ask you: how do you prepare your website for the complete collapse of western society?
That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".
We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.
the correct answer for those companies is "we have it on the roadmap but for right now accept the risk"
What if the fall-back also never comes back up?
At least we've got github steady with our code and IaaC, right? Right?!
You simply cannot avoid it. There are so many applications and services that use AWS. Companies cant sit on 100% in-house software stacks.
Contrast this with the top post.
> Now imagine for a bit that it will never come back up.
Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.
I'm so happy we chose Hetzner instead but unfortunately we also use Supabase (dashboard affected) and Resend (dashboard and email sending affected).
Probably makes sense to add "relies on AWS" to the criteria we're using to evaluate 3rd-party services.
We got off pretty easy (so far). Had some networking issues at 3am-ish EDT, but nothing that we couldn't retry. Having a pretty heavily asynchronous workflow really benefits here.
One strange one was metrics capturing for Elasticache was dead for us (I assume Cloudwatch is the actual service responsible for this), so we were getting no data alerts in Datadog. Took a sec to hunt that down and realize everything was fine, we just don't have the metrics there.
I had minor protests against us-east-1 about 2.5 years ago, but it's a bit much to deal with now... Guess I should protest a bit louder next time.
Wonder if this is related
https://www.dockerstatus.com/pages/533c6539221ae15e3f000031
Yup
> We have identified the underlying issue with one of our cloud service providers.
Another time to link The Machine Stops by E.M. Forster, 1909: https://web.cs.ucdavis.edu/~rogaway/classes/188/materials/th...
> “The Machine,” they exclaimed, “feeds us and clothes us and houses us; through it we speak to one another, through it we see one another, in it we have our being. The Machine is the friend of ideas and the enemy of superstition: the Machine is omnipotent, eternal; blessed is the Machine.”
..
> "she spoke with some petulance to the Committee of the Mending Apparatus. They replied, as before, that the defect would be set right shortly. “Shortly! At once!” she retorted"
..
> "there came a day when, without the slightest warning, without any previous hint of feebleness, the entire communication-system broke down, all over the world, and the world, as they understood it, ended."
Our Alexa's stopped responding and my girl couldn't log in to myfitness pal anymore.. Let me check HN for a major outage and here we are :^)
At least when us-east is down, everything is down.
Oh no... may be LaLiga found out pirates hosting on AWS?
this is how I discover that is not just Serie A doing this shenanigans. I'm not really surprised
All the big leagues take "piracy" very seriously and constantly try to clamp down on it.
TV rights is one of their main revenue sources, and it's expected to always go up, so they see "piracy" as a fundamental threat. IMO, it's a fundamental misunderstanding on their side, because people "pirating" usually don't have a choice - either there is no option for them to pay for the content (e.g. UK's 3pm blackout), or it's too expensive and/or spread out. People in the UK have to pay 3-4 different subscriptions to access all local games.
The best solution, by far, is what France's Ligue 1 just did (out of necessity though, nobody was paying them what they wanted for the rights after the previous debacles). Ligue 1+ streaming service, owned and operated by them which you can get access through a variety of different ways (regular old TV paid channel, on Amazon Prime, on DAZN, via Bein Sport), whichever suits you the best. Same acceptable price for all games.
MLB in the US does the same thing for the regular season, it's awesome despite the blackouts which prevent you from watching your local team but you can get around that with a simple VPN. But alas I believe that they will be making the service part of ESPN which will undoubtedly make the product worse just like they will do with NFL Red Zone.
The problem is that leagues miss out on billions of dollars of revenue when they do this AND they also have to maintain the streaming service which is way outside their technical wheelhouse.
MLS also has a pretty straightforward streaming service through AppleTV which I also enjoy.
What i find weird is that people complain (at least in the case of the MLS deal) that it's a BAD thing, that somehow having an easily accessible service that you just pay for and get access to without a contract or cable is diminishing popularity / discoverability of the product?
> The problem is that leagues miss out on billions of dollars of revenue when they do this
TBH, I have a hard time believing statements like this because if the revenue difference was really there, they'd make the switch.
If there's one thing I'll give credit to US sports leagues for, it's knowing how to make money.
After rereading my comment I think I was a bit vague, but i'll try to clarify.
Most leagues DO sell their rights to other big companies to have them handle it however they see fit for a large annual fee.
MLB does it partially, some games are shown through cable tv (There are so many games a year that only a small portion is actually aired nationally) the rest are done via regional sports networks (RSNs) that aren't shown nationally. In order to make some money out of this situation MLB created MLBtv that lets you watch all games as long as there are not nationally aired or a local team that is serviced by a RSN. Recently there have been changes because one of the biggest conglomerate of RSNs has gone bankrupt forcing MLB to buy them out and MLB is trying to negotiate a new national cable package with the big telecoms. I believe ESPN has negotiated with MLB to buy out MLBtv but details are scarce.
MLS is a smaller league and Apple bought out exclusive streaming rights for 10 years for some ungodly amount of money. NFL and NBA also have some streaming options but I am less knowledgeable about them but I assume it's similar to MLBtv where there are too many games to broadcast so you can just watch them with a subscription to their service.
In the end of the day these massive deals are the biggest source of revenue for the leagues and the more ways they can divide up the pie among different companies they can extract more money in total. Just looking that the amount of contracts for the US alone is overwhelming.
[1]https://en.wikipedia.org/wiki/Sports_broadcasting_contracts_...
More and more ads at every level every year, when will it be enough?
Is this why reddit is down? (https://www.redditstatus.com/ still says it is up but with degraded infrastructure)
Shameless from them to make it look like it's a user problem. It was loading fine for me one hour ago, now I refresh the page and their message states I'm doing too many requests and should chill out (1 request per hour is too many for you?)
Never ascribe to malice that which is adequately explained by incompetence.
It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.
I remember that I made a website and then I got a report that it doesn't work on newest Safari. Obviously, Safari would crash with a message blaming the website. Bro, no website should ever make your shitty browser outright crash.
Actually I’m just thinking that knowledge about how to crash Safari is valuable.
True. At the time though I was just focused on fixing the bug.
Could be a bunch of reddit bots on AWS are now catching back up as AWS recovers and spiking hits to reddit
I got a rate limit error which didn't make sense since it was my first time opening reddit in hours.
Man , I just wanted to enjoy celebrating Diwali with my family but been up from 3am trying to recover our services. There goes some quality time
It has started recovering now. https://www.whatsmydns.net/#A/dynamodb.us-east-1.amazonaws.c... is showing full recovery of dns resolutions.
Internet, out.
Very big day for an engineering team indeed. Can't vibe code your way out of this issue...
Easiest day for engineers on-call everywhere except AWS staff. There’s nothing you can do except wait for AWS to come back online.
Pour one out for the customer service teams of affected businesses instead
Well, but tomorrow there will be CTOs asking for a contingency plan if AWS goes down, even if planning, preparing, executing and keeping it up to date as the infra evolves will cost more than the X hours of AWS outage.
There are certainly organizations for which that cost is lower than the overall damage of services being down due to AWS fault, but tomorrow we will hear CTOs from smaller orgs as well.
They’ll ask, in a week they’ll have other priorities and in a month they’ll have forgotten about it.
This will hold until the next time AWS had a major outage, rinse and repeat.
It's so true it hurts. If you are new in any infra/platform management position you will be scared as hell this week. Then you will just learn that feeling will just disappear by itself in a few days.
Yep, when I was a young programmer I lived in dread of an outage or worse been responsible for a serious bug in production, then I got to watch what happened when it happened to others (and that time I dropped the prod database at half past four on a Friday).
When everything is some varying degree of broken at all times been responsible for a brief uptick in the background brokenness isn't the drama you think it is.
It would be different if the systems I worked on where true life and death (ATC/Emergency Services etc) but in reality the blast radius from my fucking up somewhere is monetary and even at the biggest company I worked for constrained (while 100+K per hour from an outage sounds horrific - in reality the vast majority of that was made up when the service was back online, people still needed to order the thing in the end).
This applies to literally half of random "feature requests" and "tasks" that are urgent and needed to get done yesterday incoming from the business team..
Honestly? "Nothing because all our vendors are on us-east-1 too"
Lots of NextJS CTOs are gonna need to think about it for the first time too
He will then give it to the CEO who says there is no budget for that
No really true for large systems. We are doing things like deploying mitigations to avoid scale-in (eg services not receiving traffic incorrectly autoscaling down), preparing services for the inevitable storm, managing various circuit breakers, changing service configurations to ease the flow of traffic through the system, etc. We currently have 64 engineers in our on-call room managing this. There's plenty of work to do.
Well, some engineer somewhere made the recommendation to go with AWS, even tho it is more expensive than alternatives. That should raise some questions.
Engineer maybe, executive swindled by sales team? Definitely.
> Easiest day for engineers on-call everywhere
I have three words for you: cascading systems failure
Can confirm, pretty chill we can blame our current issues on AWS.
and by one I trust you mean a bottle.
>Can't vibe code your way out of this issue...
I feel bad for the people impacted by the outage. But at the same time there's a part of me that says we need a cataclysmic event to shake the C-Suite out of their current mindset of laying off all of their workers to replace them with AI, the cheapest people they can find in India, or in some cases with nothing at all, in order to maximize current quarter EPS.
/ai why is AWS down? can you bring it back up
Pour one out for everyone on-call right now.
After some thankless years preventing outages for a big tech company, I will never take an oncall position again in my life.
Most miserable working years I have had. It's wild how normalized working on weekends and evenings becomes in teams with oncall.
But it's not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
And outside of Google you don't even get paid for oncall at most big tech companies! Company losing millions of dollars an hour, but somehow not willing to pay me a dime to jump in at 3AM? Looks like it's not my problem!
When I used to be on call for Cisco WebEx services. I got paid extra, and got extra time of. Even if nothing happened. In addition we where enough people on the rotation, so I didn't have to do it that often.
I believe the rules varied based on jurisdiction, and I think some had worse deals, and some even better. But I was happy with our setup in Norway.
Tbh I do not think we would have had, what we had if it wasn't for the local laws and regulations. Sometimes worker friendly laws can be nice.
As I was reading the parent, I was thinking “hm, doesn’t match my experience at Cisco!” So it’s funny to see your comment right after.
> And outside of Google you don't even get paid for oncall at most big tech companies.
What the redacted?
Welcome to the typical American salary abuse. There's even a specific legal cutout exempting information technology, scientific and artistic fields from the overtime pay requirements of the Fair Labor Standards Act.
There's a similar cutout for management, which is how companies like GameStop squeeze their retail managers. They just don't give enough payroll hours for regular employees, so the salaried (but poorly paid) manager has to cover all of the gaps.
It's also unneccesary at large companies, since there'll likely be enough offices globally to have a follow the sun model.
Follow the sun does not happen by itself. Very few if any engineering teams are equally split across thirds of the globe in such a way that (say) Asia can cover if both EMEA and the Americas are offline.
Having two sites cover the pager is common, but even then you only have 16 working hours at best and somebody has to take the pager early/late.
Not to mention that the knowledge, skills, experience to troubleshoot/recover is rarely evenly distributed across the teams.
"Your shitposting is very important to us, please stay on the site"
We get TOIL for being on call.
> But this is not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
It is completely normal for staff to have to work 24/7 for critical services.
Plumbing, HVAC, power plant engineers, doctors, nurses, hospital support staff, taxi drivers, system and network engineers - these people keep our modern world alive, all day, every day. Weekends, midnights, holidays, every hour of every day someone is AT WORK to make sure our society functions.
Not only is it normal, it is essential and required.
It’s ok that you don’t like having to work nights or weekends or holidays. But some people absolutely have to. Be thankful there are EMTs and surgeons and power and network engineers working instead of being with their families on holidays or in the wee hours of the night.
Nice try at guilt-tripping people doing on-call, and doing it for free.
But to parent's points: if you call a plumber or HVAC tech at 3am, you'll pay for the privilege.
And doctors and nurses have shifts/rotas. At some tech places, you are expected to do your day job plus on-call. For no overtime pay. "Salaried" in the US or something like that.
And these companies often say "it's baked into your comp!" But you can typically get the same exact comp working an adjacent role with no oncall.
Then do that instead. What’s the problem with simply saying “no”?
You’re looking for a job in this economy with a ‘he said no to being on call’ in your job history.
This is plainly bad regulation, the market at large discovered the marginal price of oncall is zero, but it’s rather obviously skewed in employer’s favor.
Yup, that is precisely what I did and what I'm encouraging others to do as well.
Edit: On-call is not always disclosed. When it is, it's often understated. And finally, you can never predict being re-orged into a team with oncall.
I agree employees should still have the balls to say "no" but to imply there's no wrongdoing here on companies' parts and that it's totally okay for them to take advantage of employees like this is a bit strange.
Especially for employees that don't know to ask this question (new grads) or can't say "no" as easily (new grads or H1Bs.)
Guilt tripping? Quite the opposite.
If you or anyone else are doing on-call for no additional pay, precisely nobody is forcing you to do that. Renegotiate, or switch jobs. It was either disclosed up front or you missed your chance to say “sorry, no” when asked to do additional work without additional pay. This is not a problem with on call but a problem with spineless people-pleasers.
Every business will ask you for a better deal for them. If you say “sure” to everything you’re naturally going to lose out. It’s a mistake to do so, obviously.
An employee’s lack of boundaries is not an employer’s fault.
First, you try to normalise it:
> It is completely normal for staff to have to work 24/7 for critical services.
> Not only is it normal, it is essential and required.
Now you come with the weak "you don't have to take the job" and this gem:
> An employee’s lack of boundaries is not an employer’s fault.
As if there isn't a power imbalance, or employers always disclose everything or chance their mind. But of course, let's blame those entitled employees!
You know, there's this thing called shifts. You should look it up.
No one dies if our users can't shitpost until tomorrow morning.
I'm glad there are people willing to do oncall. Especially for critical services.
But the software engineering profession as a whole would benefit from negotiating concessions for oncall. We have normalized work interfering with life so the company can squeeze a couple extra millions from ads. And for what?
Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
> Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
Interestingly, when I worked on analytics around bugs we found that often (in the ads space), there actually wasn't an impact when advertisers were unable to create ads, as they just created all of them when the interface started working again.
Now, if it had been the ad serving or pacing mechanisms then it would've been a lot of money, but not all outages are created equal.
Not all websites are for shitposting. I can’t talk to my clients for whom I am on call because Signal is down. I also can’t communicate with my immediate family. There are tons of systems positively critical to society downstream from these services.
Some can tolerate downtime. Many can’t.
You could give them a Phone call, you know. Pretty reliable technology.
No, actually, I can’t. My phone doesn’t have a phone number and can’t make calls.
I expect it's their SREs who are dealing with this mess.
> Can't vibe code your way out of this issue...
Exactly. This time, some LLM providers are also down and can't help vibe coders on this issue.
Qwen3 on lm-studio running fine on my work Mac M3, what's wrong with yours?
Anthem Health call center disconnected my wife numerous times yesterday with an ominous robo-message of "Emergency in our call center"; curious if that was this. Seems likely, but what a weird message.
AWS truly does stand for "All Web Sites".
funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.
AWS has made the internet into a single-point-of failure.
What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?
To be fair, there is another point of failure, Cloudflare. It seems like half the internet goes down when Cloudflare has one of their moments
Clourflare is not merely a single point of failure. They are the official MITM of the internet. They control the flow of information. They probably know more about your surfing habits than Google at this point. There are some sites I can not even connect to using IP addresses anymore.
That company is very concerning and not because of an outage. In fact, I wish one day we have a full cloudflare outage and the entire net goes dark and it finally sink in how much control this one f'ing company has over information in our so called free society.
Not every site is controlled by Cloudflare. My site doesn't use it at all (the fact that it is currently not working properly is entirely coincidental), because I don't really see a reason to use it. Whenever they go down, I'm unaffected
A lot of status pages hosted by Atlasian StatusPage are down! The irony…
I can't believe this. When status page first created their product, they used to market how they were in multiple providers so that they'd never be affected by downtime.
Maybe all that got canned after the acquisition?
DynamoDB is performing fine in production in eu-central-1.
Seems to be really limited to us-east-1 (https://health.aws.amazon.com/health/status). I think they host a lot of console and backend stuff there.
Yet. Everything goes down the ... Bach ;)
One thing has become quite clear to me over the years. Much of the thinking around uptime of information systems has become hyperbolic and self-serving.
There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
> The takeaway I always have from these events is that you should engineer your business to be resilient
An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.
I'm getting old, but this was the 1980's, not the 1800's.
In other words, to agree with your point about resilience:
A lot of the time even some really janky fallbacks will be enough.
But to somewhat disagree with your apparent support for AWS: While it is true this attitude means you can deal with AWS falling over now and again, it also strips away one of the main reasons people tend to give me for why they're in AWS in the first place - namely a belief in buying peace of mind and less devops complexity (a belief I'd argue is pure fiction, but that's a separate issue). If you accept that you in fact can survive just fine without absurd levels of uptime, you also gain a lot more flexibility in which options are viable to you.
The cost of maintaining a flawless eject button is indeed high, but so is the cost of picking a provider based on the notion that you don't need one if you're with them out of a misplaced belief in the availability they can provide, rather than based on how cost effectively they can deliver what you actually need.
I would argue that you are still buying peace of mind by hosting on AWS, even when there are outages. This outage is front page news around the world, so it's not as much of a shock if your company's website goes down at the same time.
Some of the peace of mind comes just from knowing it’s someone else’s (technical) problem if the system goes down. And someone else’s problem to monitor the health of it. (Yes, we still have to monitor and fix all sorts of things related to how we’ve built our products, but there’s a nontrivial amount of stuff that is entirely the responsibility of AWS)
The cranked tills (or registers for the Americans) is an interesting example, because it seems safe to assume they don't have that equipment anymore, and could not so easily do that.
We have become much more reliant on digital tech (those hand cranked tills were prob not digital even when the electricity was on), and much less resilient to outages of such tech I think.
> An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.
what did the do with the frozen food section? Was all that inventory lost?
No idea. We didn't hang around that long.
Tech companies, and in particular ad-driven companies, keep a very close eye on their metrics and can fairly accurately measure the cost of an outage in real dollars
I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.
Just yesterday I saw another Hetzner thread where someone claimed AWS beats them in uptime and someone else blasted AWS for huge incidents. I bet his coffee tastes better this morning.
To be fair, my Hetzner server had ten minutes of downtime the other day. I've been a customer for years and this was the second time or so, so I love Hetzner, but everything has downtime.
Their auction systems are interesting to dig through, but to your point, everything fails. Especially these older auction systems. Great price/service, though. Less than an hour for more than one ad-hoc RAID card replacement
Yeah, I really want one of their dedicated servers, but it's a bit too expensive for what I use it for. Plus, my server is too much of a pet, so I'm spoiled on the automatic full-machine backups.
Absolutely understandable :)
I honestly wonder if there is safety in the herd here. If you have a dedicated server in a rack somewhere that goes down and takes your site with it. Or even the whole data center has connectivity issues. As far as the customer is concerned, you screwed up.
If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
> If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."
Show up at a meeting looking like you wet yourself, it’s all anyone will ever talk about.
Show up at a meeting where a whole bunch of people appear to have wet themselves, and we’ll all agree not to mention it ever again…
My guess is their infrastructure is set up through clickops, making it extra painful to redeploy in another region. Even if everything is set up through CloudFormation, there's probably umpteen consumers of APIs that have their region hardwired in. By the time you get that all sorted, the region is likely to be back up.
You can take advantage by having an unplanned service window every time a large cloud provider goes down. Then tell your client that you where the reason why AWS went down.
The Register calls it Microsoft 364, 363, ...
365 "eights" of uptime per year.
That's the funniest thing I've heard this morning. Still less than one 9
Reported uptimes are little more than fabricated bullshit.
They measure uptime using averages of "if any part of a chain is even marginally working".
People experience downtime however as "if any part of a chain is degraded".
Feel bad for the Amazon SDR randomly pitching me AWS services today. Although apparently our former head of marketing got that pitch from four different LinkedIn accounts. Maybe there's a cloud service to rein them in that broke ;)
I'd say that this is true for the average admin who considers PaaS, Kubernetes and microservices one giant joke. Vendor-neutral monolithic deployments keep on winning.
"The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers."
https://health.aws.amazon.com/health/status?path=service-his...
Ah, it's just what I thought. An underlying internal subsystem.
I think AWS should use, and provide as an offering to big customers, a Chaos Monkey tool that randomly brings down specific services in specific AZs. Example: DynamoDB is down in us-east-1b. IAM is down in us-west-2a.
Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.
AWS Fault Injection Service: https://docs.aws.amazon.com/fis/latest/userguide/what-is.htm...
At some point AWS has so many services it's subject to a version of xkcd Rule 34 -- if you can imagine it, there's an AWS service for it.
I used to tell people there that my favorite development technique was to sit down and think about the system I wanted to build, then wait for it to be announced at that year's re:Invent. I called it "re:Invent and Simplify". "I" built my best stuff that way.
Stupid question, why isn't the stock down? Couldn't this lead to people jumping to other providers and at the very least require some pretty big fees for do dramatically breaking SLA? Is it just not a biggest fraction of revenue to matter?
Robinhood is down ;)
Non-technical people don't really notice these things. They hear it and shrug, because usually it's fixed within a day.
CNBC is supposed to inform users about this stuff, but they know less than nothing about it. That's why they were the most excited about the "Metaverse" and telling everyone to get on board (with what?) or get left behind.
The market is all about perception of value. That's why Musk can tweet a meme and double a stocks price, it's not based in anything real.
Maybe since Amazon is reporting Q3 numbers soon and this will only show up in Q4 numbers?
Lots of stock brokerages consumer offerings are served through....
....AWS!
Looks like maybe a DNS issue? https://www.whatsmydns.net/#A/dynamodb.us-east-1.amazonaws.c...
Resolves to nothing.
It's always DNS.
It's plausible that Amazon removes unhealthy servers from all round-robins including DNS. If all servers are unhealthy, no DNS.
Alternatively, perhaps their DNS service stopped responding to queries or even removed itself from BGP. It's possible for us mere mortals to tell which of these is the case.
Chances are there's some cyclical dependencies. These can creep up unnoticed without regular testing, which is not really possible at AWS scale unless they want to have regular planned outages to guard against that.
Maybe they forgot to pay the bills.
Twilio is down worldwide: https://status.twilio.com/
Friends don’t let friends use us-east-1
It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189
I wonder if the new endpoint was affected as well.
dynamodb.us-east-1.api.aws
Is there any data on which AWS regions are most reliable? I feel like every time I hear about an AWS outage it's in us-east-1.
Trouble is one can't fully escape us-east-1. Many services are centralized there like: S3, Organizations, Route53, Cloudfront, etc. It is THE main region, hence suffering the most outages, and more importantly, the most troubling outages.
This is not true. There are a FEW high-level ones that have their dashboards there, but there's a reason you log into region-specific dashboards
We're mostly deployed on eu-west-1 but still seeing weird STS and IAM failures, likely due to internal AWS dependencies.
Also we use Docker Hub, NPM and a bunch of other services that are hosted by their vendors on us-east-1 so even non AWS customers often can't avoid the blast radius of us-east-1 (though the NPM issue mostly affects devs updating/adding dependencies, our CI builds use our internal mirror)
FYI: 1. AWS IAM mutations all go through us-east-1 before being replicated to other public/commercial regions. Read/List operations should use local regional stacks. I expect you'll see a concept of "home region" give you flexibility on the write path in the future. 2. STS has both global and regional endpoints. Make sure you're setup to use regional endpoints in your clients https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credenti...
us-east-1 was, probably still is, AWS' most massive deployment. Huge percentage of traffic goes through that region. Also, lots of services backhaul to that region, especially S3 and CloudFront. So even if your compute is in a different region (at Tower.dev we use eu-central-1 mostly), outages in us-east-1 can have some halo effect.
This outage seems really to be DynamoDB related, so the blast radius in services affected is going to be big. Seems they're still triaging.
your website loads for a second and then suddenly goes blank. There is one fatal errors from Framer in the console
It is dark. You are likely to be eaten by a grue.
Thanks...probably an issue with Framer during the outage.
Anywhere other than us-east-1 in my experience is rock solid.
Agreed, my company had been entirely on us-east-1 predating my joining ~12 years ago. ~7 years ago, after multiple us-east-1 outages, I moved us to us-west-2 and it has been a lot less bumpy since then.
If you're using AWS then you are most likely using us-east-1 there is no escape. When big problems happen on us-east-1 it affect most of AWS services.
I don't recommend to my clients they use us-east-1. It's the oldest and most prone to outages. I usually always recommend us-east-2 (Ohio) unless they require West Coast.
and if they need West Coast, it's us-west-2. I consider us-west-1 to be a failed region. They don't get some of the new instance types, you can't get three AZs for your VPCs, and they're more expensive than the other US regions.
US-East-1 and its consistent problems are literally the Achilles Heel of the Internet.
Amazon has spent most of its HR post-pandemic efforts in:
• Laying off top US engineering earners.
• Aggressively mandating RTO so the senior technical personnel would be pushed to leave.
• Other political ways ("Focus", "Below Expectations") to push engineering leadership (principal engineers, etc) to leave, without it counting as a layoff of course.
• Terminating highly skilled engineering contractors everywhere else.
• Migrating serious, complex workloads to entry-level employees in cheap office locations (India, Spain, etc).
This push was slow but mostly completed by Q1 this year. Correlation doesn't imply causation? I find that hard to believe in this case. AWS had outages before, but none like this "apparently nobody knows what to do" one.
Source: I was there.
Our entire data stack (Databricks and Omni) are all down for us also. The nice thing is that AWS is so big and widespread that our customers are much more understanding about outages, given that its showing up on the news.
Signal is also down for me.
My messages are not getting through, but status page seems ok.
Seems fixed now.
This is from Amazon's latest earnings call when Andy Jessy was asked why they aren't growing as much as there competitors
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area." also "And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
That was a bit of ramble. Not something I d expect from a CEO of AWS who probably handles press all the time.
It reminds me of that viral clip from a beauty pageant where the contestant went on a geographical ramble while the question was about US education.
Well that aged well
> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.
Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup
https://m.youtube.com/watch?v=KFvhpt8FN18 clear detailed explanation of the AWS outage and how properly designed systems should have shielded the issue with zero client impact
It’s that period of the year when we discover AWS clients that don’t have fallback plans
♫ It's the most blunderful time of the year
There'll be much admin moaning
And servers not glowing
and the NOC crew in tears
It's the most blunderful time of the year ♫
Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.
Are there websites that do post-mortems for how the single points of failure impacted the entire internet?
Not just AWS, but Cloudflare and others too. Would be interesting to review them clinically.
I don't think blaming AWS is fair, since they typically exceed their regional and AZ SLAs
AWS makes their SLAs & uptime rates very clear, along with explicit warnings about building failover / business continuity.
Most of the questions on the AWS CSA exam are related to resiliency .
Look, we've all gone the lazy route and done this before. As usual, the problem exists between the keyboard and the chair.
Not sure any of their SLA’s are covered here.
If they don’t obfuscate the downtime (they will, of course), this outage would put them at, what, two nines? Thats very much out of their SLA.
People also keep talking about it as if its one region, but there are reports in this thread of internal dependencies inside AWS which are affecting unrelated regions with various services. (r53 updates for example)
Sounds like your lesson is "yes we should continue shaming AWS rather than fix our app"
It sounds like you think the SLA is just toilet paper? When in reality it's a contract which defines AWS's obligations. So the lesson here is that they broke their contract big time. So yes. Shaming is the right approach. Also it seems you missed somehow the other 1700+ comments agreeing with shaming
I wouldn't go that far. The SLA is a contract, and they are clear on the remedy (up to 100% refund if they don't hit 95% uptime in a month).
Just like reading medication side effects, they are letting you know that downtime is possible, albeit unlikely.
All of the documentation and training programs explain the consequence of single-region deployments.
The outage was a mistake. Let's hope it doesn't indicate a trend. I'm not defending AWS. I'm trying to help people translate the incident into a real lesson about how to proceed.
You don't have control over the outage, but you do have control over how your app is built to respond to a similar outage in the future.
They supposed to exceed their SLA. SLA is guaranteed worst case.
If you don't break anything, you aren't making anything valuable.
When I follow the link, I arrive on a "You broke reddit" page :-o
The internet was once designed to survive a nuclear war. Nowadays it cannot even survive until tuesday.
When did Snapchat move out of GCP?
Since I'm 5+ years out from my NDA around this stuff, I'll give some high level details here.
Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.
Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.
Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)
So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)
Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P
Thats so interesting to me, I always assume companies like google who have "unlimited" dollars will always be happy to eat the cost to keep customers, especially given gcp usage outside googles internal services is way smaller compared to azure and aws. Also interesting to see snapchat had a hacky solution with AppEngine
These are the best additional bits of information that I can find to share with you if you're curious to read more about Snap and what they did. (They were spending $400m per year on GCP which was famously disclosed in their S-1 when they IPO'd)
0: https://chrpopov.medium.com/scaling-cloud-infrastructure-5c6...
1: https://eng.snap.com/monolith-to-multicloud-microservices-sn...
The "unlimited dollars" come from somewhere after all.
GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.
I’m not sure what you mean by Azure being more painful for FOSS stacks. That is not my experience. Old you elaborate?
However I have seen many people flee from GCP because: Google lacks customer focus, Google is free about killing services, Google seems to not care about external users, people plain don’t trust Google with their code, data or reputation.
Customers would rather choose Azure. GCP has a bad rep, bad documentation, bad support compared to AWS / Azure. & with google cutting off products, their trust is damaged.
GCP as I understand it is the E-commerce/retail choice for this reason. Not Amazon being the main reason.
Honestly as a (very small) shareholder in Amazon, they should spin off AWS as a separate company. The Amazon brand is holding AWS back.
Absolutely! AWS is worth more as a separate company than being hobbled by the rest of Amazon. YouTube is the same.
Big monopolists do not unlock more stock market value, they hoard it and stifle it.
Google does not give even a singular fuck about keeping their customers. They will happily kill products that are actively in use and are low-effort for... convenience? Streamlining? I don't know, but Google loves to do that.
The engineering manager that was leading the project got promoted and now no longer cares about it.
High margin companies are always looking to cut the lower-margin parts of their business regardless of if they're profitable.
The general idea being that you'll losing money due to opportunity cost.
Personally, I think you're better off just not laying people off and having them work the less (but still) profitable stuff. But I'm not in charge.
They might have an implicit dependency on AWS, even if they're not primarily hosted there.
Apparently hiring 1000s of software engineers every month was load bearing
Various AI services (e.g. Perplexity) are down as well
I don't like how they phrased it. From the Verge:
“Perplexity is down right now,” Perplexity CEO Aravind Srinivas said on X. “The root cause is an AWS issue. We’re working on resolving it.”
What he should have said, IMHO, is "The root cause is that Perplexity fully depends on AWS."
I wonder if they're actually working on resolving that, or that they're just waiting for AWS to come back up.
Just tried Perplexity and it has no answer.
Damn, this is really bad.
Looking forward to the postmortem.
AWS has been the backbone of the internet. It is single point of failure most websites.
Other hosting services like Vercel, package managers like npm, even the docker registeries are down because of it.
docker hub or github cache internal maybe is affected:
Booting builder /usr/bin/docker buildx inspect --bootstrap --builder builder-1c223ad9-e21b-41c7-a28e-69eea59c8dac #1 [internal] booting buildkit #1 pulling image moby/buildkit:buildx-stable-1 #1 pulling image moby/buildkit:buildx-stable-1 9.6s done #1 ERROR: received unexpected HTTP status: 500 Internal Server Error ------ > [internal] booting buildkit: ------ ERROR: received unexpected HTTP status: 500 Internal Server Error
DockerHub shows full outage: https://www.dockerstatus.com/
Whose idea was it to make the whole world dependent on us-east-1?
The NSA might be happy everything runs through a local data center to their Virginia offices
The most recent public count of datacenters for AWS in us-east-1 is 159. I suspect that’s even an unwieldy number for NSA to spy on.
11 years ago: https://news.ycombinator.com/item?id=8448894
US$70 billion in spend aggregating data, back then, this number has only increased.
https://journals.sagepub.com/doi/pdf/10.1177/205395171454186...
AWS often deploys its new platform products and latest hardware (think GPUs) into us-east-1, so everyone has to maintain a footprint in us-east-1 to use any of it.
So as a result, everyone keeps production in us-east-1. =)
Not to mention, even if you went through the hassle of a diverse multi-cloud deployment, there's still something in your stack that has a dependency on us-east-1, and it's probably some weird frontend javascript module that uses a floatilla of free tier lambda services to format dates.
Isn’t it the cheapest AWS region? Or at least among the cheapest. If I’m correct, this incentivizes users to start there.
All countries are more expensive than the US, but many US regions are the same (cheapest) price. (eg Ohio, Oregon).
us-east-1 is more a legacy of it being the first region so by virtue of being the default region for the longest time most customers built on it.
Eh, us-east-1 is the oldest AWS region and if you get some AWS old timers talking over some beers they'll point out the legacy SPOFs that still exist in us-east-1.
1) People in us-east-1.
2) People who thought that just having stuff "in the cloud" meant that it was automatically spread across regions. Hint, it's not; you have to deploy it in different regions and architect/maintain around that.
3) Accounting.
With more and more parts of our lives depending on often only one cloud infrastructure provider as a single point of failure, enabling companies to have built-in redundancy in their systems across the world could be a great business.
Humans have built-in redundancy for a reason.
>Oct 20 12:51 AM PDT We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.
Weird that case creation uses the same region as the case you'd like to create for.
The support apis only exist in us-east-1, iirc. It’s a “global” service like IAM, but that usually means modifications to things have to go through us-east-1 even if they let you pull the data out elsewhere.
We are on Azure. But our CI/CD pipelines are failing, because Docker is on AWS.
Even railway's status page is down (guess they use Vercel):
https://railway.instatus.com/
Do events like this stir conversations in small to medium size businesses to escape the cloud?
This isn't a "cloud failure". All of these apps would be running now had they spent the additional 5% development costs to add failover to another region.
us-east1 is supposed to consist of a number of "availability zones" that are independent and stay "available" even if one goes down. That's clearly not happening, so yes, this is very much a cloud failure.
It's an AWS failure, perhaps, but it's not a reason to write off "the cloud"
It would have to be catastrophic for most businesses to make think about escaping the cloud. The cost of migration and maintenance are massive for small and medium businesses.
Probably, but they're usually dropped after they come back up
Depends how small.
I have clients and I’ve heard “even Amazon is down, we can be down” more than once.
[dead]
> due to an "operational issue" related to DNS
Always DNS..
Wasn't the point why AWS is so much premium that you will always get at least 6 nines if not more in availability?
They guarantee the dashboard will be green 99.999% of the time
I take dashboard is not covered by SLA?
The dashboard is the SLA.
IIRC it takes WAY too many managers to approve the dashboard being anything other than green.
It's not a reflection of reality nor is it automated.
The point of AWS is to promise you the nines and make you feel good about it. Your typical "growth & engagement" startup CEO can feel good and make his own customers feel good about how his startup will survive a nuclear war.
Delivery of those nines is not a priority. Not for the cloud provider - because they can just lie their way out of it by not updating their status page - and even when they don't, they merely have to forego some of their insane profit margin for a couple hours in compensation. No provider will actually put their ass on the line and offer you anything beyond their own profit margin.
This is not an issue for most cloud clients either because they keep putting up with it (lying on the status page wouldn't be a thing if clients cared) - the unspoken truth is that nobody cares that your "growth & engagement" thing is down for an hour or so, so nobody makes anything more than a ceremonial stink about it (chances are, the thing goes down/misbehaves regularly anyway every time the new JS vibecoder or "AI employee" deploys something, regardless of cloud reliability).
Things where nines actually matter will generally invest in self-managed disaster recovery plans that are regularly tested. This also means it will generally be built differently and far away from your typical "cloud native" dumpster fire. Depending on how many nines you actually need (aka what's the cost of not meeting that target - which directly controls how much budget you have to ensure you always meet it), you might be building something closer to aircraft avionics with the same development practices, testing and rigor.
I can tell you from personal experience that improving/maintaining uptime (by doing root cause analysis, writing correction of error reports, going through application security reviews, writing/reviewing design docs for safely deploying changes, working on operational improvements to services) probably takes up a majority of most AWS engineers' time. I'm genuinely curious what you are basing the opinion "Delivery of those nines is not a priority" off of.
It's usually true if you arent in US-East-1 which is widely known to be the least reliable location. Theres no reason anyone should be deploying anything new to it these days.
1. Competitors are not any better, or worse
2. Trusted brand
You are supposed to build multi regional services if you need higher resilience.
Actual multi-region replication is hard and forces you to think about complicated things like the CAP theorem/etc. It's easier to pretend AWS magically solves that problem for you.
Which is actually totally fine for the vast majority of things, otherwise there would be actual commercial pressures to make sure systems are resilient to such outages.
You could also achieve this in practice by just not using us-east-1, though at the very least you should have another region going for DR.
Never said it was easy or desirable for most companies.
But there is only so much a cloud provider can guarantee within a region or whatever unit of isolation they offer.
the highest availability service i think is S3 at 4 nines
you might be thinking of durability for s3 which is 11 nines, and i've never heard of anyone losing an object yet
no, it's probably Route 53, touted as having "100% availability" (https://en.wikipedia.org/wiki/Amazon_Route_53)
hah, that's funny as the outage seems to be caused by DNS issues
Route53 was still resolving DNS entries just fine. But it looked like someone/something removed the entries for DynamoDB.
us-east-1 the worst region for availability
You can. You just need to do the work to make it work. That's the bit where everyone but Netflix and Amazon fail.
Last time I checked the standard SLA is actually 99 % and the only compensation you get for downtime is a refund. Which is why I don't use AWS for anything mission critical.
Does any host provide more compensation than refund for downtime?
https://mail.tarsnap.com/tarsnap-announce/msg00050.html
> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, I credited everyone's Tarsnap accounts with 50% of a month's storage costs.
So in this case the downtime was roughly 26 hours, and the refund was for 50% of a month, so that's more than a 1-1 downtime refund.
Most "legacy" hosts do yes. The norm used to be a percentage of your bill for every hour of downtime once uptime dropped below 99.9%. If the outage was big enough you'd get credit exceeding your bill, and many would allow credit withdrawal in those circumstances. There were still limits to protect the host but there was a much better SLA in place.
Cloud providers just never adopted that and the "ha, sucks to be you" mentality they have became the norm.
Depends on which service you're paying for. For pure hosting the answer is no, which is why it rarely makes sense to go AWS for uptime and stability because when it goes down there's nothing you can do. As opposed to bare metal hosting with redundancy across data centers, which can even cost less than AWS for a lot of common workloads.
What do you do if not AWS?
Theres literally thousands of options. 99% of people on AWS do not need to be on AWS. VPS servers or load balanced cloud instances from providers like Hetzner are more than enough for most people.
It still baffles me how we ended up in this situation where you can almost hear peoples disapproval over the internet when you say AWS / Cloud isn't needed and you're throwing money away for no reason.
There's nothing particularly wrong with AWS, other than the pricing premium.
The key is that you need to understand no provider will actually put their ass on the line and compensate you for anything beyond their own profit margin, and plan accordingly.
For most companies, doing nothing is absolutely fine, they just need to plan for and accept the occasional downtime. Every company CEO wants to feel like their thing is mission-critical but the truth is that despite everything being down the whole thing will be forgotten in a week.
For those that actually do need guaranteed uptime, they need to build it themselves using a mixture of providers and test it regularly. They should be responsible for it themselves, because the providers will not. The stuff that is actually mission-critical already does that, which is why it didn't go down.
Been using AWS too, but for a critical service we mirrored across three Hetzner datacenters with master-master replication as well as two additional locations for cluster node voting.
US-East-1 is literally the Achilles Heel of the Internet.
You would think that after the previous big us-east-1 outages (to be fair there have been like 3 of them in the past decade, but still, that's plenty), companies would have started to move to other AWS regions and/or to spread workloads between them.
It’s not that simple. The bit AWS doesn’t talk much about publicly (but will privately if you really push them) is that there’s core dependencies behind the scenes on us-east-1 for running AWS itself. When us-east-1 goes down the blast radius has often impacted things running in other regions.
It impacts AWS internally too. For example rather ironically it looks like the outage took out AWS’s support systems so folks couldn’t contact support to get help.
Unfortunately it’s not as simple as just deploying in multiple regions with some failover load balancing.
our eu-central-1 services had zero disruption during this incident, the only impact was that if we _had_ had any issues we couldn't log in to the AWS console to fix them
So moving stuff out of us-east-1 absolutely does help
There have been region-specific console pages for a number of years now.
Sure it helps. Folks just saying there’s lots of examples where you still get hit by the blast radius of a us-east-1 issue even if you’re using other regions.
Exactly
> Amazon Alexa: routines like pre-set alarms were not functioning.
It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.
exactly why it won't happen :)
Exactly.
I half-seriously like to say things like, "I'm excited for a time when we have powerful enough computers to actually run applications on them instead of being limited to only thin clients." Only problem is most of the younger people don't get the reference anymore, so it's mainly the olds that get it
we[1] operate out of `us-east-1` but chose to not use any of the cloud based vendor lockin (sorry vercel, supabase, firebase, planetscale etc). Rather a few droplets in DigitalOcean(us-east-1) and Hetzner(eu). We serve 100 million requests/mo, few million user generated content(images)/mo at monthly cost of just about $1000/mo.
It's not difficult, it's just that we engineers chose convenience and delegated uptime to someone else.
[1] - https://usetrmnl.com
Bitbucket seems affected too [1]. Not sure if this status page is regional though.
[1] https://bitbucket.status.atlassian.com/incidents/p20f40pt1rg...
it is very funny to me that us-east-1 going down nukes the internet. all those multiple region reliability best practices are for show
Seems to be really only in us-east-1, DynamoDB is performing fine in production on eu-central-1.
The internal disruption reviews are going to be fun :)
The fun is really gonna start if the root cause of this somehow implicates an AI as a primary cause.
I haven't seen the "90% of our code is AI" nonsense from Amazon.
Their business doesn't depend on selling AI so they have the luxury of not needing to be in on that grift.
https://archive.is/b6aUD
It's never an AI's fault since it's up to a human to implement the AI and put in a process that prevents this stuff from happening.
So blame humans even if an AI wrote some bad code.
I agree but then again it’s always a humans fault in the end. So probably a root cause will have a bit more neuance. I was more thinking of the possible headlines and how that would potentially affect the public AI debate. Since this event is big enough to actually get the attention of eg risk management at not-insignificant orgs.
> but then again it’s always a humans fault in the end
Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.
Also agree. “What” build the system though? (Humans)
Edit: and more important who governed the system, ie made decisions about maintainance, staffing, training, processes and so on
It’s gonna be DNS
Your remark made me laugh, but..:
"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."
https://health.aws.amazon.com/health/status
It’s always DNS! Except when it’s the firewall.
if that's the case it'll be buried
Of course this happens when I take a day off from work lol
Came here after the Internet felt oddly "ill" and even got issues using Medium, and sure enough https://status.medium.com
Potentially-ignoramus comment here, apologies in advance, but amazon.com itself appears to be fine right now. Perhaps slower to load pages, by about half a second. Are they not eating (much of) their own dog food?
They are 100% fully on AWS and nothing else. (I’m Ex-Amazon)
It seems like the outage is only effecting one region so AWS is likely falling back to others. I’m sure parts of the site are down but the main sites are resilient
Kindle downloads and Amazon orders history were wholly unavailable, with rapid errors from the responsive website.
I was getting 500 errors a few hours ago on amazon.com
those sentences aren't logically connected - the outage this post is about is mostly confined to `us-east-1`, anyone who wanted to build a reliable system on top of AWS would do so across multiple regions, including Amazon itself.
`us-east-1` is unfortunately special in some ways but not in ways that should affect well-designed serving systems in other regions.
From the great Corey Quinn
Ah yes, the great AWS us-east-1 outage.
Half the internet’s on fire, engineers haven’t slept in 18 hours, and every self-styled “resilience thought leader” is already posting:
“This is why you need multi-cloud, powered by our patented observability synergy platform™.”
Shut up, Greg.
Your SaaS product doesn’t fix DNS, you're simply adding another dashboard to watch the world burn in higher definition.
If your first reaction to a widespread outage is “time to drive engagement,” you're working in tragedy tourism. Bet your kids are super proud.
Meanwhile, the real heroes are the SREs duct-taping Route 53 with pure caffeine and spite.
https://www.linkedin.com/posts/coquinn_aws-useast1-cloudcomp...
This wins all the internets today. Probably not a day where internets are particularly valuable, but it wins them nonetheless.
He once said about an open source project that I was the third highest contributor on at AWS “This may be the worst AWS naming of 2021.” It was one of the proudest moments in my career.
Yes I know it’s sad…
Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.
Similar experience here. People laughed and some said something like "well, if something like AWS falls then we have bigger problems". They laugh because honestly is too far-fetched to think the whole AWS infra going down. Too big to fail as they say in the US. Nothing short of a nuclear war would fuck up the entire AWS network so they're kinda right.
Until this happen. A single region in a cascade failure and your saas is single region.
They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.
Why would your competitors go down? AWS has at best 30-35% market share. And that's ignoring the huge mass of companies who still run their infrastructure on bare metal.
A whole bunch of meeting bots use Recall.
Recall is on AWS.
Everyone using Recall for meeting recordings is down.
In some domains, a single SaaS dominates the domain and if that SaaS sits on AWS, it doesn't matter if AWS is 35% marketshare because the SaaS that dominates 80% of the domain is on AWS so the effect is wider than just AWS's market share.
We're on GCP, but we have various SaaS vendors on AWS so any of the services that rely on AWS are gone.
Many chat/meeting services also run on AWS Chime so even if you're not on AWS, if a vendor uses Chime, that service is down.
Doesn’t Slack use Chime for video calls?
Yes. Yes it does.
Part of the company I work at is doing infrastructure consulting. We're in fact seeing companies moving to bare metal, with the rise of turnkey container systems between Nutanix, Purestorage, Redhat, ... At this point in time, a few remotely managed boxes in a rack can offer a really good experience for containers for very little effort.
And this comes in a time with regulations like Dora and the BaFin tightening things - managing these boxes becomes less effort than maintaining compliance across vendors.
There have been plenty of solutions for a while. Pivotal Cloud Foundry, Open Shift, etc. None of these were "turnkey" though. If you're doing consulting, is it more newer, easier to install/manage tech, or is it cost?
I'm not in our consulting parts for infra, but based on their feedback and some talks at a conference a bit back: Vendors have been working on standardizing on Kubernetes components and mechanics and a few other protocols, which is simplifying configuration and reducing the configuration you have to do infrastructurally a lot.
Note, I'm not affiliated with any of these companies.
For example, Purestorage has put a lot of work into their solution and for a decent chunk of cache, you get a system that slots right into VMware, offers iSCSI for other infrastructure providers, offers a CSI plugin for containers, and speaks S3. And integration with a few systems like OpenShift has been simplified as well.
This continues. You can get ingress/egress/network monitoring compliance from Calico slotting in as a CNI plugin, some systems managing supply chain security, ... Something like Nutanix is an entirely integrated solution you rack and then you have a container orchestration with storage and all of the cool things.
Cost is not really that much a factor in this market. Outsourcing regulatory requirements and liability to vendors is great.
Because your competitor probably depends on a service which uses aws. They may host all their stuff in azure, but use cloudfront as cache which uses aws and goes down.
because your competitors are probably using services that depend on AWS.
>> People laughed and some said something like "well, if something like AWS falls then we have bigger problems".
> They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.
They made their own bigger problems by all crowding into the same single region.
it's a weird effect:
imagine a beach, with icecream vendors. You'd think it would be optimal for two vendors to each split it half north, half south. However, in wanting to steal some of the other vendors' customers, you end up with two icecream stands in the center.
So too with outages. Safety / loss of blame in numbers.
Plenty of stuff still works.
Sure but to their point, you're off the hook if half of the internet is down.. it's sort of like: "No one gets fired for picking IBM".
I'd rather my services and products be in the half of the Internet that works than the half that doesn't.
I feel like I don't really like AWS and prefer self hosted vps or even google cloud / cloudflare more but and so I agree with what you are trying to say but Let me be the devil's argument.
I mean I agree but what you are saying is that where else are you gonna host it? If you host it yourself and then it turns out to be an issue and you go down then that's entirely on you and 99% of the internet still works.
But if Aws goes down, lets say 50% of the internet goes down.
So, in essence, nobody blames a particular team/person just as the parent comment said that nobody gets fired for picking IBM.
Although, I still think that the idea which is worrying is such massive centralization of servers that we have a single switch which can turn half the internet off. So I am a bit worried from the centralization side of thing's.
Our app was up. I'm sure we made a lot of money.
The question really becomes, did you make money that you wouldn't have made when services came back up? As in, will people just shift their purchase time to tomorrow when you are back online? Sure, some % is completely lost but you have to weigh that lost amount against the ongoing costs to be multi-cloud (or multi-provider) and the development time against those costs. For most people I think it's cheaper to just be down for a few hours. Yes, this outage is longer than any I can remember but most people will shrug it off and move on once it comes back up fully.
At the end of the day most of us aren't working on super critical things. No one is dying because they can't purchase X item online or use Y SaaS. And, more importantly, customers are _not_ willing to pay the extra for you to host your backend in multiple regions/providers.
In my contracts (for my personal company) I call out the single-point-of-failure very clearly and I've never had anyone balk. If they did I'd offer then resiliency (for a price) and I have no doubt that they would opt to "roll the dice" instead of pay.
Lastly, it's near-impossible to verify what all your vendors are using so even if you manage to get everything resilient it only takes one chink in the armor the bring it all down (See: us-east-1 and various AWS services that rely on that even if you don't host anything in us-east-1 directly).
I'm not trying to downplay this, pretend it doesn't matter, or anything like that. Just trying to point out that most people don't care because no one seems to care (or want to pay for it). I wish that was different (I wish a lot of things were different) but wishing doesn't pay my bills and so if customers don't want to pay for resiliency then this is what they get and I'm at peace with that.
Nothing short of a nuclear war, a bad deploy, or some operational oopsie, and everybody knows how rare all these things are!
If you were dependent upon a single distribution (region) of that Service, yes it would be a massive single point of failure in this case. If you weren't dependent upon a particular region, you'd be fine.
Of course lots of AWS services have hidden dependencies on us-east-1. During a previous outage we needed to update a Route53(DNS) record in us-west-2, but couldn't because of the outage in us-east-1.
So, AWS's redundant availability goes something like "Don't worry, if nothing is working in us-east-1, it will trigger failover to another regions" ... "Okay, where's that trigger located?" ... "In the us-east-1 region also" ... "Doens't that seem a problem to you?" ... "You'd think it might be! But our logs say it's never been used."
Some 'regional' AWS services still rely on other services (some internal) that are only in us-east-1.
Even Amazon’s own services (ie ring) were affected by this outage
Relying on AWS is a single point of failure. Not as much as relying on a single AWS region, but it's still a single point.
It's fairly difficult to avoid single points of failure completely, and if you do it's likely your suppliers and customers haven't managed to.
It's about how much your risk level is.
AWS us-east-1 fails constantly, it has terrible uptime, and you should expect it to go. A cyberattack which destroyed AWSs entire infrastructure would be less likely. BGP hijacks across multiple AWS nodes are quite plausible though, but that can be mitigated to an extent with direct connects.
Sadly it seems people in charge of critical infrastructure don't even bother thinking about these things, because next quarters numbers are more important.
I can avoid London as a single point of failure, but the loss of Docklands would cause so much damage to the UK's infrastructure I can't confidently predict that my servers in Manchester connected to peering points such as IXman will be able to reach my customer in Norwich. I'm not even sure how much international connectivity I could rely on. In theory Starlink will continue to work, but in practice I'm not confident.
When we had power issues in Washington DC a couple of months ago, three of our four independent ISPS failed, as they all had undeclared requirements on active equipment in the area. That wasn't even a major outage, just a local substation failure. The one circuit which survived was clearly just fibre from our (UPS/generator backed) equipment room to a data centre towards Baltimore (not Ashburn).
Amazing. So you will build your own load balancer that sends loads between AWS and Gcloud and make it the single point of failure instewd?
I mean given what we've seen with these AWS failures impact, wouldn't any enemies first target be to hit us-east-1 ? Imagine if it just disappeared?
It won't be over until long after AWS resolves it - the outages produce hours of inconsistent data. It especially sucks for financial services, things of eventual consistency and other non-transactional processes. Some of the inconsistencies introduced today will linger and make trouble for years.
What are the design best practices and industry standards for building on-premise fallback capabilities for critical infrastructure? Say for health care/banking ..etc
A relative of mine lived and worked in the US for Oppenheimer Funds in the 1990's and they had their own datacenters all over the US, multiple redundancy for weather or war. But every millionaire feels entitled to be a billionaire now, so all of that cost was rolled into a single point of cloud failure.
Reminds me of a great Onion tagline:
"Plowshare hastily beaten back into sword."
If we see more of this, it would not be crazy to assume that all this compelling of engineers to "use AI" and the flood of Looks Good To Me code is coming home.
Big if, major outages like this aren't unheard of, and so far, fairly uncommon. Definitely hit harder than their SLAs promise though. I hope they do an honest postmortem, but I doubt they would blame AI even if it was somehow involved. Not to mention you can't blame AI unless you go completely hands-off - but that's like blaming an outsourcing partner, which also never happens.
Looks like a DNS issue - dynamodb.us-east-1.amazonaws.com is failing to resolve.
"Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1."
it seems they found your comment
I'm not sure if this is directly related, but I've noticed my Apple Music app has stopped working (getting connection error messages). Didn't realize the data for Music was also hosted on AWS, unless this is entirely unrelated? I've restarted my phone and rebooted the app to no avail, so I'm assuming this is the culprit.
I'm getting rate limit issues on Reddit so it could be related.
We're seeing issues with multiple AWS services https://health.aws.amazon.com/health/status
Wow, about 9 hours later and 21 of 24 Atlassian services are still showing up as impacted on their status page.
Even @ 9:30am ET this morning, after this supposedly was clearing up, my doctor's office's practice management software was still hosed. Quite the long tail here.
https://status.atlassian.com/
This is why we use us-east-2.
us-east-2 remains the best kept secret
Did they try asking Claude to fix these issues? If it turns out this problem is AI-related, I'd love to see the AAR.
I forget where I read it originally, but I strongly feel that AWS should offer a `us-chaos-1` region, where every 3-4 days, one or two services blow up. Host your staging stack there and you build real resiliency over time.
(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)
AWS already offers Fault Injection Service, which you can use in any region to conduct chaos engineering: https://aws.amazon.com/fis/
This website just seems to be an auto-generated list of "things" with a catchy title:
> 5000 Reddit users reported a certain number of problems shortly after a specific time.
> 400000 A certain number of reports were made in the UK alone in two hours.
This is usually something I see on Reddit first, within minutes. I’ve barely seen anything on my front page. While I understand it’s likely the subs I’m subscribed to, that was my only reason for using Reddit. I’ve noticed that for the past year - more and more tech heavy news events don’t bubble up as quickly anymore. I also didn’t see this post for a while for whatever reason. And Digg was hit and miss on availability for me, and I’m just now seeing it load with an item around this.
I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.
Reddit itself is having issues. I have multiple comments fail to post. And the Reddit user page leads me to a 404.
Yeah, Reddit has been half-working all morning. Last time this happened I had an account get permabanned because the JavaScript on the page got stuck in a no-backoff retry loop and it banned me for "spamming." Just now it put me in rate limit jail for trying to open my profile one time, so I've closed out all my tabs.
Try lemmy, it's currently in the front page, albeit on a meme sub
Anecdotally, I think you should disregard this. I found out about this issue first via Reddit, roughly 30 minutes after the onset (we had an alarm about control plane connectivity).
Most Reddit API is down as well.
For me, Reddit fails to load. Gives some "upstream connection error"
I found it pretty fast on the /r/signal sub and went from there.
Reddit is worthless now, and posting about your tech infrastructure on reddit is a security and opsec lapse. My workplace has reddit blocked at the edge. I would trust X more than reddit, and that is with X having active honeypot accounts (it is even a meme about Asian girls). In fact, heard about this outage on X before anywhere else.
https://status.tailscale.com/ clients' auth down :( what a day
That just says the homepage and knowledge base are down and that admin access specifically isn’t effected.
yep, admin panel works, but in practice my devices are logged out and there is no way to re-authorize them.
I can authenticate my devices just fine.
interesting, which auth provider u are using? browser based auth via Google wasn't working for me. tailscale used as jumphost for private subnets in aws, and... so that was painful incident as access to corp resources is mandatory for me
I was using browser-based auth via Google.
can't log into https://amazon.com either after logging out; so many downstream issues
Related thread: https://news.ycombinator.com/item?id=45640772
Related thread: https://news.ycombinator.com/item?id=45640838
I can't do anything for school because Canvas by Instructure is down because of this.
I can't log in to my AWS account, in Germany, on top of that it is not possible to order anything or change payment options from amazon.de.
No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.
Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.
I wanted to log into my Audible account after a long time on my phone, I couldn't, started getting annoyed, maybe my password is not saved correctly, maybe my account was banned, ... Then checking desktop, still errors, checking my Amazon.de, no profile info... That's when I started suspecting that it's not me, it's you, Amazon! Anyway, I guess, I'll listen to my book in a couple of hours, hopefully.
Btw, most parts of the amazon.de is working fine, but I can't load profiles, and can't login.
You might be interested in Libation [0]. I use it to de-DRM my Audible library and generate a cue sheet with chapters in for offline listening.
[0] https://getlibation.com/
We use IAM Identity Center (née SSO) which is hosted in the eu-central-1 region, and I can log in just fine. Its admin pages are down, though. Ditto for IAM.
Other things seem to be working fine.
I just ordered stuff from Amazon.de. And I highly any Amazon site can go down because of one region. Just like Netflix are rarely affected.
I can’t even login, I get the internal error treatment. This is on Amazon.de
I'm on Amazon.de and I literally ordered stuff seconds before posting the comment. They took the money and everything. The order is in my order history list.
Not remotely surprised. Any competent engineer knows full well the risk of deploying into us-east-1 (or any “default” region for that matter), as well as the risks of relying on global services whose management or interaction layer only exists in said zone. Unfortunately, us-east-1 is the location most outsourcing firms throw stuff, because they don’t have to support it when it goes pear-shaped (that’s the client’s problem, not theirs).
My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.
> as well as the risks of relying on global services whose management or interaction layer only exists in said zone.
Is this well known/documented? I don't have anything on AWS but previously worked for a company that used it fairly heavily. We had everything in EU regions and I never saw any indication/warning that we had a dependency on us-east-1. But I assume we probably did based on the blast radius of today's outage.
Some of the “global” and “edge” services depend on us-east-1.
See: https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
“In the aws partition, the IAM service’s control plane is in the us-east-1 Region, with isolated data planes in each Region of the partition.“
Also, intra-region, many of the services use eachother, and not in a manner where the public can discern the dependency map.
Nowadays when this happens it's always something. "Something went wrong."
Even the error message itself is wrong whenever that one appears.
Displaying and propagating accurate error messages is an entire science unto itself... ...I can see why it's sometimes sensible to invest resource elsewhere and fall back to 'something'.
IMHO if error handling is rocket science, the error is you
Perhaps you're not handling enough errors ;-)
I use the term “unexpected error” because if the code got to this alert it wasn’t caught by any traps I’d made for the “expected” errors.
Reddit shows:
"Too many requests. Your request has been rate limited, please take a break for a couple minutes and try again."
Appears to have also disabled that bot on HN that would be frantically posting [dupe] in all the other AWS outage threads right about now.
That’s done by human beings.
Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
[0]: https://news.ycombinator.com/item?id=45614922
As a data point, I've been running stuff at Hetzner for 10 years now, in two datacenters (physical servers). There were brief network outages when they replaced networking equipment, and exactly ONE outage for hardware replacement, scheduled weeks in advance, with a 4-hour window and around 1-2h duration.
It's just a single data point, but for me that's a pretty good record.
It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.
> physical servers are way simpler than the extremely complex software and networking systems that AWS provides.
Or, rather, it's your fault when the complex software and networking systems you deployed on top of those physical servers go wrong (:
Yes. Which is why I try to keep my software from being overly complex, for example by not succumbing to the Kubernetes craze.
Well the complexity comes not from Kubernetes per se but that the problem it wants to solve (generalized solution for distributed computing) is very hard in itself.
Only if you actually has a system complex enough to require it. A lot of systems that use kubernetes are not complex enough to require it, but use it anyway. In that case kubernetes does indeed add unnecessary complexity.
Except that k8s doesn't solve the problem of generalized distributed computing at all. (For that you need distributed fault-tolerant state handling which k8s doesn't do.)
K8s solves only one problem - the problem of organizational structure scaling. For example, when your Ops team and your Dev team have different product deadlines and different budgets. At this point you will need the insanity of k8s.
I am so happy to read that someone views kubernetes the same way I do. for many years i have been surrounded by people who "kubernetes all the things" and that is absolute madness to me.
Yes, I remember when Kubernetes hit the scene and it was only used by huge companies who needed to spin-up fleets of servers on demand. The idea of using it for small startup infra was absurd.
As another data point, I run a k8s cluster on Hetzner (mainly for my own experience, as I'd rather learn on my pet projects vs production), and haven't had any Hetzner related issues with it.
So Hetzner is OK for the overly complex as well, if you wish to do so.
I love my k8s. Spend 5 minutes per month over the past 8 years and get a very reliable infra
Do you work on k8s professionally outside of the project you’re talking about?
5 mins seems unrealistic unless you’re spending time somewhere else to keep up to speed with version releases, upgrades, etc.
I think it sounds quite realistic especially if you’re using something like Talos Linux.
I’m not using k8s personally but the moment I moved from traditional infrastructure (chef server + VMs) to containers (Portainer) my level of effort went down by like 10x.
I would say even if not using Talos, Argo CD or Flux CD together with Renovate really helps to simplify the reoccuring maintenence.
You've spent less than 8 hours total on kubernetes?
I agree. Even when Kubernetes is used in large environments, is it still cumbersome, verbose and overly complex.
What are the alternatives?
Right, who needs scalability? Each app should have a hard limit of users and just stop acceppting new users when limits are reached.
Yeah scalability is great! Let’s burn through thousands of dollars an hour and give all our money to Amazon/Google/Microsoft
When those pink slips come in, we’ll just go somewhere else and do the same thing!
You know that “scale” existed long before K8s - or even Borg - was a thing, right? I mean, how do you think Google ran before creating them?
yes and mobile phones existed before smartphones, what's the point? So far in terms of scalability nothing beats k8s. And from OpenAI and Google we also see that it even works for high performance use case such as LLM trainings with huge amounts of nodes.
If the complex software you deployed and/or configured goes wrong on AWS it's also your fault.
On the other hand, I had the misfortune of having a hardware failure on one of my Hetzner servers. They got a replacement harddrive in fairly quickly, but still complete data loss on that server, so I had to rebuild it from scratch.
This was extra painful, because I wasn't using one of the OS that is blessed by Hetzner, so it requires a remote install. Remote installs require a system that can run their Java web plugin, and that have a stable and fast enough connection to not time out. The only way I have reliably gotten them to work is by having an ancient Linux VM that was also running in Hetzner, and had the oldest Firefox version I could find that still supported Java in the browser.
My fault for trying to use what they provide in a way that is outside their intended use, and props to them for letting me do it anyway.
That can happen with any server, physical or virtual, at any time, and one should be prepared for it.
I learned a long time ago that servers should be an output of your declarative server management configuration, not something that is the source of any configuration state. In other words, you should have a system where you can recreate all your servers at any time.
In your case, I would indeed consider starting with one of the OS base installs that they provide. Much as I dislike the Linux distribution I'm using now, it is quite popular, so I can treat it as a common denominator that my ansible can start from.
Cloud marketing and career incentives seems to have instilled in the average dev that MTBF for hardware is in days rather than years.
MTBF?
Mean Time Between Failures.
Mean time between failures
Do you monitor your product closely enough to know that there weren't other brief outages? E.g. something on the scale of unscheduled server restarts, and minute-long network outages?
I personally do through status monitors at larger cloud providers at 30 sec resolutions, never noticed a downtime. They will sometimes drop ICMP though, even though the host is alive and kicking.
Surprised they allow ICMP at all
why does this surprise you?
actually, why do people block ICMP? I remember in 1997-1998 there were some Cisco ICMP vulnerabilities and people started blocking ICMP then and mostly never stopped, and I never understood why. ICMP is so valuable for troubleshooting in certain situations.
Security through obscurity mostly, I don't know who continues to push the advice to block ICMP without a valid technical reason since at best if you tilt your head and squint your eyes you could almost maybe see a (very new) script kiddie being defeated by it.
I've rarely actually seen that advice anywhere, more so 20 years ago than now but people are still clearly getting it from circles I don't run in.
I don’t disagree. I am used to highly regulated industries where ping is blocked across the WAN
Of course. It's a production SaaS, after all. But I don't monitor with sub-minute resolution.
I do for some time now, on the scale of around 20 hosts in their cloud offering. No restarts or network outages. I do see "migrations" from time to time (vm migrating to a different hardware, I presume), but without impact on metrics.
to stick to the above point, this wasn't a minute long outage. if you care about seconds/minutes long outages, you monitor. running on aws, hetzer, ovh, or a raspberry in a shoe box makes no difference
I do. Routers, switches, and power redundancy are solved problems in datacenter hardware. Network outages rarely occur because of these systems, and if any component goes down, there's usually an automatic failover. The only thing you might notice is TCP connections resetting and reconnecting, which typically lasts just a few seconds.
Having run bare-metal servers for a client + plenty of VMs pre-cloud, you'd be surprised how bloody obvious that sort of thing is when it happens.
Also sorts of monitoring gets flipped.
And no, there generally aren't brief outages in normal servers unless you did it.
I did have someone accidentally shut down one of the servers once though.
7 years, 20 servers, same here.
When AWS is down, everybody knows it. People don’t really question your hosting choice. It’s the IBM of cloud era.
Yes, but those days are numbered. For many years AWS was in a league of its own. Now they’ve fallen badly behind in a growing number of areas and are struggling to catch up.
There’s a ton of momentum associated with the prior dominance, but between the big misses on AI, a general slow pace of innovation on core services, and a steady stream of top leadership and engineers moving elsewhere they’re looking quite vulnerable.
Can you throw out an example or two, because in my experience, AWS is the 'it just works' of the cloud world. There's a service for everything and it works how you'd expect.
I'm not sure what feature they're really missing, but my favorite is the way they handle AWS Fargate. The other cloud providers have similar offerings but I find Fargate to have almost no limitations when compared to the others.
You’ve given a good description of IBM for most of the 80s through the 00s. For the first 20 years of that decline “nobody ever got fired for buying IBM” was still considered a truism. I wouldn’t be surprised if AWS pulls it off for as long as IBM did.
I think that the worst thing that can happen to an org is to have that kind of status ("nobody ever got fired for buying our stuff" / "we're the only game in town").
It means no longer being hungry. Then you start making mistakes. You stop innovating. And then you slowly lose whatever kind of edge you had, but you don't realize that you're losing it until it's gone
Unfortunately I think AWS is there now. When you talk to folks there they don’t have great answers to why their services are behind or not as innovative as other things out there. The answer is basically “you should choose AWS because we’re AWS.” It’s not good.
I couldn't agree more, there was clearly a big shift when Jassy became CEO of amazon as a whole and Charlie Bell left (which is also interesting because it's not like azure is magically better now).
The improvements to core services at AWS hasn't really happened at the same pace post covid as it did prior, but that could also have something to do with overall maturity of the ecosystem.
Although it's also largely the case that other cloud providers have also realized that it's hard for them to compete against the core competency of other companies, whereas they'd still be selling the infrastructure the above services are run on.
Looks like you’re being down voted for saying the quiet bit out loud. You’re not wrong though.
Or because people don’t agree with “days are numbered”.
As much as I might not like AWS, I think they’ll remain #1 for the foreseeable future. Despite the reasons the guy listed.
Given recent earnings and depending on where things end up with AI it’s entirely plausible that by the end of the decade AWS is the #2 or #3 cloud provider.
AWS' core advantage is price. No one cares if they are "behind on AI" or "the VP left." At the end of the day they want a cheap provider. Amazon knows how to deliver good-enough quality at discount prices.
That story was true years ago but I don’t know that it rings true now. AWS is now often among the more expensive options, and with services that are struggling to compete on features and quality.
That is 100% true. You cant be fired for picking AWS... But I doubt its the best choice for most people. Sad but true
Schrodingers user;
Simultaneously too confused to be able to make their own UX choices, but smart enough to understand the backend of your infrastructure enough to know why it doesn't work and excuses you for it.
The morning national TV news (BBC) was interrupted with this as breaking news, and about how many services (specifically snapchat for some reason) are down because of problems with "Amazon's Web Services, reported on DownDetector"
I liked your point though!
Well, at that level of user they just know "the internet is acting up this morning"
I thought we didn't like when things were "too big to fail" (like the banks being bailed out because if we didn't the entire fabric of our economy would collapse; which emboldens them to take more risks and do it again).
A typical manager/customer understands just enough to ask their inferiors to make their f--- cloud platform work, why haven't you fixed it yet? I need it!
In technically sophisticated organizations, this disconnect simply floats to higher levels (e.g. CEO vs. CTO rather than middle manager vs. engineer).
You can't be fired, but you burn through your runway quicker. No matter which option you choose, there is some exothermic oxidative process involved.
AWS is smart enough to throw you a few mill credits to get you started.
MILL?!
I only got €100.000 bounded to a year, then a 20% discount for spend in the next year.
(I say "only" because that certainly would be a sweeter pill, €100.000 in "free" credits is enough to make you get hooked, because you can really feel the free-ness in the moment).
Mille is thousand in Latin so they might have meant a few thousand dollars.
Every one of the big hyperscalers has a big outage from time to time.
Unless you lose a significant amount of money per minute of downtime, there is no incentive to go multicloud.
And multicloud has its own issues.
In the end, you live with the fact that your service might be down a day or two per year.
When we looked at this our conclusion was not multi cloud but local resiliency with cloud augmentation. We still had our own small data center
> In the end, you live with the fact that your service might be down a day or two per year.
This is hilarious. In the 90s we used to have services which ran on machines in cupboards which would go down because the cleaner would unplug them. Even then a day or two per year would be unacceptable.
Usually, 2 founders creating a startup can't fire each other anyway so a bad decision can still be very bad for lots of people in this forum
On the other side of that coin, I am excited to be up and running while everyone else is down!
On one hand it's allows to shift the blame but on other hand is shows a disadvantage of hyper centralization - if AWS is down too many important services are down at the same time which makes it worse. E. g. when AWS is down it's important to have communication/monitoring services UP so engineers can discuss / co-ordinate workarounds and have good visibility but Atlassian was (is) significantly degraded today too.
Facebook had a comically bad outage a few years ago wherein the internal sign-in, messaging, and even server keycard access went down
https://en.wikipedia.org/wiki/2021_Facebook_outage#Impact
Somewhat related tip of the day, don't host your status page as a subdomain off the main site. Ideally host it with a different provider entirely
To back up this point, currently BBC News have it as their most significant story, with "live" reporting: https://www.bbc.co.uk/news/live/c5y8k7k6v1rt
This is alongside "live" reporting on the Israel/Gaza conflict as well as news about Epstein and the Louvre heist.
This is mainstream news.
I like how their headline starts with Snapchat and Roblox being affected.
Actually I am keen to know how Roblox got impacted. Following the terrible Halloween Outage in 2021, they posted 2 years ago about migrating to a cell based architecture in https://corp.roblox.com/newsroom/2023/12/making-robloxs-infr...
Perhaps some parts of the migration haven't been completed, or there is still a central database in us-east1
The journalist found out about it from their tween.
[dead]
[dead]
They're the only apps English people are allowed to use, the rest of the internet is banned by Ofcom. /s
100%. When AWS was down, we'd say "AWS is down!", and our customers would get it. Saying "Hetzner is down!" raises all sorts of questions your customers aren't interested in.
I've ran a production application off Hetzner for a client for almost a decade and I don't think I have had to tell them "Hetzner is down", ever, apart from planned maintenance windows.
A bold strategy to think they'll never have an outage though, right? Maybe even a little naive and a little arrogant...
No provider is better than two providers.
Hosting on second- or even third-tier providers allows you to overprovision and have much better redundancy, provided your solution is architected from the ground up in a vendor agnostic way. Hetzner is dirt cheap, and there are countless cheap and reliable providers spread around the globe (Europe in my case) to host a fleet of stateless containers that never fail simultaneously.
Stateful services are much more difficult, but replication and failover is not rocket science. 30 minutes of downtime or 30 seconds of data loss rarely kill businesses. On the contrary, unrealistic RTOs and RPOs are, in my experience, more dangerous, either as increased complexity or as vendor lock-in.
Customers don't expect 100% availability and noone offers such SLAs. But for most businesses, 99.95% is perfecty acceplable, and it is not difficult to have less than 4h/year of downtime.
The point seems to be not that Hetzner will never have an outage, but rather that they have a track record of not having outages large enough for everyone to be affected.
Seems like large cloud providers, including AWS, are down quite regularly in comparison, and at such a scale that everything breaks for everyone involved.
> The point seems to be not that Hetzner will never have an outage, but rather that they have a track record of not having outages large enough for everyone to be affected.
If I am affected, I want everyone to be affected, from a messaging perspective
Okay, that helps for the case when you are affected. But what about the case when you are not affected and everyone else is? Doesn't that seem like good PR?
Take the hit of being down once every 10 years compared to being up for the remaining 9 that others are down.
That depends on the service. Far from everyone is on their PC or smartphone all day, and even fewer care about these kinds of news.
Amazon is up, what are they doing?
Which eventually leads to the headline "AWS down indefinitely, society collapses".
most people dont even know aws exists
Non-techies don’t. Here’s how CNN answered, what is AWS?
“Amazon Web Services (AWS) is Amazon’s internet based cloud service connecting businesses to people using their apps or online platforms.”
Uh.. yeah.
Kudos to the Globe/AP for getting it right:
> An Amazon Web Services outage is causing major disruptions around the world. The service provides remote computing services to many governments, universities and companies, including The Boston Globe.
> On DownDetector, a website that tracks online outages, users reported issues with Snapchat, Roblox, Fortnite online broker Robinhood, the McDonald’s app and many other services.
That's actually a fairly decent description for the non-tech crowd and I am going to adopt it, as my company is in the cloud native services space and I often have a problem explaining the technical and business model to my non-technical relatives and family - I get bogged down in trying to explain software defined hardware and similar concepts...
I asked ChatGPT for a succinct definition, and I thought it was pretty good:
“Amazon Web Services (AWS) is a cloud computing platform that provides on-demand access to computing power, storage, databases, and other IT resources over the internet, allowing businesses to scale and pay only for what they use.”
For us techies yes, but to the regular folks that is just as good as our usual technical gobbledy-gook - most people don´t differentiate between a database and a hard-drive.
You make a good point.
This part:
could be simplified to: access to computer serversMost people who know little about computers can still imagine a giant mainframe they saw in a movie with a bunch of blinking lights. Not so different, visually, from a modern data center.
Ah, yes, servers. I have seen those at Chili's and TGI Fridays!
It's the difference between connecting your home to the grid to get electricity vs having your own generator.
It's the same as having a computer room but in someone else's datacentre.
This one's great too, thanks.
And yet they still all activate their on call people (wait why do we have them if we are on the cloud?) to do .. nothing at all.
You can argue about Hetzner's uptime, but you can 't argue about Hetzner's pricing which is hands down the best there is. I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
For the price of AWS you could run Hetzner, a second provider for resiliancy and still make a large saving.
Your margin is my opportunity indeed.
I switched to netcup for even cheaper private vps for personal noncritical hosting. I'd heard of netcup being less reliable but so far 4 months+ uptime and no problems. Europe region.
Hetzner has the better web interface and supposedly better uptime, but I've had no problems with either. Web interface not necessary at all either when using only ssh and paying directly.
I am on Hetzner with a primary + backup server and on Netcup (Vienna) with a secondary. For DNS I am using ClouDNS.
I think I am more distributed then most of the AWS folks and it still is way cheaper.
I used netcup for 3 years straight for some self hosting and never noticed an outage. I was even tracking it with smokeping so if the box disappeared I would see it but all of the down time was mine when I rebooted for updates. I don't know how they do it but I found them rock solid.
I've been running my self-hosting stuff on Netcup for 5+ years and I don't remember any outages. There probably were some, but they were not significant enough for me to remember.
netcup is fine unless you have to deal with their support, which is nonexistent. Never had any uptime issues in the two years I've been using them, but friends had issues. Somewhat hit or miss I suppose.
Exactly. Hetzner is the equivalent of the original Raspberry Pi. It might not have all fancy features but it delivers and for the price that essentially unblocks you and allows you to do things you wouldn't be able to do otherwise.
They've been working pretty hard on those extra features. Their load balancing across locations is pretty decent for example.
> I'd rather go with Hetzner and cobble up together some failover than pay AWS extortion.
Comments like this are so exaggerated that they risk moving the goodwill needle back to where it was before. Hetzner offers no service that is similar to DynamoDB, IAM or Lambda. If you are going to praise Hetzner as a valid alternative during a DynamoDB outage caused by DNS configuration, you would need to a) argue that Hetzner is a better option regarding DNS outages, b) Hetzner is a preferable option for those who use serverless offers.
I say this as a long-time Hetzner user. Herzner is indeed cheaper, but don't pretend that Herzner let's you click your way into a highly-availale nosql data store. You need non-trivial levels of you're ow work to develop, deploy, and maintain such a service.
> but don't pretend that Herzner let's you click your way into a highly-availale nosql data store.
The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.
Of course nobody else offers AWS products, but people use AWS for their solutions to compute problems and it can be easy to forget virtually all other providers offer solutions to all the same problems.
>The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda
With some services I'd agree with you, but DynamoDB and Lambda are easily two of their 'simplest' to configure and understand services, and two of the ones that scale the easiest. IAM roles can be decently complicated, but that's really up to the user. If it's just 'let the Lambda talk to the table' it's simple enough.
S3/SQS/Lambda/DynamoDB are the services that I'd consider the 'barebones' of the cloud. If you don't have all those, you're not a cloud provider, your just another server vendor.
> With some services I'd agree with you, but DynamoDB and Lambda are easily two of their 'simplest' to configure and understand services, and two of the ones that scale the easiest. IAM roles can be decently complicated, but that's really up to the user. If it's just 'let the Lambda talk to the table' it's simple enough.
We agree, but also, I feel like you're missing my point: "let the Lambda talk to the table" is what quickstarts produce. To make a lambda talk to a table at scale in production, you'll want to setup your alerting and monitoring to notify you when you're getting close to your service limits.
If you're not hitting service limits/quotas, you're not running even close to running at scale.
> Lambda are easily two of their 'simplest'
Not if you want to build something production ready. Even a simple thing like say static IP ingress for the Lambda is very complicated. The only AWS way you can do this is by using Global Accelerator -> Application Load Balancer -> VPC Endpoint -> API Gateway -> Lambda !!.
There are so many limits for everything that is very hard to run production workloads without painful time wasted in re-architecture around them and the support teams are close to useless to raise any limits.
Just in the last few months, I have hit limits on CloudFormation stack size, ALB rules, API gateway custom domains, Parameter Store size limits and on and on.
That is not even touching on the laughably basic tooling both SAM and CDK provides for local development if you want to work with Lambda.
Sure Firecracker is great, and the cold starts are not bad, and there isn't anybody even close on the cloud. Azure functions is unspeakably horrible, Cloud Run is just meh. Most Open Source stacks are either super complex like knative or just quite hard to get the same cold start performance.
We are stuck with AWS Lambda with nothing better yes, but oh so many times I have come close to just giving up and migrate to knative despite the complexity and performance hit.
>Not if you want to build something production ready.
>>Gives a specific edge case about static IPs and doing a serverless API backed by lambda.
The most naive solution you'd do on any non-cloud vendor, just have a proxy with a static ip that then routes traffic whereever it needed to go, would also work on AWS.
So if you think AWS's solution sucks why not just go with that? What you described doesn't even sound complicated when you think of the networking magic behind the scenes that will take place if you ever do scale to 1 million tps.
> Production ready
Don’t know what you think should mean but for me that means
1. Declarative IaaC in either in CF/terraform
2. Fully Automated discovery which can achieve RTO/RPO objectives
3. Be able to Blue/Green and % or other rollouts
Sure I can write ansible scripts, have custom EC2 images run HA proxy and multiple nginx load balancers in HA as you suggest, or host all that to EKS or a dozen other “easier” solutions
At the point why bother with Lambda ? What is the point of being cloud native and serverless if you have to literally put few VMs/pod in front and handle all traffic ? Might as well host the app runtime too .
> doesn’t even sound complicated .
Because you need a full time resource who is AWS architect and keeps up with release notes and documentation or training and constantly works to scale your application - because every single component has a dozen quotas /limits and you will hit them - it is complicated.
If you spend few million a year on AWS then spending 300k on an engineer to do just do AWS is perhaps feasible .
If you spend few hundred thousands on AWS as part of mix of workloads it is not easy or simple.
The engineering of AWS impressive as it maybe has nothing to the products being offered . There is a reason why Pulumi, SST or AWS SAM itself exist .
Sadly SAM is so limited I had to rewrite everything to CDK in couple of months . CDK is better but I am finding that I have to monkey patching limits on CDK with the SDK code now, while possible , the SDK code will not generate Cloudformation templates .
> Don’t know what you think should mean but for me that means
I think your inexperience is showing, if that's what you try to mean by "production-ready". You're making a storm in a teacup over features that you automatically onboard if you go through an intro tutorial, and "production-ready" typically means way more than a basic run-of-the-mill CICD pipeline.
As most of the times, the most vocal online criticism comes from those who have the least knowledge and experience over the topic they are railing against, and their complains mainly boil down to criticising their own inexperience and ignorance. There is plenty of things to criticize AWS for, such as cost and vendor lock-in, but being unable and unwilling to learn how to use basic services is not it.
> Even a simple thing like say static IP ingress for the Lambda is very complicated.
Explain exactly what scenario you believe requires you to provide a lambda behind a static IP.
In the meantime, I recommend you learn how to invoke a lambda, because static IPs is something that is extremely hard to justify.
Try telling that to customers who can only do outbound API calls to whitelisted IP addresses
When you are working with enterprise customers or integration partners it doesn’t even have to be regulated sectors like finance or healthcare, these are basic asks you cannot get away from .
people want to be able to know whitelist your egress and ingress IPs or pin certificates. It is not up to me to say on efficacy of these rules .
I don’t make the rules of the infosec world , I just follow them.
> Try telling that to customers who can only do outbound API calls to whitelisted IP addresses
Alright, if that's what you're going with then you can just follow a AWS tutorial:
https://docs.aws.amazon.com/lambda/latest/dg/configuration-v...
Provision an elastic IP to have your static IP address, set the NAT gateway to handle traffic, and plugin the lambda to the NAT gateway.
Do you think this qualifies as very complicated?
> The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.
I'll bite. Explain exactly what work you think you need to do to get your pick of service running on Hetnzer to have equivalent fault-tolerance to, say, a DynamoDB Global Table created with the defaults.
Are you Netflix? Because is not theres a 99% probability you dont need any of those AWS services and just have a severe case of shiny object syndrome in your organisation.
Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
To be honest I don't trust myself running a HA PostgreSQL setup with correct backups without spending an exorbitant effort to investigate everything (weeks/months) - do you ? I'm not even sure what effort that would take. I can't remember last time I worked with unmanaged DB in prod where I did not have a dedicated DBA/sysadmin. And I've been doing this for 15 years now. AFAIK Hetzner offers no managed database solution. I know they offer some load balancer so there's that at least.
At some point in the scaling journey bare metal might be the right choice, but I get the feeling a lot of people here trivialize it.
If it requires weeks/months to sort setting that up and backups then you need a news ops person as that's insane.
If you're doing it yourself, learn Ansible, you'll do it once and be set forever.
You do not need "managed" database services. A managed database is no different from apt install postgesql followed by a scheduled backup.
It genuinely is trivial, people seem to have this impression theres some sort of unique special sauce going on at AWS when there really isn't.
That doesn’t give you high availability; it doesn’t give you monitoring and alerting; it doesn’t give you hardware failure detection and replacement; it doesn’t solve access control or networking…
Managed databases are a lot more than apt install postgresql.
If you're doing it yourself, learn Ansible, you'll do it once and be set forever.
You do not need "managed" database services. A managed database is no different from apt install postgesql followed by a scheduled backup.
Genuinely no disrespect, but these statements really make it seem like you have limited experience building an HA scalable system. And no, you don't need to be Netflix or Amazon to build software at scale, or require high availability.
Backups with wall-g and recurring pg_dump are indeed trivial. (Modulo S3 outage taking so long that your WAL files fill up the disk and you corrupt the entire database.)
It's the HA part, especially with a high-volume DB that's challenging.
From your comment, you don't even have the faintest idea of what is the problem domain. No wonder you think you know better.
But that's the thing - if I have an ops guy who can cover this then sure it makes sense - but who does at an early stage ? As a semi competent dev I can setup a terraform infra and be relatively safe with RDS. I could maybe figure out how to do it on my own in some time - but I don't know what I don't know - and I don't want to do a weekend production DB outage debugging because I messed up the replication setup or something. Maybe I'm getting old but I just don't have the energy to deal with that :)
If youre not Netflix then just sudo yum install postgresql and pg_dump every day, upload to S3. Has worked for me for 20 years at various companies, side projects, startups …
> If youre not Netflix then just sudo yum install postgresql and pg_dump every day, upload to S3.
database services such as DynamoDB support a few backup strategies out of the box, including continuous backups. You just need to flips switch and never bother about it again.
> Has worked for me for 20 years at various companies, side projects, startups …
That's perfectly fine. There are still developers who don't even use version control at all. Some old habits die hard, even when the whole world moved on.
What happens when the server goes down ? How do you update it ?
> Are you Netflix? Because is not theres a 99% probability you dont need any of those AWS services and just have a severe case of shiny object syndrome in your organisation.
I think you don't even understand the issue you are commenting on. It's irrelevant if you are Netflix or some guy playing with a tutorial. One of the key traits of serverless offerings is how it eliminates the need to manage and maintain a service or even worry about you have enough computational resources. You click a button to provision everything, you configure your clients to consume that service, and you are done.
If you stop to think about the amount of work you need to invest to even arrive at a point where you can actually point a client at a service, you'll be looking at what the value of serverless offerings.
Ironically, it's the likes of Netflix who can put together a case against using serverless offerings. They can afford to have their own teams managing their own platform services with the service levels they are willing to afford. For everyone else, unless you are in the business of managing and tuning databases or you are heavily motivated to save pocket change on a cloud provider bill, the decision process is neither that clear not favours running your own services.
> Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
And almost all of them need a database, a load balancer, maybe some sort of cache. AWS has got you covered.
Maybe some of them need some async periodic reporting tasks. Or to store massive files or datasets and do analysis on them. Or transcode video. Or transform images. Or run another type of database for a third party piece of software. Or run a queue for something. Or capture logs or metrics.
And on and on and and on. AWS has got you covered.
This is Excel all over again. "Excel is too complex and has too many features, nobody needs more than 20% of Excel. It's just that everyone needs a different 20%".
You're right AWS does have you covered. But that doesn't mean thats the only way of doing it. Load balancing is insanely easy to do yourself, databases even easier. Caching, ditto.
I think a few people who claim to be in devops could do with learning the basics about how things like Ansible can help them as there's a fair few people who seem to be under the impression AWS is the only, and the best option, which unless you're FAANG really is rarely the case.
> You're right AWS does have you covered. But that doesn't mean thats the only way of doing it. Load balancing is insanely easy to do yourself, databases even easier. Caching, ditto
I think you don't understand the scenario you are commenting on. I'll explain why.
It's irrelevant if you believe that you are able to imagine another way to do something, and that you believe it's "insanely easy" to do those yourself. What matters is that others can do that assessment themselves, and what you are failing to understand is that when they do so, their conclusion is that the easiest way by far to deploy and maintain those services is AWS.
And it isn't even close.
You mention load balancing and caching. The likes of AWS allows you to setup a global deployment of those services with a couple of clicks. In AWS it's a basic configuration change. And if you don't want it, you just tear down everything with a couple of clicks as well.
Why do you think a third of all the internet runs on AWS? Do you think every single cloud engineer in the world is unable to exercise any form of critical thinking? Do you think there's a conspiracy out there to force AWS to rule the world?
You can spin up a redundant database setup with backups and monitoring and automatic fail over in 10 mins (the time it takes in AWS)? And maintain it? If you've done this a few times before and have it highly automated, sure. But let's not pretend it's "even easier" than "insanely easy".
Load balancing is trivial unless you get into global multicast LBs, but AWS have you covered there too.
You could never run a site like hacker news on a single box somewhere with a backup box a couple of states away.
(/s, obviously)
And have the two fail at the same time because similarly old hardware with similarly old and used disks fails at roughly the same time :)
> click your way into a HA NoSQL data store
Maybe not click, but Scylla’s install script [0] doesn’t seem overly complicated.
0: https://docs.scylladb.com/manual/stable/getting-started/inst...
If you need the absolutely stupid scale DynamoDB enables what is the difference compared to running for example FoundationDb on your own using Hetzner?
You will in both cases need specialized people.
> Hetzner offers no service that is similar to DynamoDB, IAM or Lambda.
The key thing you should ask yourself: do you need DynamoDB or Lambda? Like "need need" or "my resume needs Lambda".
> The key thing you should ask yourself: do you need DynamoDB or Lambda? Like "need need" or "my resume needs Lambda".
If you read the message you're replying to, you will notice that I singled out IAM, Lambda, and DynamoDB because those services were affected by the outage.
If Hetzner is pushed as a better or even relevant alternative, you need to be able to explain exactly what you are hoping to say to Lambda/IAM/DynamoDB users to convince them that they would do better if they used Hetzner instead.
Making up conspiracy theories over CVs doesn't cut it. Either you know anything about the topic and you actually are able to support this idea, or you're an eternal September admission whose only contribution is noise and memes.
What is it?
Well, Lambda scales down to 0 so I don't have to pay for the expensive EC2 instan... oh, wait!
TBH, in my last 3 years with Hetzner, i never saw a downtime to my servers other than myself doing some routin maitenance for os updates. Location Falkenstein.
You really need your backup procedures and failover procedures though, a friend bought a used server and the disk died fairly quickly leaving him sour.
I do have HA setup and Backup for DB that run periodically to an S3.
THE disk?
It's a server! What in the world is your friend doing running a single disk???
Ate a bare minimum they should have been running a mirror.
Younger guy with ambitions but little experience, I think my point was that used servers with Hetzner are still used so if someone has been running disk heavy jobs you might want to request new disks or multiple ones and not just pick the cheapest options at the auction.
(Interesting that an anectode like above got downvoted)
> (Interesting that an anectode like above got downvoted)
experts almost unilaterally judge newbies harshly, as if the newbies should already know all of the mistakes to avoid. things like this are how you learn what mistakes to avoid.
"hindsight is 20/20" means nothing to a lot of people, unfortunately.
And I have seen them delete my entire environment including my backups due to them not following their own procedures.
Sure, if you configure offsite backups you can guard against this stuff, but with anything in life, you get what you pay for.
What is the Hetzner equivalent for those in Windows Server land? I looked around for some VPS/DS providers that specialize in Windows, and they all seem somewhat shady with websites that look like early 2000s e-commerce.
I work at a small / medium company with about ~20 dedicated servers and ~30 cloud servers at Hetzner. Outages have happened, but we were lucky that the few times it did happen, it was never a problem / actual downtime.
One thing to note is that there are some scheduled maintenances were we needed to react.
We've been running our services on Hetzner for 10 years, never experienced any significant outages.
That might be datacenter dependant of course, since our root servers and cloud services are all hosted in Europe, but I really never understood why Hetzner is said to be less reliable
Haha, yeah that's a nugget
> 99.99% uptime infra significantly cheaper than the cloud.
I guess that's another person that has never actually worked in the domain (SRE/admin) but still wants to talk with confidence on the topic.
Why do I say that? Because 99.99% is frickin easy
That's almost one full hour of complete downtime per year.
It only gets hard in the 99.9999+ range ... And you rarely meet that range with cloud providers either as requests still fail for some reason, like random 503 when a container is decommissioned or similar
>Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
That's not necessarily ironic. Seems like you are suffering from recency bias.
My recommendation is to use AWS, but not the US-EAST-1 region. That way you get benefits of AWS without the instability.
AWS has internal dependencies on US-EAST-1.
Admittedly they're getting fewer and fewer, but they exist.
The same is also true in GCP, so as much as I prefer GCP from a technical standpoint: the truth is, if you can't see it, it doesn't mean it goes away.
The only hard dependency I am still aware of is write operations to the R53 control plane. Failover records and DNS queries would not be impacted. So business workflows would run as if nothing happened.
(There may still be some core IAM dependencies in USE1, but I haven’t heard of any.)
We're currently witnessing the fact that what you're claiming is not as true as you imply.
We don't know that (yet) - it's possible that this is simply a demonstration of how many companies have a hard dependency on us-east-1 for whatever reason (which I can certainly believe).
We'll know when (if) some honest RCAs come out that pinpoint the issue.
I created a DNS record in route53 this morning with no issues
the Billing part of the console in eu-west-2 was down though, presumably because that uses us-east-1 dynamodb, but route53 doesn't.
I had a problem with an ACME cert terraform module. It was doing the R53 to add the DNS TXT record for the ACME challenge and then querying the change status from R53.
R53 seems to use Dynamo to keep track of the syncing of the DNS across the name servers, because while the record was there and resolving, the change set was stuck in PENDING.
After DynamoDB came back up, R53's API started working.
We have nothing deployed in us east 1, yet all of our CI was failing due to IAM errors this morning.
I’m more curious to understand how we ended up creating a single point of failure across the whole internet.
[dead]
I don't have an opinion either way, but for now, this is just anecdotal evidence.
Looks fine for pointing an irony
In some ways yes. But in some ways this is like saying it's more likely to rain on your wedding day.
I'm not affiliated and won't be compensated in any way for saying this: Hetzner are the best business partners ever. Their service is rock solid, their pricing is fair, their support is kind and helpful.
Going forward I expect American companies to follow this European vibe, it's like the opposite of enshitification.
> the opposite of enshitification.
Why do you expect American companies to follow it then? >:)
How do you expect American....
I don't know how often Hetzner has similar outages, but outages at the rack and server level, including network outages and device failure happen for individual customers. If you've never experienced this, it is probably just survivor's bias.
Aws/cloud has similar outages too, but more redundancy and automatic failover/migrations that are transparent to customers happen. You don't have to worry about DDOS and many other admin burdens either.
YMMV, I'm just saying sometimes Aws makes sense, other times Hetzner does.
It still can be true that the uptime is better, or am I overlooking something?
Nah you're definitely correct.
Hetzner users are like the Linux users for cloud.
Love, after your laptop's wifi antenna, it's Linux users all the way down.
Btw I use Hertzner
Been using OVH here with no complaints.
Stop making things up. As someone who commented on the thread in favour of AWS, there is almost no mention of better uptime in any comment I could find.
I could find one or two downvoted or heavily critisized comments, but I can find more people mentioning the opposite.
I got a downvote already for pointing this out :’)
Unfortunately, HN is full of company people, you can't talk anything against Google, Meta, Amazon, Microsoft without being downvoted to death.
It's less about company loyalty and more about protecting their investment into all the buzzwords from their resumes.
As long as the illusion that AWS/clouds are the only way to do things continues, their investment will keep being valuable and they will keep getting paid for (over?)engineering solutions based on such technologies.
The second that illusion breaks down, they become no better than any typical Linux sysadmin, or teenager ricing their Archlinux setup in their homelab.
Can't fully agree. People genuinely detest Microsoft on HN and all over the globe. My Microsoft-related rants are always upvoted to the skies.
> People genuinely detest Microsoft on HN and all over the globe
I would say tech workers rather than "people" as they are the ones needing to interact with it the most
I'm a tech worker, and have been paid by a multi-billion dollar company to be a tech worker since 2003.
Aside from Teams and Outlook Web, I really don't interact with Microsoft at all, haven't done since the days of XP. I'm sure there is integration on our corporate backends with things like active directory, but personally I don't have to deal with that.
Teams is fine for person-person instant messaging and video calls. I find it terrible for most other functions, but fortunately I don't have to use it for anything other than instant messaging and video calls. The linux version of teams still works.
I still hold out a healthy suspicion of them from their behaviour when I started in the industry. I find it amusing the Microsoft fanboys of the 2000s with their "only needs to work in IE6" and "Silverlight is the future" are still having to maintain obsolete machines to access their obsolete systems.
Meanwhile the stuff I wrote to be platform-agnostic 20 years ago is still in daily use, still delivering business benefit, with the only update being a change from "<object" to "<video" on one internal system when flash retired.
AWS and Cloudflare are HN darlings. Go so far as to even suggest a random personal blog doesn't need Cloudflare and get downvoted with inane comments as "but what about DDOS protection?!"
The truth is one under the age of 35 is able to configure a webserver any more, apparently. Especially now that static site generators are in vogue and you don't even need to worry about php-fpm.
Lol, realistically you only need to care about external DDoS protection, if you are at risk of AWS bankrupting your ass.
Isn't it just ads?
Finally IT managers will start understanding that cloud is no difference than Hetzner.
When things go wrong, you can point at a news article and say its not just us that have been affected.
I tried that but Slack is broken and the message hasn't got through yet...
Well, we have a naming issue (Hetzner also has Hetzner Cloud, it looks people still equal cloud with the three biggest public cloud providers).
In any case, in order for this to happen, someone would have to collect reliable data (not all big cloud providers like to publish precise data, usually they downlplay the outages and use weasel words like "some customers... in some regions... might have experienced" just not to admit they had an outage) and present stats comparing the availability of Heztner Cloud vs the big three.
Amazon itself apperas to be out for some products. I get a "Sorry, We couldn't find that page" when clicking on products
We're seeing issues with RDS proxy. Wouldn't be surprised if a DNS issue was the cause, but who knows, will wait for the postmortem.
We changed our db connection settings to go direct to the db and that's working. Try taking the proxy out the loop
We're also seeing issues with Lambda and RDS proxy endpoint.
May be because of this that trying to pay with PayPal on Lenovo's website has failed thrice for me today? Just asking... Knowing how everything is connected nowadays it wouldn't surprise me at all.
https://news.ycombinator.com/item?id=45640754
Can't login to Jira/Confluence either.
Seems to work fine for me. I'm in Europe so maybe connecting to some deployment over here.
You are already logged in. If you try to access your account settings, for example, you will be disappointed...
Severity - Degraded...
https://health.aws.amazon.com/health/status
https://downdetector.com/
Slack is down. Is that related? Probably is.
02:34 Pacific: Things seem to be recovering.
Couple of years ago us-east was considered the least stable region here on HN due to its age. Is that still a thing?
When I was there at aws (left about a decade ago), us-east-1 was considered least stable, because it was the biggest.
I.e. some bottle-necks in new code appearing only _after_ you've deployed there, which is of course too late.
It didn't help that some services had their deploy trains (pipelines in amazon lingo) of ~3 weeks, with us-east-1 being the last one.
I bet the situation hasn't changed much since.
>It didn't help that some services had their deploy trains (pipelines in amazon lingo) of ~3 weeks, with us-east-1 being the last one.
oof, so you're saying this outage could be cause by a change merged 3 weeks ago?
Yup, never add anything new to us-east-1. There is never a good reason to willingly use that region.
Couple of weeks or months ago the front page was saying how us-east-1 instability was a thing of the past due to <whatever chang of architecture>.
Yes
My site was down for a long time after they claimed it was fixed. Eventually I realized the problem lay with Network Load Balancers so I bypassed them for now and got everything back up and running.
Yes, we're seeing issues with Dynamo, and potentially other AWS services.
Appears to have happened within the last 10-15 minutes.
Yep, first alert for us fired @ 2025-10-20T06:55:16Z
I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.
Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.
Ironically, the HTTP request to this article timed out twice before a successful response.
LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?
According to their status page the fault was in DNS lookup of the Dynamo services.
Everything depends on DNS....
Dynamo had a outage last year if I recall correctly.
Lol ... of course it's DNS fault again.
We maybe distributed, but we die united...
Divided we stand,
United we fall.
AWS Communist Cloud
>circa 2005: Score:5, Funny on Slashdot
>circa 2025: grayed out on Hacker News
This is not Slashdot which is quite a good thing. Not that poignant humor is not always unwelcome, IMHO.
upvote :this
It's not DNS
There's no way it's DNS
It was DNS
I thought it was a pretty well-known issue that the rest of AWS depends on us-east-1 working. Basically any other AWS region can get hit by a meteor without bringing down everything else – except us-east-1.
But it seems like only us-east-1 is down today, is that right?
Some global services have control plane located only in `us-east-1`, without which they become read-only at best, or even fail outright.
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
Just don't buy it if you don't want it. No one is forced to buy this stuff.
> No one is forced to buy this stuff.
Actually, many companies are de facto forced to do that, for various reasons.
How so?
Certification, for one. Governments will mandate 'x, y and/or z' and only the big providers are able to deliver.
That is not the same as mandating AWS, it just means certain levels of redundancy. There are no requirements to be in the cloud.
No, that's not what it means.
It means that in order to be certified you have to use providers that in turn are certified or you will have to prove that you have all of your ducks in a row and that goes way beyond certain levels of redundancy, to the point that most companies just give up and use a cloud solution because they have enough headaches just getting their internal processes aligned with various certification requirements.
Medical, banking, insurance to name just a couple are heavily regulated and to suggest that it 'just means certain levels of redundancy' is a very uninformed take.
It is definitely not true that only big companies can do this. It is true that every regulation added adds to the power of big companies, which explains some regulation, but it is definitely possible to do a lot of things yourself and evidence that you've done it.
What's more likely for medical at least is that if you make your own app, that your customers will want to install it into their AWS/Azure instance, and so you have to support them.
Security/compliance theater for one
That's not a company being forced to, though?
It is if they want to win contracts
I don't think that's true. I think a company can choose to outsource that stuff to a cloud provider or not, but they can still choose.
Anyone needing multi-cloud WITH EASE, please get in touch. https://controlplane.com
I am the CEO of the company and started it because I wanted to give engineering teams an unbreakable cloud. You can mix-n-match services of ANY cloud provider, and workloads failover seamlessly across clouds/on-prem environments.
Feel free to get in touch!
"Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1..."
It's always DNS...
The Ring (Doorbell) App isn't working, nor is any the MBTA (Transit) Status pages/apps.
My apartment uses “SmartRent” for access controls and temps in our unit. It’s down…
So there's no way to get back in if you step out for food?
Hey wait wasn't the internet supposed to route around...?
One of the radio stations I listen to is just dead air tonight. I assume this is the cause.
A physical on-air broadcast station, not a web stream? That likely violates a license; they're required to perform station identification on a regular basis.
Of course if they had on-site staff it wouldn't be an issue (worst case, just walk down to the transmitter hut and use the transmitter's aux input, which is there specifically for backup operations like this), but consolidation and enshittification of broadcast media mean there's probably nobody physically present.
Yeah, real over the air fm radio. This particular station is a Jack one owned by iHeart; they don't have DJs. Probably no techs or staff in the office overnight.
This is widespread. ECR, EC2, Secrets Manager, Dynamo, IAM are what I've personally seen down.
Slack was down, so I thought I will send message to my coworkers on Signal.
Signal was also down.
E-mail still exists...
Chime has completely been down for almost 12 hours.
Impacting all banking series with red status error. Oddly enough, only their direct deposits are functioning without issues.
https://status.chime.com/
Airtable is down as-well.
A lot of businesses have all their workflows depending on their data on airtable.
AWS's own management console sign-in isn't even working. This is a huge one. :(
Btw. we had a forced EKS restart last week on thursday due to Kubernetes updates. And something was done with DNS there. We had problems with ndots. Caused some trouble here. Would not be surprised, if it is related, heh.
My ISP's DNS servers were inaccessible this morning. Cloudflare and Google's DNS servers have all been working fine, though: 1.1.1.1, 1.0.0.1, and 8.8.8.8
So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.
Well at least you don't have to figure out how to test your setup locally.
Half the internet goes down because part of AWS goes down... what happened to companies having redundant systems and not having a single point of failure?
Ironically for most companies its cheaper to just say if AWS goes down half of the internet goes down so people will understand
I'm thinking about that one guy who clicked on "OK" or hit return.
Somebody, somewhere tried to rollback something and it failed
Maybe unrelated, but yesterday I went to pick up my package from an Amazon Locker in Germany, and the display said "Service unavailable". I'll wait until later today before I go and try again.
I wonder why a package locker needs connectivity to give you a package. Since your package can't be withdrawn again from a different location, partitioning shouldn't be an issue.
Generally speaking, it's easier to have computation (logic, state, etc.) centralized. If the designers didn't prioritize scenarios where decentralization helped, then centralization would've been the better option.
I was just about to post that it didn't affect us (heavy AWS users, in eu-west-1). Buut, I stopped myself because that was just massively tempting fate :)
Happened to be updating a bunch of NPM dependencies and then saw `npm i` freeze and I'm like... ugh what did I do. Then npm login wasn't working and started searching here for an outage, and wala.
voila
I just saw services that were up since 545AM ET go down around 12:30PM ET. Seems AWS has broken Lambda again in their efforts to fix things.
Yeah, noticed from Zoom: https://www.zoomstatus.com/incidents/yy70hmbp61r9
Atlassian cloud is also having issues. Closing in on the 3 hour mark.
stupid question: is buying a server rack and running it at home subject to more downtimes in a year than this? has anyone done an actual SLA analysis?
That depends on a lot of factors, but for me personally, yes it is. Much worse.
Assuming we’re talking about hosting things for Internet users. My fiber internet connection has gone down multiple times, though relatively quickly restored. My power has gone out several times in the last year, with one storm having it out for nearly 24 hrs. I was sleep when it went out and I didn’t start the generator until it was out for 3-4 hours already, far longer than my UPSes could hold up. I’ve had to do maintenance and updates both physical and software.
All of those things contribute to a downtime significantly higher than I see with my stuff running on Linode, Fly.io or AWS.
I run Proxmox and K3s at home and it makes things far more reliable, but it’s also extra overhead for me to maintain.
Most or all of those things could be mitigated at home, but at what cost?
Maybe if you use a UPS and Starlink then ...
I've been dabbling in this since 1998, It's almost always ISP and power outages that get you. There are ways to mitigate those (primary/secondary ISPs, UPSes, and generators) but typically unless you're in a business district area of a city, you'll just always be subject to problems
So for me, extremely anecdotally, I host a few fairly low-importance things on a home server (which is just an old desktop computer left sitting under a desk with Ubuntu slapped on it): A VPN (WireGuard), a few Discord bots, a Twitch bot + some auth stuff, and a few other services that I personally use.
These are the issues I've ran into that have caused downtime in the last few years:
- 1x power outage: if I had set up restart on power, probably would have been down for 30-60 minutes, ended up being a few hours (as I had to manually press the power button lol). Probably the longest non-self-inflicted issue.
- Twitch bot library issues: Just typical library bugs. Unrelated to self-hosting.
- IP changes: My IP actually barely ever changes, but I should set up DDNS. Fixable with self-hosting (but requires some amount of effort).
- Running out of disk space: Would be nice to be able to just increase it.
- Prooooooobably an internet outage or two, now that I think about it? Not enough that it's been a serious concern, though, as I can't think of a time that's actually happened. (Or I have a bad memory!)
I think that's actually about it. I rely fairly heavily on my VPN+personal cloud as all my notes, todos, etc are synced through it (Joplin + Nextcloud), so I do notice and pay a lot of attention to any downtime, but this is pretty much all that's ever happened. It's remarkable how stable software/hardware can be. I'm sure I'll eventually have some hardware failure (actually, I upgraded my CPU 1-2 years ago because it turns out the Ryzen 1700 I was using before has some kind of extremely-infrequent issue with Linux that was causing crashes a couple times a month), but it's really nice.
To be clear, though, for an actual business project, I don't think this would be a good idea, mainly due to concerns around residential vs commercial IPs, arbitrary IPs connecting to your local network, etc that I don't fully pay attention to.
so, funny story, my fiber got cut (backhoe) and it took then 12 hours to restore it.
If you had /two/ houses, in separate towns, you'd have better luck. Or, if you had cell as a backup.
Or: if you don't care about it being down for 12 hours.
Unanswerable question. Better to perform a failure mode analysis. That rack in your basement would need redundant power (two power companies or one power company and a diesel generator which typically won't be legal to have at your home), then redundant internet service (actually redundant - not the cable company vs the phone company that underneath use the same backhaul fiber).
It's not DNS
There's no way it's DNS
It was DNS
That or a Windows update.
Or unattended-upgrades
It seems that all the sites that ask for distributed systems in their interview and has their website down wouldn't even pass their own interview.
This is why distributed systems is an extremely important discipline.
Maybe actually making the interviews less of a hazing ritual would help.
Hell, maybe making today's tech workplace more about getting work done instead of the series of ritualistic performances that the average tech workday has degenerated to might help too.
Ergo, your conclusion doesn't follow from your initial statements, because interviews and workplaces are both far more broken than most people, even people in the tech industry, would think.
Well it looks like if companies and startups did their job in hiring the proper distributed systems skills more rather than hazing for the wrong skills we wouldn't be in this outage mess.
Many companies on Vercel don't think to have a strategy to be resilient to these outages.
I rarely see Google, Ably and others serious about distributed systems being down.
There was a huuuge GCP outage just a few months back: https://news.ycombinator.com/item?id=44260810
> Many companies on Vercel don't think to have a strategy to be resilient to these outages.
But that's the job of Vercel and it looks like they did a pretty good job. They rerouted away from the broken region.
distributed systems != continuous uptime
100% wrong.
Serious engineering teams that care about distributed systems and multi region deployments don't think like this.
glad all my services are either Hetzner servers or EU region of AWS!
Can't even get STS tokens. RDS Proxy is down, SQS, Managed Kafka.
It's scary to think about how much power and perhaps influence the AWS platform has. (albeit it shouldn't be surprising)
Does anyone know if having Global Accelerator set up would help right now? It's in the list of affected services, I wonder if it's useful in scenarios like this one.
Presumably the root cause of the major Vercel outage too: https://www.vercel-status.com/
No wonder, when I opened Vercel it showed a 502 error.
AWS pros know to never use us-east-1. Just don't do it. It is easily the least reliable region
I seem to recall other issues around this time in previous years. I wonder if this is some change getting shoe-horned in ahead of some reinvent release deadline...
Thing is us-east-1 the primary region for many services of AWS. DynamoDB is a very central offering used by many service. And the issue that has happend is very common[^1].
I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].
[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...
[2]: https://xkcd.com/2347/
Can confirm. I was trying to send the newsletter (with SES) and it didn't work. I was thinking my local boto3 was old, but I figured I should check HN just in case.
Completely detached from reality, AMZN has been up all day and closed up 1.6%. Wild.
Until this impacts their bottom line, how is that unexpected?
Will we see mass exits from their service? Who knows. My money says no though.
How many companies can just ride the "but it's not our fault" to buy time with customers until it's fixed?
Darn, on Heroku even the "maintenance mode" (redirects all routes to a static url) won't kick in.
It's not DNS
There's no way it's DNS
It was DNS
It's "DNS" because the problem is that at the very top of the abstraction hierarchy in any system is a bit of manual configuration.
As it happens, that naturally maps to the bootstrapping process on hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.
But it's the inevitability of the manual process that's the issue here, not the technology. We're at a spot now where the rest of the system reliability is so good that the only things that bring it down are the spots where human beings make mistakes on the tiny handful of places where human operation is (inevitably!) required.
> hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.
Unless DNS configuration propagates over DHCP?
DHCP can only tell you who the local DNS server is. That's not what's failed, nor what needs human configuration.
At the top of the stack someone needs to say "This is the cluster that controls boot storage", "This is the IP to ask for auth tokens", etc... You can automatically configure almost everything but there still has to be some way to get started.
I wonder how their nines are going. Guess they'll have to stay pretty stable for the next 100 years.
Paddle (payment provider) is down as well: https://paddlestatus.com/
Slack, Jira and Zoom are all sluggish for me in the UK
I wonder if that's not due to dependencies on AWS but all-hands-on-deck causing far more traffic than usual
Coinbase down as well: https://status.coinbase.com/
Best option for a whale to manipulate the price again.
My Alexa is hit or miss at responding to queries right now at 5:30 AM EST. Was wondering why it wasn't answering when I woke up.
Why would us-east-1 cause many UK banks and even UK gov web sites down too!? Shouldn't they operate in the UK region due to GDPR?
2 things:
1) GDPR is never enforced other than token fines based on technicalities. The vast majority of the cookie banners you see around are not compliant, so it the regulation was actually enforced they'd be the first to go... and it would be much easier to go after those (they are visible) rather than audit every company's internal codebases to check if they're sending data to a US-based provider.
2) you could technically build a service that relies on a US-based provider while not sending them any personal data or data that can be correlated with personal data.
>GDPR is never enforced Yes it is.
>other than token fines based on technicalities. Result!
Read my post again. You can go to any website and see evidence of their non-compliance (you don't have to look very hard - they generally tend to push these in your face in the most obnoxious manner possible).
You can't consider a regulation being enforced if everyone gets away with publishing evidence of their non-compliance on their website in a very obnoxious manner.
Integration with USA for your safety :)
Statuspage.io seems to load (but is slow) but what is the point if you can't post an incident because Atlassian ID service is down.
In this moments I think devs should invest in vendor independence if they can. While I'm not to that stage yet (cloudlfare dependence) using open technologies like docker (or Kubernetes), Traefik instead of managed services can help in this disaster situations by switching to a different provider in a faster way than having to rebuild from zero. as a disclosure I'm not still to that point on my infrastructure But I'm trying to slowly define one for my self
Fyi: traffic also had problems bringing new devices online
I'm speacking of the self hosted version you can install on your own vps not the managwment version. I don't like using managed services if possible.
I missed a parcel delivery because a computer server in Virginia, USA went down, and now the doorbell on my house in England doesn't work. What. The. Fork.
How the hell did Ring/Amazon not include a radio-frequency transmitter for the doorbell and chime? This is absurd.
To top it off, I'm trying to do my quarterly VAT return, and Xero is still completely borked, nearly 20 hours after the initial outage.
Why after all these years is us-east-1 such a SPOF?
During the last us-east-1 apocalypse 14 years ago, I started awsdowntime.com - don't make me regsiter it again and revive the page.
I remember the one where some contractor accidentally cut the trunk between AZes.
In us-east-1? That doesn't sound that impactful, have always heard that us-east-1's network is a ring.
Back before AWS provided transparency into AZ assignments, it was pretty common to use latency measurements to try and infer relative locality and mappings of AZs available to an account.
I cannot create a support ticket with AWS as well.
This will always be a risk when sharecropping.
r/aws not found
There aren't any communities on Reddit with that name. Double-check the community name or start a new community.
Strangely some of our services are scaling up on east-1, and there is downtick on downdetector.com so issue might be resolving.
Do we know what caused the outage yet?
I don't get how you can be a trillion dollar company and still suck this much.
Asana down Postman workspaces don't load Slack affected And the worst: heroku scheduler just refused to trigger our jobs
Slack and Zoom working intermittently for me
Only us east 1 gets new services immediately others might do but not a guarantee. Which regions are a good alternative
Impossible to connect to JIRA here (France).
Login issues with Jira Cloud here in Germany too. Just a week after going from Jira on-prem to cloud
i am amused at how us-east-1 is basically in the same location as where aol kept its datacenters back in the day.
My website on the cupboard laptop is fine.
Perplexity also have outage.
https://status.perplexity.ai
Always a lovely Monday when you wake just in time to see everything going down
Seems to be upsetting Slack a fair bit, messages taking an age to send and OIDC login doesn't want to play.
Can't update my selfhosted HomeAssistant because HAOS depends on dockerhub which seems to be still down.
Wait a second, Snapchat impacted AGAIN? It was impacted during the last GCP outage.
His influence is so great that it caused half of the internet to stop working properly.
npm and pnpm are badly affected as well. Many packages are returning 502 when fetched. Such a bad time...
Yup, was releasing something to prod and can't even build a react app. I wonder if there is some sort of archive that isn't affected?
AWS CodeArtifact can act as a proxy and fetch new packages from npm when needed. A bit late for that though but sharing if you want to future proof against the yearly us-east-1 outage
Oh damn that ruins all our builds for regions I thought would be unaffected
That strange feeling of the world getting cleaner for a while without all these dependant services.
I expect gcp and azure to gain some customers after this
GCP is underrated
Damn. This is why Duolingo isn't working properly right now.
One of my co-workers was woken up by his Eight Sleep going haywire. He couldn't turn it off because the app wouldn't work (presumably running on AWS).
Twilio seems to be affected as well
Their entire status page is red!
These things happen when profits are the measure everything. Change your provider, but if their number doesn't go up, they wont be reliable.
So your complaints matter nothing because "number go up".
I remember the good old days of everyone starting a hosting company. We never should have left.
It certainly doesn't take profit for bad planning to happen...
And everybody starting a hosting company is definitely a profit driven activity.
> And everybody starting a hosting company is definitely a profit driven activity.
Absolutely, nobody was doing it out of charity, but there is more diversity in the market and thus more innovation and then the market decides. Right now we have 3 major providers, and that makes up the lion's share. That's consolidation of a service. I believe that's not good for the market or the internet as a whole.
Did someone vibe code a DNS change
Lots of outage happening in Norway, too. So I'm guessing it is a global thing.
They haven't listed SES there yet in the affected services on their status page
Reddit itself breaking down and errors appear. Does reddit itself depends on this?
On a bright note, Alexa has stopped pushing me merchandise.
As of 4:26am Central Time in the USA, it's back up for one of my services.
It’s a good day to be a DR software company or consultant
Thanks god we built all our infra on top of EKS, so everything works smoothly =)
Amazon.ca is degraded, some product pages load but can't see prices. Amusing.
Serverless is down because servers are down. What an irony.
Serverless is just someone elses server, or something
Just tried to get into Seller Central, returned a 504.
10:30 on a Monday morning and already slacking off. Life is good. Time to touch grass, everybody!
Finally an upside to running on Oracle Cloud!
Alexa devices are also down.
And ring! Don’t know why the chime needs an AWS connection. That was surprising.
Lots of outage in Norway, started approximately 1 hour ago for me.
idiocracy_window_view.jpg
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
It should be! When I was a complete newbie at AWS my first question was why do you have to pick a region, I thought the whole point was you didn't have to worry about that stuff
As far as I know, region selection is about regulation and privacy and guarantees on that.
It's also about latency and nearness to users. Also some regions don't have all features so feature set also matters.
One might hope that this, too, would be handled by the service. Send the traffic to the closest region, and then fallback to other regions as necessary. Basically, send the traffic to the closest region that can successfully serve it.
But yeah, that's pretty hard and there are other reasons customers might want to explicitly choose the region.
The region labels found within the metadata are very very powerful.
They make lawyers happy and they stop intelligence services to access the associated resources.
For example, no one would even consider accessing data from a European region without the right paperwork.
Because if they were caught they'd have to pay _thousands_ of dollars in fines and get sternly talked to be high ranking officials.
Assuming that the service actually bothered to have multiple regions as fallbacks configured.
> another data centre
Yes, within the same region. Doing stuff cross-region takes a little bit more effort and cost, so many skip it.
I still don't know why anyone would use AWS hosting.
Can confirm, also getting hit with this.
https://www.youtube.com/shorts/liL2VXYNyus
wow I think most of Mulesoft is down, that's pretty significant in my little corner of the tech world.
What is HN hosted on?
2 physical servers, one active and one standby running BSD at M5 internet hosting.
Coinbase down as well
Signal is down for me
Yes. https://status.signal.org/
Edit: Up and running again.Is this the outage that took Medium down ?
us-east-1 down again. We all know we should leave. None of us will.
I wonder how much better the uptime would be if they made a sincere effort to retain engineering staff.
Right now on levels.fyi, the highest-paying non-managerial engineering role is offered by Oracle. They might not pay the recent grads as well as Google or Microsoft, but they definitely value the principal engineers w/ 20 years of experience.
Both Intercom and Twilio are affected, too.
- https://status.twilio.com/ - https://www.intercomstatus.com/us-hosting
I want the web ca. 2001 back, please.
Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.
But they aren't abusing their market power, are they? I mean, they are too big and should definitely be regulated but I don't think you can argue they are much of a monopoly when others, at the very least Google, Microsoft, Oracle, Cloudflare (depending on the specific services you want) and smaller providers can offer you the same service and many times with better pricing. Same way we need to regulate companies like Cloudflare essentially being a MITM for ~20% of internet websites, per their 2024 report.
Ohno, not Fortnite! oh, the humanity.
It's always DNS
That's unusual.
I wss under the impression that having multiple available zones guarantees high availability.
It seems this is not the case.
One of the open secrets of AWS is that even though AWS has a lot of regions and availability zones, a lot of AWS services have control planes that are dependent on / hosted out of us-east-1 regardless of which region / AZ you're using, meaning even if you are using a different availability zone in a different region, us-east-1 going down still can mess you up.
quay.io was down: https://status.redhat.com
quay.io is down
It was in read only mode,now available again
I did get 500 error from their public ECR too
canva.com was down until a few minutes ago.
Atlassian cloud is having problems as well.
thundering herd problems.... every time they say they fix it something else breaks
The RDS proxy for our postgres DB went down.
Cant even login via the AWS access portal.
I cannot pull images from docker hub.
It's fun to see SRE jumping left and right when they can do basically nothing at all.
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
Signal seems to be dead too though, which is much more of a WTF?
A decentralized messenger is Tox.
Clearly this is all some sort of mass delusion event, the Amazon Ring status says everything is working.
https://status.ring.com/
(Useless service status pages are incredibly annoying)
Atlassian is down as well so they probably can't access their Atlassian Statuspage admin panel to update it.
When you know a service is down but the service says it's up: it's either your fault or the service is having a severe issue
Terraform Cloud is having problem too
Great. Hope they’re down for a few more days and we can get some time off.
Signal not working here for me in AU
I can't even see my EKS clusters
Similar: our EC2s are gone.
everything is gone :(
Sling still down at 11:42PM PST
Kraken can't do deposits either at 3:38am PST
Curious to know how much does an outage like this cost to others.
Lost data, revenue, etc.
I'm not talking about AWS but whoever's downstream.
Is it like 100M, like 1B?
SES and signal seem to work again
Zoom is unable to send screenshots.
zoom unable to send messages now as well.
BGP (again)?
It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.
Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
> we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive.
You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.
Stocks stopped being indicative of anything decades ago though.
No shot that happens until an outage breaks at least an entire workday in the US timezones. The only complaint I personally heard was from someone who couldn't load reddit on the train to work.
Well by the time it really happens for a whole day Amazon leadership will be brazen enough to say "OK, enough of this my site is down, we will call back once systems are up so don't bother for a while". Also maybe responsible human engineers would fired by then and AI can be infinitely patient while working through insolvable issues.
No complaints from folks who couldn’t load reddit at work?
Reddit was back by the time work started, so all good there lol
> Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.
And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.
The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.
IAM, hands down, is one of the most amazing pieces of technology there is.
The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.
And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.
To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.
No harshness intended, but I don't see the magic.
IAM is solid, but is it any more special than any other distributed AuthN+AuthZ service?
Scale is a feature. 500M per second in practice is impressive.
The scale, speed, and uptime of AWS IAM is pretty special.
IAM is very simple, and very solid.
The scale, speed, and uptime, are downstream from the simplicity.
It's good solid work, I guess I read "amazing" as something surprising or superlative.
(The simple, solid, reliable services should absolutely get more love! Just wasn't sure if I was missing something about IAM.)
It's not simple, that's the point! The filter rules and ways to combine rules and their effects are highly complex. The achievement is how fast it is _despite_ network being involved on at least two hops - first service to IAM and then IAM to database.
I think it's simple. It's just a stemming pattern matching tree, right?
The admin UX is ... awkward and incomplete at best. I think the admin UI makes the service appear more complex than it is.
The JSON representation makes it look complicated, but with the data compiled down into a proper processable format, IAM is just a KVS and a simple rules engine.
Not much more complicated than nginx serving static files, honestly.
(Caveat: none of the above is literally simple, but it's what we do every day and -- unless I'm still missing it -- not especially amazing, comparatively).
IAM policies can have some pretty complex conditions that require it to sync to other systems often. Like when a tag value is used to allow devs access to all servers with the role:DEV tag.
In my (imagined) architecture, the auth requester sends the asset attributes (including tags in this example) with the auth request, so the auth service doesn't have to do any lookup to other systems. Updates are pushed in a message queue style manner, policy tables are cached and eventually consistent.
The irony is that true resilience is very complex, and complexity can be a major source of outages in and of itself
I have enjoyed this paper on such dynamics: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...
It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.
when did we have resilience?
Cold War was pretty good in terms of resilience.
There are plenty of ways to address this risk. But the companies impacted would have to be willing to invest in the extra operational cost and complexity. They aren’t.
Too big to recover.
Considering the history of east-1 it is fascinating that it still causes so many single point of failure incidents for large enterprises.
They are amazing at LeetCode though.
Let's be nice. I'm sure devs and ops are on fire right now, trying to fix the problems. Given the audience of HN, most of us could have been (have already been?) in that position.
No we wouldn’t because there’s like a 50/50 chance of being a H1B/L1 at AWS. They should rethink their hiring and retention strategies.
hugops ftw
They choose their hiring-retention practices and they choose to provide global infrastructure, when is the good time to criticise them?
Granted, they are not as drunk on LLM as Google and Microsoft. So, at least we can say this outage had not been vibe-coded (yet).
Affecting Coinbase[1] as well, which is ridiculous. Can't access the web UI at all. At their scale and importance they should be multi-region if not multi-cloud.
[1] https://status.coinbase.com
Seems the underlying issue is with DynamoDB, according to the status page, which will have a big blast radius in other services. AWS' services form a really complicated graph and there's likely some dependency, potentially hidden, on us-east-1 in there.
The issue appears to be cascading internationally due to internal dependencies on us-east-1
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
I get the impression that this has been thought about to some extent, but its a constantly changing architecture with new layers and new ideas being added, so for every bit of progress there's the chance of new Single Points Of Failure being added. This time it seems to be a DNS problem with DynamoDB
you put your sh*t in us-east-1 you need to plan for this :)
npm registry also down
Uhm... E(U)ropean sovereigny (and in general spreading the hosting as much as possbile) needed ASAP…
because... EU clouds don't break?
https://news.ycombinator.com/item?id=43749178
Nah, because European services should not be affected by a failure in the US. Whatever systems they have running in us-east-1 should have failovers in all major regions. Today it's an outage in Virginia, tomorrow it could be an attack on undersea cables (which I'm confident are mined and ready to be severed at this point by multiple parties).
Mined? My understanding is that they are maintained too regularly for that, or we would know.
Also, lots of the bad guy boogeymen countries have legal and technical methods to do this without property damage. Just blackhole a bunch of routes.
No, it's called "diversification". Applies both to stock/currency/investments as well as enything else :P
ok but many providers already have regions in the EU
Well, except for a lot of business leaders saying that they don't care if it's Amazon that goes down, because "the rest of the internet will be down too."
Dumb argument imho, but that's how many of them think ime.
Docker is also down.
Also:
Snapchat, Ring, Roblox, Fortnite and more go down in huge internet outage: Latest updates https://www.the-independent.com/tech/snapchat-roblox-duoling...
To see more (from the first link): https://downdetector.com
Don't miss this
"We should have a fail back to US-West."
"It's been on the dev teams list for a while"
"Welp....."
Surprising and sad to see how many folks are using DynamoDB There are more full featured multi-cloud options that don't lock you in and that don't have the single point of failure problems.
And they give you a much better developer experience...
Sigh
Happy Monday People
Apparently IMDb, an Amazon service is impacted. LOL, no multi region failover.
More and more I want to be could agnostic or multi-cloud.
Good luck to all on-callers today.
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
And yet, AMZN is up for the day. The market doesn't care. Crazy.
Medium also.
O ffs. I can't even access the NYT puzzles in the meantime ... Seriously disrupted, man
altavista.com is also down!
seeing issues with SES in us-east-1 as well
seems like services are slowly recovering
Today’s reminder: multi-region is so hard even AWS can’t get it right.
Am i imagining it or are more things like this happening in recent weeks than usual?
My app deployed on Vercel and therefore indirectly deployed on us-east-1 was down for about 2 hours today then came back up and then went down again 10 minutes ago for 2 or 3 minutes. It seems like they are still intermittent issues happening.
Now I know why the documents I was sending to my Kindle didn't go through.
For me Reddit is down and also the amazon home page isn't showing any items for me.
Sounds like a circular error with monitoring is flooding their network with metrics and logs, causing DNS to fail and produce more errors, flooding the network. Likely root cause is something like DNS conflicts or hosts being recreated on the network. Generally this is a small amount of network traffic but the LBs are dealing with host address flux, causing the hosts to keep colliding host addresses as they attempt to resolve to a new host address which are being lost from dropped packets and with so many hosts in one AZ, there's a good chance they end up with a new conflicting address.
I didn't even notice anything was wrong today. :) Looks like we're well disconnected from the US internet infra quasi-hegemony.
I in-housed an EMR for a local clinic because of latency and other network issues taking the system offline several times a month (usually at least once a week). We had zero downtime the whole first year after bringing it all in house, and I got employee of the month for several months in a row.
Paying for resilience is expensive. not as expensive as AWS, but it's not free.
Modern companies live life on the edge. Just in time, no resilience, no flexibility. We see the disaster this causes whenever something unexpected happens - the Evergiven blocking Suez for example, let alone something like Covid
However increasingly what should be minor loss of resilience, like an AWS outage or a Crowdstrike incident, turns into major failures.
This fragility is something government needs to legislate to prevent. When one supermarket is out that's fine - people can go elsewhere, the damage is contained. When all fail, that's a major problem.
On top of that, the attitude that the entire sector has is also bad. People thing IT should tail once or twice a year and it's not a problem. If that attitude affect truly important systems it will lead to major civil projects. Any civilitsation is 3 good meals away from anarchy.
There's no profit motive to avoid this, companies don't care about being offline for the day, as long as all their mates are also offline.
Keep going
Ring is affected. Why doesn’t Ring have failover to another region?
That's understandably bad for anyone who depends on Ring for security but arguably a net positive for the rest of us.
Amazon’s Ring to partner with Flock: https://news.ycombinator.com/item?id=45614713
Ironically enough I can't access Reddit due to no healthy upstream.
Substack seems to by lying about their status: https://substack.statuspage.io/
Reddit seems to be having issues too:
"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"
Snow day!
Well that takes down Docker Hub as well it looks like.
Yep, was just thinking the same when my Kubernetes failed a HelmRelease due to a pull error…
It's weird that we're living in a time where this could be a taste of a prolonged future global internet blackout by adversarial nations. Get used to this feeling I guess :)
Can't log into tidal for my music
Navidrome seems fine
is this why docker is down?
yes hub.docker.com. 75 IN CNAME elb-default.us-east-1.aws.dckr.io.
Can't check out on Amazon.com.au, gives error page
This link works fine from Australia for me.
Vercel functions are down as well.
Yes https://www.vercel-status.com/
Slack now also down: https://slack-status.com/
worst outage since xmas time 2012
Probably related:
https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...
"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."
Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.
"Never choose us-east-1"
Never choose a single point of failure.
Or rather
Ensure your single point of failure risk is appropriate for your business. I don't have full resilience for my companies AS going down, but we do have limited DR capability. Same with the loss of a major city or two.
I'm not 100% confident in a Thames Barrier flood situation, as I suspect some of our providers don't have the resilience levels we do, but we'd still be able to provide some minimal capability.
"serverless"
Slack was acting slower than usual, but did not go down. Color me impressed.
Reminder that AZs don't go down
Entire regions go down
Don't pay for intra-az traffic friends
I love this to be honest. Validates my anti cloud stance.
No service that does not run on cloud has ever had outages.
But at least a service that doesn't run on cloud doesn't pay the 1000% premium for its supposed "uptime".
At least its in my control :)
Not having control or not being responsible are perhaps major selling points of cloud solutions. To each their own, I also rather have control than having to deal with a cloud provider support as a tiny insignificant customer. But in this case, we can take a break and come back once it's fixed without stressing.
Businesses not taking responsibility for their own business should not exist in the first place...
No business is fully integrated. Doing so would be dumb and counter productive.
Many businesses ARE fully vertically integrated. And many make real stuff, in meat space, where it's 100,000x harder. But software companies can't do it?
Obviously there's pros and cons. One of the pros being that you're so much more resilient to what goes on around you.
Name one commercial company that is entirely/fully vertically integrated and can indefinitely continue business operations 100% without any external suppliers.
It's a sliding scale - eventually you rely on land. But in n out is close. And they make burgers, which is much more difficult than making software. Yes, I'm being serious.
But if you look at open source projects, many are close to perfectly verifically integrated.
There's also a big big difference between relying on someone's code and relying on someone's machines. You can vender code - you, however, rely on particular machines being up and connected to the internet. Machines you don't own and you aren't allowed to audit.
> But in n out is close.
You said "Many businesses ARE fully vertically integrated." so why name one that is close to fully vertically integrated, just name one of the many others that are fully vertically integrated. I don't really care about discussing things which prove my point instead of your point as if they prove your point.
> open source projects, many are close to perfectly verifically integrated
Comparing code to services seems odd, not sure how GitLab the software compares to GitLab the service for example. Code is just code, a service requires servers to run on, etc. GitLab the software can't have uptime because it's just code. It can only have an uptime once someone starts running it, at which point you can't attribute everything to the software anymore as the people running it have a great deal of responsibility for how well it runs, and even then, even if GitLab the software would have been "close to perfectly vertically integrated" (like if they used no OS, as if anyone would ever want that), then the GitLab serivice still needs many things from other suppliers to operate.
And again, "close to perfectly verifically integrated" is not "perfectly verifically integrated".
If you are wrong, and in fact nothing in our modern world is fully vertically integrated as I said, then it's best to just admit that and move on from that and continue discussing reality.
Jesus Christ dude, nothing is 100% and we both know that. That doesn't mean my broader point is wrong.
You're arguing semantics because you know that's the only way you'll feel right. Bye.
Allowing them to not take responsibility is an enabler for unethical business practices. Make businesses accountable for their actions, simple as that.
How are they not accountable though? Is Docker not accountable for their outage that follows as a consequence? How should I make them accountable? I don't have to do shit here, the facts are what they are and the legal consequences are what they are. Docker gives me a free service and free software, they receive 0 dollars from me, I think the deal I get is pretty fucking great.
What makes you think I meant docker?
Okay, let's discard Docker, take all other companies. How are they not accountable? How should I make them accountable? I don't have to do shit here, the facts are what they are, and the legal consequences are what they are. They either are legally accountable or not. Nothing I need to do to make that a reality.
If a company sold me a service with guaranteed uptime, I'd expect the guaranteed uptime or expect a compensation in case they cant keep up with their promises.
Every cloud provider I have ever used has an SLA that covers exactly this, here you can see the SLAs for AWS: https://aws.amazon.com/ecs/anywhere/sla/?did=sla_card&trk=sl...
Time to start calling BS on the 9's of reliability
this is why you avoid us-east-1
And yet you're still impacted because it hosts IAM
workos is down too, timing is highly correlated with AWS outage: https://status.workos.com/
That means Cursor is down, can't login.
Is this why Wordle logged me out and my 2 guesses don't seem to have been recorded? I am worried about losing my streak.
This outage is a reminder:
Economic efficiency and technical complexity are both, separately and together, enemies of resilience
Meanwhile my pair of 12 year old raspberry pi's hangling my home services like DNS survive their 3rd AWS us-east-1 outage.
"But you can't do webscale uptime on your own"
Sure. I suspect even a single pi with auto-updates on has less downtime.
There are entire apps like Reddit that are still not working. What the fuck is going on?
Remember when the "internet will just route around a network problem"?
FFS ...
99.999 percent lol
Slack now failing for me.
Honestly anyone do have outages, that's nothing extraordinary, what's wrong is the number of impacted services. We choose (at least almost choose) to ditch mainframes for clusters also for resilience. Now with cheap desktop iron labeled "stable enough to be a serious server" we have seen mainframes re-created sometimes with a cluster of VM on top of a single server, sometimes with cloud services.
Ladies and Gentleman's it's about time to learn reshoring in the IT world as well. Owning nothing, renting all means extreme fragility.
imagine spending millions on devops and sre to still have your mission critical service go down because amazon still has baked in regional dependencies
How much longer are we going to tolerate this marketing bullshit about "Designed to provide 99.999999999% durability and 99.99% availability"?
That is for S3 not AWS as a whole, AWS has never claimed otherwise. AFAIK S3 has never broken the 11 9s of durability.
But but this is a cloud, it should exist in the cloud.
[dead]
[dead]
Designed to provide 99.999% durability and 99.999% availability Still designed, not implemented
The real challenge is that implementations aren't static.
Just because today's implementation has 4 9s that doesn't mean tomorrow's will...
[dead]
[dead]
[flagged]
This is such an HN response. Oh, no problem, I'll just avoid the internet for all of my important things!
Door locks, heating and household appliances should probably not depend on Internet services being available.
Do you not have a self-hosted instance of every single service you use? :/
No. But for the important ones, yes I do.
Everything in and around my house is working fully offline
They are probably being sarcastic.
Not very helpful. I wanted to make a very profitable trade but can’t login to my brokerage. I’m losing about ~100k right now.
Time to sue, or get insurance.
what's the trade?
Probably AWS stock...
This reminds me of the twitter-based detector we had at Facebook that looked for spikes in "Facebook down" messages.
When Facebook went public, the detector became useless because it fired anytime someone wrote about the Facebook stock being down and people retweeted or shared the article.
I invested just enough time in it to decide it was better to turn it off.
Beyond Meat
Ouch. damn. good call!
[flagged]
Pretty sure this is satire and not even remotely true.
I beleived it because of a thread I read 3 months ago about non-specific Amazon layoffs but you are right. It's AI slop, and not accurate. https://www.reddit.com/r/programming/comments/1m6krap/its_re...
hello world
Hello world
Someone’s got a case of the Monday’s.
Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.
First, not all outages are created equal, so you cannot compare them like that.
I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.
But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.
AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.
If AWS fully decentralized its control planes, they’d essentially be duplicating the cost structure of running multiple independent clouds and I understand that is why they don't however as long as AWS is reliant upon us-east-1 to function, they have not achieved what they claim to me. A single point of failure for IAM? Nah, no thanks.
Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.
That's poor design, after all these years. They've had time to fix this.
Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...
If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.
You can only decentralized your control plane if you don't have conflicting requirements?
Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem
In the narrow distributed-systems sense? Yes, however those requirements are self-imposed. AWS chose strong global consistency for IAM and billing... they could loosen it at enormous expense.
The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.
I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.
The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.
In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.
AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.
Someone vibecoded it down.
Good thing hyperscalers provide 100% uptime.
When have they ever claimed that?
Plenty of people last week here claiming hyperscalers are necessary while ignoring more normal hosting options.
Looks like we're back!
So much for the peeps claiming amazing Cloud uptime ;)
Could you give us that uptime, in number ?
I'm afraid you missed the emoticon at the end of the sentence.
A `;)` is normally understoond to mean the author isn't entirely serious, and is making light of something or other.
Perhaps you American downvoters were on call and woke up with a fright, and perhaps too much time to browse Hacker News. ;)
This is the reason why it is important to plan Disaster recovery and also plan Multi-Cloud architectures.
Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.
Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.
- Qlik replicate - HexaRocket
and some more.
Or rather implement native replication solutions available with data platforms.