US-East AWS Connectivity Issues

rbranson · on Sept 13, 2013

This appears to be connectivity issues entirely to/from the Internet or other EC2 regions from a single availability zone in us-east-1. The intra-AZ networks within us-east-1 have remained available during the event. One of the AZs we use was affected, but no external traffic flows to it. I noticed this because an auto-scale group was trying to bring up instances inside of the affected zone (our us-east-1a) and was unable to contact a server outside of AWS.

cperciva · on Sept 13, 2013

I'm definitely seeing issues in multiple AZs. It seems to be partly firewall-related, however: I've seen cases where it's hard to get an initial SYN through, but once a TCP connection is established it stays established.

fjordan · on Sept 13, 2013

This, in addition to the increase in traffic we detected directly before, smells of DOS. Also, it is Friday the 13th.

jd007 · on Sept 13, 2013

Not sure if it's the entire AZ. We have some Internet-reachable instances and Internet unreachable instances in the same AZ.

fjordan · on Sept 13, 2013

We are able to reach some servers but not all in our us-east AZ and notice it seems to be random.

tomweingarten · on Sept 13, 2013

Did anyone else notice a huge spike in incoming network traffic on their EC2 instance immediately before the outage? Roughly 9:55AM EST.

justinsb · on Sept 13, 2013

Did it look like a ddos attack, or do you think something went wrong where you were getting traffic meant for other EC2 nodes?

I'm not quite sure how you would tell the difference of course...

tomweingarten · on Sept 15, 2013

We didn't get enough data to be able to determine that, but I'd be very curious to hear if someone else did.

acdha · on Sept 13, 2013

Nothing extraordinary but there was definitely a healthy spike in one AZ.

sadris · on Sept 13, 2013

rschmitty · on Sept 13, 2013

Does anyone know why in the world they display a green checkmark with a near invisible little 'i' for this?

iota · on Sept 13, 2013

There are 4 statuses.

Green checkmark (status 0)

Green checkmark with "info" badge (status 1)

Yellow triangle (status 2)

Red "do not enter" rectangle (status 3)

I suspect that status 0 indicates that they are investigating a problem with the server, and it switches to status 1 once the problem has been confirmed.

This is also a good example of poor icon design...they aren't self-explanatory, and so they should not be used.

cperciva · on Sept 13, 2013

What happened to status 2?

iota · on Sept 13, 2013

Good catch. Fixed!

rschmitty · on Sept 13, 2013

Thanks!

aroch · on Sept 13, 2013

Because checkmarks make everything look fine...

ceph_ · on Sept 13, 2013

Non-snarky answer: They want to post that they are investigating an issue, but do not want to comment on the scope of the problem when it is still not fully understood.

jordanmessina · on Sept 13, 2013

http://www.youtube.com/watch?v=uriZZ3Slz74

teoruiz · on Sept 13, 2013

They switch to white-triangle-in-yellow-circle when they find out that there is a real problem. They just switched.

jolan · on Sept 13, 2013

Amazon is continuing the trend of announcing outages 30 minutes after they start.

Just signed up for a support contract since the status page said everything was fine.

colinbartlett · on Sept 13, 2013

And by "announcing" you mean indicating everything is a-okay with the green checkmark but putting a tiny footnote next to it.

acdha · on Sept 13, 2013

“outage (n): a period when a power supply or other service is not available or when equipment is closed down”

A subset of traffic taking longer than normal is not an outage

frabcus · on Sept 13, 2013

We (ScraperWiki) can still access some of our US East servers. From those, can daisy chain SSH into the ones that are offline. Those servers can't see the world, but are working fine and can see other EC2 instances.

devy · on Sept 13, 2013

Hi, can you use port forwarding to get website up on those affected nodes?

jpea · on Sept 13, 2013

I wonder if it extends beyond Amazon, since my gmail now doesn't pull anything up after 2009, web or IMAP.

aquark · on Sept 13, 2013

I'm getting external monitoring failures that are firing on and off, but have no problem reaching the servers or the site.

Interestingly newrelic is reporting the site down at the same time it is reporting a normal level of load on it.

joe010 · on Sept 13, 2013

I've recently moved some of our servers over to Digital Ocean, but I'm still using AWS for DNS since their Route 53 weighted DNS with health checks work as a basic load balancer for our needs. I'm seeing DNS health checks that point at individual servers at Digital Ocean that are showing 0.91 for a status (1 being up and 0 being down. The alarms attached to the health checks keep flipping from "alarm" to "ok" and causing tons of alerts. As of about 15 minutes ago all of my checks started holding steady back at a status of 1 (ok) Good stuff :)

jd007 · on Sept 13, 2013

ELBs are also having problems. One of mine is reporting all instances out of service (transient error), then all instances in service, intermittently. But the ELB is never reachable (even when it reports all instances healthy and up). All instances behind this one are reachable, up and running. US-East-1.

Some of our other instances are reachable but some are not, same as others have been reporting.

sadris · on Sept 13, 2013

Why does this never happen to AWS West? I should really get to migrating over with 3 outages in the past 2 years on US East.

knodi · on Sept 13, 2013

It does happen, you just never notice because you don't have instances in US West.

brryant · on Sept 13, 2013

There are definitely issues with network connectivity between AZs as well as public internet connectivity.

jipumarino · on Sept 13, 2013

I got into one of our machines that presented the connectivity issues from another one which was still reachable. It had no external (curl www.google.com) connectivity. Just two minutes ago it started resolving again.

ihaveajob · on Sept 13, 2013

It looks ok now for us (appfluence.com), but even when it was down, our website was still up, only the sync services went offline. And even then, they were accessible from the web server...

NotDaveLane · on Sept 13, 2013

It's region us-east-1c for now, at least from where I'm sitting... I have instances in other us-east datacenters that are fine.

trevyn · on Sept 13, 2013

Specific availability zones in a region are mapped per-account, so your east-1c might be my east-1a:

"To ensure that resources are distributed across the Availability Zones for a region, we independently map Availability Zones to identifiers for each account. For example, your Availability Zone us-east-1a might not be the same location as us-east-1a for another account. Note that there's no way for you to coordinate Availability Zones between accounts."

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-reg...

ceejayoz · on Sept 13, 2013

I wonder how that works with new zones. I remember us-east-1e being added separately to the original four. Presumably, that one's the same for all accounts that'd already signed up at the time.

astrodust · on Sept 13, 2013

Clever and ridiculous at the same time.

lreeves · on Sept 13, 2013

Availability zone letters are specific to your account.

scrabble · on Sept 13, 2013

So what is the best way to balance a hosted site between Amazon and a separate service? Because these connectivity issues suck.

bredman · on Sept 13, 2013

One option would be to use Route 53 weighted round robin (WRR) DNS records and health checks to accomplish this.

jday · on Sept 13, 2013

this has taken openredis offline: https://twitter.com/openredis

heroku is also reporting issues:

https://status.heroku.com/incidents/554

martin_ · on Sept 13, 2013

All of mine just started magically working

martin_ · on Sept 13, 2013

I retract that statement! http://shutter.io/img/vs6jjs/raw

TallboyOne · on Sept 13, 2013

Aaand were beck up now.

knodi · on Sept 13, 2013

Always on a Friday...

jlgaddis · on Sept 13, 2013

Not just any Friday...

    $ date
    Fri Sep 13 12:07:58 EDT 2013

xdissent · on Sept 13, 2013

Not just any Friday the 13th... http://en.wikipedia.org/wiki/Programmers'_Day

TallboyOne · on Sept 13, 2013

Not just any Friday the 13th Programmer's Day... http://www.holidayinsights.com/other/fortunecookie.htm

o0-0o · on Sept 13, 2013

Down in Manhattan