Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
US-East AWS Connectivity Issues (amazon.com)
77 points by fjordan on Sept 13, 2013 | hide | past | favorite | 49 comments


This appears to be connectivity issues entirely to/from the Internet or other EC2 regions from a single availability zone in us-east-1. The intra-AZ networks within us-east-1 have remained available during the event. One of the AZs we use was affected, but no external traffic flows to it. I noticed this because an auto-scale group was trying to bring up instances inside of the affected zone (our us-east-1a) and was unable to contact a server outside of AWS.


I'm definitely seeing issues in multiple AZs. It seems to be partly firewall-related, however: I've seen cases where it's hard to get an initial SYN through, but once a TCP connection is established it stays established.


This, in addition to the increase in traffic we detected directly before, smells of DOS. Also, it is Friday the 13th.


Not sure if it's the entire AZ. We have some Internet-reachable instances and Internet unreachable instances in the same AZ.


We are able to reach some servers but not all in our us-east AZ and notice it seems to be random.


Did anyone else notice a huge spike in incoming network traffic on their EC2 instance immediately before the outage? Roughly 9:55AM EST.


Did it look like a ddos attack, or do you think something went wrong where you were getting traffic meant for other EC2 nodes?

I'm not quite sure how you would tell the difference of course...


We didn't get enough data to be able to determine that, but I'd be very curious to hear if someone else did.


Nothing extraordinary but there was definitely a healthy spike in one AZ.


Yes


Does anyone know why in the world they display a green checkmark with a near invisible little 'i' for this?


There are 4 statuses.

Green checkmark (status 0)

Green checkmark with "info" badge (status 1)

Yellow triangle (status 2)

Red "do not enter" rectangle (status 3)

I suspect that status 0 indicates that they are investigating a problem with the server, and it switches to status 1 once the problem has been confirmed.

This is also a good example of poor icon design...they aren't self-explanatory, and so they should not be used.


What happened to status 2?


Good catch. Fixed!


Thanks!


Because checkmarks make everything look fine...


Non-snarky answer: They want to post that they are investigating an issue, but do not want to comment on the scope of the problem when it is still not fully understood.



They switch to white-triangle-in-yellow-circle when they find out that there is a real problem. They just switched.


Amazon is continuing the trend of announcing outages 30 minutes after they start.

Just signed up for a support contract since the status page said everything was fine.


And by "announcing" you mean indicating everything is a-okay with the green checkmark but putting a tiny footnote next to it.


“outage (n): a period when a power supply or other service is not available or when equipment is closed down”

A subset of traffic taking longer than normal is not an outage


We (ScraperWiki) can still access some of our US East servers. From those, can daisy chain SSH into the ones that are offline. Those servers can't see the world, but are working fine and can see other EC2 instances.


Hi, can you use port forwarding to get website up on those affected nodes?


I wonder if it extends beyond Amazon, since my gmail now doesn't pull anything up after 2009, web or IMAP.


I'm getting external monitoring failures that are firing on and off, but have no problem reaching the servers or the site.

Interestingly newrelic is reporting the site down at the same time it is reporting a normal level of load on it.


I've recently moved some of our servers over to Digital Ocean, but I'm still using AWS for DNS since their Route 53 weighted DNS with health checks work as a basic load balancer for our needs. I'm seeing DNS health checks that point at individual servers at Digital Ocean that are showing 0.91 for a status (1 being up and 0 being down. The alarms attached to the health checks keep flipping from "alarm" to "ok" and causing tons of alerts. As of about 15 minutes ago all of my checks started holding steady back at a status of 1 (ok) Good stuff :)


ELBs are also having problems. One of mine is reporting all instances out of service (transient error), then all instances in service, intermittently. But the ELB is never reachable (even when it reports all instances healthy and up). All instances behind this one are reachable, up and running. US-East-1.

Some of our other instances are reachable but some are not, same as others have been reporting.


Why does this never happen to AWS West? I should really get to migrating over with 3 outages in the past 2 years on US East.


It does happen, you just never notice because you don't have instances in US West.


There are definitely issues with network connectivity between AZs as well as public internet connectivity.


I got into one of our machines that presented the connectivity issues from another one which was still reachable. It had no external (curl www.google.com) connectivity. Just two minutes ago it started resolving again.


It looks ok now for us (appfluence.com), but even when it was down, our website was still up, only the sync services went offline. And even then, they were accessible from the web server...


It's region us-east-1c for now, at least from where I'm sitting... I have instances in other us-east datacenters that are fine.


Specific availability zones in a region are mapped per-account, so your east-1c might be my east-1a:

"To ensure that resources are distributed across the Availability Zones for a region, we independently map Availability Zones to identifiers for each account. For example, your Availability Zone us-east-1a might not be the same location as us-east-1a for another account. Note that there's no way for you to coordinate Availability Zones between accounts."

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-reg...


I wonder how that works with new zones. I remember us-east-1e being added separately to the original four. Presumably, that one's the same for all accounts that'd already signed up at the time.


Clever and ridiculous at the same time.


Availability zone letters are specific to your account.


So what is the best way to balance a hosted site between Amazon and a separate service? Because these connectivity issues suck.


One option would be to use Route 53 weighted round robin (WRR) DNS records and health checks to accomplish this.


this has taken openredis offline: https://twitter.com/openredis

heroku is also reporting issues:

https://status.heroku.com/incidents/554


All of mine just started magically working


I retract that statement! http://shutter.io/img/vs6jjs/raw


Aaand were beck up now.


Always on a Friday...


Not just any Friday...

    $ date
    Fri Sep 13 12:07:58 EDT 2013


Not just any Friday the 13th... http://en.wikipedia.org/wiki/Programmers'_Day


Not just any Friday the 13th Programmer's Day... http://www.holidayinsights.com/other/fortunecookie.htm


Down in Manhattan




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: