Blog
Help, my AWS instance has fallen and it can't get up...
Our first day back after the festive break was an eventful one! So as we eased back into working life our AWS instance, a nicely optimised Fedora AMI that was serving both our suite of Tegel hosted sites and as the core testing area for our new Struktor application platform, suddenly went silent!!
While I thanked my lucky stars this didnt happen while I was stuffing my face over christmas dinner, or glued to the TV watching the Boxing Day Test I was none the less perplexed and a little lost.
The symptoms were strange, the instance appeared to be running when we checked the ec2-describe-instances call but no matter what we did, either via HTTP, SSH or SFTP we couldnt raise it at all. So we checked further, and it appeared that all our S3 data and the Elastic Storage Block (EBS) data appeared to be intact (thanks jeebers for that!), but the EC2 instance was toast.
So we raised the topic on the AWS EC2 forums and got a fairly quick response which was nice;
Hi James,
We are investigating a misbehaving network device that seems to be affecting connectivity to a small number of instances. We are working to fix or replace that device. You can relaunch your instance or wait and we should be able to restore connectivity.
Regards,
JoeJ
Hmm, that is strange, well at least it wasnt something we did. As you can imagine we were mentally running through all the things we may have done to bork the server, so it was nice to hear we were in the clear.
So after some more checking we decided the best course of action was indeed to rebuild the server onto a new instance, however this posed a few issues;
- The running AMI could not be stopped, it hung in terminating mode indefinately
- It proved difficult to unmount the storage block from the running AMI, unless we used the force option
- We had neglected to institute Elastic IPs so the IP attached to this AMI was lost to us, meaning all our domains needed to be repointed to the new server.
Once we realised what was involved, the disaster recovery proved relatively straightforward if however inconvenient.
- Instantiate a new instance of the AMI from the saved snapshot
- Mount the detached EBS onto the new server, and test the instance
- Grab yourself an Elastic IP if you dont have one already and assign the new instance to it (this article proved most useful)
- Make sure all your domains have the new IP address listed
- Test
The beauty of the Elastic IP as we discovered is that if this were to happen again, we can simply assign that IP to a new instance and we can skip step 4, which can be a huge timesaver if you have a lot of domains with different registrars.
A learning experience to be sure, but now we know the recovery should be pretty quick and painless!
del.icio.us
Digg
Reddit
