First month with Amazon EC2
We have started evaluating and using Amazon EC2 almost a month ago. Here is our 'lessons learned' items.
Be prepared...
We have evaluated and used encryption with OpenSolaris and ZFS on EBS. We have successfully rebundled the instance to migrate our Subversion repository on this server. Although we have always typed the encryption password right after this migration, we have finally decided to check some scenario, e.g. when we do type it wrong: can we loose data some way? Just in case something does go wrong, we have created EBS snapshots on the volumes. After some testing, we see the data lost scenario unlikely, because if we type the password wrong, we will receive something like the following:
Initial state:
pool: safe
state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-72
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
safe FAULTED 0 0 0 corrupted data
/dev/lofi/1 ONLINE 0 0 0
pool: storage
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Fri Jul 24 12:42:15 2009
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror ONLINE 0 0 0
c7d2 ONLINE 0 0 0
c7d3 ONLINE 0 0 0 43.5K resilvered
errors: No known data errors
# zfs mount safe
cannot open 'safe': I/O error
So we need to remove the lofi storage with lofiadm, and remount it, solves all the problem.
Automate...
It is always a good idea to document things, and this is especially true with a sometimes transient service like Amazon EC2. It turned out that there was a startup bug in the official OpenSolaris bundle and you need to rebundle your server with the new version if you would like to have a better version. We did, as we have encountered this bug sometimes, so the documentation become very handy: we were required just to copy-paste the commands in the console and wait for the output, as most of our documentation was like a shell-script.
The next level of automation will be to create expect-scripts to automatically set-up and bundle full images. I'd suggest anyone starting with EC2 to write the setup scripts in this later fashion from the beginning. For the hard-core Java people like myself, ExpectJ or Enchanter are vital options too, but the ultimate solution is to use something like JSch and Groovy to control every aspect of the communication.
Automate, automate...
When we start an instance, we attach the drives, the elastic IP, then execute a few commands to mount the encrypted storage and start the services. This is a very boring process, and fortunately you could automate this process too:
- Use the Amazon EC2 command line tools to query information from your available resources.
- Tag your resources according your service needs (e.g. if you have a redmine server, put the redmine tag in the EBS volume's tag and the elastic IP of that instance).
- Write scripts that process these tags with the help of the above mentioned command line tools, attach the drives and IP automatically.
- Execute other scripts (e.g. the encryption) on the running instance to fire up everything.
Even if you are using encryption, late service starting or other exotic requirement, you might reduce the number of required steps to a very small number (1-5, including the password specification).
Automate, automate, automate...
Sometimes it is not known before the server setup how often you would like to have backups / report processes. Rebundling the server just to add a new crontab entry is a very unlucky task for anyone involved. It is better to prepare the bundle image with a few cron job that might not be ever used, but if we does require them, we are not required to re-bundle the image. For example the following commands help to define a hourly report script:
export EDITOR=nano
crontab -e
# 58 * * * * [ -x /safe/home/root/hourly-report.sh ] && /safe/home/root/hourly-report.sh
As you can see, this script is placed in the '/safe' directory, which is on the encrypted volume. If for some reason the encryption / mount fails, or if there is no such file at that place, there will be no error: the [ -x ... ] directive ensures it will be executed if and only if it is present and executable. Placing this in the encrypted volume allows us the opportunity to store a few, more confidential items here as well, e.g. our script can encrypt the report mail, or use some sftp mechanism to access some remote site for such report.
Of course the type and variety of such scripts you define in your crontab is up to you entirely.
Be patient...
With the ElasticFox plugin, we have encountered some strange problem, e.g. sometimes it does take a very long time to get the list of KeyPairs. One inpatient member clicked on the 'create' button, typed the same name we have had previously and silently removed our old key and placed a new one. The KeyPair was distributed internally again, but this is just a silly move it is rather not encountered.
Labels: automation, aws, encryption, expect, opensolaris, zfs

