Wednesday, December 22, 2010

Using Amazon EC2 to speed up matlab optimisation III: Getting it all working together, and some speed results

OK, this is the final part of the tutorial
In part 1, you should have created the matlab server to listen for commands. In part 2, you should have started an EC2 instance, installed the necessary software, and saved it for future use.
In part 3, we will write a client in Matlab to send off optimisation jobs to the servers to run and get back the results.
I have implemented this as a matlab class, @socket_client.
The code is available to download
The constructor (socket_client.m) just takes as an argument the location of the ssh key for accessing the machines. To use the default location, the constructor is run as follows:
s = socket_client();
The next step is to update the server list: [s,servers] = updateserverlist(s);
The program updateserverlist.m calls the program findinstances.m, which uses the program ec2-describe-instances, which should have been installed as part of the EC2 API. In this way, all servers that have been started (as described in part 2) will be utilised.
Then a set of jobs can be constructed, for example: % Set up 20 jobs for k=1:20 joblist(k).command = codes.decompose; joblist(k).arguments = {a(k).time,a(k).vel,[],[],[]}; end
Then finally they can be run: [results,finishtimes] = runjobs(s,joblist,1);
The runjobs code keeps track of what each server is doing. When a job is finished it saves the results, and assigns it the next job from the queue.
And that is it! I'll be happy to hear if you get this working yourself or have an alternative solution to this problem.

Using Amazon EC2 to speed up matlab optimisation Part II: Setting up an Ubuntu EC2 instance to run the compiled Matlab code

In this tutorial (part 2 of 3), I go through the steps to set up an Amazon EC2 instance to run compiled Matlab code. You need to go through this procedure the first time to install your software. Subsequent times, you can just re-run (many times simultaneously if you like) your instance.

Before running this, you will need to set up an Amazon AWS account (including giving them your credit card details to pay for this).

I have used mostly the AWS management console because it is easy to use.

The first step is to log onto the AWS management console:
https://console.aws.amazon.com/s3/home


and select the EC2 tab. Under region, select whichever region is closest to you (for fastest performance - I chose the Asia-Pacific one, in Singapore).

Before starting the instance, there is some work to do first!

You need to create a key  - Select "key-pairs" on the bottom of the left menu), give it a name, then save it somewhere where you will remember (you will need it later!). The key ensures that only you can log into your server.

You also need to define some "security policies". This basically says which ports will be open so that the outside world can communicate with the machine. Click on "Security groups" on the left menu, then "Create security group". You will need one for ssh. You can call it "ssh", description "ssh". In the bottom of the screen, select "ssh" form the options, and click on "save". In the source, you can put your computer's IP if you want to make sure only you can log onto the computer (I didn't bother).


I also defined another security policy for the Matlab server. I decided (arbitrarily) to use ports 9000-9100 for my application. So repeat the process, but call it "Matlab listeners", description "Matlab listeners", and then down the bottom, select "Custom", "TCP protocol", From port 9000, To port 9100, source 0.0.0.0/0 (i.e. the whole internet) and click on save.



Now we are ready to select an "image". Under images (on the left menu), select AMIs. AMIs are Amazon Machine Images. Under viewing, I chose "64-bit", "Ubuntu" and in the text field "Lucid" (the name of the latest Ubuntu release). I chose to use Ubuntu because I am familiar with it, and I can easily install the same version on my computer to compile the code (and be confident that both are using the same libraries, etc). This will give you a list of images that other people has created and share publically. An added bonus of using Ubuntu (or other open source OS) is that it is free, so nothing else to pay (apart from the AWS fees).

I selected ami-9c2957. Click with the right mouse and select "Launch instance".

In step 1, I selected an "extra large". For now, it is not so important, but when you are actually using it for optimisation you probably want to work out the best trade off between more machines and more cores / machine (and cost!)


For instance details (step 2), I just left the defaults


For create key pair (step 3), select the key that you created earlier:
For security groups (step 4), select the "ssh" and "Matlab listeners"
At step 5, review and make sure everything is OK, then press "Launch". Congratulations, you have launched your first instance (may it be the first of many).

Now click on "instances" on the left menu, and your instance should appear (it may take a little time for it to start up). Right mouse click on it and press "Connect". It will give you instructions on how to ssh to your server. Rather than connecting to root@XXXXX, connect instead to ubuntu@XXXXX. This is because Ubuntu doesn't like you logging in as root. If you have linux / OSX, you can ssh from any terminal window. If using Windows, try PuTTY.

Once logged in, I updated the server:

apt-get update
apt-get upgrade


I then copied over the Matlab MCR: (found on my Ubuntu distribution in: /opt/MATHWORKS_R2010A/toolbox/compiler/deploy/glnxa64/MCRInstaller.bin)
using sftp (in another window), you could also use SCP (replace the 111-111-111-111 with the address of your instance)

sftp -oIdentityFile=~/.ec2/singapore-key.pem ubuntu@ec2-111-111-111-111.ap-southeast-1.compute.amazonaws.com
sftp> put MCRInstalller.bin


The Matlab MCR is 200+ MB, so it takes a while . . . The MCR is needed to run compiled Matlab programs on computers that do not have Matlab installed.

Run the installer:

sudo ./MCRInstaller.bin -console

(to run it without the gui)

Press enter a few times (defaults for everything should be fine).

Install necessary software:
sudo apt-get install zip unzip ruby openssl libopenssl-ruby curl libxpm4 libxt6 libxmu6 libxp6

Download Amazon AMI tools:
curl http://s3.amazonaws.com/ec2-downloads/ec2-ami-tools.zip > ec2-ami-tools.zip

Install them:

mkdir ec2
cp ec2-ami-tools.zip ec2
cd ec2
unzip ec2-ami-tools.zip
ln -s ec2-ami-tools-* current


edit .bashrc file (e.g. nano ~/.bashrc) and add to the end:

export EC2_AMITOOL_HOME=~/ec2/current
export PATH=${PATH}:~/ec2/current/bin


Also make a matlab directory to store the files
mkdir matlab

In order that the servers will always run the latest version, I copied the server code into an Amazon S3 store, and when the servers are run, it will copy the latest version each time.
The easiest way to create a "bucket" is with the management console (under the S3 tab), I called mine "jasonfriedman.software".
Now, copy the compiled Matlab code onto that was created in Part 1. Using the management console, there is an "upload" button. I uploaded the two filess needed, socket_server and run_socket_server.sh
I made each of them public (right click on the files) so that the EC2 instances can download them. If you then select properties, you can get the url of the file (which you will need).
Then I wrote a small perl script to count the number of processors and run that number of servers. It looks at /proc/cpuinfo to count the number, this is not very robust but should do for Amazon EC2 instances. At the beginning, it also downloads the latest version of the servers from the S3 store (as it is stored also on AWS, the transfer is quick and free). Write the script using your favourite text editor and put it in matlab/runservers.


#!/usr/bin/perl -w

system('wget https://s3.amazonaws.com/jasonfriedman.software/socket_server -O /home/ubuntu/matlab/socket_server');
system('wget https://s3.amazonaws.com/jasonfriedman.software/run_socket_server.sh -O /home/ubuntu/matlab/run_socket_server.sh');
system('chmod a+x /home/ubuntu/matlab/run_socket_server.sh /home/ubuntu/matlab/socket_server');
my $numCPUs = `cat /proc/cpuinfo | grep processor | wc -l `;
chomp($numCPUs);

print "There are $numCPUs CPUs\n";

# Now run an instance for each of the CPUs
for (my $i = 1; $i<= $numCPUs; $i++) {
        my $port = $i + 9000;
        system("/home/ubuntu/matlab/run_socket_server.sh /opt/MATLAB/MATLAB_Compiler_Runtime/v713/ $port &");
}


The next step is to make the instance run the Matlab servers by itself when it starts up. This will make it easy to start up many servers.

We do this by adding one line to /etc/rc.local, which is run on each reboot:
/home/ubuntu/matlab/runservers
(put it one line before the last line (exit 0))

The final step is to save the instance so that you don't need to go through all this installing each time. If you have used EBS, you can do it with the management console - just right click on the instance and select "Create Image (AMI)". Then next time, you can run this image as you left it (rather than running someone else's image). If not, then instructions on how to do it are here: http://instantbadger.blogspot.com/2009/09/how-to-create-and-save-ami-image-from.html
This tutorial continues in Part 3.

Note: some of the instructions on this page were modified from:
http://robrohan.com/2009/01/30/saving-a-customised-linux-amazon-instance-ec2-and-s3/

Using Amazon EC2 to speed up matlab optimisation I: Writing a socket interface in Matlab to send / receive the commands

The aim in this tutorial (part 1 of 3) is to have a small program running in Matlab on your computer, which will send off requests to a program (compiled Matlab code) running on an Amazon EC2 server or servers.

I am using sockets to do the communication, and in Matlab I use the free msocket toolbox to do this (you will need to download it and add it to your matlab path). The benefit of this toolbox is that it allows you to send matlab variables between machines. In this case, I send a matlab structure containing the command, and the parameters.

The architecture I use is to have a single server running on every core of the target machine. The server will wait for a connection, once it receives one, it will run the desired program, and return the results.

The source code for the entire program can be found here:
socketserver.m. It will also require the program messagecodes.m


It specifies which port to listen on:


socket = mslisten(port);


Then there is a endless loop that waits for a connection

% Keep listening until a connection is received
sock = -1;
while sock == -1
sock = msaccept(socket,0.0000001);
drawnow;
end


Once a connection has been accepted, a confirmation is returned to the creator:

m.accepted = 1;
mssend(sock,m);


and another loop is started to wait to receive commands:

success = -1;
while success<0 [received,success] = msrecv(sock,0.0000001); drawnow; end


I use a "switch" command to execute the appropriate command (in this example,
there is only one, but there is no reason not to have multiple possible commands).

In this case, it is executing the "decompose" command (an optimisation program I have written).
After running, it sends back the result in the rv variable:


switch received.command
case {codes.decompose}
[time,vel,numsubmovements,method,algorithm] = deal(received.arguments{:});
[rv.best,rv.bestresult,rv.bestfitresult] = ...
decompose(time,vel,numsubmovements,method,algorithm);
mssend(sock,rv);


Once a client has finished running the program, it can close the socket, which is dealt
with by the server as follows:


case {codes.closesocket}
msclose(sock);
break;


The break causes it to leave the innermost while loop, and wait again for a new connection.

In order to use this code on Amazon EC2 (without a license server), it is necessary to first compile it. You will need to have a license for the Matlab compiler (available on the
computer doing the compiling, but not on the one running the final program). Note that you will
need to compile this on a machine similar to the one you are planning on running it on
(e.g., I compiled mine on a 64-bit ubuntu machine). I installed ubuntu as a Virtualbox
image as I don't have a "real" ubuntu machine available.

Then from inside matlab, it is as simple as to run:


mcc -m socket_server


and matlab will compile it for you. If this is the first time using mcc, you may
have to answer some questions. Part 3 described how to upload this server to EC2.

Now for the client. The client has to connect to the server:


sock = msconnect(address,port);


It then can send commands to the server:


m.command = codes.decompose;
m.arguments{1} = time;
m.arguments{2} = vel;
m.arguments{3} = numsubmovements;
m.arguments{4} = method;
m.arguments{5} = algorithm;
success = mssend(sock,m);


It then needs to wait for a result:

[thisrv,success] = msrecv(sock);


Then the return value can be used as desired.

Part 2 continues by explaining how to setup Amazon EC2 to run the server component.

Part 3 of the tutorial will describe an automated way to run many servers and collect the results.

Using Amazon EC2 to speed up matlab optimisation

I run lots of optimisation programs in Matlab as part of my research. One major problem is that they can be very slow, especially if you have lots of variables. One solution is to use more computer hardware to run the procedure faster. The more the better. Amazon offer their EC2 service, which allows you basically to rent computers by the hour. So rather than running your Matlab software on one computer, you can rent a lot of computers say for a few hours or a day and get the optimisation run much, much quicker.

I received an academic research grant from Amazon (Thanks!) which consisted of $3500 credit for their AWS services (including EC2). Mathworks have published a "white paper" on how to use EC2 with Matlab, but it relies on having available licenses for the instances of Matlab running on the EC2 servers, and those servers being able to access your license server. Here at MACCS the license server has a limited number of licenses, and they are behind the university firewall, so there is no way for the EC2 instances to use them.

My solution was to use the Matlab compiler to compile the optimisation part of my work into a stand-alone component. Then, I will get my computer running matlab to connect to the computer(s) running on EC2, send them commands, and get the results. I chose to do this using a socket interface.

These tutorials, split into three parts, will explain the process I went through (mostly so that I can remember how to do it next time!):

  1. Writing a socket interface in Matlab to send / receive the commands
  2. Setting up an Ubuntu EC2 instance to run the compiled Matlab code
  3. Writing matlab software to communicate with the server.