Barry Grant < http://thegrantlab.org// >
2020-12-06 (22:38:26 on Sun, Dec 06)

Background

The goal of this hands-on session is to show you how to configure and launch your very own new computer in the cloud. This will allow you to do computational work on remote hardware with capabilities beyond those you may have at hand. For example, in bioinformatics we often need to analyze datasets that are too large (or would take too long) to analyze on our local lab computers.

What is cloud computing?

Cloud computing allows access to arbitrary amounts of compute resources that are physically located elsewhere. Typically you pay for what you use rater than having to pay upfront for new hardware and all its associated maintenance costs. Cloud computing resources exist on servers managed by cloud providers. The most popular cloud providers include Amazon (Amazon Web Services, a.k.a. AWS), Google (Google Compute Engine) and Microsoft (Azure). For academic work in the US we also have NSF/XSEED (who manage access to JetStream and other supercomputers around the country).

At the time of writing Amazon’s AWS is the market leader by a rather wide margin. We will focus here on learning the basics of AWS but the concepts apply to other cloud computing services as well. The only difference involves some brand specific acronyms and the terminology used to describe various services and actions.

Important Cloud concepts

Virtual Machines (or VMs for short) emulate the architecture and functionality of physical computers. However, they are not sat on our desk but rather live “in the cloud” (actually they are portions of large computer servers sat at remote service centers and not individual machines in the conventional sense) and hence we call them virtual machines. In Amazon’s AWS parlance VMs are called EC2 instances.

Side-note: EC2 stands for Elastic Compute Cloud and by now you should appreciate that this is an area with lots of acronyms made worse by the fact that different vendors use different terminology for similar things.

EC2 instances can be created using different operating systems (i.e. Linux, Windows and Mac) with different CPU, memory, storage and GPU sizes.

Being able to access and use VMs, like we are going to learn here, can eliminate the need to invest in new expensive hardware and avoids the hassle of configuration and maintenance downtime. Many feel that this will become more prevalent in biomedical research in near the future and hence being able to use cloud computing effectively is an important and in-demand skill for a growing number of employers.

Once you have these skills you can launch as many virtual servers as you need, configure their security, networking and manage extra storage options (more on this later). Amazon EC2 also optionally enables auto-scaling up or down to handle changes in requirements (such as spikes in usage) without having to pay up-front. This is why Netflix, Uber and of course Amazon itself are built on cloud computing resources. Another big plus here is helping the user avoid the hassle of hardware purchase, setup, configuration and maintenance.

Accessing the AWS console

The AWS console is a password protected website where you can can configure, launch and control your EC2 instances. We will not cover all it’s functionality here but rather focus on how to launch new instances.

If you are an enrolled student in this course you can access your own AWS console at https://awsed.ucsd.edu/ This will ask for your regular UCSD single-sign-on details and then you should be able to select our course and be re-directed to the AWS console:

To launch your first instance click the large orange “Launch instances” button on the upper right. This will take you to the “quick start” page below. Browse through the list of different machine types there.

Acronym alert: These machine type options are known as Amazon Machine Instance (or AMIs for short). Basically, they are preconfigured templates for launching a Virtual Machine instance. They packages the various applications you may need for your server (including the operating system and possibly additional software).

Select “Ubuntu Server 20.04 LTS (HVM), SSD Volume Type” with the default 64-bit (x86) processor type.

Choosing an instance type

This will bring you to series of pages with options to further configure your instance. First up is “2. Choose Instance Type”. Scroll down to see the wide range of options you can select from for number of virtual processors (vCPUs), memory and storage (think of this as equivalent to hard-drive space for now):

We will select “m5.2xlarge”, which has 8 vCPUs and 32Gb memory. This is a very respectable choice for most small to medium size bioinformatics work. Often you may need more memory than this for typical human genome work for example (sequence read mapping etc.). Feel free to explore other options here later.

Configuring security settings

With “m5.2xlarge” selected, jump to step number 6 “Configure Security Group”. This is where you can control how you and others can access your computer in the cloud.

Click “Select an existing security group” and chose the BIMM143/BGGN213 option. If, and only if. you are doing this on your own AWS account (i.e. and not party of the official class) then you will want to create a new security group and make sure you have SSH access on port 22 enabled along with adding a new rule for HTTP access on port 80 and TCP access on port 53XX

The displayed page gives you one last change to change configuration settings (like add more memory and CPUs etc.). It will also likely display a message or two about how your instance is not eligible for the free usage tier and how your instance may be accessible from any IP address - we want this to be able to access it from off campus (i.e. from home).

Finally, click the blue “Launch” button.

Getting your private key file

One last but very important step is creating and downloading a special key file that will allow you to access your instance from the command line.

  1. Select “Create a new key pair” (option 1 below, or use a previous one if you already have one and know what you are doing here):
  2. Name your new key file something that you will remember with no spaces or funny characters. I strongly suggest using your first name underscore bioinf (e.g. barry_bioinf) . So you can find and use it later.
  3. Download the private key to your computer.
  4. Click the blue “launch Instance” button (step4 below).

You will find out more about what these key files are in the next section.

If everything went to plan then your new computer in the cloud is now starting up and will shortly (~2-5 mins) be available for you to log into using your private key that you downloaded in the previous step.

Connecting to your instance

To see how to connect click on either the instance ID or down the bottom of the page “View Instances”.

Then “Connect” > “SSH client” and copy the example SSH UNIX command displayed there:

We will use a slight variant of this command in your favorite terminal application to access your EC2 instance via ssh - secure shell connection. This will be covered in detail in the next section (different page) but briefly here for completeness:

In your Terminal change directory to where you downloaded the private key file

cd ~/Downloads

Change the permission of the key file (make sure to use the name of your key file here and not mine - it is unlikely that your name is also barry - if it is, hi barry, nice name!)

chmod 400 bioinf_barry.pem

Now it is time to ssh into your EC2 instance with this key file - here use the command you copied from the web site previously, e.g.

# Use YOUR copied ssh command from above, e.g.
ssh -i bioinf_barry.pem ubuntu@ec2-44-234-27-254.us-west-2.compute.amazonaws.com

Again, this will be discussed in detail in the next section, but here is what success looks like for me…

Next we will get to work running some typical bioinformatics analysis on your new shiny VM.

Side-note: Later once you are done with your work please Stop or Terminate your instance so as we are not charged for it any longer. To do this select your instance then click “Instance State” > “Stop Instance”.

Go to next Section >

 

Powered by labsheet