On YOUR computer
export KEY=~/Downloads/bioinf_barry.pemClass 17: Obtaining and processing SRA datasets on AWS
In bioinformatics we often need to analyze data-sets that are too large (or would take too long) to analyze on our local lab computers. The goal of this hands-on session is to show you how to obtain and work with large biomolecular sequencing data-sets using cloud computing resources. In particular, we will query, download, decompress and analyze data from NCBI’s main Sequence Read Archive (SRA) the world’s largest publicly available repository of high throughput sequencing data.
Cloud computing allows access to arbitrary amounts of compute resources that are physically located elsewhere. Typically you pay for what you use rather than having to pay upfront for new hardware and all its associated maintenance costs. Cloud computing resources exist on servers managed by cloud providers. The most popular cloud providers include Amazon (Amazon Web Services, a.k.a. AWS), Google (Google Compute Engine) and Microsoft (Azure) (Figure 1). For academic work in the US we also have NSF/XSEDE (who manage access to JetStream and other supercomputers around the country).
In our last hands-on session we introduced important cloud concepts and walked through configuring and launching AWS EC2 instances.
The AWS console is a password protected website where you can configure, launch and control your EC2 instances. We introduced this console last day and we will not re-cover that material here beyond the instructions for starting an instance.
If you are an enrolled student in this course you can access your own AWS console at https://awsed.ucsd.edu/ This will ask for your regular UCSD single-sign-on details and then you should be able to select our course and be re-directed to the AWS console (Figure 2):
Note that in the image above I have a stopped instance listed. This is the one I used for our last class. We can re-start this instance for use today. To do this select your instance (by clicking on the radio button next to the stopped instance name). Then click “Instance State” > “Start Instance”. Note that the instance will have a new IP address that we can find from selecting instance and then clicking “Connect” > SSH client. See the connecting to your instance section below.
If you don’t have a previous instance listed you can start a new one by following the same steps from our last class. Briefly, click the orange Launch instances button in the upper left and on the resulting launch page:
Give a useful name for your instance (e.g. a combination of class number and your UCSD username, for example: class17_bjgrant);
Select Ubuntu as the OS and the version Ubuntu Server 22.04
Instance type m5.2xlarge
You can use your previous key pair from the last class or (if in doubt about where to find it) create a new one and use it.
Select existing security group and chose BIMM143/BGGN213 FA20
Configure storage with 100Gib
Verify all your settings in the right hand panel and click “Launch instance”.
After configuring and launching your instance it will take a short amount of time to setup and become available for use.
Once available you can login via SSH in your Terminal as we did previously.
-i bioinf_barry.pem along with my username ubuntu and the EC2 address provided from the AWS console (Figure 4):Note that if it is your first time connecting to your instance it will ask if you are sure you want to connect, to which you can answer yes.
Now that we have successfully gained access to a UNIX based machine with sufficient memory and cpu resources we can introduce the main Sequence Read Archive - currently the largest publicly available repository of high throughput sequencing data.
The SSH and SCP commands for connecting to and copying from your AWS instance can get long and error-prone — a single typo in the key filename or IP address will cause the connection to fail.
One clean solution is to store these values in shell environment variables so that you only have to type (and check) them once.
bioinf_barry.pem and it lives in my Downloads folder (i.e. ~/Downloads):On YOUR computer
export KEY=~/Downloads/bioinf_barry.pemOn YOUR computer
export SERVER=ubuntu@ec2-34-210-54-74.us-west-2.compute.amazonaws.comecho:On YOUR computer
echo $KEY
echo $SERVERNow you can use these variables in place of the full strings for all subsequent ssh and scp commands, making them much shorter and less error-prone:
On YOUR computer
# Connect to your instance
ssh -i $KEY $SERVER
# Transfer results.tsv back to your current directory
scp -r -i $KEY $SERVER:~/results.tsv .Note: These variables only persist for the duration of your current Terminal session. If you close your Terminal and reopen it (or open a new Terminal), you will need to re-run the two
exportcommands above. I suggest you make a note of these in a reference text file on your computer.
The Sequence Read Archive (SRA), https://www.ncbi.nlm.nih.gov/sra is NCBI’s major repository for sequencing data and includes all types of high-throughput sequencing experiments.
Many other domain specific databases will link to entries in the SRA. For example the Gene Expression Omnibus (GEO) that focuses on gene expression data links to SRA for retrieval of raw data from sequencing-based experiments whose processed data is described in a GEO record. To get the actual raw data you need to go to the associated SRA entry.
Sequencing data stored on SRA is highly-compressed to enable efficient storage and downloading. The main consequences of this, beyond faster download speeds, is that we must use special tools to decompress these records into usable FASTQ files.
The SRA-toolkit software package provides functionality to retrieve data from the SRA, and to extract the retrieved records to FASTQ files.
Instructions for obtaining the package on different platforms can be found on the projects GitHub repo. Note that precompiled binaries are available for Ubuntu that we will make use of here.
curl command. Then unzip and untar to be able to use the software:On your AWS instance
# Download
curl -O https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
# Unzip and Untar
tar -zxvf sratoolkit.current-ubuntu64.tar.gzNote the name of the directory that tar created. The name of this directory changes with each release and varies by platform, i.e. it follows the pattern
sratoolkit.<release>-<platform>e.g. at the time of writing a directory calledsratoolkit.3.3.0-ubuntu64was produced.
There should be a new directory called sratoolkit.3.3.0-ubuntu64 (or a similar name with a higher number as the version changes in the future). Check for this directory with your regular ls command. Then cd into the associated bin/ directory, (e.g. Figure 5)
On AWS
ls
cd sratoolkit.3.3.0-ubuntu64/bin
pwdIn the bin/ directory there are lots of programs that we can use to work with SRA data. We will focus on two main programs - prefetch and fasterq-dump.
The first of these, prefetch, retrieves compressed data from SRA. The second, fasterq-dump, reconstructs the actual FASTQ file(s) from the output of prefetch that we want to analyze.
We can check if these programs are working by running them with the --version or --help options.
For example:
On AWS
~/sratoolkit.3.3.0-ubuntu64/bin/prefetch --version
#/home/ubuntu/sratoolkit.3.3.0-ubuntu64/bin/prefetch : 3.0.1Note that even though I can use the TAB key to help it is still cumbersome to type the full PATH to the prefetch program every time we want to run it. To avoid needing to do this we can add its location to a special $PATH environment variable:
On AWS
export PATH=$PATH:/home/ubuntu/sratoolkit.3.3.0-ubuntu64/binNow we will not need to type the absolute PATH every time we want to run one of the sra-tools programs from its bin/ directory, e.g. this will now just work
On AWS
prefetch --version
#prefetch : 3.3.0With this setup complete we are now ready to get some sequencing data to work with.
Let’s say we wanted to obtain the ChIP-seq data from an ovarian cancer DNA sample, such as that cited in:
N Chapman-Rothe et al, “Chromatin H3K27me3/H3K4me3 histone marks define gene sets in high-grade serous ovarian cancer that distinguish malignant, tumour-sustaining and chemo-resistant ovarian tumour cells” Oncogene 32(38):4586 (2013).
Visit the full text link for this manuscript from the link above. Note that the manuscript lists an SRA accession for the project of ‘SRP016075’ under the Accession codes section.
Searching for the corresponding record on the SRA database, we will find a list of the sequencing runs associated with this manuscript (Figure 6).
Specifically look at the ChIP-seq with the H3K4me3 antibody, which can be found at: https://www.ncbi.nlm.nih.gov/sra/SRX193578. See Figure 7:
From the corresponding web page (Figure 7 above), there is a run identifier listed: ‘SRR600956’. This is the SRA identifier we need to obtain the raw sequencing data with the prefetch program.
To download the corresponding data, call the prefetch program (using either its absolute path ~/sratoolkit.3.3.0-ubuntu64/bin/ or if you have followed the instructions above simply its name) with the run accession number as its only argument:
On AWS
cd
pwd
#/home/ubuntu
prefetch SRR600956
#2023-02-27T20:38:08 prefetch.3.0.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
#2023-02-27T20:38:08 prefetch.3.0.1: 1) Downloading 'SRR600956'...
#2023-02-27T20:38:08 prefetch.3.0.1: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
#2023-02-27T20:38:08 prefetch.3.0.1: Downloading via HTTPS...
#2023-02-27T20:38:27 prefetch.3.0.1: HTTPS download succeed
#2023-02-27T20:38:28 prefetch.3.0.1: 'SRR600956' is valid
#2023-02-27T20:38:28 prefetch.3.0.1: 1) 'SRR600956' was downloaded successfullyRun an ls to see the resulting directory - note that the contents here is compressed and not yet usable for further analysis.
To obtain usable raw data we first need to reconstruct the FASTQ file(s) with the fasterq-dump program:
On AWS
fasterq-dump SRR600956This will take some time but eventually result in a FASTQ file, appearing in your current working directory, which constitutes the raw data from the corresponding sequencing run.
In this section we will move on to an arguably more interesting RNA-Seq dataset from a human cancer cell line: two replicates of a cell line treated with CRISPR-Cas9 targeting a gene of interest, and two replicates of the cell line treated with a control construct as described in:
Zhang et al. “Identification of focally amplified lineage-specific super-enhancers in human epithelial cancers”. Nat Genet 2016 Feb;48(2):176-82. PMID: 26656844
We will start with their SRA accession numbers, download their raw data and follow up with a complete analysis from raw reads to gene specific counts, principal component analysis and differential expression analysis.
The GEO entry for the study in question can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72001
Let’s use the SRA toolkit to download the study’s data and extract it into FASTQ format:
Side-note: With single-end reads we will have only one file for each sample, with paired-end reads (which are more common in RNA-seq) we will have two files for each sample as we have here.
On AWS
prefetch SRR2156848
#2023-02-27T20:47:07 prefetch.3.0.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
#2023-02-27T20:47:07 prefetch.3.0.1: 1) Downloading 'SRR2156848'...
#2023-02-27T20:47:07 prefetch.3.0.1: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
#2023-02-27T20:47:07 prefetch.3.0.1: Downloading via HTTPS...
#2023-02-27T20:47:14 prefetch.3.0.1: HTTPS download succeed
#2023-02-27T20:47:15 prefetch.3.0.1: 'SRR2156848' is valid
#2023-02-27T20:47:15 prefetch.3.0.1: 1) 'SRR2156848' was downloaded successfully
#2023-02-27T20:47:15 prefetch.3.0.1: 'SRR2156848' has 0 unresolved dependenciesRecall that the prefetch program prepares temporary files for obtaining sequence data efficiently from the SRA. With this run, you can use fasterq-dump to extract the data in FASTQ format to a file in the current directory.
On AWS
fasterq-dump SRR2156848
#Read 2959900 spots for SRR2156848
#Written 2959900 spots for SRR2156848 Note that as this sequencing run is a paired-end library, this command results in two separate files for the different mate-pairs. Now if we look at the contents of the current directory, we should see the following:
On AWS
ls
#SRR2156848/
#SRR2156848_1.fastq
#SRR2156848_2.fastqA fast approach to analysis of RNA-seq data makes use of the fact that if you only wish to quantify the expression of a gene, you don’t need to know where in the gene the read maps to. In fact, it is enough to know which genes a read might have come from. This is the principle behind Kallisto. In brief, Kallisto compares each RNA-seq read against an indexed table of transcript sequences to find the sets of transcripts that each read could have come from. These sets are then used to estimate the most likely set of transcript counts over the whole RNA-seq library.
On AWS
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
# Unzip and Untar the resulting file
___ -___ kallisto_linux-v0.44.0.tar.gzWe can check the program works by calling the kallisto executable with its full PATH:
On AWS
~/kallisto_linux-v0.44.0/kallisto
#kallisto 0.44.0
#
#Usage: kallisto <CMD> [arguments] ..
#
#Where <CMD> can be one of:
#
# index Builds a kallisto index
# quant Runs the quantification algorithm
# pseudo Runs the pseudoalignment step
# h5dump Converts HDF5-formatted results to plaintext
# inspect Inspects and gives information about an index
# version Prints version information
# cite Prints citation information
#
#Running kallisto <CMD> without arguments prints usage information for <CMD>On AWS
# Fill in the blank
___ PATH=$PATH:/home/ubuntu/kallisto_linux-v0.44.0Now we should be able to run it by simply calling kallisto.
Let’s pull the hg19 Ensembl reference transcriptome directly from the ENSEMBLE ftp site:
wget command or curl command for this.On AWS
wget ftp://ftp.ensembl.org/pub/release-67/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.67.cdna.all.fa.gzOn AWS
gunzip Homo_sapiens.GRCh37.67.cdna.all.fa.gzOn AWS
kallisto index -i hg19.ensembl Homo_sapiens.GRCh37.67.cdna.all.faThis command specified that we give this transcriptome index the name hg19.ensembl. This will take a little while.
Once complete we will have an indexed human transcriptome to match the reads against, we can then use Kallisto for fast transcript quantification on our FASTQ files.
To quantify transcripts with Kallisto we use the quant command option, specifying a transcript index (with the -i flag), an output directory (with the -o flag) in which to write results to file, and a list of our Fastq input files:
On AWS
kallisto quant -i hg19.ensembl -o SRR2156848_quant SRR2156848_1.fastq SRR2156848_2.fastqThis again will take some time (but not nearly as long as the alignment based mapping we did on our previous galaxy lab).
SRR2156848_quant directory:On AWS
ls SRR2156848_quant
abundance.h5 abundance.tsv run_info.jsonRather than running kallisto quant separately for each of our remaining three samples, this is a good opportunity to write a simple shell script.
Scripting repetitive tasks is one of the most important habits to develop in bioinformatics. Not only does it save time and reduce the chance of typos creeping into individual commands, it also makes your analysis reproducible — anyone (including your future self) can re-run the entire analysis by executing a single file.
Just like writing functions in R, whenever you find yourself copy-pasting a command and changing only one small thing each time, that’s a signal that a shell script with a loop would serve you better.
nano) to create a script called run_kallisto.sh:On AWS
nano run_kallisto.shCtrl+O, Enter, Ctrl+X):run_kallisto.sh
#!/bin/bash
# Transcript quantification with Kallisto for samples SRR2156849-51
for id in SRR2156849 SRR2156850 SRR2156851; do
echo "Processing sample: $id"
kallisto quant -i hg19.ensembl -o ${id}_quant ${id}_1.fastq ${id}_2.fastq
done
echo "All samples complete!"On AWS
chmod +x run_kallisto.sh
./run_kallisto.shThe echo statements act as progress markers so you can track which sample is currently being processed.
You should now have a *_quant/ directory for each sample.
With the transcript quantification completed, we are now finished our UNIX intensive work on AWS.
We can thus transfer our main results files to date (the various abundance.tsv files) back to our laptop for further analysis.
scp to transfer all your abundance results to this directory.On YOUR computer
# Make sure you are accessing your keyfile and instance
scp -r -i ~/Downloads/bimm143w23.pem ubuntu@ec2-34-219-113-54.us-west-2.compute.amazonaws.com:~/*_quant .To make this command simpler to type you can use your shell environment variables from earlier.
On YOUR computer
# Change these to match your key and instance address:
export KEY=~/Downloads/bioinf_barry.pem
export SERVER=ubuntu@ec2-34-210-54-74.us-west-2.compute.amazonaws.com
# Transfer results back to your current directory
scp -r -i $KEY $SERVER:~/*_quant .Please Stop your AWS instances at this point as we want to conserve funds for future work. To stop your instance:
Return to the AWS console by visiting https://awsed.ucsd.edu/,
Select your running instance,
Click “Instance State” > “Stop” to suspend and stop being charged for resources.
Back on our laptop we can now use R and Bioconductor tools to further explore this large scale data-set.
For example there is an R function called tximport() in the tximport package, which enables straightforward import of Kallisto results.
With each sample having its own directory containing the Kallisto output, we can import the transcript count estimates into R using:
# BiocManager::install("tximport")library(tximport)
# setup the folder and file-names to read
folders <- dir(pattern="SRR21568*")
samples <- sub("_quant", "", folders)
files <- file.path( folders, "abundance.h5" )
names(files) <- samples
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)1 2 3 4
head(txi.kallisto$counts) SRR2156848 SRR2156849 SRR2156850 SRR2156851
ENST00000539570 0 0 0.00000 0
ENST00000576455 0 0 2.62037 0
ENST00000510508 0 0 0.00000 0
ENST00000474471 0 1 1.00000 0
ENST00000381700 0 0 0.00000 0
ENST00000445946 0 0 0.00000 0
We now have our estimated transcript counts for each sample in R. We can see how many transcripts we have for each sample:
colSums(txi.kallisto$counts)SRR2156848 SRR2156849 SRR2156850 SRR2156851
2563611 2600800 2372309 2111474
And how many transcripts are detected in at least one sample:
sum(rowSums(txi.kallisto$counts)>0)[1] 94561
Before subsequent analysis, we might want to filter out those annotated transcripts with no reads:
to.keep <- rowSums(txi.kallisto$counts) > 0
kset.nonzero <- txi.kallisto$counts[to.keep,]And those with no change over the samples:
keep2 <- apply(kset.nonzero,1,sd)>0
x <- kset.nonzero[keep2,]We can now apply any exploratory analysis technique to this counts matrix. As an example, we will perform a PCA of the transcriptomic profiles of these samples.
Now we compute the principal components, centering and scaling each transcript’s measured levels so that each feature contributes equally to the PCA:
pca <- prcomp(t(x), scale=TRUE)summary(pca)Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 183.6379 177.3605 171.3020 1e+00
Proportion of Variance 0.3568 0.3328 0.3104 1e-05
Cumulative Proportion 0.3568 0.6895 1.0000 1e+00
Now we can use the first two principal components as a co-ordinate system for visualizing the summarized transcriptomic profiles of each sample:
plot(pca$x[,1], pca$x[,2],
col=c("blue","blue","red","red"),
xlab="PC1", ylab="PC2", pch=16)library(ggplot2)
library(ggrepel)
mycols <- c("blue","blue","red","red")
ggplot(pca$x) +
aes(PC1, PC2, label=rownames(pca$x)) +
geom_point( col=mycols ) +
geom_text_repel( col=mycols ) +
theme_bw()The plot makes it clear that PC1 separates the two control samples (SRR2156848 and SRR2156849) from the two enhancer-targeting CRISPR-Cas9 samples (SRR2156850 and SRR2156851). PC2 separates the two control samples from each other, and PC3 separates the two enhancer-targeting CRISPR samples from each other. This could be investigated further to see which genes result in these separation patterns. It is also at least slightly reassuring, implying that there are considerable differences between the treated and control samples.
We can use DESeq2 to complete the differential-expression analysis that we are already familiar with:
An example of creating a DESeqDataSet for use with DESeq2:
library(DESeq2)sampleTable <- data.frame(condition = factor(rep(c("control", "treatment"), each = 2)))
rownames(sampleTable) <- colnames(txi.kallisto$counts)dds <- DESeqDataSetFromTximport(txi.kallisto,
sampleTable,
~condition)using counts and average transcript lengths from tximport
# dds is now ready for DESeq() see our previous classes on thisdds <- DESeq(dds)res <- results(dds)
head(res)log2 fold change (MLE): condition treatment vs control
Wald test p-value: condition treatment vs control
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENST00000539570 0.000000 NA NA NA NA
ENST00000576455 0.761453 3.155061 4.86052 0.6491203 0.516261
ENST00000510508 0.000000 NA NA NA NA
ENST00000474471 0.484938 0.181923 4.24871 0.0428185 0.965846
ENST00000381700 0.000000 NA NA NA NA
ENST00000445946 0.000000 NA NA NA NA
padj
<numeric>
ENST00000539570 NA
ENST00000576455 NA
ENST00000510508 NA
ENST00000474471 NA
ENST00000381700 NA
ENST00000445946 NA
These results could go on to be visualized and subjected to pathway analysis etc. as we have done in previous classes.