On your AWS instance
# Download
curl -O https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
# Unzip and Untar
tar -zxvf sratoolkit.current-ubuntu64.tar
Class 16: Obtaining and processing SRA datasets on AWS
In bioinformatics we often need to analyze datasets that are too large (or would take too long) to analyze on our local lab computers. The goal of this hands-on session is to show you how to obtain and work with large biomolecular sequencing datasets using cloud computing resources. In particular, we will query, download, decompress and analyze data from NCBI’s main Sequence Read Archive (SRA) the worlds largest publicly available repository of high throughput sequencing data.
Cloud computing allows access to arbitrary amounts of compute resources that are physically located elsewhere. Typically you pay for what you use rater than having to pay upfront for new hardware and all its associated maintenance costs. Cloud computing resources exist on servers managed by cloud providers. The most popular cloud providers include Amazon (Amazon Web Services, a.k.a. AWS), Google (Google Compute Engine) and Microsoft (Azure). For academic work in the US we also have NSF/XSEED (who manage access to JetStream and other supercomputers around the country).
In our last hands-on session we introduced important cloud concepts and walked through configuring and launching AWS EC2 instances.
The AWS console is a password protected website where you can can configure, launch and control your EC2 instances. We introduced this console last day and we will not re-cover that material here beyond the instructions for starting a new
If you are an enrolled student in this course you can access your own AWS console at https://awsed.ucsd.edu/ This will ask for your regular UCSD single-sign-on details and then you should be able to select our course and be re-directed to the AWS console:
Briefly, click the orange Launch instances button in the upper left and on the resulting launch page:
Give a useful name for your instance (e.g. a combination of class number and your UCSD username, for example: class16_bjgrant
);
Select Ubuntu as the OS and the version Ubuntu Server 20.04
Instance type m5.2xlarge
You can use your previous key pair from the last class or (if in doubt about where to find it) create a new one and use it.
Select existing security group and chose BIMM143/BGGN213 FA20
Configure storage with 100Gib
Verify all your settings in the right hand panel and click “Launch instance”.
After configuring and launching your instance on the last page it will take a short amount of time to setup and become available for use.
Once available you can login via SSH in your Terminal as we did previously.
-i bioinf_barry.pem
along with my username ubuntu
and the EC2 address provided from th AWS console:Note that if it is your first time connecting to your instance it will ask if you are sure you want to connect to which you can answer yes
.
Now that we have successfully gained access to a UNIX based machine with sufficient memory and cpu resources we can introduce the main Sequence Read Archive - curently the largest publicly available repository of high throughput sequencing data.
The Sequence Read Archive (SRA), https://www.ncbi.nlm.nih.gov/sra is NCBI’s major repository for sequencing data and includes all types of high-throughput sequencing experiments.
Many other domain specific databases will link to entries in the SRA. For example the Gene Expression Omnibus (GEO) that focuses on gene expression data links to SRA for retrieval of raw data from sequencing-based experiments whose processed data is described in a GEO record. To get the actual raw data you need to go to the associated SRA entry.
Sequencing data stored on SRA is highly-compressed to enable efficient storage and downloading. The main consequences of this, beyond faster download speeds, is that we must use special tools to decompress these records into usable FASTQ files.
The SRA-toolkit software package provides functionality to retrieve data from the SRA, and to extract the retrieved records to FASTQ files.
Instructions for obtaining the package on different platforms can be found on the projects GitHub repo. Note that precompiled binaries are available for Ubuntu that we will make use of here.
curl
command. Then unzip and untar to be able to use the software:On your AWS instance
# Download
curl -O https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
# Unzip and Untar
tar -zxvf sratoolkit.current-ubuntu64.tar
Note the name of the directory that tar created. The name of this directory changes with each release and varies by platform, i.e. it follows the pattern
sratoolkit.<release>-<platform>
e.g. at the time of writting a directory calledsratoolkit.3.0.1-ubuntu64
was produced.
There should be a new directory called sratoolkit.3.0.1-ubuntu64
(or a similar name with a higher number as the version changes in the future). Check for this directory with your regular ls
command. Then cd
into the associated bin/
directory, e.g.
On AWS
ls
cd sratoolkit.3.0.1-ubuntu64/bin
pwd
In the bin/
directory there are lots of programs that we can use to work with SRA data. We will focus on two main programs - prefetch and fastq-dump.
The first of these, prefetch, retrieves compressed data from SRA. The second, fastq-dump, reconstructs the actual FASTQ file(s) from the output of prefetch that we want to analyze.
We can check if these programs are working by running them with the --version
or --help
options.
For example:
On AWS
$ ~/sratoolkit.3.0.1-ubuntu64/bin/prefetch --version
/home/ubuntu/sratoolkit.3.0.1-ubuntu64/bin/prefetch : 3.0.1
Note that even though I can use the TAB key to help it is still cumbersome to type the full PATH to the prefetch program every time we want to run it. To avoid needing to do this we can add it’s location to a special $PATH
environment variable:
On AWS
$ export PATH=$PATH:/home/ubuntu/sratoolkit.3.0.1-ubuntu64/bin
Now we will not need to type the absolute PATH every time we want to run one of the sra-tools programs from it’s bin/
directory, e.g. this will now just work
On AWS
$ prefetch --version
prefetch : 3.0.1
With this setup complete we are now ready to get some sequencing data to work with.
Let’s say we wanted to obtain the ChIP-seq data from an ovarian cancer DNA sample, such as that cited in:
N Chapman-Rothe et al, “Chromatin H3K27me3/H3K4me3 histone marks define gene sets in high-grade serous ovarian cancer that distinguish malignant, tumour-sustaining and chemo-resistant ovarian tumour cells” Oncogene 32(38):4586 (2013).
Visit the full text link for this manuscript from the link above. Note that the manuscript lists an SRA accession for the project of ‘SRP016075’ under the Accession codes section.
Searching for the corresponding record on the SRA database, we will find a list of the sequencing runs associated with this manuscript.
<>
Specifically look at the ChIP-seq with the H3K4me3 antibody, which can be found at: https://www.ncbi.nlm.nih.gov/sra/SRX193578. See below image:
From the corresponding web page (image above), there is a run identifier listed: ‘SRR600956’. This is the SRA identifier we need to obtain the raw sequencing data with the prefetch
program.
To download the corresponding data, call the prefetch
program (using either it’s absolute path ~/sratoolkit.3.0.1-ubuntu64/bin/
or if you have followed the instructions above simply it’s name) with the run accession number as its only argument:
On AWS
$ cd
$ pwd
/home/ubuntu
$ prefetch SRR600956
2023-02-27T20:38:08 prefetch.3.0.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-02-27T20:38:08 prefetch.3.0.1: 1) Downloading 'SRR600956'...
2023-02-27T20:38:08 prefetch.3.0.1: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-02-27T20:38:08 prefetch.3.0.1: Downloading via HTTPS...
2023-02-27T20:38:27 prefetch.3.0.1: HTTPS download succeed
2023-02-27T20:38:28 prefetch.3.0.1: 'SRR600956' is valid
2023-02-27T20:38:28 prefetch.3.0.1: 1) 'SRR600956' was downloaded successfully
Run an ls
to see the resulting directory - note that the contents here is compressed and not yet usable for further analysis.
To btain usable raw data we first need to reconstruct the FASTQ file(s) with the fastq-dump program:
On AWS
$ fastq-dump SRR600956
This will take some time but eventually result in a FASTQ file, appearing in your current working directory, which constitutes the raw data from the corresponding sequencing run.
You can use the head
command to see the start of the resulting FASTQ file.
On AWS
$ head SRR600956.fastq
@SRR600956.1 HWI-EAS486_0002:3:1:1382:1342 length=38
GTGTTCCAAATGCTGCAAATGGGTGTCAATGTATGTTA
+SRR600956.1 HWI-EAS486_0002:3:1:1382:1342 length=38
D?BCCA?BDBDBACD@=??BAAC>CBBBBBCBBBD?%%
@SRR600956.2 HWI-EAS486_0002:3:1:1382:5487 length=38
GATGATAGTTTCTTTTGCCGTTAGCACAATTTTTCCAA
+SRR600956.2 HWI-EAS486_0002:3:1:1382:5487 length=38
DCEECEAECEFFDFECEEEFFFFDB?BBADEEEEE???
@SRR600956.3 HWI-EAS486_0002:3:1:1382:4694 length=38
TGTAGGCTCCACCTCTGGGGGCAGGGCACAGACAAACA
Note that each new read starts with the pattern @SRR600956
. Let’s use grep
with this pattern to determine how many total reads are in this file:
On AWS
$ grep -c "@SRR600956" SRR600956.fastq
25849655
In this section we will move on to an arguably more interesting RNA-Seq dataset from a human cancer cell line: two replicates of a cell line treated with CRISPR-Cas9 targeting a gene of interest, and two replicates of the cell line treated with a control construct as described in:
Zhang et al. “Identification of focally amplified lineage-specific super-enhancers in human epithelial cancers”. Nat Genet 2016 Feb;48(2):176-82. PMID: 26656844
We will start with their SRA accession numbers, download their raw data and follow up with a complete analysis from raw reads to gene specific counts, principal component analysis and differential expression analysis.
The GEO entry for the study in question can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72001
Let’s use the SRA toolkit to download the study’s data and extract it into FASTQ format:
Side-note: With single-end reads we will have only one file for each sample, with paired-end reads (which are more common in RNA-seq) we will have two files for each sample as we have here.
On AWS
$ prefetch SRR2156848
2023-02-27T20:47:07 prefetch.3.0.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-02-27T20:47:07 prefetch.3.0.1: 1) Downloading 'SRR2156848'...
2023-02-27T20:47:07 prefetch.3.0.1: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-02-27T20:47:07 prefetch.3.0.1: Downloading via HTTPS...
2023-02-27T20:47:14 prefetch.3.0.1: HTTPS download succeed
2023-02-27T20:47:15 prefetch.3.0.1: 'SRR2156848' is valid
2023-02-27T20:47:15 prefetch.3.0.1: 1) 'SRR2156848' was downloaded successfully
2023-02-27T20:47:15 prefetch.3.0.1: 'SRR2156848' has 0 unresolved dependencies
Recall that the prefetch
program prepares temporary files for obtaining sequence data efficiently from the SRA. With this run, you can use fastq-dump
to extract the data in FASTQ format to a file in the current directory.
On AWS
$ fastq-dump --split-3 SRR2156848
Read 2959900 spots for SRR2156848
Written 2959900 spots for SRR2156848
Note that as this sequencing run is a paired-end library, this command uses the --split-3
argument to indicate that we wish to extract separate files for the different mate-pairs. Now if we look at the contents of the current directory, we should see the following:
On AWS
$ ls
SRR2156848/
SRR2156848_1.fastq
SRR2156848_2.fastq
SRR2156848/
SRR2156848.fastq
Q. How would you check that these files with extension ‘.fastq’ actually look like what we expect for a FASTQ file? You could try printing the first few lines to the shell standard output:
On AWS
# Fill in the blank!
___ SRR2156848 1.fastq
Q. How could you check the number of sequences in each file?
On AWS
# Fill in the blanks
$ ____ -___ "___" SRR2156848_1.fastq
2959900
Q. Check your answer with the bottom of the file using the
tail
command and also check the matching mate pair FASTQ file. Do these numbers match? If so why or why not?
SRR2156849
, SRR2156850
and SRR2156851
) we need for our analysis, first with prefetch
and then process with fastq-dump
On AWS
$ prefetch SRR2156849 SRR2156850 SRR2156851
$ fastq-dump --split-3 SRR2156849 SRR2156850 SRR2156851
On AWS
# fill in the blank
___ *.fastq
A fast approach to analysis of RNA-seq data makes use of the fact that if you only wish to quantify the expression of a gene, you don’t need to know where in the gene the read maps to. In fact, it is enough to know which genes a read might have come from. This is the principle behind Kallisto. In brief, Kallisto compares each RNA-seq read against an indexed table of transcript sequences to find the sets of transcripts that each read could have come from. These sets are then used to estimate the most likely set of transcript counts over the whole RNA-seq library.
On AWS
wget https://github.com/pachterlab/kallisto/releases/download/v0.44.0/kallisto_linux-v0.44.0.tar.gz
# Unzip and Untar the resulting file
___ -___ kallisto_linux-v0.44.0.tar.gz
We can check the program works by calling the kallisto
executable with it’s full PATH:
On AWS
$ ~/kallisto_linux-v0.44.0/kallisto
kallisto 0.44.0
Usage: kallisto <CMD> [arguments] ..
Where <CMD> can be one of:
index Builds a kallisto index
quant Runs the quantification algorithm
pseudo Runs the pseudoalignment step
h5dump Converts HDF5-formatted results to plaintext
inspect Inspects and gives information about an index
version Prints version information
cite Prints citation information
Running kallisto <CMD> without arguments prints usage information for <CMD>
On AWS
# Fill in the blank
$ ___ PATH=$PATH:/home/ubuntu/kallisto_linux-v0.44.0
Now we should be able to run it by simply calling kallisto
.
Q. Can you run
kallisto
to print out it’s citation information?
Let’s pull the hg19 Ensembl reference transcriptome directly from the ENSEMBLE ftp site:
wget
command or curl
command for this.On AWS
$ wget ftp://ftp.ensembl.org/pub/release-67/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.67.cdna.all.fa.gz
On AWS
$ gunzip Homo_sapiens.GRCh37.67.cdna.all.fa.gz
On AWS
$ kallisto index -i hg19.ensembl Homo_sapiens.GRCh37.67.cdna.all.fa
This command specified that we give this transcriptome index the name hg19.ensembl
. This will take a little while.
Once complete we will have an indexed human genome to match the reads against, we can then use Kallisto for fast transcript quantification on our FASTQ files.
To quantify transcripts with Kallisto we use the quant
command option, specifying a transcript index (with the -i
flag, an output directory (with the -o
flag) in which to write results to file, and a list of our Fastq input files:
On AWS
$ kallisto quant -i hg19.ensembl -o SRR2156848_quant SRR2156848_1.fastq SRR2156848_2.fastq
This again will take some time (but not nearly as long as the alignment based mapping we did on our previous galaxy lab).
SRR2156848_quant
directory:On AWS
ls SRR2156848_quant
abundance.h5 abundance.tsv run_info.json
We can now run the quantification for the remaining samples:
On AWS
# You complete these steps
$ kallisto quant -i ___ -o SRR2156849_quant SRR2156849_1.fastq SRR2156849_2.fastq
$ kallisto quant -i hg19.ensembl -o SRR2156850_quant ___.fastq ___.fastq
$ kallisto quant -i hg19.ensembl -o SRR2156851_quant SRR2156851_1.fastq SRR2156851_2.fastq
You should now have a *_quant/
directory for each sample.
Q. Have a look at the TSV format versions of these files to understand their structure. What do you notice about these files contents?
With the transcript quantification completed, we are now finished our UNIX intensive work on AWS.
We can thus transfer our main results files to date (the various abundance.tsv
files) back to our laptop for further analysis.
scp
to transfer all your abundance results to this directory.On YOUR computer
# Make sure you are accessing your keyfile and instance
scp -r -i "~/Downloads/bimm143w23.pem" ubuntu@ec2-34-219-113-54.us-west-2.compute.amazonaws.com:~/*_quant .
Back on our laptop we can now use R and Bioconductor tools to further explore this large scale dataset.
For example there is an R function called tximport()
in the tximport package, which enables straightforward import of Kallisto results
With each sample having its own directory containing the Kallisto output, we can import the transcript count estimates into R using:
# BiocManager::install("tximport")
library(tximport)
# setup the folder and filenames to read
<- dir(pattern="SRR21568*")
folders <- sub("_quant", "", folders)
samples <- file.path( folders, "abundance.h5" )
files names(files) <- samples
<- tximport(files, type = "kallisto", txOut = TRUE) txi.kallisto
1 2 3 4
head(txi.kallisto$counts)
SRR2156848 SRR2156849 SRR2156850 SRR2156851
ENST00000539570 0 0 0.00000 0
ENST00000576455 0 0 2.62037 0
ENST00000510508 0 0 0.00000 0
ENST00000474471 0 1 1.00000 0
ENST00000381700 0 0 0.00000 0
ENST00000445946 0 0 0.00000 0
We now have our estimated transcript counts for each sample in R. We can see how many transcripts we have for each sample:
colSums(txi.kallisto$counts)
SRR2156848 SRR2156849 SRR2156850 SRR2156851
2563611 2600800 2372309 2111474
And how many transcripts are detected in at least one sample:
sum(rowSums(txi.kallisto$counts)>0)
[1] 94561
Before subsequent analysis, we might want to filter out those annotated transcripts with no reads:
<- rowSums(txi.kallisto$counts) > 0
to.keep <- txi.kallisto$counts[to.keep,] kset.nonzero
And those with no change over the samples:
<- apply(kset.nonzero,1,sd)>0
keep2 <- kset.nonzero[keep2,] x
We can now apply any exploratory analysis technique to this counts matrix. As an example, we will perform a PCA of the transcriptomic profiles of these samples.
Now we compute the principal components, centering and scaling each transcript’s measured levels so that each feature contributes equally to the PCA:
<- prcomp(t(x), scale=TRUE) pca
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 183.6379 177.3605 171.3020 1e+00
Proportion of Variance 0.3568 0.3328 0.3104 1e-05
Cumulative Proportion 0.3568 0.6895 1.0000 1e+00
Now we can use the first two principal components as a co-ordinate system for visualizing the summarized transcriptomic profiles of each sample:
plot(pca$x[,1], pca$x[,2],
col=c("blue","blue","red","red"),
xlab="PC1", ylab="PC2", pch=16)
Q. Use ggplot to make a similar figure of PC1 vs PC2 and a seperate figure PC1 vs PC3 and PC2 vs PC3.
library(ggplot2)
library(ggrepel)
# Make metadata object for the samples
<- data.frame(condition = factor(rep(c("control", "treatment"), each = 2)))
colData rownames(colData) <- colnames(txi.kallisto$counts)
# Make the data.frame for ggplot
<- as.data.frame(pca$x)
y $Condition <- as.factor(colData$condition)
y
ggplot(y) +
aes(PC1, PC2, col=Condition) +
geom_point() +
geom_text_repel(label=rownames(y)) +
theme_bw()
The plot makes it clear that PC1 separates the two control samples (SRR2156848 and SRR2156849) from the two enhancer-targeting CRISPR-Cas9 samples (SRR2156850 and SRR2156851). PC2 separates the two control samples from each other, and PC3 separates the two enhancer-targeting CRISPR samples from each other. This could be investigated further to see which genes result in these separation patterns. It is also at least slightly reassuring, implying that there are considerable differences between the treated and control samples.
We can use DESeq2 to complete the differential-expression analysis that we are already familiar with:
An example of creating a DESeqDataSet for use with DESeq2:
library(DESeq2)
<- data.frame(condition = factor(rep(c("control", "treatment"), each = 2)))
sampleTable rownames(sampleTable) <- colnames(txi.kallisto$counts)
<- DESeqDataSetFromTximport(txi.kallisto,
dds
sampleTable, ~condition)
using counts and average transcript lengths from tximport
# dds is now ready for DESeq() see our previous classes on this
<- DESeq(dds) dds
<- results(dds)
res head(res)
log2 fold change (MLE): condition treatment vs control
Wald test p-value: condition treatment vs control
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENST00000539570 0.000000 NA NA NA NA
ENST00000576455 0.761453 3.155061 4.86052 0.6491203 0.516261
ENST00000510508 0.000000 NA NA NA NA
ENST00000474471 0.484938 0.181923 4.24871 0.0428185 0.965846
ENST00000381700 0.000000 NA NA NA NA
ENST00000445946 0.000000 NA NA NA NA
padj
<numeric>
ENST00000539570 NA
ENST00000576455 NA
ENST00000510508 NA
ENST00000474471 NA
ENST00000381700 NA
ENST00000445946 NA
These results could go on to be visualized and subjected to pathway analysis etc. as we have done in previous classes.
Please Stop your AWS instances at this point as we want to conserve funds for future work. To stop your instance:
Retuen to the AWS console by visiting https://awsed.ucsd.edu/,
Select your running instance,
Click “Instance State” > “Stop” to suspend and stop being charged for resources.