Barry Grant < http://thegrantlab.org/ >
2022-02-28 (09:57:19 PST on Mon, Feb 28)
Cancer is fundamentally a disease of the genome, caused by changes in the DNA, RNA, and proteins of a cell that push cell growth into overdrive. Identifying the genomic alterations that arise in a given cancer can help researchers decode how a particular cancer develops and improve upon the diagnosis and treatment of cancers based on their distinct molecular abnormalities.
With the ability to sequence whole genomes and exomes, attention has turned to trying to understand the full spectrum of genetic mutations that underlie cancer.
The genomes or exomes of tens of thousands of cancers have now been sequenced. Analyzing this data can yield important new insights into cancer biology. This is important because it is estimated that cancer will strike over 40% of people at some point in their lifetime with frequently devastating effects.
The National Cancer Institute (NCI) in the US has established the Genomic Data Commons (or GDC for short) for sharing cancer genomics data-sets.
This includes data from a range of large scale projects such as The Cancer Genome Atlas (TCGA) and other projects. The TGCA project aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and sub-types of cancer. As of writing, TCGA has analyzed matched tumor and normal tissues from over 11,000 patients covering 33 cancer types and sub-types.
You can get a feel for the types of cancer data contained in the NCI-GDC by visiting their new web portal: https://portal.gdc.cancer.gov.
Visit the NCI-GDC web portal and enter p53 into the search box.
Q1. How many Cases (i.e. patient samples) have been found to have p53 mutations?
Q2. What are the top 6 misssense mutations found in this gene?
HINT: Scroll down to the ‘TP53 - Protein’ section and mouse over the displayed plot. For example R175H is found in 156 cases.
Q3. Which domain of the protein (as annotated by PFAM) do these mutations reside in?
Q4. What are the top 6 primary sites (i.e. cancer locations such as Lung, Brain, etc.) with p53 mutations and how many primary sites have p53 mutations been found in?
HINT: Clicking on the number links in the Cancer Distribution section will take you to a summary of available data accross cases, genes, and mutations for p53. Looking at the cases data will give you a ranked listing of primary sites.
Return to the NCI-GDC homepage and using a similar search and explore strategy answer the following questions:
Q5. What is the most frequentely mutated position associated with cancer in the KRas protein (i.e. the amino acid with the most mutations)?
Q6. Are KRas mutations common in Pancreatic Adenocarcinoma (i.e. is the Pancreas a common ‘primary site’ for KRas mutations?).
Q6. What is the ‘TGCA project’ with the most KRas mutations?
Q7. What precent of cases for this ‘TGCA project’ have KRas mutations and what precent of cases have p53 mutations?
HINT: Placing your mouse over the project bar in the Cancer Distribution panel will bring up a tooltip with useful summary data.
Q8. How many TGCA Pancreatic Adenocarcinoma cases (i.e. patients from the TCGA-PAAD project) have RNA-Seq data available?
Side-Note: If Barry forgets please remind him to show the top miss-sense mutation sites on the protein structure (and discuss mechanism) as well as demo how to find RNA-Seq FASTQ files and other data such as biopsy images etc.
By now it should be clear that the NCI-GDC is a rich source of both genomic and clinical data for a wide range of cancers. For example, at the time of writing there are 5,306 files associated with Pancreatic Adenocarcinoma and 14,278 for Colon Adenocarcinoma. These include RNA-Seq, WXS (whole exome sequencing), Methylation and Genotyping arrays as well as well as rich metadata associated with each case, file and biospecimen.
The GenomicDataCommons Bioconductor package provides functions for querying, accessing, and mining the NCI-GDC in R. Using this package allows us to couple large cancer genomics data sets (for example the actual RNA-Seq, WXS or SNP data) directly to the plethora of state-of-the-art bioinformatics methods available in R. This is important because it greatly facilitates both targeted and exploratory analysis of molecular cancer data well beyond that accessible via a web portal.
This section highlights how one can couple the GenomicDataCommons, TCGAbiolinks and maftools bioconductor packages to quickly gain insight into public cancer genomics data-sets.
We will first use functions from the GenomicDataCommons
package to identify and then fetch, using the TCGAbiolinks
package, somatic variant results from the NCI-GDC and then provide a high-level assessment of those variants using the maftools
package. The later package works with Mutation Annotation Format or MAF format files used by GDC and others to store somatic variants.
The workflow will be:
#source("https://bioconductor.org/biocLite.R")
#biocLite(c("GenomicDataCommons", "TCGAbiolinks", "maftools"))
::install( c("GenomicDataCommons", "TCGAbiolinks", "maftools") ) BiocManager
Once installed, load the packages, as usual.
library(GenomicDataCommons)
library(TCGAbiolinks)
library(maftools)
Now lets check on GDC status:
status()
## $commit
## [1] "b49b90e1318040f447906940f3dff145809d9ea0"
##
## $data_release
## [1] "Data Release 31.0 - October 29, 2021"
##
## $status
## [1] "OK"
##
## $tag
## [1] "3.0.0"
##
## $version
## [1] 1
If this statement results in an error such as SSL connect error
, then please see the troubleshooting section here.
We will typically start our interaction with the GDC by searching the resource to find data that we are interested in investigating further. In GDC speak this is called “Querying GDC metadata”. Metadata here refers to the extra descriptive information associated with the actual patient data (i.e. ‘cases’) in the GDC.
For example: Our query might be ‘find how many patients were studied for each major project’ or ‘find and download all gene expression quantification data files for all pancreatic cancer patients’. We will answer both of these questions below.
The are four main sets of metadata that we can query, namely projects()
, cases()
, files()
, and annotations()
. We will start with projects()
<- getGDCprojects()
projects head(projects)
If you use the View(projects)
function call you can see all the project names (such as Neuroblastoma, Pancreatic Adenocarcinoma, etc.) along with their project IDs (such as TARGET-NBL, TCGA-PAAD, etc.) and associated information.
Moving onto cases()
we can use an example from the package associated publication to answer our first from question above (i.e. find the number of cases/patients across different projects within the GDC):
<- cases() %>%
cases_by_project facet("project.project_id") %>%
aggregations()
head(cases_by_project)
## $project.project_id
## doc_count key
## 1 18004 FM-AD
## 2 16824 GENIE-MSK
## 3 14232 GENIE-DFCI
## 4 3857 GENIE-MDA
## 5 3320 GENIE-JHU
## 6 2632 GENIE-UHN
## 7 2145 TARGET-AML
## 8 2052 GENIE-VICC
## 9 1587 TARGET-ALL-P2
## 10 1132 TARGET-NBL
## 11 1098 TCGA-BRCA
## 12 1038 GENIE-GRCC
## 13 995 MMRF-COMMPASS
## 14 801 GENIE-NKI
## 15 795 CPTAC-3
## 16 652 TARGET-WT
## 17 617 TCGA-GBM
## 18 608 TCGA-OV
## 19 585 TCGA-LUAD
## 20 583 BEATAML1.0-COHORT
## 21 560 TCGA-UCEC
## 22 537 TCGA-KIRC
## 23 528 TCGA-HNSC
## 24 516 TCGA-LGG
## 25 507 TCGA-THCA
## 26 504 TCGA-LUSC
## 27 500 TCGA-PRAD
## 28 489 NCICCR-DLBCL
## 29 470 TCGA-SKCM
## 30 461 TCGA-COAD
## 31 443 TCGA-STAD
## 32 440 REBC-THYR
## 33 412 TCGA-BLCA
## 34 383 TARGET-OS
## 35 377 TCGA-LIHC
## 36 342 CPTAC-2
## 37 339 TRIO-CRU
## 38 307 TCGA-CESC
## 39 291 TCGA-KIRP
## 40 261 TCGA-SARC
## 41 211 CGCI-HTMCP-CC
## 42 200 CMI-MBC
## 43 200 TCGA-LAML
## 44 191 TARGET-ALL-P3
## 45 185 TCGA-ESCA
## 46 185 TCGA-PAAD
## 47 179 TCGA-PCPG
## 48 176 OHSU-CNL
## 49 172 TCGA-READ
## 50 150 TCGA-TGCT
## 51 124 TCGA-THYM
## 52 120 CGCI-BLGSP
## 53 113 TCGA-KICH
## 54 109 HCMI-CMDC
## 55 101 WCDT-MCRPC
## 56 92 TCGA-ACC
## 57 87 TCGA-MESO
## 58 80 TCGA-UVM
## 59 70 ORGANOID-PANCREATIC
## 60 69 TARGET-RT
## 61 58 TCGA-DLBC
## 62 57 TCGA-UCS
## 63 56 BEATAML1.0-CRENOLANIB
## 64 51 TCGA-CHOL
## 65 45 CTSP-DLBCL1
## 66 36 CMI-ASC
## 67 30 CMI-MPC
## 68 24 TARGET-ALL-P1
## 69 13 TARGET-CCSK
## 70 7 VAREPOP-APOLLO
Note that the facet() and aggregations() functions here are from the GenomicDataCommons
package and act to group all cases by the project id and then count them up.
If you use the View() function on our new cases_by_project
object you will find that the data we are after is accessible via cases_by_project$project.project_id
.
Q9. Write the R code to make a barplot of the cases per project. Lets plot this data with a log scale for the y axis (
log="y"
), rotated axis labels (las=2
) and color the bar coresponding to the TCGA-PAAD project.
<- cases_by_project$project.project_id
x
# Make a custom color vector for our plot
<- rep("lightblue", nrow(x))
colvec <- "red"
colvec[___]
# Plot with 'log' for y axis and rotate labels with 'las'
#par(___)
barplot(___, names.arg=___, log="y", col=colvec, las=2)
Lets explore some other functions from the related TCGAbiolinks
package.
We can use the getSampleFilesSummary() function to determine for a given project how many cases and what type of data we have available for each case:
<- getSampleFilesSummary("TCGA-PAAD") samp
## Accessing information for project: TCGA-PAAD
head(samp)
Now we can use GDCquery() function to focus in on a particular data type that we are interested in. For example, to answer our second question from above - namely ‘find all gene expression data files for all pancreatic cancer patients’:
<- GDCquery(project="TCGA-PAAD",
query data.category="Transcriptome Profiling",
data.type="Gene Expression Quantification")
<- getResults(query) ans
head(ans)
In RStudio we can now use the View() function to get a feel for the data organization and values in the returned ans
object.
# Run in your RStudio CONSOLE
View(ans)
We should see that ans
contains a row for every RNA-Seq data file from the ‘TCGA-PAAD’ project. At the time of writing this was 546 RNA-Seq data files.
nrow(ans)
## [1] 546
We could download these with standard R tools, or for larger data-sets such as this one, use the packages transfer() function, which uses the GDC transfer client (a separate command-line tool) to perform more robust data downloads.
Note we could go to the NCI-GDC web portal and enter the Advanced Search page and then construct a search query to find MAF format somatic mutation files for our ‘TCGA-PAAD’ project.
After some exploration of the website I came up with the following query: “cases.project.project_id in ["TCGA-PAAD"] and files.data_type in ["Masked Somatic Mutation"] and files.data_format in ["MAF"]
”.
Q9. Did you get this online search to work? If so how many MAF files for the TCGA-PAAD project were found from this advanced web search?
Lets do the same search in R with the help of the TCGAbiolinks
package function GDCquery_Maf(). For brevity we will focus on only one of the MAF files for this project, namely the MuTect2 workflow variant calls.
<- GDCquery_Maf(tumor="PAAD", pipelines = "mutect") maf.file
##
indexed 5.40kB in 0s, 314.75MB/s
indexed 1.00TB in 0s, 5.26TB/s
And lets take a peak at the first 6 rows of this data:
head(maf.file)
Q10. What argument could we use to write the MAF file into a csv document in your current working directory?
The MAF file contents is now stored as a dataframe and the maftools package workflow, which starts with a MAF file or dataframe, can proceed, starting with reading the pancreatic cancer MAF file.
= read.maf(maf = maf.file, verbose = FALSE) vars
With the data now available as a maftools MAF object, a lot of functionality is available with little code. While the maftools package offers quite a few functions, here are a few highlights. Cancer genomics and bioinformatics researchers will recognize these plots:
We can use plotmafSummary()
function to plot a summary of the maf object, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification. We can add either mean or median line to the stacked barplot to display average/median number of variants across the cohort.
plotmafSummary(vars)
A very useful summary representation of this data can be obtained via so-called oncoplots, also known as waterfall plots.
oncoplot(maf = vars, top = 10)
You might need to run the oncoplot() command in the R Console and then zoom the display to see the full plot (as it is rather large and may not appear initially in your Rmarkdown document before Knitting. Another option is to send your plot to a PNG or PDF plot device directly, for example:
# Oncoplot for our top 10 most frequently mutated genes
pdf("oncoplot_panc.pdf")
oncoplot(maf = vars, top = 10, fontSize = 12)
dev.off()
## quartz_off_screen
## 2
Oncoplots display the “mutational landscape” of patients accross specific genes of interest. By default, and in the plot above, the top 10 mutated genes in the cohort are used. Each column is a patient and each row is a gene of interest (sorted by the mutation rate of that gene in the cohort). Each colored cell indicates that the patient has a mutation in that gene and it is color-coded by the type of mutation (depicted in the legend on the bottom left of the plot).
For example, in the oncoplot above KRAS is depicted as the most frequently mutated gene in the cohort (77% of the time), following by TP53 (63%). KRAS misssense mutations predominate. Many patients have both KRAS and TP53 mutations. A small subset of patients have either TP53 or GNAS mutations but no KRAS mutation. These observations could lead to questions such as – What role does mutated KRAS play in these patients? What is distinct about TP53 and GNAS mutations in the absence of KRAS mutations? Could these top 10 genes be affecting certain pathways that may lead to these patients acquiring this disease?
NOTE: The oncoplot() function is a wrapper around ComplexHeatmap’s
OncoPrint()
function and there are lots and lots of possible customization options as usual with R graphics.
NOTE: Variants annotated as Multi_Hit are those genes which are mutated more than once in the same sample.
We can visualize any subset set of genes using the oncostrip() function, which draws mutations in each sample similar to the graphic on the NCI-GDC web portal. Note that oncostrip() can be used to draw any number of genes using the input top
or genes
arguments
oncostrip(maf=vars, genes=c("KRAS", "TP53"))
Another plot focusing on KRAS in our particular dataset.
lollipopPlot(vars, gene='KRAS')
## HGNC refseq.ID protein.ID aa.length
## 1: KRAS NM_004985 NP_004976 188
## 2: KRAS NM_033360 NP_203524 189
Q11. Adapt the code above to produce a lollipop plot for p53 (i.e. the ‘TP53’ gene)?
Your p53 plot should look like this:
## HGNC refseq.ID protein.ID aa.length
## 1: TP53 NM_000546 NP_000537 393
## 2: TP53 NM_001126112 NP_001119584 393
## 3: TP53 NM_001126113 NP_001119585 346
## 4: TP53 NM_001126114 NP_001119586 341
## 5: TP53 NM_001126115 NP_001119587 261
## 6: TP53 NM_001126116 NP_001119588 209
## 7: TP53 NM_001126117 NP_001119589 214
## 8: TP53 NM_001126118 NP_001119590 354
Additional functionality is available for each of the GenomicDataCommons
, TCGAbiolinks
and maftools
packages not to mention the 100’s of other bioinformatics R packages that can now work with this type of data in both exploratory and targeted analysis modes.
The purpose of this hands-on session was to highlight how one can leverage three such packages to quickly gain insight into rapidly expanding public cancer genomics data-sets. Hopefully this will inspire your further exploration of these and other bioinformatics R packages.