Barry Grant < http://thegrantlab.org/ >
2022-02-28 (09:57:19 PST on Mon, Feb 28)

Overview

Cancer is fundamentally a disease of the genome, caused by changes in the DNA, RNA, and proteins of a cell that push cell growth into overdrive. Identifying the genomic alterations that arise in a given cancer can help researchers decode how a particular cancer develops and improve upon the diagnosis and treatment of cancers based on their distinct molecular abnormalities.

With the ability to sequence whole genomes and exomes, attention has turned to trying to understand the full spectrum of genetic mutations that underlie cancer.

The genomes or exomes of tens of thousands of cancers have now been sequenced. Analyzing this data can yield important new insights into cancer biology. This is important because it is estimated that cancer will strike over 40% of people at some point in their lifetime with frequently devastating effects.

1. The NCI Genomic Data Commons

The National Cancer Institute (NCI) in the US has established the Genomic Data Commons (or GDC for short) for sharing cancer genomics data-sets.

This includes data from a range of large scale projects such as The Cancer Genome Atlas (TCGA) and other projects. The TGCA project aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and sub-types of cancer. As of writing, TCGA has analyzed matched tumor and normal tissues from over 11,000 patients covering 33 cancer types and sub-types.

You can get a feel for the types of cancer data contained in the NCI-GDC by visiting their new web portal: https://portal.gdc.cancer.gov.

Exploring the GDC online

Visit the NCI-GDC web portal and enter p53 into the search box.

Q1. How many Cases (i.e. patient samples) have been found to have p53 mutations?

Q2. What are the top 6 misssense mutations found in this gene?
HINT: Scroll down to the ‘TP53 - Protein’ section and mouse over the displayed plot. For example R175H is found in 156 cases.

Q3. Which domain of the protein (as annotated by PFAM) do these mutations reside in?

Q4. What are the top 6 primary sites (i.e. cancer locations such as Lung, Brain, etc.) with p53 mutations and how many primary sites have p53 mutations been found in?
HINT: Clicking on the number links in the Cancer Distribution section will take you to a summary of available data accross cases, genes, and mutations for p53. Looking at the cases data will give you a ranked listing of primary sites.

Return to the NCI-GDC homepage and using a similar search and explore strategy answer the following questions:

Q5. What is the most frequentely mutated position associated with cancer in the KRas protein (i.e. the amino acid with the most mutations)?

Q6. Are KRas mutations common in Pancreatic Adenocarcinoma (i.e. is the Pancreas a common ‘primary site’ for KRas mutations?).

Q6. What is the ‘TGCA project’ with the most KRas mutations?

Q7. What precent of cases for this ‘TGCA project’ have KRas mutations and what precent of cases have p53 mutations?
HINT: Placing your mouse over the project bar in the Cancer Distribution panel will bring up a tooltip with useful summary data.

Q8. How many TGCA Pancreatic Adenocarcinoma cases (i.e. patients from the TCGA-PAAD project) have RNA-Seq data available?


Side-Note: If Barry forgets please remind him to show the top miss-sense mutation sites on the protein structure (and discuss mechanism) as well as demo how to find RNA-Seq FASTQ files and other data such as biopsy images etc.

By now it should be clear that the NCI-GDC is a rich source of both genomic and clinical data for a wide range of cancers. For example, at the time of writing there are 5,306 files associated with Pancreatic Adenocarcinoma and 14,278 for Colon Adenocarcinoma. These include RNA-Seq, WXS (whole exome sequencing), Methylation and Genotyping arrays as well as well as rich metadata associated with each case, file and biospecimen.

2. The GenomicDataCommons R package

The GenomicDataCommons Bioconductor package provides functions for querying, accessing, and mining the NCI-GDC in R. Using this package allows us to couple large cancer genomics data sets (for example the actual RNA-Seq, WXS or SNP data) directly to the plethora of state-of-the-art bioinformatics methods available in R. This is important because it greatly facilitates both targeted and exploratory analysis of molecular cancer data well beyond that accessible via a web portal.

This section highlights how one can couple the GenomicDataCommons, TCGAbiolinks and maftools bioconductor packages to quickly gain insight into public cancer genomics data-sets.

We will first use functions from the GenomicDataCommons package to identify and then fetch, using the TCGAbiolinks package, somatic variant results from the NCI-GDC and then provide a high-level assessment of those variants using the maftools package. The later package works with Mutation Annotation Format or MAF format files used by GDC and others to store somatic variants.

The workflow will be:

  • Install packages if not already installed
  • Load libraries
  • Identify and download somatic variants for a representative TCGA dataset, in this case pancreatic adenocarcinoma.
  • Use maftools to provide rich summaries of the data.
#source("https://bioconductor.org/biocLite.R")
#biocLite(c("GenomicDataCommons", "TCGAbiolinks", "maftools"))
BiocManager::install( c("GenomicDataCommons", "TCGAbiolinks", "maftools") )

Once installed, load the packages, as usual.

library(GenomicDataCommons)
library(TCGAbiolinks)
library(maftools)

Now lets check on GDC status:

status()
## $commit
## [1] "b49b90e1318040f447906940f3dff145809d9ea0"
## 
## $data_release
## [1] "Data Release 31.0 - October 29, 2021"
## 
## $status
## [1] "OK"
## 
## $tag
## [1] "3.0.0"
## 
## $version
## [1] 1

If this statement results in an error such as SSL connect error, then please see the troubleshooting section here.

3. Querying the GDC from R

We will typically start our interaction with the GDC by searching the resource to find data that we are interested in investigating further. In GDC speak this is called “Querying GDC metadata”. Metadata here refers to the extra descriptive information associated with the actual patient data (i.e. ‘cases’) in the GDC.

For example: Our query might be ‘find how many patients were studied for each major project’ or ‘find and download all gene expression quantification data files for all pancreatic cancer patients’. We will answer both of these questions below.

The are four main sets of metadata that we can query, namely projects(), cases(), files(), and annotations(). We will start with projects()

projects <- getGDCprojects()
head(projects)

If you use the View(projects) function call you can see all the project names (such as Neuroblastoma, Pancreatic Adenocarcinoma, etc.) along with their project IDs (such as TARGET-NBL, TCGA-PAAD, etc.) and associated information.

Moving onto cases() we can use an example from the package associated publication to answer our first from question above (i.e. find the number of cases/patients across different projects within the GDC):

cases_by_project <- cases() %>%
  facet("project.project_id") %>%
  aggregations()
head(cases_by_project)
## $project.project_id
##    doc_count                   key
## 1      18004                 FM-AD
## 2      16824             GENIE-MSK
## 3      14232            GENIE-DFCI
## 4       3857             GENIE-MDA
## 5       3320             GENIE-JHU
## 6       2632             GENIE-UHN
## 7       2145            TARGET-AML
## 8       2052            GENIE-VICC
## 9       1587         TARGET-ALL-P2
## 10      1132            TARGET-NBL
## 11      1098             TCGA-BRCA
## 12      1038            GENIE-GRCC
## 13       995         MMRF-COMMPASS
## 14       801             GENIE-NKI
## 15       795               CPTAC-3
## 16       652             TARGET-WT
## 17       617              TCGA-GBM
## 18       608               TCGA-OV
## 19       585             TCGA-LUAD
## 20       583     BEATAML1.0-COHORT
## 21       560             TCGA-UCEC
## 22       537             TCGA-KIRC
## 23       528             TCGA-HNSC
## 24       516              TCGA-LGG
## 25       507             TCGA-THCA
## 26       504             TCGA-LUSC
## 27       500             TCGA-PRAD
## 28       489          NCICCR-DLBCL
## 29       470             TCGA-SKCM
## 30       461             TCGA-COAD
## 31       443             TCGA-STAD
## 32       440             REBC-THYR
## 33       412             TCGA-BLCA
## 34       383             TARGET-OS
## 35       377             TCGA-LIHC
## 36       342               CPTAC-2
## 37       339              TRIO-CRU
## 38       307             TCGA-CESC
## 39       291             TCGA-KIRP
## 40       261             TCGA-SARC
## 41       211         CGCI-HTMCP-CC
## 42       200               CMI-MBC
## 43       200             TCGA-LAML
## 44       191         TARGET-ALL-P3
## 45       185             TCGA-ESCA
## 46       185             TCGA-PAAD
## 47       179             TCGA-PCPG
## 48       176              OHSU-CNL
## 49       172             TCGA-READ
## 50       150             TCGA-TGCT
## 51       124             TCGA-THYM
## 52       120            CGCI-BLGSP
## 53       113             TCGA-KICH
## 54       109             HCMI-CMDC
## 55       101            WCDT-MCRPC
## 56        92              TCGA-ACC
## 57        87             TCGA-MESO
## 58        80              TCGA-UVM
## 59        70   ORGANOID-PANCREATIC
## 60        69             TARGET-RT
## 61        58             TCGA-DLBC
## 62        57              TCGA-UCS
## 63        56 BEATAML1.0-CRENOLANIB
## 64        51             TCGA-CHOL
## 65        45           CTSP-DLBCL1
## 66        36               CMI-ASC
## 67        30               CMI-MPC
## 68        24         TARGET-ALL-P1
## 69        13           TARGET-CCSK
## 70         7        VAREPOP-APOLLO

Note that the facet() and aggregations() functions here are from the GenomicDataCommons package and act to group all cases by the project id and then count them up.

If you use the View() function on our new cases_by_project object you will find that the data we are after is accessible via cases_by_project$project.project_id.

Q9. Write the R code to make a barplot of the cases per project. Lets plot this data with a log scale for the y axis (log="y"), rotated axis labels (las=2) and color the bar coresponding to the TCGA-PAAD project.

x <- cases_by_project$project.project_id

# Make a custom color vector for our plot
colvec <- rep("lightblue", nrow(x))
colvec[___] <- "red"

# Plot with 'log' for y axis and rotate labels with 'las'
#par(___)  
barplot(___, names.arg=___, log="y", col=colvec, las=2)

Lets explore some other functions from the related TCGAbiolinks package.

We can use the getSampleFilesSummary() function to determine for a given project how many cases and what type of data we have available for each case:

samp <- getSampleFilesSummary("TCGA-PAAD")
## Accessing information for project: TCGA-PAAD
head(samp)

Now we can use GDCquery() function to focus in on a particular data type that we are interested in. For example, to answer our second question from above - namely ‘find all gene expression data files for all pancreatic cancer patients’:

query <- GDCquery(project="TCGA-PAAD",
                  data.category="Transcriptome Profiling",
                  data.type="Gene Expression Quantification")

ans <- getResults(query)
head(ans)

In RStudio we can now use the View() function to get a feel for the data organization and values in the returned ans object.

# Run in your RStudio CONSOLE
View(ans)

We should see that ans contains a row for every RNA-Seq data file from the ‘TCGA-PAAD’ project. At the time of writing this was 546 RNA-Seq data files.

nrow(ans)
## [1] 546

We could download these with standard R tools, or for larger data-sets such as this one, use the packages transfer() function, which uses the GDC transfer client (a separate command-line tool) to perform more robust data downloads.

4. Variant analysis with R

Note we could go to the NCI-GDC web portal and enter the Advanced Search page and then construct a search query to find MAF format somatic mutation files for our ‘TCGA-PAAD’ project.

After some exploration of the website I came up with the following query: “cases.project.project_id in ["TCGA-PAAD"] and files.data_type in ["Masked Somatic Mutation"] and files.data_format in ["MAF"]”.

Q9. Did you get this online search to work? If so how many MAF files for the TCGA-PAAD project were found from this advanced web search?

Lets do the same search in R with the help of the TCGAbiolinks package function GDCquery_Maf(). For brevity we will focus on only one of the MAF files for this project, namely the MuTect2 workflow variant calls.

maf.file <- GDCquery_Maf(tumor="PAAD", pipelines = "mutect")
## 
indexed 5.40kB in  0s, 314.75MB/s
                                                                              
indexed 1.00TB in  0s, 5.26TB/s
                                                                              

And lets take a peak at the first 6 rows of this data:

head(maf.file)

Q10. What argument could we use to write the MAF file into a csv document in your current working directory?

MAF analysis

The MAF file contents is now stored as a dataframe and the maftools package workflow, which starts with a MAF file or dataframe, can proceed, starting with reading the pancreatic cancer MAF file.

vars = read.maf(maf = maf.file, verbose = FALSE)

With the data now available as a maftools MAF object, a lot of functionality is available with little code. While the maftools package offers quite a few functions, here are a few highlights. Cancer genomics and bioinformatics researchers will recognize these plots:

Plotting MAF summary.

We can use plotmafSummary() function to plot a summary of the maf object, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification. We can add either mean or median line to the stacked barplot to display average/median number of variants across the cohort.

plotmafSummary(vars)

Drawing oncoplots

A very useful summary representation of this data can be obtained via so-called oncoplots, also known as waterfall plots.

oncoplot(maf = vars, top = 10)

You might need to run the oncoplot() command in the R Console and then zoom the display to see the full plot (as it is rather large and may not appear initially in your Rmarkdown document before Knitting. Another option is to send your plot to a PNG or PDF plot device directly, for example:

# Oncoplot for our top 10 most frequently mutated genes
pdf("oncoplot_panc.pdf")
oncoplot(maf = vars, top = 10, fontSize = 12)
dev.off()
## quartz_off_screen 
##                 2

Oncoplots display the “mutational landscape” of patients accross specific genes of interest. By default, and in the plot above, the top 10 mutated genes in the cohort are used. Each column is a patient and each row is a gene of interest (sorted by the mutation rate of that gene in the cohort). Each colored cell indicates that the patient has a mutation in that gene and it is color-coded by the type of mutation (depicted in the legend on the bottom left of the plot).

For example, in the oncoplot above KRAS is depicted as the most frequently mutated gene in the cohort (77% of the time), following by TP53 (63%). KRAS misssense mutations predominate. Many patients have both KRAS and TP53 mutations. A small subset of patients have either TP53 or GNAS mutations but no KRAS mutation. These observations could lead to questions such as – What role does mutated KRAS play in these patients? What is distinct about TP53 and GNAS mutations in the absence of KRAS mutations? Could these top 10 genes be affecting certain pathways that may lead to these patients acquiring this disease?

NOTE: The oncoplot() function is a wrapper around ComplexHeatmap’s OncoPrint() function and there are lots and lots of possible customization options as usual with R graphics.

NOTE: Variants annotated as Multi_Hit are those genes which are mutated more than once in the same sample.

Oncostrip

We can visualize any subset set of genes using the oncostrip() function, which draws mutations in each sample similar to the graphic on the NCI-GDC web portal. Note that oncostrip() can be used to draw any number of genes using the input top or genes arguments

oncostrip(maf=vars, genes=c("KRAS", "TP53"))

Another plot focusing on KRAS in our particular dataset.

lollipopPlot(vars, gene='KRAS')
##    HGNC refseq.ID protein.ID aa.length
## 1: KRAS NM_004985  NP_004976       188
## 2: KRAS NM_033360  NP_203524       189

Q11. Adapt the code above to produce a lollipop plot for p53 (i.e. the ‘TP53’ gene)?

Your p53 plot should look like this:

##    HGNC    refseq.ID   protein.ID aa.length
## 1: TP53    NM_000546    NP_000537       393
## 2: TP53 NM_001126112 NP_001119584       393
## 3: TP53 NM_001126113 NP_001119585       346
## 4: TP53 NM_001126114 NP_001119586       341
## 5: TP53 NM_001126115 NP_001119587       261
## 6: TP53 NM_001126116 NP_001119588       209
## 7: TP53 NM_001126117 NP_001119589       214
## 8: TP53 NM_001126118 NP_001119590       354

Summary

Additional functionality is available for each of the GenomicDataCommons, TCGAbiolinks and maftools packages not to mention the 100’s of other bioinformatics R packages that can now work with this type of data in both exploratory and targeted analysis modes.

The purpose of this hands-on session was to highlight how one can leverage three such packages to quickly gain insight into rapidly expanding public cancer genomics data-sets. Hopefully this will inspire your further exploration of these and other bioinformatics R packages.