Marcos Díaz-Gay < https://marcos-diazg.github.io/ >
2023-03-07 (16:42:53 on Tue, Mar 07)

Background

The goal of this hands-on mini-project is to perform a mutational signature analysis of a set of cancer patients and decipher the biological processes underlying the mutations found in the patients’ tumors, and ultimately leading to the development of the disease.

While analyzing the somatic mutations found in a tumor has been commonly done looking for potential driver alterations (defined as those leading to cancer development), in the last decade, mutational signature analysis has revolutionized this paradigm. By exploring and characterizing the patterns of all somatic mutations (not only those considered drivers), it is possible to identify the causes leading to a specific tumor, such as, for example, environmental exposures (UV light exposure) or lifestyle choices (tobacco smoking), as well as internal endogenous cellular processes (like a deficient DNA repair machinery). Somatic mutations are present in all cells of the human body and occur throughout life. They are the consequence of multiple mutational processes, which generate unique combinations of mutation types, termed mutational signatures.

1. Exploring a cancer sequencing data portal

Our input data will be cancer sequencing data from the publicly available cBioPortal platform. In particular, we will analyze data from The Cancer Genome Atlas (TCGA) project. To focus on this particular project and, specifically, on the latest published results, you should search for “tcga pancancer atlas” in the cBioPortal Query tab, as shown below.



We will perform this mini-project divided into three groups that will jointly discuss the results together at the end of the session. The goal is for each group to identify the shared patterns of mutations of a specific cancer type, as well as the most common mutational signatures active. The potential environmental exposures associated with different cancer types should be considered when analyzing the results. Group 1 will analyze liver hepatocellular carcinomas, group 2 lung squamous cell carcinomas, and group 3 skin cutaneous melanoma.

Side-note: For this lab, colorectal adenocarcinomas will be used as an example, but the questions should be answered considering the specific cancer type of your group.

First of all, after selecting the data set of choice, we will choose the Explore selected studies button. This will lead us to the Summary panel, where different clinical features can be explored.


Discussion #1

Please work with your group to explore the clinical and molecular data categories present in cBioPortal and answer the questions below.

Side-note: Please provide your answers for your specific cancer type in the red rectangles after each question. Correct answers will be indicated by a change of rectangle color.

  • Q. How many cancer samples are included in the dataset?
  • Q. Which is the most mutated gene?
  • Q. Which is the most common treatment undergone by patients?

Please reply to the question above for any of the cancer types of groups 1, 2, or 3, i.e., liver hepatocellular carcinomas, lung squamous cell carcinomas , or skin cutaneous melanoma.

Hint: cancer samples are not the same as cancer patients, as more than one sample can come from the same patient, e.g., because the patient had multiple tumors or there are various biopsies from the same tumor (at different time points or locations).

In addition, please reflect on the data available in the Ethnicity and Race categories. Do you observe any bias regarding the ethnic and racial composition of the TCGA cohort compared to the US general population? Please discuss these clinical data results for your correspondent cancer type with your team. Please select a person to share your thoughts in a joint discussion among all teams.

2. Downloading cancer sequencing data

To download the cancer sequencing data to use for the mutational signature analysis, we will use the following button within the Summary panel.



This will download all available data for the cancer data set, including clinical and molecular data. The file containing the mutation data is data_mutations.txt. Although the extension of this file is .txt, it was actually generated using the MAF format, commonly seen in cancer sequencing data. More information about this format can be found here.

3. Generating mutational matrices and visualizing mutational profiles

Benefiting from the standard MAF format, we can use the maftools R package to manage this sequencing data. This package contains lots of useful commands to work with this specific format, starting from reading the actual files in their native format using read.maf, as well as many other helpful features. More information can be found on the Bioconductor website, as well as in the maftools vignette.

Following the code below, you can install the maftools R package, as well as read your input data and generate mutational matrices. Mutational matrices are the first step for mutational signature analysis and correspond to a helpful data type, as it contains no protected information.

In this case, we are going to focus on the SBS96 mutational context, which, as mentioned in the lecture, allows classifying single nucleotide mutations in different categories based on the mutated nucleotide, as well as the immediately preceding and posterior nucleotides to the mutation (a.k.a. the 5’ and 3’ nucleotides from the mutation). Other contexts exist for single nucleotide variants, as well as additional variant types, such as doublet base substitutions (DBSs) or short insertions and deletions (indels; IDs).

Side-note: The code presented was built to run the analysis on the colorectal adenocarcinomas cohort, including the input data used, but also the names of variables (“coad” abbreviation standing for “colorectal adenocarcinomas”), etc. Each group should use this code below as the basis and then adapt it for their own analysis, which will be based on their specific cancer type (defined above in section 1).

# Install required packages
if (!require("BiocManager")){
    install.packages("BiocManager")
}
if (!require("maftools")){
    BiocManager::install("maftools")
}
if (!require("BSgenome.Hsapiens.UCSC.hg19")){         # reference genome needed to
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")   # generate mutational matrices
}

# Read maf file
library(maftools)
coad = read.maf('data/coadread_tcga_pan_can_atlas_2018/data_mutations.txt')

# Generate mutational matrix (SBS96 context)
mm_coad = trinucleotideMatrix(maf = coad, prefix = 'chr', add = TRUE,
                              ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
mm_coad = t(mm_coad$nmf_matrix)

For the visualization of SBS96 mutational profiles, we will make use of the MutationalPatterns R package. This library is commonly used for all kinds of mutational signature analysis, and we will also use it for the subsequent assignment analysis.

To generate mutational profiles, we will use the previously generated mutational matrices as input. As the cohorts are pretty large, we will focus on exploring the mutational profiles of 4 random samples, the 4 samples harboring the highest number of mutations, the 4 samples with the lowest number of mutations, and the average mutational profile of all samples of the cohort. For the latest visualization, it’s important to keep in mind that we need to first obtain the relative mutational matrix, using percentages instead of absolute values. This step is required to avoid samples with high numbers of mutations to bias the average mutational profile.

Side-note: Although MutationalPatterns also allows users to generate mutational matrices, this is only possible using VCF files as input, not MAF format files. The VCF format is the most standard format for data obtained from variant calling tools, and they are commonly individual sample files, i.e., one VCF file per sample. In contrast, MAF files are widely used in large-scale projects for sharing multi-sample data.

# Install MutationalPatterns package
if (!require("MutationalPatterns")){
BiocManager::install("MutationalPatterns")
}

# Generate mutational profiles (4 random samples)
library(MutationalPatterns)
set.seed(11111) # fixing the seed for random number generation

samples_to_plot = sample(1:ncol(mm_coad),4) # selecting 4 random samples
plot_96_profile(mm_coad[,samples_to_plot], condensed = T)

# Generate mutational profiles (top 4 mutated samples and top 4 less mutated)
mutations_in_samples = colSums(mm_coad)
mutations_in_samples = sort(mutations_in_samples, decreasing = T)
samples_to_plot = names(mutations_in_samples)[1:4]
plot_96_profile(mm_coad[,samples_to_plot], condensed = T)

mutations_in_samples = sort(mutations_in_samples, decreasing = F)
samples_to_plot = names(mutations_in_samples)[1:4]
plot_96_profile(mm_coad[,samples_to_plot], condensed = T)

# Generate average mutational profiles
relative_mutational_profile = apply(mm_coad, 2, prop.table) # obtained relative
                                                            # mutational matrix
average_mutational_profile = rowMeans(relative_mutational_profile)
average_mutational_profile = data.frame(average_mutational_profile)
plot_96_profile(average_mutational_profile, condensed = T)

4. COSMIC reference mutational signatures

The COSMIC reference set of mutational signatures was derived after the analysis of many thousands of DNA sequenced samples from various cancer types by large international consortia over the last decade. These reference mutational signatures have been associated with specific environmental exposures, lifestyle choices, and endogenous cellular mechanisms. More information about the etiologies of the signatures and the cancer types where they have previously been found is available at the COSMIC Mutational Signatures website.

For our analysis, we are focusing on the SBS96 mutational context. In this regard, all the extracted reference signatures to date can be found here. You will use the COSMIC Mutational Signature website after the mutational signature assignment analysis to gather more information about the most active signatures in your cohort.


5. Assigning reference mutational signatures

Leveraging the COSMIC mutational signatures, we will perform a mutational signature assignment analysis to quantify the number of mutations contributed by each signature to a given cancer sample and, therefore, decipher which mutational processes have been active in each individual tumor.

We will perform two different assignment analyses using the MutationalPatterns package (as shown in their vignette). First, a standard assignment analysis carried out by the fit_to_signatures function using the non-negative least squares (NNLS) algorithm. On the other hand, we will also use the fit_to_signatures_strict function. In this case, a strict analysis including downstream optimization of the NNLS results is used, improving the overall accuracy of the signatures assignment albeit drastically increasing the running time.

# Mutational signature assignment
cosmic_signatures = get_known_signatures(source = 'COSMIC_v3.2')
fit_res = fit_to_signatures(mm_coad, cosmic_signatures)

# Top contributing signatures
contributions = fit_res$contribution

top_contributing_signatures_abs = rowMeans(contributions)
top_contributing_signatures_abs = sort(top_contributing_signatures_abs,
                                       decreasing = T)[1:4]

## Top 4 contributing signatures (absolute values)
top_contributing_signatures_abs
##     SBS15    SBS10b      SBS1    SBS10a 
## 114.41157  97.00990  87.95620  55.34383
relative_contributions = apply(contributions,2,prop.table)
top_contributing_signatures_rel = rowMeans(relative_contributions)
top_contributing_signatures_rel = sort(top_contributing_signatures_rel,
                                       decreasing = T)[1:4]

## Top 4 contributing signatures (relative values)
top_contributing_signatures_rel
##       SBS1      SBS15      SBS87       SBS6 
## 0.25443581 0.13583121 0.12910496 0.05854452
# Mutational signature assignment strict
fit_res_strict = fit_to_signatures_strict(mm_coad, cosmic_signatures)
fit_res_strict = fit_res_strict$fit_res
contributions_strict = fit_res_strict$contribution

6. Visualizing mutational signature assignment results

To visualize the mutational signature assignment results, we will use the default visualizations available in the MutationalPatterns package. However, other visualizations are also present as part of maftools (please check the appropriate section in their vignette) or can be created using ggplot2 and the contributions output matrix from the mutational signature assignment analysis (contributions or contributions_strict).

# Visualization of signature assignment results (fit_to_signatures)
set.seed(11111)
samples_to_plot = sample(1:ncol(mm_coad),4)

plot_contribution(contributions[,samples_to_plot], mode = "absolute")

plot_contribution(contributions[,samples_to_plot], mode = "relative")

plot_contribution_heatmap(contributions, cluster_samples = F)

# Visualization of signature assignment results (strict)
plot_contribution(contributions_strict[,samples_to_plot], mode = "absolute")

plot_contribution(contributions_strict[,samples_to_plot], mode = "relative")

plot_contribution_heatmap(contributions_strict, cluster_samples = F)

Assigning mutational signatures is a mathematical optimization problem, which consists in obtaining the best combination of reference signatures that better reconstruct the original mutational profile. Considering this, it is a good practice to check how good is this reconstruction, because it can happen that even if the method tries to optimize the reconstruction, this is still not good enough. Different similarity measures can be used to check this. The most common in the mutational signatures field is the cosine similarity, although others such as correlation, L1/L2 norm, or KL divergence can be used. A cosine similarity above 0.90 commonly indicates a good reconstruction.

To check the cosine similarity of the reconstruction for some specific samples, we can use the following visualization from the MutationalPatterns R package.

# Cosine similarity reconstruction vs. original mutational profile (fit_to_signatures)
set.seed(11111)
samples_to_plot = sample(1:ncol(mm_coad),4)

plot_original_vs_reconstructed(mm_coad[,samples_to_plot],
                               fit_res$reconstructed[,samples_to_plot], 
                               y_intercept = 0.90)

# Cosine similarity reconstruction vs. original mutational profile (strict)
plot_original_vs_reconstructed(mm_coad[,samples_to_plot],
                               fit_res_strict$reconstructed[,samples_to_plot], 
                               y_intercept = 0.90)

Discussion #2

Please work with your group to jointly evaluate the results obtained for the mutational signature analysis, including both analysis (using fit_to_signatures and fit_to_signatures_strict), as well as relative and absolute contributions of the different COSMIC reference signatures. Subsequently, please select a person to present the results to other groups in a joint discussion. Please answer the questions below using the results obtained by your group as well as those shared by other groups. In addition, if time allows, please feel free to perform the analysis individually for each of the three cancer types.

  • Q. Which is the etiology of the top absolute contributing signature for liver cancer?
  • Q. Which is the most prominent mutational context for the top contributing signature in skin cancer?
  • Q. The etiology of the top contributing signature for lung cancer corresponds to an endogenous cellular mechanism.
  • Q. SBS4 is one of the most common signatures found in lung cancer and is associated with tobacco smoking.
  • Q. SBS7d is one of the most common signatures in skin cancer and is associated with UV light exposure and high numbers of C>T mutations.

Please make use of the COSMIC Mutational Signatures database to learn more about the etiologies (or causes) of the reference signatures, as well as the other cancer types where they have been previously found active.

7. Advanced mutational signature analysis using the SigProfilerAssignment python package [OPTIONAL]

Mutational signature assignment is a dynamic field, where new methods are currently being developed. One of the most recent tools is SigProfilerAssignment, developed in the Alexandrov lab at UC San Diego. In contrast to maftools or MutationalPatterns, SigProfilerAssignment is a python package, so the installation process differs from prior packages considered in this lab. However, SigProfilerAssignment is also available as an R wrapper (SigProfilerAssignmentR), making this process easier.

First, we will need to install the Miniconda package manager, which is highly recommended for python users and can also be used to manage your R dependencies. Miniconda philosophy is built around conda environments, which correspond to closed containers where all the versions of the different python/R dependencies are fixed. This structure highly benefits the reproducibility of the code, as each analysis can be done with an exclusively dedicated environment, which will not change any dependency version, even though the general python installation of the computer or a particular package is updated. Please note that Miniconda installation differs between macOS and Windows users.

# Install Miniconda
## macOS (Intel)
system('wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh')
system('bash Miniconda3-latest-MacOSX-x86_64.sh')

## macOS (Apple M1)
system('wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh')
system('Miniconda3-latest-MacOSX-arm64.sh')

## Windows
# Download https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe
# Install python
## macOS
new_environment_name = 'bimm143_wi22'
path_of_miniconda = ' /Users/mdiazgay/opt/miniconda3/' # Should be changed
path_of_conda = paste0(path_of_miniconda, '/bin/conda')
system(paste0(path_of_conda, ' create --name ', new_environment_name, ' -y'))
system(paste0(path_of_conda, ' install -n ', new_environment_name, ' python=3.10 -y'))

## Windows (using Anaconda Prompt)
# conda create --name bimm143_wi23 -y
# conda activate bimm143_wi23
# conda install python=3.10 -y

# Install SigProfilerAssigment python package
## macOS
system(paste0(path_of_miniconda, '/envs/', new_environment_name,
              '/bin/pip install SigProfilerAssignment'))

## Windows (using Anaconda Prompt)
# conda activate bimm143_wi23
# pip install SigProfilerAssignment

# Install reticulate (package connecting python & R)
if (!require("reticulate")){
    install.packages("reticulate")
}

# Install R wrapper (SigProfilerAssignmentR)
if (!require("devtools")){
    install.packages("devtools")
}
devtools::install_github("AlexandrovLab/SigProfilerAssignmentR")
devtools::install_github('marcos-diazg/utils.mdg')
# Generate mutational assignment analysis using SigProfilerAssignment
library(reticulate)
use_condaenv('bimm143_wi22')
py_config()
## python:         /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22/bin/python
## libpython:      /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22/lib/libpython3.10.dylib
## pythonhome:     /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22:/Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22
## version:        3.10.9 (main, Mar  1 2023, 12:33:47) [Clang 14.0.6 ]
## numpy:          /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22/lib/python3.10/site-packages/numpy
## numpy_version:  1.24.2
## 
## NOTE: Python version was forced by RETICULATE_PYTHON
library(utils.mdg)
mm_coad_sigprofiler = map_MP_to_SP(mm_coad) # Order of rows is different in
                                            # SigProfiler vs. MutationalPatterns
library(SigProfilerAssignmentR)
cosmic_fit(mm_coad_sigprofiler, './results', export_probabilities = F)
# Output results folder will be in the new "results" directory inside
# your current working directory

About this document

Here we use the sessionInfo() function to report on our R systems setup at the time of document execution.

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.60.0                  
##  [3] rtracklayer_1.52.1                Biostrings_2.60.2                
##  [5] XVector_0.32.0                    MutationalPatterns_3.2.0         
##  [7] NMF_0.25                          Biobase_2.52.0                   
##  [9] cluster_2.1.2                     rngtools_1.5.2                   
## [11] registry_0.5-1                    GenomicRanges_1.44.0             
## [13] GenomeInfoDb_1.28.4               IRanges_2.26.0                   
## [15] S4Vectors_0.30.2                  BiocGenerics_0.38.0              
## [17] maftools_2.8.05                   BiocManager_1.30.16              
## [19] labsheet_0.1.2                   
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                matrixStats_0.61.0         
##  [3] doParallel_1.0.17           RColorBrewer_1.1-2         
##  [5] tools_4.1.1                 bslib_0.4.2                
##  [7] utf8_1.2.2                  R6_2.5.1                   
##  [9] DBI_1.1.1                   colorspace_2.0-2           
## [11] tidyselect_1.1.1            compiler_4.1.1             
## [13] cli_3.6.0                   DelayedArray_0.18.0        
## [15] sass_0.4.5                  scales_1.1.1               
## [17] stringr_1.5.0               digest_0.6.31              
## [19] Rsamtools_2.8.0             rmarkdown_2.20             
## [21] pkgconfig_2.0.3             htmltools_0.5.4            
## [23] MatrixGenerics_1.4.3        highr_0.10                 
## [25] fastmap_1.1.1               rlang_1.0.6                
## [27] rstudioapi_0.13             farver_2.1.0               
## [29] jquerylib_0.1.4             BiocIO_1.2.0               
## [31] generics_0.1.0              jsonlite_1.8.4             
## [33] BiocParallel_1.26.2         dplyr_1.0.7                
## [35] RCurl_1.98-1.5              magrittr_2.0.3             
## [37] GenomeInfoDbData_1.2.6      Matrix_1.3-4               
## [39] Rcpp_1.0.7                  munsell_0.5.0              
## [41] fansi_0.5.0                 lifecycle_1.0.3            
## [43] stringi_1.7.12              yaml_2.3.7                 
## [45] ggalluvial_0.12.5           SummarizedExperiment_1.22.0
## [47] zlibbioc_1.38.0             plyr_1.8.6                 
## [49] grid_4.1.1                  crayon_1.5.1               
## [51] lattice_0.20-45             splines_4.1.1              
## [53] knitr_1.42                  pillar_1.8.1               
## [55] rjson_0.2.21                reshape2_1.4.4             
## [57] codetools_0.2-18            XML_3.99-0.8               
## [59] glue_1.6.2                  evaluate_0.20              
## [61] data.table_1.14.2           vctrs_0.5.2                
## [63] foreach_1.5.1               gtable_0.3.0               
## [65] purrr_0.3.4                 tidyr_1.1.4                
## [67] assertthat_0.2.1            cachem_1.0.7               
## [69] ggplot2_3.3.5               xfun_0.37                  
## [71] gridBase_0.4-7              restfulr_0.0.13            
## [73] pracma_2.4.2                survival_3.2-13            
## [75] tibble_3.1.8                iterators_1.0.13           
## [77] GenomicAlignments_1.28.0    ellipsis_0.3.2
 

Powered by labsheet