Marcos Díaz-Gay < https://marcos-diazg.github.io/ >
2023-03-07 (16:42:53 on Tue, Mar 07)
The goal of this hands-on mini-project is to perform a mutational signature analysis of a set of cancer patients and decipher the biological processes underlying the mutations found in the patients’ tumors, and ultimately leading to the development of the disease.
While analyzing the somatic mutations found in a tumor has been commonly done looking for potential driver alterations (defined as those leading to cancer development), in the last decade, mutational signature analysis has revolutionized this paradigm. By exploring and characterizing the patterns of all somatic mutations (not only those considered drivers), it is possible to identify the causes leading to a specific tumor, such as, for example, environmental exposures (UV light exposure) or lifestyle choices (tobacco smoking), as well as internal endogenous cellular processes (like a deficient DNA repair machinery). Somatic mutations are present in all cells of the human body and occur throughout life. They are the consequence of multiple mutational processes, which generate unique combinations of mutation types, termed mutational signatures.
Our input data will be cancer sequencing data from the publicly available cBioPortal platform. In particular, we will analyze data from The Cancer Genome Atlas (TCGA) project. To focus on this particular project and, specifically, on the latest published results, you should search for “tcga pancancer atlas” in the cBioPortal Query
tab, as shown below.
We will perform this mini-project divided into three groups that will jointly discuss the results together at the end of the session. The goal is for each group to identify the shared patterns of mutations of a specific cancer type, as well as the most common mutational signatures active. The potential environmental exposures associated with different cancer types should be considered when analyzing the results. Group 1 will analyze liver hepatocellular carcinomas, group 2 lung squamous cell carcinomas, and group 3 skin cutaneous melanoma.
Side-note: For this lab, colorectal adenocarcinomas will be used as an example, but the questions should be answered considering the specific cancer type of your group.
First of all, after selecting the data set of choice, we will choose the Explore selected studies
button. This will lead us to the Summary panel
, where different clinical features can be explored.
Please work with your group to explore the clinical and molecular data categories present in cBioPortal and answer the questions below.
Side-note: Please provide your answers for your specific cancer type in the red rectangles after each question. Correct answers will be indicated by a change of rectangle color.
Please reply to the question above for any of the cancer types of groups 1, 2, or 3, i.e., liver hepatocellular carcinomas, lung squamous cell carcinomas , or skin cutaneous melanoma.
Hint: cancer samples are not the same as cancer patients, as more than one sample can come from the same patient, e.g., because the patient had multiple tumors or there are various biopsies from the same tumor (at different time points or locations).
In addition, please reflect on the data available in the Ethnicity
and Race
categories. Do you observe any bias regarding the ethnic and racial composition of the TCGA cohort compared to the US general population? Please discuss these clinical data results for your correspondent cancer type with your team. Please select a person to share your thoughts in a joint discussion among all teams.
To download the cancer sequencing data to use for the mutational signature analysis, we will use the following button within the Summary
panel.
This will download all available data for the cancer data set, including clinical and molecular data. The file containing the mutation data is data_mutations.txt
. Although the extension of this file is .txt
, it was actually generated using the MAF format, commonly seen in cancer sequencing data. More information about this format can be found here.
Benefiting from the standard MAF format, we can use the maftools
R package to manage this sequencing data. This package contains lots of useful commands to work with this specific format, starting from reading the actual files in their native format using read.maf
, as well as many other helpful features. More information can be found on the Bioconductor website, as well as in the maftools
vignette.
Following the code below, you can install the maftools
R package, as well as read your input data and generate mutational matrices. Mutational matrices are the first step for mutational signature analysis and correspond to a helpful data type, as it contains no protected information.
In this case, we are going to focus on the SBS96 mutational context, which, as mentioned in the lecture, allows classifying single nucleotide mutations in different categories based on the mutated nucleotide, as well as the immediately preceding and posterior nucleotides to the mutation (a.k.a. the 5’ and 3’ nucleotides from the mutation). Other contexts exist for single nucleotide variants, as well as additional variant types, such as doublet base substitutions (DBSs) or short insertions and deletions (indels; IDs).
Side-note: The code presented was built to run the analysis on the colorectal adenocarcinomas cohort, including the input data used, but also the names of variables (“coad” abbreviation standing for “colorectal adenocarcinomas”), etc. Each group should use this code below as the basis and then adapt it for their own analysis, which will be based on their specific cancer type (defined above in section 1).
# Install required packages
if (!require("BiocManager")){
install.packages("BiocManager")
}if (!require("maftools")){
::install("maftools")
BiocManager
}if (!require("BSgenome.Hsapiens.UCSC.hg19")){ # reference genome needed to
::install("BSgenome.Hsapiens.UCSC.hg19") # generate mutational matrices
BiocManager
}
# Read maf file
library(maftools)
= read.maf('data/coadread_tcga_pan_can_atlas_2018/data_mutations.txt')
coad
# Generate mutational matrix (SBS96 context)
= trinucleotideMatrix(maf = coad, prefix = 'chr', add = TRUE,
mm_coad ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
= t(mm_coad$nmf_matrix) mm_coad
For the visualization of SBS96 mutational profiles, we will make use of the MutationalPatterns
R package. This library is commonly used for all kinds of mutational signature analysis, and we will also use it for the subsequent assignment analysis.
To generate mutational profiles, we will use the previously generated mutational matrices as input. As the cohorts are pretty large, we will focus on exploring the mutational profiles of 4 random samples, the 4 samples harboring the highest number of mutations, the 4 samples with the lowest number of mutations, and the average mutational profile of all samples of the cohort. For the latest visualization, it’s important to keep in mind that we need to first obtain the relative mutational matrix, using percentages instead of absolute values. This step is required to avoid samples with high numbers of mutations to bias the average mutational profile.
Side-note: Although
MutationalPatterns
also allows users to generate mutational matrices, this is only possible using VCF files as input, not MAF format files. The VCF format is the most standard format for data obtained from variant calling tools, and they are commonly individual sample files, i.e., one VCF file per sample. In contrast, MAF files are widely used in large-scale projects for sharing multi-sample data.
# Install MutationalPatterns package
if (!require("MutationalPatterns")){
::install("MutationalPatterns")
BiocManager
}
# Generate mutational profiles (4 random samples)
library(MutationalPatterns)
set.seed(11111) # fixing the seed for random number generation
= sample(1:ncol(mm_coad),4) # selecting 4 random samples
samples_to_plot plot_96_profile(mm_coad[,samples_to_plot], condensed = T)
# Generate mutational profiles (top 4 mutated samples and top 4 less mutated)
= colSums(mm_coad)
mutations_in_samples = sort(mutations_in_samples, decreasing = T)
mutations_in_samples = names(mutations_in_samples)[1:4]
samples_to_plot plot_96_profile(mm_coad[,samples_to_plot], condensed = T)
= sort(mutations_in_samples, decreasing = F)
mutations_in_samples = names(mutations_in_samples)[1:4]
samples_to_plot plot_96_profile(mm_coad[,samples_to_plot], condensed = T)
# Generate average mutational profiles
= apply(mm_coad, 2, prop.table) # obtained relative
relative_mutational_profile # mutational matrix
= rowMeans(relative_mutational_profile)
average_mutational_profile = data.frame(average_mutational_profile)
average_mutational_profile plot_96_profile(average_mutational_profile, condensed = T)
The COSMIC reference set of mutational signatures was derived after the analysis of many thousands of DNA sequenced samples from various cancer types by large international consortia over the last decade. These reference mutational signatures have been associated with specific environmental exposures, lifestyle choices, and endogenous cellular mechanisms. More information about the etiologies of the signatures and the cancer types where they have previously been found is available at the COSMIC Mutational Signatures website.
For our analysis, we are focusing on the SBS96 mutational context. In this regard, all the extracted reference signatures to date can be found here. You will use the COSMIC Mutational Signature website after the mutational signature assignment analysis to gather more information about the most active signatures in your cohort.
Leveraging the COSMIC mutational signatures, we will perform a mutational signature assignment analysis to quantify the number of mutations contributed by each signature to a given cancer sample and, therefore, decipher which mutational processes have been active in each individual tumor.
We will perform two different assignment analyses using the MutationalPatterns
package (as shown in their vignette). First, a standard assignment analysis carried out by the fit_to_signatures
function using the non-negative least squares (NNLS) algorithm. On the other hand, we will also use the fit_to_signatures_strict
function. In this case, a strict analysis including downstream optimization of the NNLS results is used, improving the overall accuracy of the signatures assignment albeit drastically increasing the running time.
# Mutational signature assignment
= get_known_signatures(source = 'COSMIC_v3.2')
cosmic_signatures = fit_to_signatures(mm_coad, cosmic_signatures)
fit_res
# Top contributing signatures
= fit_res$contribution
contributions
= rowMeans(contributions)
top_contributing_signatures_abs = sort(top_contributing_signatures_abs,
top_contributing_signatures_abs decreasing = T)[1:4]
## Top 4 contributing signatures (absolute values)
top_contributing_signatures_abs
## SBS15 SBS10b SBS1 SBS10a
## 114.41157 97.00990 87.95620 55.34383
= apply(contributions,2,prop.table)
relative_contributions = rowMeans(relative_contributions)
top_contributing_signatures_rel = sort(top_contributing_signatures_rel,
top_contributing_signatures_rel decreasing = T)[1:4]
## Top 4 contributing signatures (relative values)
top_contributing_signatures_rel
## SBS1 SBS15 SBS87 SBS6
## 0.25443581 0.13583121 0.12910496 0.05854452
# Mutational signature assignment strict
= fit_to_signatures_strict(mm_coad, cosmic_signatures)
fit_res_strict = fit_res_strict$fit_res
fit_res_strict = fit_res_strict$contribution contributions_strict
To visualize the mutational signature assignment results, we will use the default visualizations available in the MutationalPatterns
package. However, other visualizations are also present as part of maftools
(please check the appropriate section in their vignette) or can be created using ggplot2
and the contributions output matrix from the mutational signature assignment analysis (contributions
or contributions_strict
).
# Visualization of signature assignment results (fit_to_signatures)
set.seed(11111)
= sample(1:ncol(mm_coad),4)
samples_to_plot
plot_contribution(contributions[,samples_to_plot], mode = "absolute")
plot_contribution(contributions[,samples_to_plot], mode = "relative")
plot_contribution_heatmap(contributions, cluster_samples = F)
# Visualization of signature assignment results (strict)
plot_contribution(contributions_strict[,samples_to_plot], mode = "absolute")
plot_contribution(contributions_strict[,samples_to_plot], mode = "relative")
plot_contribution_heatmap(contributions_strict, cluster_samples = F)
Assigning mutational signatures is a mathematical optimization problem, which consists in obtaining the best combination of reference signatures that better reconstruct the original mutational profile. Considering this, it is a good practice to check how good is this reconstruction, because it can happen that even if the method tries to optimize the reconstruction, this is still not good enough. Different similarity measures can be used to check this. The most common in the mutational signatures field is the cosine similarity, although others such as correlation, L1/L2 norm, or KL divergence can be used. A cosine similarity above 0.90 commonly indicates a good reconstruction.
To check the cosine similarity of the reconstruction for some specific samples, we can use the following visualization from the MutationalPatterns
R package.
# Cosine similarity reconstruction vs. original mutational profile (fit_to_signatures)
set.seed(11111)
= sample(1:ncol(mm_coad),4)
samples_to_plot
plot_original_vs_reconstructed(mm_coad[,samples_to_plot],
$reconstructed[,samples_to_plot],
fit_resy_intercept = 0.90)
# Cosine similarity reconstruction vs. original mutational profile (strict)
plot_original_vs_reconstructed(mm_coad[,samples_to_plot],
$reconstructed[,samples_to_plot],
fit_res_stricty_intercept = 0.90)
Please work with your group to jointly evaluate the results obtained for the mutational signature analysis, including both analysis (using fit_to_signatures
and fit_to_signatures_strict
), as well as relative and absolute contributions of the different COSMIC reference signatures. Subsequently, please select a person to present the results to other groups in a joint discussion. Please answer the questions below using the results obtained by your group as well as those shared by other groups. In addition, if time allows, please feel free to perform the analysis individually for each of the three cancer types.
Please make use of the COSMIC Mutational Signatures database to learn more about the etiologies (or causes) of the reference signatures, as well as the other cancer types where they have been previously found active.
Mutational signature assignment is a dynamic field, where new methods are currently being developed. One of the most recent tools is SigProfilerAssignment
, developed in the Alexandrov lab at UC San Diego. In contrast to maftools
or MutationalPatterns
, SigProfilerAssignment
is a python package, so the installation process differs from prior packages considered in this lab. However, SigProfilerAssignment
is also available as an R wrapper (SigProfilerAssignmentR
), making this process easier.
First, we will need to install the Miniconda
package manager, which is highly recommended for python users and can also be used to manage your R dependencies. Miniconda
philosophy is built around conda environments, which correspond to closed containers where all the versions of the different python/R dependencies are fixed. This structure highly benefits the reproducibility of the code, as each analysis can be done with an exclusively dedicated environment, which will not change any dependency version, even though the general python installation of the computer or a particular package is updated. Please note that Miniconda
installation differs between macOS and Windows users.
# Install Miniconda
## macOS (Intel)
system('wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh')
system('bash Miniconda3-latest-MacOSX-x86_64.sh')
## macOS (Apple M1)
system('wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh')
system('Miniconda3-latest-MacOSX-arm64.sh')
## Windows
# Download https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe
# Install python
## macOS
= 'bimm143_wi22'
new_environment_name = ' /Users/mdiazgay/opt/miniconda3/' # Should be changed
path_of_miniconda = paste0(path_of_miniconda, '/bin/conda')
path_of_conda system(paste0(path_of_conda, ' create --name ', new_environment_name, ' -y'))
system(paste0(path_of_conda, ' install -n ', new_environment_name, ' python=3.10 -y'))
## Windows (using Anaconda Prompt)
# conda create --name bimm143_wi23 -y
# conda activate bimm143_wi23
# conda install python=3.10 -y
# Install SigProfilerAssigment python package
## macOS
system(paste0(path_of_miniconda, '/envs/', new_environment_name,
'/bin/pip install SigProfilerAssignment'))
## Windows (using Anaconda Prompt)
# conda activate bimm143_wi23
# pip install SigProfilerAssignment
# Install reticulate (package connecting python & R)
if (!require("reticulate")){
install.packages("reticulate")
}
# Install R wrapper (SigProfilerAssignmentR)
if (!require("devtools")){
install.packages("devtools")
}::install_github("AlexandrovLab/SigProfilerAssignmentR")
devtools::install_github('marcos-diazg/utils.mdg') devtools
# Generate mutational assignment analysis using SigProfilerAssignment
library(reticulate)
use_condaenv('bimm143_wi22')
py_config()
## python: /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22/bin/python
## libpython: /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22/lib/libpython3.10.dylib
## pythonhome: /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22:/Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22
## version: 3.10.9 (main, Mar 1 2023, 12:33:47) [Clang 14.0.6 ]
## numpy: /Users/mdiazgay/opt/miniconda3/envs/bimm143_wi22/lib/python3.10/site-packages/numpy
## numpy_version: 1.24.2
##
## NOTE: Python version was forced by RETICULATE_PYTHON
library(utils.mdg)
= map_MP_to_SP(mm_coad) # Order of rows is different in
mm_coad_sigprofiler # SigProfiler vs. MutationalPatterns
library(SigProfilerAssignmentR)
cosmic_fit(mm_coad_sigprofiler, './results', export_probabilities = F)
# Output results folder will be in the new "results" directory inside
# your current working directory
Here we use the sessionInfo()
function to report on our R systems setup at the time of document execution.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.60.0
## [3] rtracklayer_1.52.1 Biostrings_2.60.2
## [5] XVector_0.32.0 MutationalPatterns_3.2.0
## [7] NMF_0.25 Biobase_2.52.0
## [9] cluster_2.1.2 rngtools_1.5.2
## [11] registry_0.5-1 GenomicRanges_1.44.0
## [13] GenomeInfoDb_1.28.4 IRanges_2.26.0
## [15] S4Vectors_0.30.2 BiocGenerics_0.38.0
## [17] maftools_2.8.05 BiocManager_1.30.16
## [19] labsheet_0.1.2
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 matrixStats_0.61.0
## [3] doParallel_1.0.17 RColorBrewer_1.1-2
## [5] tools_4.1.1 bslib_0.4.2
## [7] utf8_1.2.2 R6_2.5.1
## [9] DBI_1.1.1 colorspace_2.0-2
## [11] tidyselect_1.1.1 compiler_4.1.1
## [13] cli_3.6.0 DelayedArray_0.18.0
## [15] sass_0.4.5 scales_1.1.1
## [17] stringr_1.5.0 digest_0.6.31
## [19] Rsamtools_2.8.0 rmarkdown_2.20
## [21] pkgconfig_2.0.3 htmltools_0.5.4
## [23] MatrixGenerics_1.4.3 highr_0.10
## [25] fastmap_1.1.1 rlang_1.0.6
## [27] rstudioapi_0.13 farver_2.1.0
## [29] jquerylib_0.1.4 BiocIO_1.2.0
## [31] generics_0.1.0 jsonlite_1.8.4
## [33] BiocParallel_1.26.2 dplyr_1.0.7
## [35] RCurl_1.98-1.5 magrittr_2.0.3
## [37] GenomeInfoDbData_1.2.6 Matrix_1.3-4
## [39] Rcpp_1.0.7 munsell_0.5.0
## [41] fansi_0.5.0 lifecycle_1.0.3
## [43] stringi_1.7.12 yaml_2.3.7
## [45] ggalluvial_0.12.5 SummarizedExperiment_1.22.0
## [47] zlibbioc_1.38.0 plyr_1.8.6
## [49] grid_4.1.1 crayon_1.5.1
## [51] lattice_0.20-45 splines_4.1.1
## [53] knitr_1.42 pillar_1.8.1
## [55] rjson_0.2.21 reshape2_1.4.4
## [57] codetools_0.2-18 XML_3.99-0.8
## [59] glue_1.6.2 evaluate_0.20
## [61] data.table_1.14.2 vctrs_0.5.2
## [63] foreach_1.5.1 gtable_0.3.0
## [65] purrr_0.3.4 tidyr_1.1.4
## [67] assertthat_0.2.1 cachem_1.0.7
## [69] ggplot2_3.3.5 xfun_0.37
## [71] gridBase_0.4-7 restfulr_0.0.13
## [73] pracma_2.4.2 survival_3.2-13
## [75] tibble_3.1.8 iterators_1.0.13
## [77] GenomicAlignments_1.28.0 ellipsis_0.3.2
Powered by labsheet