Bio3D_nma-dhfr-partII.Rmd
Bio3D1 is an R package that provides interactive tools for structural bioinformatics. The primary focus of Bio3D is the analysis of bimolecular structure, sequence and simulation data (Grant et al. 2006).
Normal mode analysis (NMA) is one of the major simulation techniques used to probe large-scale motions in biomolecules. Typical application is for the prediction of functional motions in proteins. Version 2.0 of the Bio3D package now includes extensive NMA facilities (Skjaerven et al. 2015). These include a unique collection of multiple elastic network model force-fields, automated ensemble analysis methods, and variance weighted NMA (see also the NMA Vignette). Here we provide an in-depth demonstration of ensemble NMA with working code that comprise complete executable examples2.
Detailed instructions for obtaining and installing the Bio3D package on various platforms can be found in the Installing Bio3D Vignette available both on-line and from within the Bio3D package. In addition to Bio3D the MUSCLE and CLUSTALO multiple sequence alignment programs (available from the muscle home page and clustalo home page) must be installed on your system and in the search path for executables. Please see the installation vignette for further details.
In this vignette we extend the analysis from Part I by including a more extensive search of distant homologues within the DHFR family. Based on a HMMER search we identify and collect protein species down to a pairwise sequence identity of 20%. Normal modes analysis (NMA) across these species reveals a remarkable similarity of the fluctuation profiles, but also features which are characteristic to specific species.
Having identified relevant PDB structures through the hmmer search we proceed by fetching and pre-processing the PDB files with functions get.pdb() and pdbsplit().
As in the previous vignette, we are interested in protein structures without missing in-structure residues, and we also want to limit the number of identifical conformers:
# fetch and split PDBs raw.files <- get.pdb(ids, path = "raw_pdbs", gzip=TRUE) files <- pdbsplit(raw.files, ids = ids, path = "raw_pdbs/split_chain", ncore=4) pdbs.all <- pdbaln(files) # exclude hits with fusion proteins gaps <- gap.inspect(pdbs.all$ali) pdbs <- trim(pdbs.all, row.inds=which(gaps$row > 200)) # exclude specific hits excl.inds <- unlist(lapply(c("5dxv"), grep, pdbs$id)) pdbs <- trim(pdbs, row.inds=-excl.inds) # exclude structures with missing residues conn <- inspect.connectivity(pdbs, cut=4.05) pdbs <- trim(pdbs, row.inds=which(conn)) # exclude conformational redundant structures rd <- filter.rmsd(pdbs$xyz, cutoff=0.25, fit=TRUE, ncore=4) pdbs <- trim(pdbs, row.inds=rd$ind)
In this particular case a standard sequence alignment (e.g. through function pdbaln() or seqaln()) is not sufficient for a correct alignment. We will therefore make use of the Pfam profile alignment, and align our selected PDBs to this using argument profile
to function seqaln(). Subsequently, we re-read the fasta file, and use function read.fasta.pdb() to obtain aligned C-alpha atom data (including coordinates etc.) for the PDB ensemble:
# align pdbs to Pfam-profile alignment aln <- seqaln(pdbs, profile=pfam.aln, exefile="clustalo", extra.args="--dealign") # final alignment will also contain the profile # store only PDBs in alignment aln$ali <- aln$ali[1:length(pdbs$id),] aln$id <- aln$id[1:length(pdbs$id)] # re-read PDBs to match the new alignment pdbs <- read.fasta.pdb(aln) # exclude gap-only columns pdbs <- trim(pdbs) # refit coordinates pdbs$xyz <- pdbfit(pdbs)
# refit coordinates, and write PDBs to disk pdbs$xyz <- pdbfit(pdbs, outpath="flsq/")
# fetch IDs again ids <- basename.pdb(pdbs$id) species <- hmm$hit.tbl$species[hmm$hit.tbl$acc %in% ids]
# labels for annotating plots labs <- paste(substr(species, 1,1), ". ", lapply(strsplit(species, " "), function(x) x[2]), sep="") print(unique(labs))
## [1] "G. stearothermophilus" "B. anthracis" "M. profunda"
## [4] "Y. pestis" "K. pneumoniae" "E. coli"
## [7] "E. faecalis" "S. aureus" "S. pneumoniae"
## [10] "L. casei" "M. tuberculosis" "M. smegmatis"
## [13] "M. avium" "H. volcanii" "S. mansoni"
## [16] "P. carinii" "M. musculus" "G. gallus"
## [19] "H. sapiens" "C. albicans" "[. glabrata"
The pdbs object now contains aligned C-alpha atom data, including Cartesian coordinates, residue numbers, residue types, and B-factors. The sequence alignment is also stored by default to the FASTA format file ‘aln.fa’ (to view this you can use an alignment viewer such as SEAVIEW, see Requirements section above).
Function seqidentity() can be used to calculate the sequence identity for the PDBs ensemble. Below we also print a summary of the calculated sequence identities, and perform a clustering of the structures based on sequence identity:
seqide <- seqidentity(pdbs) summary(c(seqide))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2130 0.3080 0.3380 0.4485 0.3910 1.0000
hclustplot(hc, k=3, labels=labs, cex=0.25, fillbox=FALSE)
Function nma.pdbs() will calculate the normal modes of each protein structures stored in the pdbs object. The normal modes are calculated on the full structures as provided by object pdbs. Use argument rm.gaps=FALSE
to visualize fluctuations also of un-aligned residues:
modes <- nma(pdbs, rm.gaps=FALSE, ncore=4)
The modes object of class enma contains aligned normal mode data including fluctuations, RMSIP data (only whenrm.gaps=FALSE
), and aligned eigenvectors. A short summary of the modes object can be obtain by calling the function print(), and the aligned fluctuations can be plotted with function plot.enma():
print(modes)
##
## Call:
## nma.pdbs(pdbs = pdbs, rm.gaps = FALSE, ncore = 4)
##
## Class:
## enma
##
## Number of structures:
## 259
##
## Attributes stored:
## - Aligned atomic fluctuations
## - Aligned eigenvectors (gaps not removed)
## - Dimensions of x$U.subspace: 720x381x259
##
## Coordinates were aligned prior to NMA calculations
##
## + attr: fluctuations, rmsip, U.subspace, L, full.nma, xyz,
## call
## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## SO4, HOH
resno <- pdbs$resno[1, ] resno[is.na(resno)] <- "" xlab <- paste0("Residue number (reference PDB: ", basename.pdb(pdbs$id[1]), ")") par(mfrow=c(2,1), mar=c(4, 4, 2, 2)) plot(modes, pdbs, ylim=c(0,2), col=grps.seq, label=NULL, xlab=xlab)
## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## SO4, HOH
plot.bio3d(cons, resno=resno, sse=sse, ylab="Conservation", xlab=xlab)
In some cases it can be difficult to interpret the fluctuation plot when all lines are plotted on top of each other. Argument spread=TRUE
adds a small gap between grouped fluctuation profiles. Use this argument in combination with a new groups (grps
) variable to function plot.enma():
grps <- rep(NA, length(grps.seq)) grps[grep("coli", labs)]=1 grps[grep("aureus", labs)]=2 grps[grep("anthracis", labs)]=3 grps[grep("tubercu", labs)]=4 grps[grep("casei", labs)]=5 grps[grep("sapiens", labs)]=6 grps[grep("albicans", labs)]=7 grps[grep("glabrata", labs)]=8 grps[grep("carinii", labs)]=9 plot(modes, pdbs=pdbs, col=grps, spread=TRUE, ylim=c(0,1.5), label=NULL)
## Warning in dssp.pdb(pdb.ref, ...): Non-protein residues detected in input PDB:
## NDP, TOP, HOH
A function call to mktrj.enma() will generate a trajectory PDB file for the visualization of a specific normal mode for one of the structures in the pdbs object. This allows for a visual comparison of the calculated normal modes. Below we make a PDB trajectory of the first mode (argument m.inds=1
) of 3 relevant species (e.g. argument s.inds=1
). Note that we use grep() to fetch the indices (in the modes and pdbs objects) of the relevant species:
inds <- c(grep("coli", species)[1], grep("sapiens", species)[1], grep("albicans", species)[1]) # E.coli mktrj(modes, pdbs, m.inds=1, s.inds=inds[1], file="ecoli-mode1.pdb") # H. sapiens mktrj(modes, pdbs, m.inds=1, s.inds=inds[2], file="hsapiens-mode1.pdb") # C. albicans mktrj(modes, pdbs, m.inds=1, s.inds=inds[3], file="calbicans-mode1.pdb")
This document is shipped with the Bio3D package in both R and PDF formats. All code can be extracted and automatically executed to generate Figures and/or the PDF with the following commands:
print(sessionInfo(), FALSE)
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bio3d_2.4-1.9000
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 digest_0.6.25 crayon_1.3.4 rprojroot_1.3-2
## [5] assertthat_0.2.1 grid_3.6.0 R6_2.4.1 backports_1.1.8
## [9] magrittr_1.5 evaluate_0.14 highr_0.8 stringi_1.4.6
## [13] rlang_0.4.6.9000 fs_1.4.2 rmarkdown_2.3.2 pkgdown_1.5.1.9000
## [17] desc_1.2.0 tools_3.6.0 stringr_1.4.0 parallel_3.6.0
## [21] yaml_2.2.1 xfun_0.15 compiler_3.6.0 memoise_1.1.0
## [25] htmltools_0.5.0 knitr_1.29
Grant, B. J., A. P. D. C Rodrigues, K. M. Elsawy, A. J. Mccammon, and L. S. D. Caves. 2006. “Bio3d: An R Package for the Comparative Analysis of Protein Structures.” Bioinformatics 22: 2695–6. https://doi.org/10.1093/bioinformatics/btl461.
Skjaerven, L., X. Q. Yao, G. Scarabelli, and B. J. Grant. 2015. “Integrating Protein Structural Dynamics and Evolutionary Analysis with Bio3d.” BMC Bioinformatics 15: 399. https://doi.org/10.1186/s12859-014-0399-6.
The latest version of the package, full documentation and further vignettes (including detailed installation instructions) can be obtained from the main Bio3D website: http://thegrantlab.org/bio3d/↩︎
This vignette contains executable examples, see help(vignette)
for further details.↩︎