Barry Grant < http://thegrantlab.org/teaching/ >
2022-02-16 (11:54:44 on Wed, Feb 16)

1: Introduction to the RCSB Protein Data Bank (PDB)

The PDB archive is the major repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. Understanding the shape of these molecules helps to understand how they work. This knowledge can be used to help deduce a structure’s role in human health and disease, and in drug development. The structures in the PDB range from tiny proteins and bits of DNA or RNA to complex molecular machines like the ribosome composed of many chains of protein and RNA.

In the first section of this lab we will interact with the main US based PDB website (note there are also sites in Europe and Japan).

Visit: http://www.rcsb.org/ and answer the following questions

NOTE: The “Analyze” > “PDB Statistics” > “by Experimental Method and Molecular Type” on the PDB home page should allow you to determine most of these answers.

PDB statistics

Open RStudio and begin a new class09 project. If we have covered GitHub in a previous class then you should create this within your GitHub tacked directory/folder from that class. Make sure “Create a git repository” option is NOT ticked. This is because we want to use the same git repository as we used last day and not start a new one - if you are not sure what this means ask Barry now!

Next, open a new R Markdown document (File > New File > R Markdown…). Chose “From Template” and select “GitHub Document”. Note we will aim to have a rendered GitHub and PDF document with working code on GitHub by the end of this class!

Download a CSV file from the PDB site (accessible from “Analyze” > “PDB Statistics” > “by Experimental Method and Molecular Type”. Move this CSV file into your RStudio project and use it to answer the following questions:

Q1: What percentage of structures in the PDB are solved by X-Ray and Electron Microscopy.
Q2: What proportion of structures in the PDB are protein?
Q3: Type HIV in the PDB website search box on the home page and determine how many HIV-1 protease structures are in the current PDB?

The PDB format

Now download the “PDB File” for the HIV-1 protease structure with the PDB identifier 1HSG. On the website you can “Display” the contents of this “PDB format” file.

Alternatively, you can examine the contents of your downloaded file in a suitable text editor or use the Terminal tab from within RStudio (or your favorite Terminal/Shell) and try the following command:

less ~/Downloads/1hsg.pdb         ## (use ‘q’ to quit)

NOTE: When viewing the file stop when you come the lines beginning with the word “ATOM”. We will discuss this ubiquitous PDB file format when you have got this far.

Protein Data Bank files (or PDB files) are the most common format for the distribution and storage of high-resolution biomolecular coordinate data. At their most basic, PDB coordinate files contain a list of all the atoms of one or more molecular structures. Each atom position is defined by its x, y, z coordinates in a conventional orthogonal coordinate system. Additional data, including listings of observed secondary structure elements, are also commonly (but not always) detailed in PDB files.

Molecular graphics programs such as VMD, PyMol and Chimera take these files and plot them in 3D with the ability to make simplified and stylized representations such as the one shown below:

Figure 1. HIV-1 protease structure (PDB code: 1HSG) in complex with the small molecule indinavir.

2. Visualizing the HIV-1 protease structure

The HIV-1 protease is an enzyme that is vital for the replication of HIV. It cleaves newly formed polypeptide chains at appropriate locations so that they form functional proteins. Hence, drugs that target this protein could be vital for suppressing viral replication. A handful of drugs - called HIV-1 protease inhibitors (saquinavir, ritonavir, indinavir, nelfinavir, etc.) - are currently commercially available that inhibit the function of this protein, by binding in the catalytic site that typically binds the polypeptide.

In this section we will use the 2Å resolution X-ray crystal structure of HIV-1 protease with a bound drug molecule indinavir (PDB ID: 1HSG). We will use the VMD molecular viewer to visually inspect the protein, the binding site and the drug molecule. After exploring features of the complex we will move on to computationally dock a couple of drug molecules into the binding site of HIV-1 protease to see how well computational docking can reproduce the crystallographically observed binding pose. If time permits, we will also explore the conformational dynamics and flexibility of the protein - important for it’s function and for considering during drug design.

NOTE: If you have not already done so please download and install VMD from here. You may need to register for a free account with a username and password of your choice. Remember these as you will be asked for them twice. If you are on a mac chose either the “M1” version (if you on a newer mac with an M1 cpu type) or “x86” version (if you on a older Intel chip based mac).

On a mac: Once you download the “VMD*.dmg” file, double click to open the disc image and then drag the resulting VMD icon to your Applications folder in a new Finder window. Do not attempt to open VMD from the disc image as this will not work well. Instead, open VMD from your Applications folder by Control-clicking on the icon and then select open.

Getting to know VMD

Open VMD and load 1hsg.pdb by using the VMD Main window and going to “File” -> “New Molecule” and then from the new window that appears click “Browse” and select your downloaded 1hsg.pdb file. Then click “Load”.

You should now see the protein structure displayed as lines and water molecules as little red dots. Use the mouse to zoom and rotate. Once you have the hang of rotation we will start exploring different “Graphical Representations”.

VMD can display molecules in various ways by choosing different options in the Graphical Representations window shown in the figure. You can access this window by clicking Graphics > Representations from the small VMD Main window.

Figure 2. The VMD graphical Representation window. Note that the Drawing Method (labeled 1 in the figure) defines which graphical representation is used and (2) the Selected Atoms determines which part of the molecule is drawn, and (3) defines the color it is displayed with. You are encouraged to explore different drawing styles (i.e. Drawing Methods - labeled 1) including Licorice, Tube and NewCartoon (see below for examples A-C). Licorice, Tube and NewCartoon Drawing Methods in VMD

Figure 3. VMD Drawing Methods: Licorice, Tube and NewCartoon

Also try different selections by entering text in the (Selected Atoms box - labeled 2 in the Graphical Representations Figure 3. above). Some examples to try include:

       chain A and backbone
       resname ASP
       within 5 of resname MK1

Using Atom Selections

Now type “protein” in the Selected Atoms text box (labeled 2 in Figure) and show the protein using the Cartoon representation and color by chain (see label 3 in Figure)

Lets add a new representation by clicking the “Create Rep” (circled in Figure) and using the selection text “not protein and not water”

Add more representations (by clicking the “Create Rep” button) and hiding (by double clicking) or deleting previous ones (with the “Delete Rep” button) to explore different representations for both the ligand and the protein.

NOTE: you can use the residue name of the ligand “resname MK1” to select just the ligand

Water molecules have the residue name HOH. Select and display all water molecules as red spheres. If you think the spheres are too big, how would you reduce their size?

Q4: Water molecules normally have 3 atoms. Why do we see just one atom per water molecule in this structure?

Q5: There is a conserved water molecule in the binding site. Can you identify this water molecule? What residue number does this water molecule have (see note below)?

NOTE: From the VMD Main window click Mouse > Label > Atoms and then click on the water in question to display its residue number. A short cut is to press the #1 key when your mouse is active in the OpenGL window.

Now you should be able to produce an image similar or even superior to Figure 2 and save it to an image file on disk with VMD Main window, File > Render > Start Rendering.

NOTE: You can chose different rendering engines including Tachyon (internal), which is commonly used for publication quality images.

Optional: Generate and save a figure clearly showing the two distinct chains of HIV-protease along with the ligand. You might also consider showing the catalytic residues ASP 25 in each chain (we recommend Licorice for these side-chains). Upload this figure to Piazza for some extra credit.

Discussion Topic: Can you think of a way in which indinavir, or even larger ligands and substrates, could enter the binding site?

Sequence Viewer Extension [OPTIONAL]

When dealing with a protein for the first time, it is very useful to be able to find and display different amino acids quickly. The sequence viewer extension allows viewing of the protein sequence, as well as to easily pick and display one or more residues of interest.

To launch the Sequence Viewer click VMD Main window, Extensions > Analysis > Sequence Viewer. The different color scales beside the sequence correspond to the B-factor and Secondary structure type (the major ones being Extended (beta) in yellow and Helix in purple).

Q6: As you have hopefully observed HIV protease is a homodimer (i.e. it is composed of two identical chains). With the aid of the graphic display and the sequence viewer extension can you identify secondary structure elements that are likely to only form in the dimer rather than the monomer?

3. Introduction to Bio3D in R

Bio3D is an R package for structural bioinformatics. Features include the ability to read, write and analyze biomolecular structure, sequence and dynamic trajectory data.

In your existing Rmarkdown document load the Bio3D package by typing in a new code chunk:

library(bio3d)

Side-Note: If you see an error message reported then you will first need to install the package with the command: install.packages("bio3d") in your R Console (i.e. don’t put this in your Rmarkdown document or it will be re-installed every time you knit/render your document). This is only required once whereas the library(bio3d) command is required at the start of every new R session where you want to use Bio3D.

Reading PDB file data into R

To read a single PDB file with Bio3D we can use the read.pdb() function. The minimal input required for this function is a specification of the file to be read. This can be either the file name of a local file on disc, or the RCSB PDB identifier of a file to read directly from the on-line PDB repository. For example to read and inspect the on-line file with PDB ID 1HSG:

pdb <- read.pdb("1hsg")

##   Note: Accessing on-line PDB file

To get a quick summary of the contents of the pdb object you just created you can issue the command print(pdb) or simply type pdb (which is equivalent in this case):

pdb

## 
##  Call:  read.pdb(file = "1hsg")
## 
##    Total Models#: 1
##      Total Atoms#: 1686,  XYZs#: 5058  Chains#: 2  (values: A B)
## 
##      Protein Atoms#: 1514  (residues/Calpha atoms#: 198)
##      Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)
## 
##      Non-protein/nucleic Atoms#: 172  (residues: 128)
##      Non-protein/nucleic resid values: [ HOH (127), MK1 (1) ]
## 
##    Protein sequence:
##       PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYD
##       QILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNFPQITLWQRPLVTIKIGGQLKE
##       ALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTP
##       VNIIGRNLLTQIGCTLNF
## 
## + attr: atom, xyz, seqres, helix, sheet,
##         calpha, remark, call

Q7: How many amino acid residues are there in this pdb object?
Q8: Name one of the two non-protein residues?
Q9: How many protein chains are in this structure?

Note that the attributes (+ attr:) of this object are listed on the last couple of lines. To find the attributes of any such object you can use:

attributes(pdb)

## $names
## [1] "atom"   "xyz"    "seqres" "helix"  "sheet"  "calpha" "remark" "call"  
## 
## $class
## [1] "pdb" "sse"

To access these individual attributes we use the dollar-attribute name convention that is common with R list objects. For example, to access the atom attribute or component use pdb$atom:

head(pdb$atom)

4. Comparative structure analysis of Adenylate Kinase

The goal of this section is to perform principal component analysis (PCA) on the complete collection of Adenylate kinase structures in the protein data-bank (PDB).

Adenylate kinase (often called simply Adk) is a ubiquitous enzyme that functions to maintain the equilibrium between cytoplasmic nucleotides essential for many cellular processes. Adk operates by catalyzing the reversible transfer of a phosphoryl group from ATP to AMP. This reaction requires a rate limiting conformational transition (i.e. change in shape). Here we analyze all currently available Adk structures in the PDB to reveal detailed features and mechanistic principles of these essential shape changing transitions.

Figure 5. Adenylate kinase structure (PDB code: 1AKE) with a bound inhibitor molecule.

The bio3d package pca() function provides a convenient interface for performing PCA of biomolecular structure data. As we have discussed in previous classes, PCA is a statistical approach used to transform large data-sets down to a few important components that usefully describe the directions where there is most variance. In terms of protein structures PCA can be used to capture major structural variations within a set of structures (a.k.a. structure ensemble). This can make interpreting major conformational states (such as ‘active’ and ‘inactive’ or ‘ligand bound’ and ‘un-bound’ states) and structural mechanisms for activation or regulation more clear.

Overview

Starting from only one Adk PDB identifier (PDB ID: 1AKE) we will search the entire PDB for related structures using BLAST, fetch, align and superpose the identified structures, perform PCA and finally calculate the normal modes of each individual structure in order to probe for potential differences in structural flexibility.

Setup

We will begin by first installing the packages we need for today’s session.

# Install packages in the R console not your Rmd

install.packages("bio3d")
install.packages("ggplot2")
install.packages("ggrepel")
install.packages("devtools")
install.packages("BiocManager")

BiocManager::install("msa")
devtools::install_bitbucket("Grantlab/bio3d-view")

Q10. Which of the packages above is found only on BioConductor and not CRAN?
Q11. Which of the above packages is not found on BioConductor or CRAN?:
Q12. True or False? Functions from the devtools package can be used to install packages from GitHub and BitBucket?

The install.packages() function is used to install packages from the main CRAN repository for R packages. BioConductor is a separate repository of R packages focused on high-throughput biomolecular assays and biomolecular data. Packages from BioConductor can be installed using the BiocManager::install() function. Finally, R packages found on GitHub or BitBucket can be installed using devtools::install_github() and devtools::install_bitbucket() functions.

Search and retrieve ADK structures

Below we perform a blast search of the PDB database to identify related structures to our query Adenylate kinase (ADK) sequence. In this particular example we use function get.seq() to fetch the query sequence for chain A of the PDB ID 1AKE and use this as input to blast.pdb(). Note that get.seq() would also allow the corresponding UniProt identifier.

library(bio3d)
aa <- get.seq("1ake_A")

## Fetching... Please wait. Done.

aa

##              1        .         .         .         .         .         60 
## pdb|1AKE|A   MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT
##              1        .         .         .         .         .         60 
## 
##             61        .         .         .         .         .         120 
## pdb|1AKE|A   DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDRI
##             61        .         .         .         .         .         120 
## 
##            121        .         .         .         .         .         180 
## pdb|1AKE|A   VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG
##            121        .         .         .         .         .         180 
## 
##            181        .         .         .   214 
## pdb|1AKE|A   YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG
##            181        .         .         .   214 
## 
## Call:
##   read.fasta(file = outfile)
## 
## Class:
##   fasta
## 
## Alignment dimensions:
##   1 sequence rows; 214 position columns (214 non-gap, 0 gap) 
## 
## + attr: id, ali, call

Q13. How many amino acids are in this sequence, i.e. how long is this sequence?

Now we can use this sequence as a query to BLAST search the PDB to find similar sequences and structures.

# Blast or hmmer search 
#b <- blast.pdb(aa)

Side-note: Due to the number of students in this class session this command, which uses online NCBI blast service, may time-out. If this happens please jump ahead to the next Side-note below to skip running the actual blast search.

The function plot.blast() facilitates the visualization and filtering of the Blast results. It will attempt to set a seed position to the point of largest drop-off in normalized scores (i.e. the biggest jump in E-values). In this particular case we specify a cutoff (after initial plotting) of to include only the relevant E.coli structures:

# Plot a summary of search results
#hits <- plot(b)

Figure 6: Blast results. Visualize and filter blast results through function plot.blast(). Here we proceed with only the top scoring hits (black).

# List out some 'top hits'
#head(hits$pdb.id)

Side-note: If blast did not return results (likely due to the large number of simultaneous requests from the class) you can use the following vector of PDB IDs

hits <- NULL
hits$pdb.id <- c('1AKE_A','6S36_A','6RZE_A','3HPR_A','1E4V_A','5EJE_A','1E4Y_A','3X2S_A','6HAP_A','6HAM_A','4K46_A','3GMT_A','4PZL_A')

The Blast search and subsequent filtering identified a total of 13 related PDB structures to our query sequence. The PDB identifiers of this collection are accessible through the $pdb.id attribute to the hits object (i.e. hits$pdb.id). Note that adjusting the cutoff argument (to plot.blast()) will result in a decrease or increase of hits.

We can now use function get.pdb() and pdbslit() to fetch and parse the identified structures.

# Download releated PDB files
files <- get.pdb(hits$pdb.id, path="pdbs", split=TRUE, gzip=TRUE)

Align and superpose structures

Next we will use the pdbaln() function to align and also optionally fit (i.e. superpose) the identified PDB structures.

# Align releated PDBs
pdbs <- pdbaln(files, fit = TRUE)#, exefile="msa")

# Vector containing PDB codes for figure axis
ids <- basename.pdb(pdbs$id)

# Draw schematic alignment
plot(pdbs, labels=ids)

Figure 7: Schematic representation of alignment. Grey regions depict aligned residues, while white depict gap regions. The red bar at the top depict sequence conservation.

Optional: Viewing our superposed structures

We can view our superposed results with the new bio3d.view view() function:

library(bio3d.view)
library(rgl)

view.pdbs(pdbs)

Figure 8: 3D view of superposed ADK structures available in the PDB.

Annotate collected PDB structures [Optional]

The function pdb.annotate() provides a convenient way of annotating the PDB files we have collected. Below we use the function to annotate each structure to its source species. This will come in handy when annotating plots later on:

anno <- pdb.annotate(ids)
unique(anno$source)

## [1] "Escherichia coli"                                
## [2] "Escherichia coli K-12"                           
## [3] "Escherichia coli O139:H28 str. E24377A"          
## [4] "Escherichia coli str. K-12 substr. MDS42"        
## [5] "Photobacterium profundum"                        
## [6] "Vibrio cholerae O1 biovar El Tor str. N16961"    
## [7] "Burkholderia pseudomallei 1710b"                 
## [8] "Francisella tularensis subsp. tularensis SCHU S4"

We can view all available annotation data:

anno

Principal component analysis

Function pca() provides principal component analysis (PCA) of the structure data. PCA is a statistical approach used to transform a data set down to a few important components that describe the directions where there is most variance. In terms of protein structures PCA is used to capture major structural variations within an ensemble of structures.

PCA can be performed on the structural ensemble (stored in the pdbs object) with the function pca.xyz(), or more simply pca().

# Perform PCA
pc.xray <- pca(pdbs)
plot(pc.xray)

Figure 9: Results of PCA on Adenylate kinase X-ray structures. Each dot represents one PDB structure.

Function rmsd() will calculate all pairwise RMSD values of the structural ensemble. This facilitates clustering analysis based on the pairwise structural deviation:

# Calculate RMSD
rd <- rmsd(pdbs)

# Structure-based clustering
hc.rd <- hclust(dist(rd))
grps.rd <- cutree(hc.rd, k=3)

plot(pc.xray, 1:2, col="grey50", bg=grps.rd, pch=21, cex=1)

Figure 10: Projection of Adenylate kinase X-ray structures. Each dot represents one PDB structure.

The plot shows a conformer plot – a low-dimensional representation of the conformational variability within the ensemble of PDB structures. The plot is obtained by projecting the individual structures onto two selected PCs (e.g. PC-1 and PC-2). These projections display the inter-conformer relationship in terms of the conformational differences described by the selected PCs.

5. Optional further visualization

To visualize the major structural variations in the ensemble the function mktrj() can be used to generate a trajectory PDB file by interpolating along a give PC (eigenvector):

# Visualize first principal component
pc1 <- mktrj(pc.xray, pc=1, file="pc_1.pdb")

You can open this file, pc_1.pdb, in VMD, chose the “Drawing Method” Tube and “Coloring Method” Index. Then click the play button shown below to animate the structure and visualize the major structural variations along PC1.

Figure 11: Visualization of PC-1 in VMD. Trajectory PDB file is generated using mktrj().

We can also view our results with the new bio3d.view view() function:

view.xyz(pc1)

## Potential all C-alpha atom structure(s) detected: Using calpha.connectivity()

Figure 12: Visualization of PC-1 trajectory generated using mktrj().

We could set the color to highlight the most variable regions like so:

view.xyz(pc1, col=vec2color( rmsf(pc1) ))

## Potential all C-alpha atom structure(s) detected: Using calpha.connectivity()

We can also plot our main PCA results with ggplot:

#Plotting results with ggplot2
library(ggplot2)
library(ggrepel)

df <- data.frame(PC1=pc.xray$z[,1], 
                 PC2=pc.xray$z[,2], 
                 col=as.factor(grps.rd),
                 ids=ids)

p <- ggplot(df) + 
  aes(PC1, PC2, col=col, label=ids) +
  geom_point(size=2) +
  geom_text_repel(max.overlaps = 20) +
  theme(legend.position = "none")
p

6. Normal mode analysis [optional]

Function nma() provides normal mode analysis (NMA) on both single structures (if given a singe PDB input object) or the complete structure ensemble (if provided with a PDBS input object). This facilitates characterizing and comparing flexibility profiles of related protein structures.

# NMA of all structures
modes <- nma(pdbs)

plot(modes, pdbs, col=grps.rd)

Q14. What do you note about this plot? Are the black and colored lines similar or different? Where do you think they differ most and why?

Collectively these results indicate the existence of two major distinct conformational states for Adk. These differ by a collective low frequency displacement of two nucleotide-binding site regions that display distinct flexibilities upon nucleotide binding.

Important-Note: Remember to save your R markdown script document and Knit to generate a GitHub format .md report. Then stage, commit and push both these documents to GitHub by following the steps outlined in the last class - ask Barry if you are unsure of this process.

Here we use the sessionInfo() function to report on our R systems setup at the time of document execution.

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggrepel_0.9.1         ggplot2_3.3.5         rgl_0.108.3          
## [4] bio3d.view_0.1.0.9000 bio3d_2.4-2.9000      NGLVieweR_1.3.1      
## [7] labsheet_0.1.2       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.29         bslib_0.3.1       purrr_0.3.4      
##  [5] generics_0.1.1    colorspace_2.0-2  vctrs_0.3.8       htmltools_0.5.2  
##  [9] yaml_2.2.1        utf8_1.2.2        rlang_0.4.12      jquerylib_0.1.4  
## [13] later_1.3.0       pillar_1.6.4      withr_2.4.3       DBI_1.1.1        
## [17] glue_1.5.1        lifecycle_1.0.1   stringr_1.4.0     munsell_0.5.0    
## [21] gtable_0.3.0      htmlwidgets_1.5.4 evaluate_0.14     labeling_0.4.2   
## [25] knitr_1.37        fastmap_1.1.0     extrafont_0.17    httpuv_1.6.4     
## [29] parallel_4.1.2    curl_4.3.2        fansi_0.5.0       Rttf2pt1_1.3.9   
## [33] highr_0.9         Rcpp_1.0.7        xtable_1.8-4      promises_1.2.0.1 
## [37] scales_1.1.1      jsonlite_1.7.2    farver_2.1.0      mime_0.12        
## [41] png_0.1-7         digest_0.6.29     stringi_1.7.6     dplyr_1.0.7      
## [45] shiny_1.7.1       grid_4.1.2        tools_4.1.2       magrittr_2.0.1   
## [49] sass_0.4.0        tibble_3.1.6      crayon_1.4.2      extrafontdb_1.0  
## [53] pkgconfig_2.0.3   ellipsis_0.3.2    assertthat_0.2.1  rmarkdown_2.11   
## [57] httr_1.4.2        R6_2.5.1          compiler_4.1.2

Class 9

Structural Bioinformatics (Pt. 1)