Barry Grant < http://thegrantlab.org/teaching/ >
2026-02-10 (09:12:46 on Tue, Feb 10)
By the end of this lab, you will be able to:
The PDB archive is the major repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. Understanding the shape of these molecules helps to understand how they work. This knowledge can be used to help deduce a structure’s role in human health and disease, and in drug development. The structures in the PDB range from tiny proteins and bits of DNA or RNA to complex molecular machines like the ribosome composed of many chains of protein and RNA.
In the first section of this lab we will interact with the main US based PDB website (note there are also sites in Europe and Japan).
Visit: http://www.rcsb.org/ and answer the following questions
NOTE: The “Analyze” > “PDB Statistics” > “by Experimental Method and Molecular Type” on the PDB home page should allow you to determine most of these answers. If the statistics page is taking too long to update due to server load then you can obtain the Feb 2026 numbers here.
Open RStudio and begin a new class project. If we have covered GitHub in a previous class then you should create this within your GitHub tacked directory/folder from that class. Make sure “Create a git repository” option is NOT ticked. This is because we want to use the same git repository as we used last day and not start a new one - if you are not sure what this means ask Barry now!
Next, open a new Quarto document (File > New File > Quarto Document…). As always, we will aim to have a rendered PDF report with working code by the end of this class!
Download this CSV file into your RStudio project and use it to answer the following questions. Note that this data was obtained from the RCSB PDB website on Feb 6th 2026 using their “Analyze” > “PDB Statistics” > “by Experimental Method and Molecular Type” tool.
Q1: What percentage of structures in the PDB are solved by X-Ray and Electron Microscopy.
Q2: What proportion of structures in the PDB are protein?
Q3: Type HIV in the PDB website search box on the home page and determine how many HIV-1 protease structures are in the current PDB?
Now download the “PDB File” for the HIV-1 protease structure with the PDB identifier 1HSG. On the website you can “Display” the contents of this “PDB format” file.
Alternatively, you can examine the contents of your downloaded file in a suitable text editor or use the Terminal tab from within RStudio (or your favorite Terminal/Shell) and try the following command:
NOTE: When viewing the file stop when you come to the lines beginning with the word “ATOM”. We will discuss this ubiquitous PDB file format when you have got this far.
Protein Data Bank files (or PDB files) are the most common format for the distribution and storage of high-resolution biomolecular coordinate data. At their most basic, PDB coordinate files contain a list of all the atoms of one or more molecular structures. Each atom position is defined by its x, y, z coordinates in a conventional orthogonal coordinate system. Additional data, including listings of observed secondary structure elements, are also commonly (but not always) detailed in PDB files.
Molecular graphics programs such as Mol*, VMD, PyMol and Chimera take these files and plot them in 3D with the ability to make simplified and stylized representations such as the one shown below:
Figure 1. HIV-1 protease structure (PDB code: 1HSG) in complex with the small molecule indinavir.
The HIV-1 protease is an enzyme that is vital for the replication of HIV. It cleaves newly formed polypeptide chains at appropriate locations so that they form functional proteins. Hence, drugs that target this protein could be vital for suppressing viral replication. A handful of drugs - called HIV-1 protease inhibitors (saquinavir, ritonavir, indinavir, nelfinavir, etc.) - are currently commercially available that inhibit the function of this protein, by binding in the catalytic site that typically binds the polypeptide.
In this section we will use the 2Å resolution X-ray crystal structure of HIV-1 protease with a bound drug molecule indinavir (PDB ID: 1HSG). We will use the Mol* molecular viewer to visually inspect the protein, the binding site and the drug molecule. After exploring features of the complex we will move on to perform bioinformatics analysis of single and multiple crystallographic stuctures to explore the conformational dynamics and flexibility of the protein - important for it’s function and for considering during drug design.
Mol* (pronounced “molstar”) is a new web-based molecular viewer that is rapidly gaining in popularity and utility. At the time of writing it is still a long way from having the full feature set of stand-alone molecular viewer programs like VMD, PyMol or Chimera. However, it is gaining new features all the time and does not require any download or complicated installation.
You can use Mol* directly at the PDB website (as well as UniProt and elsewhere). However, for the latest and greatest version we will visit the Mol* homepage at: https://molstar.org/viewer/.
To load a structure from the PDB we can enter the PDB code and click “Apply” in the “Download Structure” menu (see figure below)
Once loaded the sidebar should change to the so-called hierarchical “State Tree” menu. Of particular note there are entries for Polymer, Ligand and Water. You can turn the display of any of these entries OFF/ON by clicking on the eye icon or delete them by clicking the “trash” bin icon (but we will not do that just yet). We can turn this left-side control panel off to save screen space. Especially as we will not need it again until we come to close the molecule or read a new molecule later.
Key-point: You can access and change all visual representations on the opposite right side control panel under the “Components” drop-down menu (see figure below). Try togling ON/OFF the display of Ligand and Water with the “eye” icon.
Let’s temporally toggle OFF/ON the display of water molecules and change the display representation of the Ligand to Spacefill (a.k.a VdW spheres). To do this:
Let’s also change the protein “Polymer” > “Set Coloring” > “Residue Property” > “Secondary Structure”.
Key-point: All these expanding drop-down menus can quickly become overwhelming. I find that closing them by clicking the 3 dots again can help keep things tidy and avoid menu items disappearing off small screens.
Once you are happy with your display you can save a high-resolution image to your computer for including in your Quarto document. To do this find the “iris-like” screenshot icon on the right side of the display region and select your resolution and click download (see figure below)
To help highlight important amino acid residues that interact with the ligand you can click on the ligand itself. This will lead to a new “Focus Surroundings (5A)” display component to appear. Mousing over this will highlight the corresponding amino acids in the Sequence display panel.
Note: Zoom in and rotate to examine these ligand interactions. Of these positions Asp 25 (D25) in both chains is critical for protease activity. Can you find this amino acid in both chains? Note the residue information displayed in the bottom right of the viewing window as you mouse over different amino acids.
Most viewers will find that displaying all ligand surrounding amino acids is too busy for a single display. Turn off the display of these positions by clicking the eye” icon for the “Focus Surroundings (5A)” Components entry in the right side control panel.
Now we can highlight a subset of the most important positions:
Note that a new “Custom Selection” component has appeared in the right side control panel. This will contain your two D25 positions. You can again delete the “Focus Surroundings (5A)” and Focus Target Components to clean up the display.
At this point you should consider saving an image as discussed above.
Toggle on the display of all water molecules again.
Q4: Water molecules normally have 3 atoms. Why do we see just one atom per water molecule in this structure?
Q5: There is a critical “conserved” water molecule in the binding site. Can you identify this water molecule? What residue number does this water molecule have
Now you should be able to produce an image similar or even superior to Figure 2 and save it to an image file.
Q6: Generate and save a figure clearly showing the two distinct chains of HIV-protease along with the ligand. You might also consider showing the catalytic residues ASP 25 in each chain and the critical water (we recommend “Ball & Stick” for these side-chains). Add this figure to your Quarto document.
Discussion Topic: Can you think of a way in which indinavir, or even larger ligands and substrates, could enter the binding site?
Q7: [Optional] As you have hopefully observed HIV protease is a homodimer (i.e. it is composed of two identical chains). With the aid of the graphic display can you identify secondary structure elements that are likely to only form in the dimer rather than the monomer?
Bio3D is an R package for structural bioinformatics. Features include the ability to read, write and analyze biomolecular structure, sequence and dynamic trajectory data.
In your existing Rmarkdown document load the Bio3D package by typing in a new code chunk:
Side-Note: If you see an error message reported then you will first need to install the package with the command:
install.packages("bio3d")in your R Console (i.e. don’t put this in your Rmarkdown document or it will be re-installed every time you knit/render your document). This is only required once whereas thelibrary(bio3d)command is required at the start of every new R session where you want to use Bio3D.
To read a single PDB file with Bio3D we can use the
read.pdb() function. The minimal input required for this
function is a specification of the file to be read. This can be either
the file name of a local file on disc, or the RCSB PDB identifier of a
file to read directly from the on-line PDB repository. For example to
read and inspect the on-line file with PDB ID 1HSG:
## Note: Accessing on-line PDB file
To get a quick summary of the contents of the pdb object you just
created you can issue the command print(pdb) or simply type
pdb (which is equivalent in this case):
##
## Call: read.pdb(file = "1hsg")
##
## Total Models#: 1
## Total Atoms#: 1686, XYZs#: 5058 Chains#: 2 (values: A B)
##
## Protein Atoms#: 1514 (residues/Calpha atoms#: 198)
## Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
##
## Non-protein/nucleic Atoms#: 172 (residues: 128)
## Non-protein/nucleic resid values: [ HOH (127), MK1 (1) ]
##
## Protein sequence:
## PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYD
## QILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNFPQITLWQRPLVTIKIGGQLKE
## ALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTP
## VNIIGRNLLTQIGCTLNF
##
## + attr: atom, xyz, seqres, helix, sheet,
## calpha, remark, call
Q7: How many amino acid residues are there in this pdb object?
Q8: Name one of the two non-protein residues?
Q9: How many protein chains are in this structure?
Note that the attributes (+ attr:) of this object are
listed on the last couple of lines. To find the attributes of any such
object you can use:
## $names
## [1] "atom" "xyz" "seqres" "helix" "sheet" "calpha" "remark" "call"
##
## $class
## [1] "pdb" "sse"
To access these individual attributes we use the
dollar-attribute name convention that is common with R list
objects. For example, to access the atom attribute or
component use pdb$atom:
We can use the Bio3D partner package, bio3dview, to generate quick interactive molecular visualizations. To install the development version of bio3dview from GitHub, along with the related NGLVieweR package use:
install.packages("remotes")
remotes::install_github("bioboot/bio3dview")
install.packages("NGLVieweR")Then load the respective packages and generate a quick NGL (webGL
based) structure overview of a bio3d pdb class object with
a number of simple defaults. The returned NGLVieweR object can be
further added to build custom interactive visualizations:
You can also customize the display in many ways with minimal code. For example, lets custom color the chains and highlight some key residues as spacefill/vdw:
Let’s read a new PDB structure of Adenylate Kinase and perform Normal mode analysis.
## Note: Accessing on-line PDB file
## PDB has ALT records, taking A only, rm.alt=TRUE
##
## Call: read.pdb(file = "6s36")
##
## Total Models#: 1
## Total Atoms#: 1898, XYZs#: 5694 Chains#: 1 (values: A)
##
## Protein Atoms#: 1654 (residues/Calpha atoms#: 214)
## Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
##
## Non-protein/nucleic Atoms#: 244 (residues: 244)
## Non-protein/nucleic resid values: [ CL (3), HOH (238), MG (2), NA (1) ]
##
## Protein sequence:
## MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAKDIMDAGKLVT
## DELVIALVKERIAQEDCRNGFLLDGFPRTIPQADAMKEAGINVDYVLEFDVPDELIVDKI
## VGRRVHAPSGRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYHQMTAPLIG
## YYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG
##
## + attr: atom, xyz, seqres, helix, sheet,
## calpha, remark, call
Normal mode analysis (NMA) is a structural bioinformatics method to predict protein flexibility and potential functional motions (a.k.a. conformational changes).
## Building Hessian... Done in 0.01 seconds.
## Diagonalizing Hessian... Done in 0.178 seconds.
To view a “movie” of these predicted motions we can generate a
molecular “trajectory” with the mktrj() function.
Now we can load the resulting “adk_m7.pdb” PDB into Mol* with the “Open Files” option on the right side control panel. Once loaded click the “play” button to see a movie (see image below). We will discuss how this method works at the end of this lab when we apply it across a large set of homologous structures.
Here is what the output movie looks like:
Alternatively, for a quicker display you can use the
view.nma() function from the bio3dview package mentioned
previously: