Schedule

Course delivery for Winter 2022 will be a mix of online and in-person. Our first 3 weeks will be online only. From week 4 onward we will hold in-person hands-on sessions twice a week on Wed/Fri 1:00-4:00 pm in TATA 2501 (Map). All online course components will be made available on a weekly basis via this public facing website.

Clicking on the topics below will take you to corresponding video lectures, hands-on “lab session” supporting walk-through screencasts, required reading material and homework assignments.


#DateTopics for Winter 2022
0-Getting Oriented
Course introduction, Learning goals & expectations, Meet the instructional team. Setup your computer with required software.
1Week1 01/05/22Welcome to Bioinformatics
Biology is an information science, History of Bioinformatics, Types of data, Application areas and introduction to upcoming course segments, Hands on with major Bioinformatics databases and key online NCBI and EBI resources
2Week2 01/12/22Sequence alignment fundamentals, algorithms and applications
Homology, Sequence similarity, Local and global alignment, classic Needleman-Wunsch, Smith-Waterman and BLAST heuristic approaches, Hands on with dot plots, Needleman-Wunsch and BLAST algorithms highlighting their utility and limitations.
3Week3 01/19/22Project: Find a gene project assignment
(Part 1) Principles of database searching, due in 2 weeks. (Part 2) Sequence analysis, structure analysis and general data analysis with R due at the end of the quarter.
*Week3 01/19/22Optional: Advanced sequence alignment and database searching
Detecting remote sequence similarity, Database searching beyond BLAST, Substitution matrices, Using PSI-BLAST, Profiles and HMMs, Protein structure comparisons as a gold standard.
4Week4 01/26/22Bioinformatics data analysis with R
Why do we use R for bioinformatics? R language basics and the RStudio IDE, Major R data structures and functions, Using R interactively from the RStudio console. Introducing Rmarkdown documents.
5Wed 02/02/22In-Person Data exploration and visualization in R
The exploratory data analysis mindset, Data visualization best practices, Simple base graphics (including scatterplots, histograms, bar graphs, dot chats, boxplots and heatmaps), Building more complex charts with ggplot. N.B. This is a 1pm TATA 2501 in-person class session.
6Fri 02/04/22Why, when and how of writing your own R functions
The basics of writing your own functions that promote code robustness, reduce duplication and facilitate code re-use. Extending functionality and utility with R packages from CRAN and BioConductor, Working with Bio3D for molecular data.
7Wed 02/09/22Introduction to machine learning for Bioinformatics 1
Unsupervised learning, K-means clustering, Hierarchical clustering, Heatmap representations. Dimensionality reduction, Principal Component Analysis (PCA)
8Fri 02/11/22Unsupervised learning mini-project
Longer hands-on session with unsupervised learning analysis of cancer cells further highlighting Practical considerations and best practices for the analysis and visualization of high dimensional datasets
9Wed 02/16/22Structural Bioinformatics
Protein structure function relationships, Protein visualization resources, Modeling energy as a function of structure, Working with sequence and structure data in R. Structure prediction with AlphaFold2 and the new age of structural biology. (If time allows) Protein motion and conformational variants, Molecular simulation and small molecule docking and drug optimization.
10Fri 02/18/22Genome informatics and high throughput sequencing
Searching genes and gene functions, Genome databases, Variation in the Genome, High-throughput sequencing technologies, biological applications, bioinformatics analysis methods; The Galaxy platform along with resources from the EBI & UCSC
11Wed 02/23/22Transcriptomics, RNA-Seq analysis, and the interpretation of gene lists
RNA-Seq aligners, Differential expression tests, RNA-Seq statistics, Counts and FPKMs and avoiding P-value misuse, Hands-on analysis of RNA-Seq data with R. Gene functional annotation, Functional databases KEGG, InterPro, GO ontologies and functional enrichment.
12Fri 02/25/22RNA-Seq mini project
Differential expression analysis project with DESeq2 followed by gene enrichment and functional annotation with KEGG, InterPro, and GO ontologies.
13Wed 03/02/22Essential UNIX for bioinformatics
Bioinformatics on the command line, Understanding processes, File system structure, Connecting to remote servers, Redirection, streams and pipes, Workflows for batch processing, Launching and using AWS EC2 instances (A.K.A. Virtual Machines).
14Fri 03/04/22Vaccination rate mini project
A topical mini-project using ggplot and dplyr on with the latest state wide COVID-19 vaccination data. Practical considerations and best practices for the exploratory analysis.
15Wed 03/09/22Investigating pertussis resurgence mini project
A topical mini-project using web-scraping,JSON based APIs and advanced dplyr and ggplot to investigate brand new datasets associated with pertussis cases and longitudinal RNA-Seq on the immune response to vaccination.
16Fri 03/11/22Portfolio building and discussion of bioinformatics in industry
Hands-on with Git and GitHub, Why you should use a version control system, Making a public facing GitHub pages portfolio of your bioinformatics work;
Livestream interview with leading bioinformatics and genomics scientists from industry.
Project: Find a gene assignment due!

Class material


0: Getting oriented

Topics:
Course introduction, Learning goals & expectations, Meet the instructional team. Seting up your computer with required software.

Goals:

  • Understand course scope, expectations, logistics and ethics code.
  • Complete the pre-course questionnaire.
  • Setup your computer for this course.

Videos:

Supporting material:


1: Welcome to Bioinformatics

Topics: Biology is an information science, History of Bioinformatics, Types of data, Application areas and introduction to upcoming course segments, Introduction to NCBI & EBI resources for the molecular domain of bioinformatics, Hands-on session using NCBI-BLAST, Entrez, GENE, UniProt, Muscle and PDB bioinformatics tools and databases.

Goals:

  • Understand the increasing necessity for computation in modern life sciences research.
  • Get introduced to how bioinformatics is practiced.
  • Be able to query, search, compare and contrast the data contained in major bioinformatics databases (GenBank, GENE, UniProt, PFAM, OMIM, PDB) and describe how these databases intersect.
  • The goals of the hands-on session is to introduce a range of core bioinformatics databases and associated online services whilst actively investigating the molecular basis of several common human disease.

Videos:

Supporting Material:

Homework:


2: Sequence alignment fundamentals, algorithms and applications

Topics: Sequence Alignment and Database Searching: Homology, Sequence similarity, Local and global alignment, Heuristic approaches, Database searching with BLAST, E-values and evaluating alignment scores and statistics.

Goals:

  • Be able to describe how dynamic programming works for pairwise sequence alignment.
  • Appreciate the differences between global and local alignment along with their major application areas.
  • Understand how aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins.
  • The goals of the hands-on session are to explore the principles underlying the computational tools that can be used to compute and evaluate sequence alignments.

Videos:

Supporting Material:

Homework:

  • Questions,
  • Submit your completed lab report (i.e. filled in PDF form) to GradeScope,
  • OPTIONAL: Complete the following Alignment Problem,
  • For next week please install R and RStudio,
  • DataCamp: Sign-up to our F21_Bioinformatics group/organization via the link on Piazza or in your UCSD email. We will use this from next week onward. However, feel free to get started with your first course Introduction to R!.

Readings:



(Project:) Find a Gene Assignment Part 1

The find-a-gene project is a required assignment for BGGN-213. The objective with this assignment is for you to demonstrate your grasp of database searching, sequence analysis, structure analysis and the R environment that we have covered to date in class.

You may wish to consult the scoring rubric at the end of the above linked project description and the example report for format and content guidance.

  • Your responses to questions Q1-Q4 are due Friday Feb 4th (02/04/22) at 12pm San Diego time.

  • The complete assignment, including responses to all questions, is due Friday March 11th (03/11/22) at 12pm San Diego time.

  • In both instances your PDF format report should be submitted to GradeScope. Late responses will not be accepted under any circumstances.

Videos:


3: (Optional extension) Advanced sequence alignment and database searching

Topics:
Detecting remote sequence similarity, Substitution matrices, Database searching beyond BLAST with PSI-BLAST, Profiles and HMMs, Protein structure comparisons, Beginning with command line based database searches.

Goal:

  • Be able to calculate the alignment score between protein (or nucleotide) sequences using a provided scoring matrix such as BLOSUM62.
  • Understand the limits of homology detection with tools such as BLAST.
  • Know how to derive a PROSITE style regular expression for aligned motifs.
  • Be able to calculate a PSSM profile and for aligned sequences and subsequently score new sequences using a PSSM.
  • Be able to perform PSI-BLAST, HMMER and protein structure based database searches and interpret the results in terms of the biological significance of an e-value.
  • Be familiar with the concepts of True Positives, False Positives, Sensitivity and Specificity.

Material:

Homework:

  • Optional Questions click and select “make a copy” then follow instructions,
  • DataCamp: Sign-up to our F21_Bioinformatics group/organization via the link in your UCSD email and start (you do not have to finish yet) Introduction to R! (we will complete this next week).
  • RStudio and R download and setup.

4: Bioinformatics data analysis with R

Topics: Why do we use R for bioinformatics? R language basics and the RStudio IDE, Major R data structures and functions, Using R interactively from the RStudio console.

Goal:

  • Understand why we use R for bioinformatics
  • Familiarity with R’s basic syntax,
  • Familiarity with major R data structures (vectors, data.frames and lists),
  • Understand the basics of using functions (arguments, vectorizion and re-cycling).

Videos:

Supporting Material:

Homework:


5: Data exploration and visualization in R

Topics: The exploratory data analysis mindset, Data visualization best practices, Simple base graphics (including scatterplots, histograms, bar graphs, dot chats, boxplots and heatmaps), Building more complex charts with ggplot.

Goal:

  • Appreciate the major elements of exploratory data analysis and why it is important to visualize data.
  • Be conversant with data visualization best practices and understand how good visualizations optimize for the human visual system.
  • Be able to generate informative graphical displays including scatterplots, histograms, bar graphs, boxplots, dendrograms and heatmaps and thereby gain exposure to the extensive graphical capabilities of R.
  • Appreciate that you can build even more complex charts with ggplot and additional R packages.
  • Be able to write and (re)use basic R scripts to aid with reproducibility.

Videos:

Supporting Material:

Homework:


6: R functions and R packages from CRAN and BioConductor

Topics: The why, when and how of writing your own R functions with worked examples. Further extending functionality and utility with R packages, Obtaining R packages from CRAN and Bioconductor, Working with Bio3D for molecular data, Managing genome-scale data with bioconductor.

Goals:

  • Understand the structure and syntax of R functions and how to view the code of any R function,
  • Be able to follow a step by step process of going from a working code snippet to a more robust function that reduces duplication and facilitate code re-use,
  • Be able to find and install R packages from CRAN and bioconductor,
  • Understand how to find and use package vignettes, demos, documentation, tutorials and source code repository where available.

Videos:

Supporting material:

Homework:

  • Questions,
  • Submit your completed PDF lab report to GradeScope,
  • Write a function: See Q6 of the hands-on lab supplement above. This entails turning a supplied code snippet into a more robust and re-usable function that will take any of the three listed input proteins and plot the effect of drug binding. Note assessment rubric and submission instructions within document. No longer required due to in-person transition week chaos.
  • DataCamp: Please work toward completing any outstanding courses including Intro to R, Intro to ggplot2 and Intermediate R.

Other:


7: Introduction to machine learning for Bioinformatics

Topics: Unsupervised learning, supervised learning and reinforcement learning; Focus on unsupervised learning, K-means clustering, Hierarchical clustering, Dimensionality reduction, visualization and analysis, Principal Component Analysis (PCA) Practical considerations and best practices for the analysis of high dimensional datasets.

Goal:

  • Understand the major differences between unsupervised and supervised learning.
  • Be able to create k-means and hierarchical cluster models in R
  • Be able to describe how the k-means and bottom-up hierarchical cluster algorithms work.
  • Know how to visualize and integrate clustering results and select good cluster models.
  • Be able to describe in general terms how PCA works and its major objectives.
  • Be able to apply PCA to high dimensional datasets and visualize and integrate PCA results (e.g identify outliers, find structure in features and aid in complex dataset visualization).

Videos:

Supporting material:

Homework:

Other Material:


8: Unsupervised Learning Mini-Project

Topics: Hands-on project session with unsupervised learning analysis of cancer cells, Practical considerations and best practices for the analysis and visualization of high dimensional datasets.

Goals:

  • Be able to import data and prepare data for unsupervised learning analysis.
  • Be able to apply and test combinations of PCA, k-means and hierarchical clustering to high dimensional datasets and critically review results.

Material:

Homework:


9: Structural Bioinformatics (Focus on new AlphaFold2)

Topics: Protein structure function relationships, Protein structure and visualization resources, Modeling energy as a function of structure, Homology modeling, AlphaFold, Predicting functional dynamics, Inferring protein function from structure.

Goal:

  • View and interpret the structural models in the PDB,
  • Understand the classic Sequence>Structure>Function via energetics and dynamics paradigm,
  • Be able to use VMD for biomolecular visualization and analysis,
  • Appreciate the role of AlphaFold in advancing structural bioinformatics,
  • Be able to use the Bio3D package for exploratory analysis of protein sequence-structure-function-dynamics relationships.

Videos:

Material:

  • Lecture Slides: Large PDF, Small PDF,
  • Lab: Hands-on structural bioinformatics analysis (pt. 1),

  • Software links: VMD download, MUSCLE download,
  • Alternate Windows install and setup cmd: curl -o "muscle.exe" "https://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86win32.exe"
  • Alternative Intel Mac install and setup cmd: sudo curl -o "/usr/local/bin/muscle" "http://thegrantlab.org/misc/muscle"; sudo chmod +x /usr/local/bin/muscle
  • Alternative M1 Mac install and setup cmd: sudo curl -o "/usr/local/bin/muscle" "http://thegrantlab.org/misc/m1/muscle"; sudo chmod +x /usr/local/bin/muscle
  • Side-note: Check your Mac cpu type with cmd: sysctl -a | grep cpu.brand
  • Feedback: Muddy point assessment.

Homework:

  • Questions.

10: Genome informatics

Topics: Genome sequencing technologies past, present and future (Sanger, Shotgun, PacBio, Illumina, toward the $500 human genome), Biological applications of sequencing, Variation in the genome, RNA-Sequencing for gene expression analysis; Major genomic databases, tools and visualization resources from the EBI & UCSC, The Galaxy platform for quality control and analysis; Sample Galaxy RNA-Seq workflow with FastQC and Bowtie2

Goals:

  • Appreciate and describe in general terms the rapid advances in sequencing technologies and the new areas of investigation that these advances have made accessible.
  • Understand the process by which genomes are currently sequenced and the bioinformatics processing and analysis required for their interpretation.
  • For a genomic region of interest (e.g. the neighborhood of a particular SNP), use a genome browser to view nearby genes, transcription factor binding regions, epigenetic information, etc.
  • Be able to use the Galaxy platform for basic RNA-Seq analysis from raw reads to expression value determination.
  • Understand the FASTQ file format and the information it holds.
  • Understand the SAM/BAM file format and the information it holds.

Videos:

Supporting material:

Homework:

  • Population analysis: Submit to GradeScope your RMarkdown generated PDF with working code, output and narrative text answering Q13 and Q14 in this weeks Hands-on section worksheet.

IPs

  • nt1 IP: http://3.212.78.120/galaxy
  • nt2 IP: http://3.231.195.172/galaxy

OPTIONAL: Bioinformatics in structure prediction and design (Focus on new AlphaFold2)

Topics: The traditional path to drug discovery; High throughput screening
 approaches; Computational receptor/target-based bioinformatics approaches; Computational ligand/drug-based bioinformatics approaches; Small molecule docking methods; Prediction and analysis of biomolecular dynamics, conformational variants and functional dynamics; Molecular simulation and drug optimization.

Goals:

  • Appreciate how bioinformatics can aid drug discovery,
  • Be able to perform In silico docking and virtual screening strategies for drug discovery,
  • Appreciate how bioinformatics can predict the functional dynamics of biomolecules,
  • Be able to use Bio3D for the analysis and prediction of protein flexibility,
  • Understand the increasing role of bioinformatics in pharma and the drug discovery process in particular.

Videos:

  • X.1 - No videos for this class.

Material:

Homework:

  • Questions.

11: Transcriptomics and the analysis of RNA-Seq data

Topics: Analysis of RNA-Seq data with R, Differential expression tests, RNA-Seq statistics, Counts and FPKMs, Normalizing for sequencing depth, DESeq2 analysis. Gene finding and functional annotation from high throughput sequencing data, Functional databases KEGG, InterPro, GO ontologies and functional enrichment.

Goals:

  • Given an RNA-Seq dataset, find the set of significantly differentially expressed genes and their annotations.
  • Gain competency with data import, processing and analysis with DESeq2 and other bioconductor packages.
  • Understand the structure of count data and metadata required for running analysis.
  • Be able to extract, explore, visualize and export results.
  • Perform a GO analysis to identify the pathways relevant to a set of genes (e.g. identified by transcriptomic study or a proteomic experiment). Use both Bioconductor packages and online tools to interpret gene lists and annotate potential gene functions.

Videos:

Supporting material:

Readings:

Homework:

  • Submit your completed PDF lab report to GradeScope,

12: RNA-Seq analysis mini-project

Topics: Differential expression analysis project with DESeq2 followed by gene enrichment and functional annotation with KEGG, InterPro, and GO ontologies.


13: Essential UNIX for bioinformatics

Topics: Bioinformatics on the command line, Why do we use UNIX for bioinformatics? UNIX philosophy, 21 Key commands, Understanding processes, File system structure, Connecting to remote servers, Redirection, streams and pipes, Workflows for batch processing, Organizing computational projects, Going further with your own computer in the cloud, Launching and using AWS EC2 instances (A.K.A. Virtual Machines).

Goals:

  • Understand why we use UNIX for bioinformatics
  • Use UNIX command-line tools for file system navigation and text file manipulation.
  • Have a familiarity with 21 key UNIX commands that we will use ~90% of the time.
  • Be able to connect to remote servers from the command line.
  • Use existing programs at the UNIX command line to analyze bioinformatics data.
  • Understand IO Redirection, Streams and pipes.
  • Understand best practices for organizing computational projects.

Videos:

Supporting material:

Homework:


18: Guest lecture: Immunoinformatics, immunotherapy and cancer

Topics: Cancer genomics resources and bioinformatics tools for investigating the molecular basis of cancer. Large scale cancer sequencing projects; NCI Genomic Data Commons; What has been learned from genome sequencing of cancer? Immunoinformatics, immunotherapy and cancer; Using genomics and bioinformatics to harness a patient’s own immune system to fight cancer. Implications for the development of personalized medicine.

Material:


14: Vaccination rate mini project

Topics: A topical mini-project using ggplot and dplyr on with the latest state wide COVID-19 vaccination data. Practical considerations and best practices for exploratory data analysis.


15: Mini Project: Investigating Pertussis Resurgence

Topics: A topical mini-project using web-scraping, JSON based APIs and advanced dplyr and ggplot to investigate new datasets associated with pertussis cases and longitudinal RNA-Seq on the immune response to distinct vaccination strategies. This class will be co-taught with Dr. Bjoern Peters from the La Jolla Institute for Immunology.

Homework:

  • Generate a complete lab report with all sections and question responses for submission to gradescope.
  • There are no homework quiz questions this week.

16: Hands-on with Git & online portfolio completion plus bonus Bioinformatics in industry session

Topics: Today’s lecture and hands-on sessions introduce Git, currently the most popular version control system. We will learn how to perform common operations with Git and RStudio. We will also cover making a public facing GitHub pages portfolio of your bioinformatics work. Project assignment troubleshooting. Discussion of Bioinformatics and genomics career opportunities.

Videos:

  • 16.1 - OPTIONAL: Git for humans,
  • 16.2 Introduction to GitHub Pages that we will use for building your portfolio website.
  • 16.3 Live stream interview with leading bioinformatics and genomics scientists from industry including Dr Ali Crawford (Associate Director, Scientific Research, Illumina Inc.), Dr. Bjoern Peters (Full Professor and Principal Investigator, La Jolla Institute) and Dr. Ana Grant (Director of Research Informatics, Synthetic Genomics Inc.).

Supporting material:

Or student topic of choice to be selected from those below:

  • Biological network analysis
  • Cancer genomics
  • Unix tips and tricks for Bioinformatics
  • Structural Bioinformatics and computational drug design
  • Introduction to the tidyverse
  • Writing R packages
  • Advanced RMarkdown
  • Creating online work portfolios with GitHub-pages