Lectures
All Lectures are Tu/Th 9:00-12:00 pm in Warren Lecture Hall 2015 (WLH 2015) (Map). Clicking on the class topics below will take you to corresponding lecture notes, homework assignments, pre-class video screen-casts and required reading material.
# | Date | Topics for Winter 2018 |
---|---|---|
1 | Tu, 01/09 | Welcome to Bioinformatics Course introduction, Leaning goals & expectations, Biology is an information science, History of Bioinformatics, Types of data, Application areas and introduction to upcoming course segments, Hands on with major Bioinformatics databases and key online NCBI and EBI resources |
2 | Th, 01/11 | Sequence alignment fundamentals, algorithms and applications Homology, Sequence similarity, Local and global alignment, classic Needleman-Wunsch, Smith-Waterman and BLAST heuristic approaches, Hands on with dot plots, Needleman-Wunsch and BLAST algorithms highlighting their utility and limitations |
3 | Tu, 01/16 | Advanced sequence alignment and database searching Detecting remote sequence similarity, Database searching beyond BLAST, Substitution matrices, Using PSI-BLAST, Profiles and HMMs, Protein structure comparisons |
4 | Th, 01/18 | Bioinformatics data analysis with R Why do we use R for bioinformatics? R language basics and the RStudio IDE, Major R data structures and functions, Using R interactively from the RStudio console |
5 | Tu, 01/23 | Data exploration and visualization in R The exploratory data analysis mindset, Data visualization best practices, Using and customizing base graphics (scatterplots, histograms, bar graphs and boxplots), Building more complex charts with ggplot and rgl |
6 | Th, 01/25 | Why, when and how of writing your own R functions The basics of writing your own functions that promote code robustness, reduce duplication and facilitate code re-use |
7 | Tu, 01/30 | Bioinformatics R packages from CRAN and BioConductor Extending functionality and utility with R packages, Obtaining R packages from CRAN and BioConductor, Working with Bio3D for molecular data |
8 | Th, 02/01 | Introduction to Machine Learning for Bioinformatics 1 Unsupervised learning, K-means clustering, Hierarchical clustering, Heatmap representations. Dimensionality reduction, Principal Component Analysis (PCA) |
9 | Tu, 02/06 | Unsupervised Learning Mini-Project Longer hands-on session with unsupervised learning analysis of cancer cells further highlighting Practical considerations and best practices for the analysis and visualization of high dimensional datasets |
10 | Th, 02/08 | Project: Find a gene assignment (Part 1) Principles of database searching, sequence analysis, structure analysis along with Hands-on with Git How to perform common operations with the Git version control system. We will also cover the popular social code-hosting platforms GitHub and BitBucket. |
11 | Tu, 02/13 | Structural Bioinformatics (Part 1) Protein structure function relationships, Protein structure and visualization resources, Modeling energy as a function of structure |
12 | Th, 02/15 | Bioinformatics in drug discovery and design Target identification, Lead identification, Small molecule docking methods, Protein motion and conformational variants, Molecular simulation and drug optimization |
13 | Tu, 02/20 | Genome informatics and high throughput sequencing (Part 1) Genome sequencing technologies past, present and future; Biological applications of sequencing, Analysis of variation in the genome, and gene expression; The Galaxy platform along with resources from the EBI & UCSC; Sample Galaxy RNA-Seq workflow with FastQC and Bowtie2 |
14 | Th, 02/22 | Transcriptomics and the analysis of RNA-Seq data RNA-Seq aligners, Differential expression tests, RNA-Seq statistics, Counts and FPKMs and avoiding P-value misuse, Hands-on analysis of RNA-Seq data with R. N.B. Find a gene assignment part 1 due today! |
15 | Tu, 02/27 | Genome annotation and the interpretation of gene lists Gene finding and functional annotation, Functional databases KEGG, InterPro, GO ontologies and functional enrichment |
16 | Th, 03/01 | Essential statistics for bioinformatics Everything you wanted to know about statistics for bioinformatics but were afraid to ask. Extensive R examples and applications. |
17 | Tu, 03/06 | Biological network analysis Network based approaches for integrating and interpreting large heterogeneous high throughput data sets; Discovering relationships in ‘omics’ data; Network construction, manipulation, visualization and analysis; Major graph theory and network topology measures and concepts (Degree, Communities, Shortest Paths, Centralities, Betweenness, Random graphs vs scale free); Hands-on with Cytoscape and igraph packages. |
18 | Th, 03/08 | Cancer genomics Cancer genomics resources and bioinformatics tools for investigating the molecular basis of cancer. Mining the NCI Genomic Data Commons; Immunoinformatics and immunotherapy; Using genomics and bioinformatics to design a personalized cancer vaccine. Implications for personalized medicine. N.B. Find a gene assignment due on Tuesday 03/13! |
19 | Tu, 03/13 | Course summary Summary of learning goals, Student course evaluation time and exam preparation; Find a gene assignment due! |
20 | Th, 03/15 | Final exam! |
Class material
1: Welcome to Bioinformatics and introduction to Bioinformatics databases and key online resources
Topics:
Course introduction, Leaning goals & expectations, Biology is an information science, History of Bioinformatics, Types of data, Application areas and introduction to upcoming course segments, Student 30-second introductions, Introduction to NCBI & EBI resources for the molecular domain of bioinformatics, Hands-on session using NCBI-BLAST, Entrez, GENE, UniProt, Muscle and PDB bioinformatics tools and databases.
Goals:
- Understand course scope, expectations, logistics and ethics code.
- Understand the increasing necessity for computation in modern life sciences research.
- Get introduced to how bioinformatics is practiced.
- Complete the pre-course questionnaire.
- Setup your laptop computer for this course.
- The goals of the hands-on session is to introduce a range of core bioinformatics databases and associated online services whilst actively investigating the molecular basis of several common human disease.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on section worksheet
- Feedback: Muddy Point Assessment,
Feedback: Results.
- Handout: Class Syllabus
- Computer Setup Instructions.
Homework:
- Questions,
- Readings:
- PDF1: What is bioinformatics? An introduction and overview,
- PDF2: Advancements and Challenges in Computational Biology,
- Other: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights New York Times, 2014.
Screen Casts:
2: Sequence alignment fundamentals, algorithms and applications
Topics:
Further coverage of major NCBI & EBI resources for the molecular domain of bioinformatics with a focus on GenBank, UniProt, Entrez and Gene Ontology. There are many bioinformatics databases (see handout) and being able to judge their utility and quality is important. Sequence Alignment and Database Searching: Homology, Sequence similarity, Local and global alignment, Heuristic approaches, Database searching with BLAST, E-values and evaluating alignment scores and statistics.
Goals:
- Be able to query, search, compare and contrast the data contained in major bioinformatics databases (GenBank, GENE, UniProt, PFAM, OMIM, PDB) and describe how these databases intersect.
- Be able to describe how nucleotide and protein sequence and structure data are represented (FASTA, FASTQ, GenBank, UniProt, PDB).
- Be able to describe how dynamic programming works for pairwise sequence alignment
- Appreciate the differences between global and local alignment along with their major application areas.
- Understand how aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins.
- The goals of the hands-on session are to explore the principles underlying the computational tools that can be used to compute and evaluate sequence alignments.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Handout: Major Bioinformatics Databases
- Lab: Hands-on section worksheet
- Feedback: Muddy Point Assessment.
- Feedback: Results.
Homework:
Readings:
- Readings: PDF1: What is dynamic programming?,
- Readings: PDF2 Fundamentals of database searching.
3. Advanced sequence alignment and database searching
Topics:
Detecting remote sequence similarity, Database searching beyond BLAST, PSI-BLAST, Profiles and HMMs, Protein structure comparisons, Beginning with command line based database searches.
Goal:
- Be able to calculate the alignment score between two nucleotide or protein sequences using a provided scoring matrix
- Understand the limits of homology detection with tools such as BLAST
- Be able to perform PSI-BLAST, HMMER and protein structure based database searches and interpret the results in terms of the biological significance of an e-value.
- Run our first bioinformatics tool from the command line.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on section worksheet
- Feedback: Muddy Point Assessment.
Homework:
4: Bioinformatics data analysis with R
Topics: Why do we use R for bioinformatics? R language basics and the RStudio IDE, Major R data structures and functions, Using R interactively from the RStudio console.
Goal:
- Understand why we use R for bioinformatics
- Familiarity with R’s basic syntax,
- Be able to use R to read and parse comma-separated (.csv) formatted files ready for subsequent analysis,
- Familiarity with major R data structures (vectors, matrices and data.frames),
- Understand the basics of using functions (arguments, vectorizion and re-cycling).
Material:
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on section 1,
- Feedback: Muddy point assessment,
- Feedback: Responses.
Homework:
5: Data exploration and visualization in R
Topics: The exploratory data analysis mindset, Data visualization best practices, Simple base graphics (including scatterplots, histograms, bar graphs, dot chats, boxplots and heatmaps), Building more complex charts with ggplot.
Goal:
- Appreciate the major elements of exploratory data analysis and why it is important to visualize data.
- Be conversant with data visualization best practices and understand how good visualizations optimize for the human visual system.
- Be able to generate informative graphical displays including scatterplots, histograms, bar graphs, boxplots, dendrograms and heatmaps and thereby gain exposure to the extensive graphical capabilities of R.
- Appreciate that you can build even more complex charts with ggplot and additional R packages such as rgl.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Rmarkdown documents for plot session 1, and more advanced plots,
- Lab: Hands-on section worksheet,
- Example data for hands-on sections bimm143_05_rstats.zip,
- Feedback: Muddy point assessment,
- Feedback: Responses.
Homework:
- This units homework is all via DataCamp (see lecture 4 above).
6: Why, When and How of Writing Your Own R Functions
Topics: , Using R scripts and Rmarkdown files, Import data in various formats both local and from online sources, The basics of writing your own functions that promote code robustness, reduce duplication and facilitate code re-use.
Goals:
- Be able to import data in various flat file formats from both local and online sources.
- Understand the structure and syntax of R functions and how to view the code of any R function.
- Understand when you should be writing functions.
- Be able to follow a step by step process of going from a working code snippet to a more robust function.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Flat files for importing with read.table: test1.txt, test2.txt, test3.txt.
- Lab: Hands-on section worksheet,
- Feedback: Muddy point assessment,
- Feedback: Responses.
Homework:
- See Q6 of the hands-on lab sheet above. This entails turning a supplied code snippet into a more robust and re-usable function that will take any of the three listed input proteins and plot the effect of drug binding. Note assessment rubric and submission instructions within document. (Submission deadline: 9am Th, 02/08).
- The remainder of this units homework is all via DataCamp.
7: Using CRAN and Bioconductor Packages for Bioinformatics
Topics: More on how to write R functions with worked examples. Further extending functionality and utility with R packages, Obtaining R packages from CRAN and Bioconductor, Working with Bio3D for molecular data, Managing genome-scale data with bioconductor.
Goals:
- Be able to find and install R packages from CRAN and bioconductor,
- Understand how to find and use package vignettes, demos, documentation, tutorials and source code repository where available.
- Be able to write and (re)use basic R scripts to aid with reproducibility.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Collaborative Google Doc based notes on selected R packages,
- Introductory tutorial on R packages,
- Feedback: Muddy point assessment.
- Feedback: Responses.
Homework:
See Q6 from the last days hands-on lab sheet above. This entails turning a supplied code snippet into a more robust and re-usable function that will take any of the three listed input proteins and plot the effect of drug binding. Note assessment rubric and submission instructions within document. (Submission deadline: 9am Th, 02/08).
- The remainder of this units homework is all via DataCamp.
8: Introduction to Machine Learning for Bioinformatics
Topics: Unsupervised learning, supervised learning and reinforcement learning; Focus on unsupervised learning, K-means clustering, Hierarchical clustering, Heatmap representations. Dimensionality reduction, visualization and analysis, Principal Component Analysis (PCA) Practical considerations and best practices for the analysis of high dimensional datasets.
Goal:
- Understand the major differences between unsupervised and supervised learning.
- Be able to create k-means and hierarchical cluster models in R
- Be able to describe how the k-means and bottom-up hierarchical cluster algorithms work.
- Know how to visualize and integrate clustering results and select good cluster models.
- Be able to describe in general terms how PCA works and its major objectives.
- Be able to apply PCA to high dimensional datasets and visualize and integrate PCA results (e.g identify outliers, find structure in features and aid in complex dataset visualization).
Material:
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on section worksheet for PCA
- Data files: UK_foods.csv.
- Introduction to PCA site.
- Feedback: Muddy point assessment.
- Feedback: Responses.
9: Unsupervised Learning Mini-Project
Topics: Longer hands-on session with unsupervised learning analysis of cancer cells, Practical considerations and best practices for the analysis and visualization of high dimensional datasets.
Goals:
- Be able to import data and prepare for unsupervised learning analysis.
- Be able to apply and test combinations of PCA, k-means and hierarchical clustering to high dimensional datasets and critically review results.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on section worksheet for PCA
- Data file: WisconsinCancer.csv.
- Bio3D PCA App: http://bio3d.ucsd.edu.
- Feedback: Muddy point assessment.
10: (Project:) Find a Gene Assignment Part 1
The find-a-gene project is a required assignment for BIMM-143. The objective with this assignment is for you to demonstrate your grasp of database searching, sequence analysis, structure analysis and the R environment that we have covered to date in class.
You may wish to consult the scoring rubric at the end of the above linked project description and the example report for format and content guidance.
Your responses to questions Q1-Q4 are due at the beginning of class Thursday February 22nd (02/22/18).
The complete assignment, including responses to all questions, is due at the beginning of class Thursday March 13th (03/13/18).
Late responses will not be accepted under any circumstances.
Bonus: Hands-on with Git
Today’s lecture and hands-on sessions with introduce Git, currently the most popular version control system. We will learn how to perform common operations with Git and RStudio. We will also cover the popular social code-hosting platforms GitHub and BitBucket.
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on with Git
11: Structural Bioinformatics
Topics: Protein structure function relationships, Protein structure and visualization resources, Modeling energy as a function of structure, Homology modeling, Predicting functional dynamics, Inferring protein function from structure.
Goal:
- View and interpret the structural models in the PDB,
- Understand the classic
Sequence>Structure>Function
via energetics and dynamics paradigm, - Appreciate the role of bioinformatics in mapping the ENERGY LANDSCAPE of biomolecules,
- Be able to use the Bio3D package for exploratory analysis of protein sequence-structure-function-dynamics relationships.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on section worksheet for VMD and Bio3D
- Software links: VMD download, MUSCLE download
- Feedback: Muddy point assessment.
12: Bioinformatics in drug discovery and design
Topics: The traditional path to drug discovery; High throughput screening approaches; Computational receptor/target-based bioinformatics approaches; Computational ligand/drug-based bioinformatics approaches; Small molecule docking methods; Prediction and analysis of biomolecular motion, conformational variants and functional dynamics; Molecular simulation and drug optimization.
Goals:
- Appreciate how bioinformatics can predict functional dynamics & aid drug discovery,
- Be able to use Bio3D and R for the analysis and prediction of protein flexibility,
- Be able to perform In silico docking and virtual screening strategies for drug discovery,
- Understand the increasing role of bioinformatics in the drug discovery process.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Lab: Hands-on section worksheet for In silico drug docking
- Software download links: AutoDock Tools, AutoDock Vina, VMD, MUSCLE
- Optional backup files: 1, 2, 3, log.txt, all.pdbqt
- Feedback: Muddy point assessment
- Feedback: Responses.
13: Genome informatics and high throughput sequencing (Part 1)
Topics: Genome sequencing technologies past, present and future (Sanger, Shotgun, PacBio, Illumina, toward the $500 human genome), Biological applications of sequencing, Variation in the genome, RNA-Sequencing for gene expression analysis; Major genomic databases, tools and visualization resources from the EBI & UCSC, The Galaxy platform for quality control and analysis; Sample Galaxy RNA-Seq workflow with FastQC and Bowtie2
Goals:
- Appreciate and describe in general terms the rapid advances in sequencing technologies and the new areas of investigation that these advances have made accessible.
- Understand the process by which genomes are currently sequenced and the bioinformatics processing and analysis required for their interpretation.
- For a genomic region of interest (e.g. the neighborhood of a particular SNP), use a genome browser to view nearby genes, transcription factor binding regions, epigenetic information, etc.
- Be able to use the Galaxy platform for basic RNA-Seq analysis from raw reads to expression value determination.
- Understand the FASTQ file format and the information it holds.
- Understand the SAM/BAM file format and the information it holds.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet,
- RNA-Seq data files: HG00109_1.fastq, HG00109_2.fastq, genes.chr17.gtf, Expression genotype results, Example R script.
- SAM/BAM file format description.
- Feedback: Muddy point assessment.
IPs
- 149.165.169.245
- 129.114.17.65
- 129.114.17.251
- 149.165.156.226
- 149.165.170.88
- 129.114.17.244
14: Transcriptomics and the analysis of RNA-Seq data
Topics: Analysis of RNA-Seq data with R, Differential expression tests, RNA-Seq statistics, Counts and FPKMs, Normalizing for sequencing depth, DESeq2 analysis.
Goals:
- Given an RNA-Seq dataset, find the set of significantly differentially expressed genes and their annotations.
- Gain competency with data import, processing and analysis with DESeq2 and other bioconductor packages.
- Understand the structure of count data and metadata required for running analysis.
- Be able to extract, explore, visualize and export results.
Material:
- Lecture Slides: Large PDF, Small PDF.
- Detailed Bioconductor setup instructions.
- Hands-on section worksheet
- Data files: airway_scaledcounts.csv, airway_metadata.csv, annotables_grch38.csv.
- Muddy point assessment
Readings:
- Excellent review article: Conesa et al. A survey of best practices for RNA-seq data analysis. Genome Biology 17:13 (2016).
- An oldey but a goodie: Soneson et al. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (2015).
- Abstract and introduction sections of: Himes et al. “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.” PLoS ONE 9.6 (2014): e99625.
15: Genome annotation and the interpretation of gene lists
Topics: Gene finding and functional annotation from high throughput sequencing data, Functional databases KEGG, InterPro, GO ontologies and functional enrichment
Goals: Perform a GO analysis to identify the pathways relevant to a set of genes (e.g. identified by transcriptomic study or a proteomic experiment). Use both Bioconductor packages and online tools to interpret gene lists and annotate potential gene functions.
Material:
- Lecture Slides: Large PDF, Small PDF.
- Hands-on section worksheet
- Data files: GSE37704_featurecounts.csv, GSE37704_metadata.csv.
- Muddy point assessment
Homework: Quiz Assessment
Readings:
- Good review article: Trapnell C, Hendrickson DG, Sauvageau M, Goff L et al. “Differential analysis of gene regulation at transcript resolution with RNA-seq”. Nat Biotechnol 2013 Jan;31(1):46-53. PMID: 23222703.
16: Essential statistics for bioinformatics
Topics: Data summary statistics; Inferential statistics; Significance testing; Two sample T-test in R; Power analysis in R; Multiple testing correction; and almost everything you wanted to know about Principal Component Analysis (PCA) but were afraid to ask! Extensive R examples and applications.
Material:
- Lecture Slides: PDF.
- Data files:
- Feedback: Muddy point assessment.
17: Biological network analysis
Topics: Network graph approaches for integrating and interpreting large heterogeneous high throughput data sets; Discovering relationships in ‘omics’ data; Network construction, manipulation, visualization and analysis; Graph theory; Major network topology measures and concepts (Degree, Communities, Shortest Paths, Centralities, Betweenness, Random graphs vs scale free); De novo sub-network construction and clustering. Hands-on with Cytoscape and R packages for network visualization and analysis.
Goals:
- Understand the challenges in integrating and interpreting large heterogenous high throughput data sets into their functional context.
- Be able to describe the major goals of biological network analysis and the concepts underlying network visualization and analysis.
- Be able to use Cytoscape for network visualization and manipulation.
- Be able to find and instal Cytoscape Apps to extend network analysis functionality.
- Appreciate that the igraph R package has extensive network analysis functionality beyond that in Cytoscape and that the R bioconductor package RCy3 package allows us to bring networks and associated data from R to Cytoscape so we can have the best of both worlds.
Material:
- Lecture Slides: Large PDF, Small PDF.
- Hands-on section worksheet Part 1.
- Hands-on section worksheet Part 2.
- Data files:
- Muddy point assessment
18: Cancer genomics
Topics: Cancer genomics resources and bioinformatics tools for investigating the molecular basis of cancer. Large scale cancer sequencing projects; NCI Genomic Data Commons; What has been learned from genome sequencing of cancer? Immunoinformatics, immunotherapy and cancer; Using genomics and bioinformatics to harness a patient’s own immune system to fight cancer. Implications for the development of personalized medicine.
Material:
- Lecture Slides: Large PDF, Small PDF.
- Hands-on section worksheet Part 1.
- Hands-on section worksheet Part 2.
- Data files:
- Solutions:
- Example mutant identification and subsequence extraction with R walk through.
- subsequences.fa,
- Solutions.pdf.
- IEDB HLA binding prediction website http://tools.iedb.org/mhci/.
19: Course summary
Topics: Summary of learning goals, Student course evaluation time and exam preparation; Find a gene assignment due. Open study.
Hand-out: Exam guidlines, topics, and example questions
Ether-pad: Feedback
20: Final Exam
This open-book, open-notes 150-minute test consists of 35 questions. The number of points for each question is indicated in green font at the beginning of each question. There are 80 total points on offer.
Please remember to:
- Read all questions carefully before starting.
- Put your name, UCSD email and PID number on your test.
- Write all your answers on the space provided in the exam paper.
- Remember that concise answers are preferable to wordy ones.
- Clearly state any simplifying assumptions you make in solving a problem.
- No copies of this exam are to be removed from the class-room.
- No talking or communication (electronic to otherwise) with your fellow students once the exam has begun.
- Good luck!