Lectures
All Lectures are Tu/Th 9:00-12:00 pm in Warren Lecture Hall 2015 (WLH 2015) (Map). Clicking on the class topics below will take you to corresponding lecture notes, homework assignments, pre-class video screen-casts and required reading material.
# | Date | Topics for Fall 2017 |
---|---|---|
1 | Th, 09/28 | Welcome to Foundations of Bioinformatics Course introduction, Leaning goals & expectations, Biology is an information science, History of Bioinformatics, Types of data, Application areas and introduction to upcoming course segments, Student computer setup |
2 | Tu, 10/03 | Bioinformatics databases and key online resources NCBI & EBI resources for the molecular domain of bioinformatics, Focus on GenBank, UniProt, Entrez and Gene Ontology. Hands on with BLAST, GenBank, OMIM, GENE, UniProt, Muscle, PFAM and PDB bioinformatics tools and databases |
3 | Th, 10/05 | Sequence alignment fundamentals, algorithms and applications Homology, Sequence similarity, Local and global alignment, classic Needleman-Wunsch, Smith-Waterman and BLAST heuristic approaches |
4 | Tu, 10/10 | Advanced database searching Database searching beyond BLAST, PSI-BLAST, Profiles and HMMs, Protein structure comparisons |
5 | Th, 10/12 | Introduction to UNIX for bioinformatics Why do we use UNIX for bioinformatics? UNIX philosophy, 21 Key commands, Understanding processes, File system structure, Connecting to remote servers |
6 | Tu, 10/17 | Working with Unix Bioinformatics on the command line, Redirection, streams and pipes, Workflows for batch processing, Shell scripting, Organizing computational projects |
7 | Th, 10/19 | Bioinformatics data analysis with R R language basics and the RStudio IDE, Major R data structures and functions, Using R scripts from the command line |
8 | Tu, 10/24 | Data exploration and visualization in R Import data in various formats (both local and from online sources), The exploratory data analysis mindset, Data visualization best practices, Simple base graphics (scatterplots, histograms, bar graphs and boxplots), Building more complex charts with ggplot |
9 | Th, 10/26 | Why, when and how of writing your own R functions Import data in various formats both local and from online sources, The basics of writing your own functions that promote code robustness, reduce duplication and facilitate code re-use |
10 | Tu, 10/31 | Working with R packages for bioinformatics Extending functionality and utility with R packages, Obtaining R packages from CRAN and bioconductor, Working with Bio3D for molecular data, Managing and analyzing genome-scale data with bioconductor |
11 | Th, 11/02 | Structural Bioinformatics Protein structure function relationships, Protein structure and visualization resources, Modeling energy as a function of structure, Homology modeling, Predicting functional dynamics, Inferring protein function from structure |
12 | Tu, 11/07 | Bioinformatics in drug discovery and design Target identification, Lead identification, Small molecule docking methods, Protein motion and conformational variants, Molecular simulation and drug optimization |
13 | Th, 11/09 | Project: Find a gene assignment Principles of database searching, sequence analysis, structure analysis and bioinformatic data analysis with the R environment |
14 | Tu, 11/14 | Genome informatics and high throughput sequencing Searching genes and gene functions, Genome databases, Variation in the genome, Sequencing technologies past, present and future (Sanger, Shotgun, PacBio, Illumina, toward the $500 human genome), Biological applications of sequencing, Bioinformatics analysis methods |
15 | Th, 11/16 | Major bioinformatics resources for genomics. Databases, tools and visualization resources from NCBI, EBI & UCSC, The Galaxy platform for quality control and analysis; FASTQ, SAM and BAM file formats; Sample workflows with FASTQC and bowtie2 |
16 | Tu, 11/21 | Immunoinformatics resources for the understanding of immunological information Guest lecture from Dr. Bjoern Peters (LIAI) with topics including: Epitope prediction, Reverse vaccinology, Immune system modeling, Disease diagnosis and therapy along with implications for the development of personalized medicine. |
Th, 11/23 | Happy Thanksgiving! No class N.B. Find a gene assignment due on Monday 11/27! | |
17 | Tu, 11/28 | Transcriptomics and the analysis of RNA-Seq data RNA-Seq aligners, Differential expression tests, RNA-Seq statistics, Counts and FPKMs and avoiding P-value misuse, Hands-on analysis of RNA-Seq data with R |
18 | Th, 11/30 | Genome annotation and the interpretation of gene lists Gene finding and functional annotation, Functional databases KEGG, InterPro, GO ontologies and functional enrichment |
19 | Tu, 12/05 | Guest lecture Student selected guest presentation with possible topics including: Metagenomics / Pharmacogenomics / Epigenomicss / Personal genomics / Genome evolution / Genome editing and synthetic genomics / Social impacts and ethical implications of continuing* genomic advances |
20 | Th, 12/07 | Course summary Summary of learning goals, Student course evaluation time and exam preparation |
TBD (Th, 12/12) | Final exam! |
Class material
1: Welcome to Foundations of Bioinformatics
Topics:
Course introduction, Leaning goals & expectations, Biology is an information science, History of Bioinformatics, Types of data, Application areas and introduction to upcoming course segments, Student 30-second introductions, Student computer setup.
Goals:
- Understand course scope, expectations, logistics and ethics code.
- Understand the increasing necessity for computation in modern life sciences research.
- Get introduced to how bioinformatics is practiced.
- Complete the pre-course questionnaire.
- Setup your laptop computer for this course.
Material:
- Pre class screen casts (also see below):
- SC1: Welcome to BGGN-213,
- SC2: What is Bioinformatics? and
- SC3: How do we do Bioinformatics?.
- Lecture Slides: Large PDF, Small PDF,
- Handout: Class Syllabus
- Computer Setup Instructions.
Homework:
- Questions,
- Readings:
- PDF1: What is bioinformatics? An introduction and overview,
- PDF2: Advancements and Challenges in Computational Biology,
- Other: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights New York Times, 2014.
Screen Casts:
2: Bioinformatics databases and key online resources
Topics:
NCBI & EBI resources for the molecular domain of bioinformatics, Focus on GenBank, UniProt, Entrez and Gene Ontology. Hands on with BLAST, GenBank, OMIM, GENE, UniProt, Muscle, PFAM and PDB bioinformatics tools and databases. There are many bioinformatics databases (see handout) and being able to judge their utility and quality is important.
Goals:
- Be able to query, search, compare and contrast the data contained in major bioinformatics databases (GenBank, GENE, UniProt, PFAM, OMIM, PDB) and describe how these databases intersect.
- Be able to describe how nucleotide and protein sequence and structure data are represented (FASTA, FASTQ, GenBank, UniProt, PDB).
- Be familiar with online tools at the EBI and NCBI including Muscle and BLAST.
- The goals of the hands-on session is to introduce a range of core bioinformatics databases and associated online services whilst actively investigating the molecular basis of several common human disease.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Handout: Major Bioinformatics Databases
- Hands-on section worksheet
- Muddy point assessment
Homework:
- Readings: PDF1: What is dynamic programming?,
- Readings: PDF2 Fundamentals of database searching.
3. Alignment fundamentals, algorithms and applications
Topics:
Sequence Alignment and Database Searching Homology, Sequence similarity, Local and global alignment, Heuristic approaches, Database searching with BLAST, E-values and evaluating alignment scores and statistics.
Goal:
- Be able to describe how dynamic programming works for pairwise sequence alignment
- Appreciate the differences between global and local alignment along with their major application areas.
- Understand how aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins.
- The goals of the hands-on session are to explore the principles underlying the computational tools that can be used to compute and evaluate sequence alignments.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet
- Muddy point assessment
Homework:
4: Advanced Database Searching
Topics: Database searching beyond BLAST, Using PSI-BLAST, Profiles and HMMs, Protein structure comparisons, Beginning with command line based database searches.
Goal:
- Be able to calculate the alignment score between two nucleotide or protein sequences using a provided scoring matrix
- Understand the limits of homology detection with tools such as BLAST
- Be able to perform PSI-BLAST, HMMER and protein structure based database searches and interpret the results in terms of the biological significance of an e-value.
- Run our first bioinformatics tool from the command line.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet
- Muddy point assessment
Homework:
- Questions and alignment problem from Lecture 3 above are due before the next class.
5: Introduction to UNIX for bioinformatics
Topics: Why do we use UNIX for bioinformatics? UNIX philosophy, 21 Key commands, Understanding processes, File system structure, Connecting to remote servers, Starting up and managing a Jetstream service virtual machine instance.
Goal:
- Understand why we use UNIX for bioinformatics
- Use UNIX command-line tools for file system navigation and text file manipulation.
- Have a familiarity with 21 key UNIX commands that we will use ~90% of the time.
- Be able to connect to remote servers from the command line.
Material:
- Pre class screen cast,
- Lecture Slides: Large PDF, Small PDF,
- Hands-on sections taken from https://swcarpentry.github.io/shell-novice/,
- Unix Reference Commands and Glossary,
- Example data to download and explore: bggn213_01_unix.zip. Please download and move it to your Desktop and unzip.
- Starting and connecting to a Jetstream virtual machine.
- Muddy point assessment
Homework:
- Complete Software Carpentry UNIX lesson sections 5 and 6.
- Read: A Quick Guide to Organizing Computational Biology Projects (Noble 2009).
- Optional: Introduction to Bash Shell Scripting.
6: Working with Unix
Topics: Bioinformatics on the command line, Redirection, streams and pipes, Workflows for batch processing, Shell scripting, Organizing computational projects.
Goal:
- Use existing programs at the UNIX command line to analyze bioinformatics data,
- Understand IO Redirection, Streams and pipes,
- Think in terms of modular workflows for batch processing,
- Understand best practices for organizing computational projects.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet
- Muddy point assessment.
Homework:
- Questions,
- List an unexpected feature of a command of your choice. A feature that you would have not expected when reading about the command.
- The file SGD_features.tab file contains the annotations for genomic features of the Yeast genome. The feature type is stored in the second column.
- Create a file that counts how many times does each type occur.
- What command would show the top ten most common features?
- What command would show the least common features?
- Readings:
- A Quick Guide to Organizing Computational Biology Projects, Plos Comp Bio, 2009
- Blog posts on Designing projects and a research workflow based on github.
7: Bioinformatics data analysis with R
Topics: R language basics and the RStudio IDE, Major R data structures and functions, Using R for data exploration and visualization. R scripts and R Markdown.
Goal:
- Familiarity with R’s basic syntax,
- Be able to use R to read and parse comma-separated (.csv) formatted files ready for subsequent analysis,
- Familiarity with major R data structures (vectors, matrices and data.frames),
- Understand the basics of using functions (arguments, vectorizion and re-cycling).
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section 1,
- Muddy point assessment.
Homework:
8: Data exploration and visualization in R
Topics: The exploratory data analysis mindset, Data visualization best practices, Simple base graphics (including scatterplots, histograms, bar graphs, dot chats, boxplots and heatmaps), Building more complex charts with ggplot.
Goal:
- Appreciate the major elements of exploratory data analysis and why it is important to visualize data.
- Be conversant with data visualization best practices and understand how good visualizations optimize for the human visual system.
- Be able to generate informative graphical displays including scatterplots, histograms, bar graphs, boxplots, dendrograms and heatmaps and thereby gain exposure to the extensive graphical capabilities of R.
- Appreciate that you can build even more complex charts with ggplot and additional R packages such as rgl.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Rmarkdown documents for plot session 1, and more advanced plots,
- Hands-on section worksheet,
- Example data for hands-on sections bggn213_08_rstats.zip,
- Muddy point assessment
Homework:
- This units homework is all via DataCamp (see lecture 7 above).
9: Why, When and How of Writing Your Own R Functions
Topics: Import data in various formats both local and from online sources, The basics of writing your own functions that promote code robustness, reduce duplication and facilitate code re-use.
Goals:
- Be able to import data in various flat file formats from both local and online sources.
- Understand the structure and syntax of R functions and how to view the code of any R function.
- Understand when you should be writing functions.
- Be able to follow a step by step process of going from a working code snippet to a more robust function.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet,
- Flat files for importing with read.table: test1.txt, test2.txt, test3.txt.
- Muddy point assessment
Homework:
- This units homework is all via DataCamp (see lecture 7 above).
10: Using CRAN and Bioconductor Packages for Bioinformatics
Topics: More on how to write R functions with worked examples. Further extending functionality and utility with R packages, Obtaining R packages from CRAN and Bioconductor, Working with Bio3D for molecular data, Managing genome-scale data with bioconductor.
Goals:
- Be able to find and install R packages from CRAN and bioconductor,
- Understand how to find and use package vignettes, demos, documentation, tutorials and source code repository where available.
- Be able to write and (re)use basic R scripts to aid with reproducibility.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Collaborative Google Doc based notes on selected R packages,
- Introductory tutorial on R packages,
- Muddy point assessment.
Homework:
- Complete question 6 from the lecture 9 worksheet. This entails turning a supplied code snippet into a more robust and re-usable function that will take any of the three listed input proteins and plot the effect of drug binding. Note assessment rubric within document. (Submission deadline: 9am Th, 11/09).
11: Structural Bioinformatics
Topics: Protein structure function relationships, Protein structure and visualization resources, Modeling energy as a function of structure, Homology modeling, Predicting functional dynamics, Inferring protein function from structure.
Goal:
- View and interpret the structural models in the PDB,
- Understand the classic
Sequence>Structure>Function
via energetics and dynamics paradigm, - Appreciate the role of bioinformatics in mapping the ENERGY LANDSCAPE of biomolecules,
- Be able to use the Bio3D package for exploratory analysis of protein sequence-structure-function-dynamics relationships.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet,
- VMD software download link,
- Muddy point assessment.
12: Bioinformatics in drug discovery and design
Topics: The traditional path to drug discovery; High throughput screening approaches; Computational receptor/target-based bioinformatics approaches; Computational ligand/drug-based bioinformatics approaches; Small molecule docking methods; Prediction and analysis of biomolecular motion, conformational variants and functional dynamics; Molecular simulation and drug optimization.
Goals:
- Appreciate how bioinformatics can predict functional dynamics & aid drug discovery,
- Be able to use Bio3D and R for the analysis and prediction of protein flexibility,
- Be able to perform In silico docking and virtual screening strategies for drug discovery,
- Understand the increasing role of bioinformatics in the drug discovery process.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet,
- MGLTools software download link,
- VMD software download link,
- Muddy point assessment.
13: Project Assignment Introduction!
The find-a-gene project is a required assignment for BGGN-213. The objective with this assignment is for you to demonstrate your grasp of database searching, sequence analysis, structure analysis and the R environment that we have covered to date in class.
You may wish to consult the scoring rubric at the end of the above linked project description and the example report for format and content guidance.
Your responses to questions Q1-Q4 are due at the beginning of class Thursday November 16th (11/16/17).
The complete assignment, including responses to all questions, is due at the beginning of class Tuesday December 5th (12/5/17).
Late responses will not be accepted under any circumstances.
Hands-on with Git:
Today’s lecture and hands-on sessions with introduce Git, currently the most popular version control system. We will learn how to perform common operations with Git that you’ll do every day. We will also cover the popular social code-hosting platforms GitHub and BitBucket.
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet 1,
- Optional: Hands-on section worksheet 2.
14: Genome informatics and high throughput sequencing
Topics: Searching genes and gene functions, Genome databases, Variation in the genome, Sequencing technologies past, present and future Sanger, Shotgun, PacBio, Illumina, toward the $500 human genome), Biological applications of sequencing, RNA-Sequencing for gene expression analysis, Bioinformatics analysis methods
Goals:
- Appreciate and describe in general terms the rapid advances in sequencing technologies and the new areas of investigation that these advances have made accessible.
- Understand the process by which genomes are currently sequenced and the bioinformatics processing and analysis required for their interpretation.
- Be able to launch your own cloud based Galaxy server for NGS analysis.
- Be able to navigate the Galaxy platform, input NGS sequence data and access common NGS tools for sequence analysis.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet,
- RNA-Seq data files: HG00109_1.fastq, HG00109_2.fastq, genes.chr17.gtf, Expression genotype results, Example R script.
- Hands-on section Solutions.pdf
- Muddy point assessment.
15: Major bioinformatics resources for genomics.
Topics: Databases, tools and visualization resources from NCBI, EBI & UCSC, The Galaxy platform for quality control and analysis; FASTQ, SAM and BAM file formats; Sample Galaxy workflow with FastQC and Bowtie2
Goals:
- For a genomic region of interest (e.g. the neighborhood of a particular SNP), use a genome browser to view nearby genes, transcription factor binding regions, epigenetic information, etc.
- Understand the FASTQ file format and the information it holds.
- Understand the SAM/BAM file format and the information it holds.
- Be able to launch your own cloud based Galaxy server for NGS analysis.
- Be able to use the Galaxy platform for basic RNA-Seq analysis from raw reads to expression value determination.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section worksheet,
- RNA-Seq data files: HG00109_1.fastq, HG00109_2.fastq, genes.chr17.gtf, Expression genotype results, Example R script.
- SAM/BAM file format description.
- Muddy point assessment.
16: Immunoinformatics
Topics: Immunoinformatics resources for the understanding of immunological information. A case study in personalized cancer immunotherapy.
Guest lecture from Dr. Bjoern Peters (LIAI) with topics including: Epitope prediction, Reverse vaccinology, Immune system modeling, Disease diagnosis and therapy along with implications for the development of personalized medicine.
Material:
- Lecture Slides: Large PDF, Small PDF,
- Hands-on section tasks,
- Data files: lecture16_sequences.fa, Example mutant identification and subsequence extraction with R walk through.
- Patient HLA typing results:
HLA-A*02:01 HLA-A*68:01 HLA-B*07:02 HLA-B*35:01
- Results: subsequences.fa, Solutions.pdf)
- IEDB HLA binding prediction website http://tools.iedb.org/mhci/.
17: Transcriptomics and the analysis of RNA-Seq data
Topics: Analysis of RNA-Seq data with R, Differential expression tests, RNA-Seq statistics, Counts and FPKMs, Normalizing for sequencing depth, DESeq2 analysis.
Goals:
- Given an RNA-Seq dataset, find the set of significantly differentially expressed genes and their annotations.
- Given an RNA-Seq dataset, find the set of significantly differentially expressed genes and their annotations
- Gain competency with data import, processing and analysis with DESeq2 and other bioconductor packages
- Understand the structure of count data and metadata required for running analysis
- Be able to extract, explore, visualize and export results
Material:
- Lecture Slides: PDF.
- Detailed Bioconductor setup instructions.
- Hands-on section worksheet
- Data files: airway_scaledcounts.csv, airway_metadata.csv, annotables_grch38.csv.
- Muddy point assessment
Readings:
- Excellent review article: Conesa et al. A survey of best practices for RNA-seq data analysis. Genome Biology 17:13 (2016).
- An oldey but a goodie: Soneson et al. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (2015).
- Abstract and introduction sections of: Himes et al. “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.” PLoS ONE 9.6 (2014): e99625.
18: Genome annotation and the interpretation of gene lists
Topics: Gene finding and functional annotation, Functional databases KEGG, InterPro, GO ontologies and functional enrichment
Goals: Perform a GO analysis to identify the pathways relevant to a set of genes (e.g. identified by transcriptomic study or a proteomic experiment). Use both Bioconductor packages and online tools to interpret gene lists and annotate potential gene functions.
Material:
- Lecture Slides: Large PDF.
- Hands-on section worksheet
- Data files: GSE37704_featurecounts.csv, GSE37704_metadata.csv.
Readings:
- Good review article: Trapnell C, Hendrickson DG, Sauvageau M, Goff L et al. “Differential analysis of gene regulation at transcript resolution with RNA-seq”. Nat Biotechnol 2013 Jan;31(1):46-53. PMID: 23222703.
19: Guest lecture
Topics: Student selected industry based genomic scientist presentation with possible topics including: Metagenomics / Pharmacogenomics / Epigenomicss / Personal genomics / Genome evolution / Genome editing and synthetic genomics / Social impacts and ethical implications of continuing genomic advances
Goals: Understand the challenges in integrating and interpreting large heterogenous high throughput data sets into their functional context.
20: Foundational statistics for bioinformatics
Topics: Data summary statistics; Inferential statistics; Significance testing; Two sample T-test in R; Power analysis in R; Chi-square test in R; Multiple testing correction; and almost everything you wanted to know about Principal Component Analysis (PCA) but were afraid to ask!
Material: