# Module 2. Introduction to Statistics in Bioinformatics

Basic statistics as used in bioinformatics, especially standard statistical tests of significance and when they apply. Applications to genetics, experimental and observational medical data, as well as exploration of multiple testing issues that arise in bioinformatics and other experimental settings.

**N.B.** Please complete this pre-course **questionnaire** if you have not already done so.

#### Lecture (2-1): **Framework for statistical analysis of biomedical data**

**Time**: Feb 9 (Tuesday), 2:30 - 4:00 PM**Topics**: Probability distributions, quantifying central values and variability, quantifying association, graphical displays of data**Material**:

Lecture slides (PDF)

We will be using R throughout this module to demonstrate data analyses concepts and best practices. In preparation for our first lab session we are requesting that you all complete the free online interactive learning tutorial “TryR” (http://tryr.codeschool.com). This will take you through a gentle introduction to R syntax and some of the major R data structures (called vectors, matrices and data.frames).

#### Lab (2-1): **Descriptive statistics and summarizing data**

**Time**: 2:30 – 4:00 PM, Feb 11 (Thursday) or Feb 12, 10:30 - 12:00 PM, (Friday)**Topics**: Introduction to R, probability distributions, quantifying central values and variability, quantifying association, graphical displays of data.**Material**:

PDF slides: Introduction to R, Video

Lab worksheet

Lab worksheet with key

Dataset TROPHY.csv

Readings Feasibility of Treating Prehypertension with an Angiotensin-Receptor Blocker (TROPHY. S. Julius 2006), R Data Types

Muddy point assessment**Homework**:

Homework Assignment 1

Homework Assignment 1 with key

#### Lecture (2-2): **Approaches to statistical estimation and testing**

**Time**: Feb 16 (Tuesday), 2:30 - 4:00 PM**Topics**: Estimation and standard errors, standard errors for means, correlations, and log odds ratios, formal hypothesis testing, tests involving means, correlations, and log odds ratios, power.**Material**:

Lecture slides (PDF)

#### Lab (2-2): **Statistical estimation and hypothesis testing**

**Time**: 2:30 – 4:00 PM, Feb 18 (Thursday) or Feb 19, 10:30 - 12:00 PM, (Friday)**Topics**: Estimation and standard errors, standard errors for means, correlations, and log odds ratios, formal hypothesis testing, one and two sample tests involving means, power.**Material**:

Lab worksheet

Lab worksheet with key

Dataset TROPHY.csv

Muddy point assessment**Homework**:

Homework Assignment 2

Homework Assignment 2 with key

#### Lecture (2-3): **Analyses involving associations**

**Time**: Feb 23 (Tuesday), 2:30 - 4:00 PM**Topics**: Pearson correlation, t-test, odds ratios, discussion of a research article**Material**:

Lecture slides (PDF)

#### Lab (2-3): **Pearson correlation, t-test, and log odds ratios**

**Time**: 2:30 – 4:00 PM, Feb 25 (Thursday) or**Feb**26, 10:30 - 12:00 PM, (Friday)**Topics**: Tests based on Pearson correlation t-test, and log odds ratios**Material**:

Lab worksheet

Lab worksheet with key

Muddy point assessment**Homework**:

Homework Assignment 3

Homework Assignment 3 with key

#### Lecture (2-4): **Linear regression**

**Time**: Mar 8 (Tuesday), 2:30 - 4:00 PM**Topics**: Single and multiple variable linear regression, Bonferroni correction, power for regression analysis**Material**:

Lecture slides (PDF)

#### Lab (2-4): **Regression models**

**Time**: 2:30 – 4:00 PM, Mar 10 (Thursday) or Mar 11, 10:30 - 12:00 PM, (Friday)**Topics**: Fitting regression models for prediction and effect estimation, inference for regression effects, R^2, diagnostics, comparing models**Material**:

Lab worksheet

Lab worksheet with key

Muddy point assessment**Homework**:

Homework Assignment 4

Homework Assignment 4 with key

#### Lecture (2-5): **Introduction to graphical methods for multivariate data analysis**

**Time**: Mar 15 (Tuesday), 2:30 - 4:00 PM**Topics**: Clustering methods, Multidimensional scaling and Principal component analysis**Material**:

Lecture slides (PDF)

#### Lab (2-5): **Clustering and principal component analysis**

**Time**: 2:30 – 4:00 PM, Mar 17 (Thursday) or Mar 18, 10:30 - 12:00 PM, (Friday)**Topics**: Multivariate data, Heat maps and dendrograms, clustering methods, principal component analysis**Material**:

Lab worksheet with key

Muddy point assessment

### Reference material

RStudio cheatsheet: well designed reference card for RStudio features.

Try R An excellent interactive online R tutorial.