Barry Grant < http://thegrantlab.org/bimm143/ >
2019-09-10 (13:21:15 on Tue, Sep 10)

Overview

One of the biggest attractions of the R programming language is the ability to have complete programmatic control over the plotting of complex graphs and figures. In this session we will introduce base R functions for generating simple plots and explore the generation of more complex multi-layered figures with custom positioning, layout, fonts and colors.

Sections 1 to 3 are exercises designed to test and expand your familiarity with R base plotting. Section 4 feature a detailed guide to plotting with base R and will be useful for answering questions throughout your subsequent projects. My objective with the last section is to provide you with everything you would need to create plots for almost any kind of data you may encounter in the future.

Side-note: While I have attempted to cover all of the major graph types you are likely to use as an R newcomer these are obviously not the only way to do plotting in R. There are many other R graphing packages and libraries, which can be imported and each of these has advantages and disadvantages. The most common, and ridiculously popular, external library is ggplot2, – we will introduce this add on package from CRAN next day.

Section 1: Getting organized

1A. Creating a Project

1A. We will begin by getting organized with a new RStudio Project specifically for todays class. For now think of this as a folder on your computer where we will store all our work (documented in a new R script), input (the data we will analyze) and output (the graphs, figures and reports we will generate).

Begin by opening RStudio and creating a new Project File > New Project > New Directory > New Project make sure you are working in the directory (a.k.a. folder!) where you want to keep all your work for this course organized. For example, for me this is a directory on my Desktop with the course name (see animated figure below). We will create our project as a sub-directory called class05 in this location.

The key step here is to name your project after this class session (i.e. “class05”) and make sure it is a sub-directory of where ever you are organizing all your work for this course. N.B. Staying organized like this will be very helpful later in the course when we come to make an on-line portfolio of your work.

1B. Getting data to plot

1B. Your next task is to download the ZIP file containing all the input data for this lab to your computer and move it to your RStudio project folder.

Download the ZIP file
Note that on most computers this will typically unzip itself after downloading to create a new directory/folder named bimm143_05_rstats. Check your Downloads directory/folder. However, some computers may require you to double click on the ZIP file to begin the unzipping process.
Once unzipped move the resulting bimm143_05_rstats folder into your R project directory. You can use your Finder window (on Mac) or File Explorer (on Windows) to do this.

1C. Create an R script

1C. Finally, open a new R script: File > New File > R Script and save as class05.R. This is nothing more than a text file where we can write and save our R code. The big advantage is that we will have a record of our work and thus be able to reproduce and automate our analysis later. We will also turn this into a fancy HTML and PDF report for sharing with others (more on this later…).

Section 2: Customizing plots

2A. Line plot

2A. Scatterplots represent the most common visualization when we want to show one quantitative variable relative to another. It is often helpful to add a line to indicate the relationship between x and y or when x represents a strictly increasing quantity such as time or drug dose etc.

These types of plots are produced with the plot() function. This function takes x (and optionally) y as input along with a vast number of additional optional input arguments that customize the the resulting plot. Here we will explore some of the most common customizations.

The file weight_chart.txt from the example data you downloaded above contains data for a growth of a typical baby over the first 9 months of life. Your first task is to get this data into R:

Find the file weight_chart.txt in your RStudio Files panel. This should typically be on the right lower side of your RStudio window.
Click on the weight_chart.txt file to get a preview of its contents in RStudio and answer the following questions.
How are the records (i.e. columns) in the file separated? Is there a comma, space, tab or other character between the data entries?
True or False? The file has a “Header” line that contains the names of the variables (this is often the first line of a file that gives information on each column of data)?

Now we know something about the file format we can decide on the R function to use for data reading (a.k.a. “importing”) into R.

Look at the help page for the read.table() function and answer the following questions:

What is a good read function to use for importing the weight_chart.txt file?
What should you set the argument header equal to in this function for this input file?

Make sure to examine the object you create and store and note how changing between header=FALSE and header=TRUE changes your result.

Add your working code for reading the file to your R script and make sure to save your file!.

The trickiest part here, after deciding on the function to use and the sep and header arguments, is spelling out exactly where the input file is on your computer.

weight <- read.table("bimm143_05_rstats/weight_chart.txt", header=TRUE)

Note that in the code above I needed to explicitly tell the read.table() function that the weight_chart.txt file resides in the folder bimm143_05_rstats/ by giving it’s directory name and file name together as "bimm143_05_rstats/weight_chart.txt".

Note that the bimm143_05_rstats/ folder (a.k.a. directory) name came from our downloaded zip archive and the forward slash character / at the end is used to separate directory and file names in the UNIX world. We will learn much more about the UNIX file system and how to use it productively in a future class all about UNIX.

Now we will use the plot() function to plot this data as a point and line graph with various customizations. The intention here is for you to experiment and try changing and adding additional inoput arguments to the plot() function whilst examining the results at each step:

Change the basic scatterplot produced by plot(weight$Age, weight$Weight) to a line plot. Which additional type= argument yielded the result you were after?
What argument changes the point character to be a filled square?
What argument changes the plot point size to be 1.5x normal size?
What argument changes the line width thickness to be twice the default size?
What argument changes the y-axis limits to scale between 2 and 10kg?
What argument changes the x-axis label to be Age (months)
What argument changes the y-axis label to be Weight (kg)?
What argument add a suitable title to the top of the plot

Each extra input argument to an R function must be separated by a comma. For example:

plot(weight$Age, weight$Weight, typ="o", pch=15)

Note how the x, y, typ, and pch are all comma separated. If you leave a comma out the code will give an error message.

Finally, in your R script write the complete code to generate the following plot and make sure you save your script!

plot(weight$Age, weight$Weight, typ="o", 
     pch=15, cex=1.5, lwd=2, ylim=c(2,10), 
     xlab="Age (months)", ylab="Weight (kg)", 
     main="Baby weight with age", col="blue")

2B. Barplot

2B. The most common approach to visualizing amounts (i.e. numerical values shown for some set of categories) is using bars, either vertically or horizontally arranged. Here we will explore generating and customizing output of the barplot() function.

The file feature_counts.txt contains a summary of the number of features of different types in the mouse GRCm38 genome. Read this data into R using the same overall procedure you used in section A above (i.e. check the format of the file to decide on the read function and arguments to use).

What is the field separator of this file?
What should the header argument be set to for properly importing this file?

Read this data into an object called mouse and then produce the following barplot. Be sure to save your working code in your R script.

Note that the data you need to plot is contained within the Count column of the data frame, so you pass only that column as the data.

What should your code for the barplot() function call look like in this case?

mouse <- read.table("bimm143_05_rstats/feature_counts.txt", sep="\t", header=TRUE)

barplot(mouse$Count)

Once you have the basic plot above make the following changes:

The bars should be horizontal rather than vertical.
The feature names should be added to the y axis. (set names.arg to the Feature column of the data frame).
The plot should be given a suitable title.
The text labels should all be horizontal. Note that this parameter (along with many, many others) is documented in the par() function help page (i.e. see ?par). What parameter setting should you use here?
The margins should be adjusted to accommodate the labels (par mar parameter). You need to supply a 4 element vector for the bottom, left, top and right margin values.

Look at the value of par()$mar to see what the default values are so you know where to start. Note that you will have to redraw the barplot after making the changes to par.

The parameter las=1 will set the axis labels horizontal to the axis (see ?par).

par(mar=c(3.1, 11.1, 4.1, 2))
barplot(mouse$Count, names.arg=mouse$Feature, 
        horiz=TRUE, ylab="", 
        main="Number of features in the mouse GRCm38 genome", 
        las=1, xlim=c(0,80000))

2C. Histograms

[Extension if you have time] We often want to understand how a particular variable is distributed in a dataset. Histograms and density plots provide the most intuitive visualizations of a given distribution.

Use this hist() function to plot out the distribution of 10000 points sampled from a standard normal distribution (use the rnorm() function) along with another 10000 points sampled from the same distribution but with an offset of 4.

Example: x <- c(rnorm(10000),rnorm(10000)+4)
Find a suitable number of breaks to make the plot look nicer (breaks=10 for example)

x <- c(rnorm(10000),rnorm(10000)+4)
hist(x, breaks=80)

Section 3: Using color in plots

3A. Providing color vectors

3A.There are three fundamental use cases for color in data visualizations: we can use color to distinguish groups of data from each other, to represent data values, and to highlight. In all cases with base R plotting functions we need to provide a vector of colors as input to the plotting function.

The file male_female_counts.txt contains a time series split into male and female count values.

Plot this as a barplot
Make all bars different colors using the rainbow() function.

Note that the rainbow() function takes a single argument, which is the number of colors to generate, eg rainbow(10). Try making the vector of colors separately before passing it as the col argument to barplot (col=rainbow(10)).

Rather than hard coding the number of colors, think how you could use the nrow() function to automatically generate the correct number of colors for the size of dataset.

mf <- read.delim("bimm143_05_rstats/male_female_counts.txt")

barplot(mf$Count, names.arg=mf$Sample, col=rainbow(nrow(mf)), 
        las=2, ylab="Counts")

Re-plot, and make the bars for the males a different color to those for the females.

In this case the male and female samples alternate so you can just pass a 2 color vector to the col parameter to achieve this effect. (e.g. col=c("blue2","red2")) for example.

barplot(mf$Count, names.arg=mf$Sample, col=c("blue2","red2"), 
        las=2, ylab="Counts")

3B. Coloring by value

3B. The file up_down_expression.txt contains an expression comparison dataset, but has an extra column that classifies the rows into one of 3 groups (up, down or unchanging). Here we aim to produce a scatterplot (plot) with the up being red, the down being blue and the unchanging being grey.

Import the up_down_expression.txt into an object called genes.

What read function would be the most straightforward to use for this input file?

genes <- read.delim("bimm143_05_rstats/up_down_expression.txt")

How many genes are detailed in this file (i.e. how many rows are there)?

You can use the nrow() function or the dim() function to find the number of rows:

nrow(genes)

## [1] 5196

To determine how many genes are up, down and unchanging in their expression values between the two conditions we can inspect the genes$State column. A useful function for this is the table() function, e.g. try table(genes$State).

How many genes are annotated as ‘up’ regulated in this dataset?

table(genes$State)

## 
##       down unchanging         up 
##         72       4997        127

For graphing start by just plotting the Condition1 column against the Condition2 column in a plot
Pass the State column as the col parameter (col=genes$State for example). This will set the color according to the state of each point, but the colors will be set automatically from the default output of palette().

plot(genes$Condition1, genes$Condition2, col=genes$State, 
     xlab="Expression condition 1", ylab="Expression condition 2")

Run palette() to see what colors are there initially and check that you can see how these relate to the colors you get in your plot.
Run levels() on the State column and match this with what you saw in palette() to see how each color was selected. Work out what colors you would need to put into palette to get the color selection you actually want.
Use the palette() function to set the corresponding colors you want to use (eg palette(c("red","green","blue")) – but using the correct colors in the correct order (up being red, the down being blue and the unchanging being grey).
Redraw the plot and check that the colors are now what you wanted.

palette(c("blue","gray","red"))
plot(genes$Condition1, genes$Condition2, col=genes$State, xlab="Expression condition 1", ylab="Expression condition 2")

3C. Dynamic use of color

The uses of color described above would be what you would use to do categorical coloring, but often we want to use color for a more quantitative purpose. In some graph types you can specify a color function and it will dynamically generate an appropriate color for each data point, but it can be useful to do this more manually in other plot types.

Coloring by point density

One common use of dynamic color is to color a scatterplot by the number of points overlaid in a particular area so that you can get a better impression for where the majority of points fall. R has a built in function densCols() which can take in the data for a scatterplot and will calculate an appropriate color vector to use to accurately represent the density. This has a built in color scheme which is just shades of blue, but you can pass in your own color generating function (such as one generated by colorRampPalette()) to get whatever coloring you prefer.

The file expression_methylation.txt contains data for gene body methylation, promoter methylation and gene expression.

Read this file into an R object called meth.

# Lets plot expresion vs gene regulation
meth <- read.delim("bimm143_05_rstats/expression_methylation.txt")

How many genes are in this dataset?

nrow(meth)

## [1] 9241

Draw a scatterplot (plot) of the gene.meth column against the expression column.

plot(meth$gene.meth, meth$expression)

This is a busy plot with lots of data points in top of one another. Let improve it a little by coloring by point density. Use the densCols() function to make a new color vector that you can use in a new plot along with solid plotting character (e.g. pch=20).

dcols <- densCols(meth$gene.meth, meth$expression)

# Plot changing the plot character ('pch') to a solid circle
plot(meth$gene.meth, meth$expression, col = dcols, pch = 20)

It looks like most of the data is clustered near the origin. Lets restrict ourselves to the genes that have more than zero expresion values

# Find the indices of genes with above 0 expresion
inds <- meth$expression > 0

# Plot just these genes
plot(meth$gene.meth[inds], meth$expression[inds])

## Make a desnisty color vector for these genes and plot
dcols <- densCols(meth$gene.meth[inds], meth$expression[inds])

plot(meth$gene.meth[inds], meth$expression[inds], col = dcols, pch = 20)

Change the colramp used by the densCols() function to go between blue, green, red and yellow with the colorRampPalette() function.

dcols.custom <- densCols(meth$gene.meth[inds], meth$expression[inds],
                         colramp = colorRampPalette(c("blue2",
                                                      "green2",
                                                      "red2",
                                                      "yellow")) )

plot(meth$gene.meth[inds], meth$expression[inds], 
     col = dcols.custom, pch = 20)

Section 4: Detailed guide

A detailed guide to plotting with base R is provided as a PDF suplement that you can use as a reference whenever you need to plot data with base R. From Page 7 onward it goes into more details and covers more plot types than we could in today’s hands-on session.

Muddy Point Assessment

Link to today’s muddy point assesment.

About this document

Here we use the sessionInfo() function to report on our R systems setup at the time of document execution.

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] labsheet_0.1.2
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.0     magrittr_1.5       tools_3.6.0       
##  [4] htmltools_0.3.6    curl_4.0           yaml_2.2.0        
##  [7] Rcpp_1.0.2         KernSmooth_2.23-15 stringi_1.4.3     
## [10] rmarkdown_1.15     knitr_1.24         jsonlite_1.6      
## [13] stringr_1.4.0      xfun_0.9           digest_0.6.20     
## [16] evaluate_0.14

BIMM-143, Lecture 5

Data Exploration and Visualization in R