Barry Grant < http://thegrantlab.org/bimm143/ >
2021-10-11 (21:45:53 on Mon, Oct 11)
The ability to make clear and compelling data visualizations is a vital skill for scientists. The difference between good and bad figures can be the difference between a highly influential or an obscure paper, a grant or contract won or lost, a job interview gone well or poorly. In short, If you want to be successful in any technical field you will need to master these skills!
The goals for this hands-on session are for you to:
One of the biggest attractions of the R programming language is the ability to have complete programmatic control over the plotting of complex graphs and figures. R offers a ridiculously large set of tools and packages for data visualization. The core R language already provides a rich set of plotting functions and plot types. These plotting functions require users to specify how to plot each element on the canvas step by step. These “base R plots” offer complete control over virtually every pixel. However, they can be fiddly and time consuming to get just the way you want. By contrast, the ggplot2 package allows the specification of all plots through set of common plotting layers that minimally includes:
Data visualization with ggplot always involves these steps. Once you have mastered this sequence of steps we will layer on additional customizations and you will see that beautiful and sophisticated plots come within your reach very quickly. Let’s get cracking!
Side note:
If you have not already, please watch this weeks intro videos, which cover why we want to visualize data graphically, what makes an effective figure, and an introduction to ggplot. You may also wish to (re) visit our previous introduction to R and RStudio as well as our video on major R data structures, data types, and using functions.
Q. For which phases is data visualization important in our scientific workflows?
Q. True or False? The ggplot2 package comes already installed with R?
Data visualization is important for all phases of our scientific workflows from exploratory data analysis (EDA), quality control and the detection of outliers, through formal analysis and the communication of results.
The ggplot2 package does not come pre-installed with R. Before you use it for the first time you will need to install it with the install.packages("ggplot2)
command in your R console panel of RStudio.
We see in next week’s class (and throughout the rest of the course) that lot’s of useful functionality comes in the form of add-on R packages.
You can think of “base R” like your smartphone’s OS when you first take it out of the box and “R packages” like apps that you can optionally install to allow you to do more cool things.
We will begin by getting organized. This entails you opening up RStudio and creating a new RStudio Project, then creating a new R script for storing your work and notes for this session.
Side-note: If you are alrady fimilar with RMarkdown format documents feel free to use one of these rather than an RScript. If you have not yet heard of these, don’t worry we will be building towards these in our next class.
File > New Project > New Directory > New Project
make sure you are working in the directory (a.k.a. folder!) where you want to keep all your work for this class organized. For example, for me this is a directory on my Desktop with the class name (see animated figure below). We will create our project as a subdirectory called class05
in this location.Side-note: The key step here is to name your project after this class session (i.e. “class05”) and make sure it is a sub-directory of where ever you are organizing all your work for this course.
Finally, open a new R script: File > New File > R Script
and save as class05.R
. This is nothing more than a text file where we can write and save our R code. The big advantage is that we will have a record of our work and thus be able to reproduce and automate our analysis later. We will also turn this into a fancy HTML and PDF report for sharing with others (more on this later…)
There are many plot types available that can help you understand different features and relationships in your data.
During the exploratory data analysis phase we typically want to detect the most obvious patterns by looking at each variable in isolation or by detecting relationships of variables against others. The used plot type is also determined by the data type of the input variables like continuous numeric or discrete categorical.