Barry Grant < http://thegrantlab.org >
2025-10-21 (15:23:50 PDT on Tue, Oct 21)

Overview

In this hands-on session, we will examine some real life multivariate data in order to explain, in simple terms what principal component analysis (PCA) achieves. This builds on our PCA introduction page. If you have not visited and fully explored that page yet then please do it now! Here we will perform a principal component analysis of several different data sets of increasing complexity and examine the results.

Getting organized

Before analyzing any dataset we need to get ourselves organized by following the steps below:

First open a new RStudio Project for today’s work called class07 (File > New Project > New Directory > New Project, making sure to save this along with the rest of your course work for this class).
Then open a new Quarto Document (File > New File > Quarto Document) for saving your code and accompanying narrative notes. We will use this to generate our lab report later (see section 3 below).
In your new Quarto document be sure to keep the YAML header but remove any boilerplate example text and code (i.e. delete from line 6 or so onward) so you have a clean document to work in and add your content to.

1. PCA of UK food data

Suppose that we are examining the following data, from the UK’s ‘Department for Environment, Food and Rural Affairs’ (DEFRA), showing the consumption in grams (per person, per week) of 17 different types of food-stuff measured and averaged in the four countries of the United Kingdom in 1997.

	England	Wales	Scotland	N.Ireland
Cheese	105	103	103	66
Carcass_meat	245	227	242	267
Other_meat	685	803	750	586
Fish	147	160	122	93
Fats_and_oils	193	235	184	209
Sugars	156	175	147	139
Fresh_potatoes	720	874	566	1033
Fresh_Veg	253	265	171	143
Other_Veg	488	570	418	355
Processed_potatoes	198	203	220	187
Processed_Veg	360	365	337	334
Fresh_fruit	1102	1137	957	674
Cereals	1472	1582	1462	1494
Beverages	57	73	53	47
Soft_drinks	1374	1256	1572	1506
Alcoholic_drinks	375	475	458	135
Confectionery	54	64	62	41

We shall say that the 17 food types are the variables and the 4 countries are the observations. This would be equivalent to our samples and genes respectively from the lecture video example (and indeed the second main example further below).

Data import

First we will read the provided UK_foods.csv input file (note we can read this directly from the following tinyurl short link: “https://tinyurl.com/UK-foods” (which simply points to a CSV file on our class website).

url <- "https://tinyurl.com/UK-foods"
x <- read.csv(url)

Q1. How many rows and columns are in your new data frame named x? What R functions could you use to answer this questions?

## Complete the following code to find out how many rows and columns are in x?
___(x)

## [1] 17  5

HINT:

You can use the dim() function, which returns the number of rows and columns or the nrow() and ncol() functions to return each separately, i.e. dim(x); ncol(x); nrow(x)

Checking your data

It is always a good idea to examine your imported data to make sure it meets your expectations. At this stage we want to make sure that no odd things have happened during the importing phase that will come back to haunt us later.

For this task we can use the View() function to display all the data (in a new tab in RStudio) or the head() and tail() functions to print only a portion of the data (by default 6 rows from either the top or bottom of the dataset respectively).

Side-Note: Never leave a View() function call uncommented in your Quarto document as this is intended for interactive use and will slow down or even stop the Rendering process when you go to generate HTML, PDF, MD etc. format reports.

## Preview the first 6 rows
___(x)

Hmm, it looks like the row-names here were not set properly as we were expecting 4 columns (one for each of the 4 countries of the UK - not 5 as reported from the dim() function).

Here it appears that the row-names are incorrectly set as the first column of our x data frame (rather than set as proper row-names). This is very common and sometimes what we want - but not in this case. Lets try to fix this up with the following code, which sets the rownames() to the first column and then removes the troublesome first column (with the -1 column index):

# Note how the minus indexing works
rownames(x) <- x[,1]
x <- x[,-1]
head(x)

This looks much better, now lets check the dimensions again:

dim(x)

## [1] 17  4

Side-note: An alternative approach to setting the correct row-names in this case would be to read the data file again and this time set the row.names argument of read.csv() to be the first column (i.e. use argument setting row.names=1), see below:

x <- read.csv(url, row.names=1)
head(x)

Q2. Which approach to solving the ‘row-names problem’ mentioned above do you prefer and why? Is one approach more robust than another under certain circumstances?

HINT:

What happens if you run the first approach code block (i.e. the one with x <- x[,-1]), multiple times?

Spotting major differences and trends

A cursory glance over the numbers in this table does not reveal much of anything. Indeed in general it is difficult to extract meaning in regard to major differences and trends from any given array of numbers. Generating regular bar-plots and various pairwise plots does not help too much either:

# Using base R
barplot(as.matrix(x), beside=T, col=rainbow(nrow(x)))

Q3: Changing what optional argument in the above barplot() function results in the following plot?

HINT:

Try setting beside=FALSE in your barplot() code. Why would leaving this argument out have the same effect as setting it to FALSE? See the help page with ?barplot for further details.

Side-note: Using ggplot and the need for “tidy” data

Using the ggplot2 package for these relatively simple figures is somewhat more convoluted than using “base” R. This is due to the ggplot requirement for so-called “long” data rather than the conventional “wide” data format that we currently have. We will talk more about “long” format data a little later in the course when we introduce the dplyr package.

# Currently we have wide format
dim(x)

## [1] 17  4

Data organized with one row per Food (17 observation) and multiple columns for Country (4 different measurements across a row):

head(x)

To convert this to “long” format we want one row per measurement - maximizes rows (17x4=68), minimizes columns (with a singe Consumption measurement value per Country). We will do tidying with the pivot_longer() function from the tidyr package:

library(tidyr)

# Convert data to long format for ggplot with `pivot_longer()`
x_long <- x |> 
          tibble::rownames_to_column("Food") |> 
          pivot_longer(cols = -Food, 
                       names_to = "Country", 
                       values_to = "Consumption")

dim(x_long)

## [1] 68  3

head(x_long)

# Create grouped bar plot
library(ggplot2)

ggplot(x_long) +
  aes(x = Country, y = Consumption, fill = Food) +
  geom_col(position = "dodge") +
  theme_bw()

Q4: Changing what optional argument in the above ggplot() code results in a stacked barplot figure?

HINT:

Try changing arguments to the geom_col() function to create a stacked bar chart where values are accumulated vertically.

Pairs plots and heatmaps

Pairs plots, also known as scatterplot matrices, are collections of bivariate scatter plots used to visualize relationships between multiple variables simultaneously. For small datasets, pairs plots are a particularly valuable exploratory analysis technique.

Q5: We can use the pairs() function to generate all pairwise plots for our countries. Can you make sense of the following code and resulting figure? What does it mean if a given point lies on the diagonal for a given plot?

pairs(x, col=rainbow(nrow(x)), pch=16)

Another useful visualization approach that is frequently used in bioinformatics is a heatmap. In a heatmap, the data values are represented as colors (often “hot” to “cool” for high and low values respectively). The objective is to make it easier to spot patterns and differences at a glance from the color scale. Heatmaps can also perform hierarchical clustering on both the rows and columns, which helps group similar observations together helping order your data by similar trends.

library(pheatmap)

pheatmap( as.matrix(x) )

In this heatmap, darker colors represent higher values (consumption in grams), and the dendrogram (tree diagram) on the left shows how foods cluster together, while the top dendrogram shows how countries cluster together.

Q6. Based on the pairs and heatmap figures, which countries cluster together and what does this suggest about their food consumption patterns? Can you easily tell what the main differences between N. Ireland and the other countries of the UK in terms of this data-set?

Key-point!

Even relatively small datasets can prove challenging to interpret Given that it is quite difficult to make sense of even this relatively small data set. Hopefully, we can clearly see that a powerful analytical method is absolutely necessary if we wish to observe trends and patterns in larger datasets.

PCA to the rescue

We need some way of making sense of the above data. Are there any trends present which are not obvious from glancing at the array of numbers?

Traditionally, we would use a series of pairwise plots (i.e. bivariate scatter plots) and analyse these to try and determine any relationships between variables, however the number of such plots required for such a task is clearly too large even for this small dataset. Therefore, for large data sets, this is not feasible or fun.

PCA generalizes this idea and allows us to perform such an analysis simultaneously, for many variables. In our example here, we have 17 dimensional data for 4 countries. We can thus ‘imagine’ plotting the 4 coordinates representing the 4 countries in 17 dimensional space. If there is any correlation between the observations (the countries), this will be observed in the 17 dimensional space by the correlated points being clustered close together, though of course since we cannot visualize such a space, we are not able to see such clustering directly (see the lecture slides for a clear description and example of this).

To perform PCA in R there are actually lots of functions to chose from and many packages offer slick PCA implementations and useful graphing approaches. However here we will stick to the base R prcomp() function.

As we noted in the lecture portion of class, prcomp() expects the observations to be rows and the variables to be columns therefore we need to first transpose our data.frame matrix with the t() transpose function.

# Use the prcomp() PCA function 
pca <- ___( t(x) )
summary(pca)

## Importance of components:
##                             PC1      PC2      PC3       PC4
## Standard deviation     324.1502 212.7478 73.87622 2.921e-14
## Proportion of Variance   0.6744   0.2905  0.03503 0.000e+00
## Cumulative Proportion    0.6744   0.9650  1.00000 1.000e+00

The first task of PCA is to identify a new set of principal axes through the data. This is achieved by finding the directions of maximal variance through the coordinates in the 17 dimensional space. This is equivalent to obtaining the (least-squares) line of best fit through the plotted data where it has the largest spread. We call this new axis the first principal component (or PC1) of the data. The second best axis PC2, the third best PC3 etc.

The summary print-out above indicates that PC1 accounts for more than 67% of the sample variance, PC2 29% and PC3 3%. Collectively PC1 and PC2 together capture 96% of the original 17 dimensional variance. Thus these first two new principal axis (PC1 and PC2) represent useful ways to view and further investigate our data set. Lets start with a simple plot of PC1 vs PC2.

Q7. Complete the code below to generate a plot of PC1 vs PC2. The second line adds text labels over the data points.

# Create a data frame for plotting
df <- as.data.frame(pca$x)
df$Country <- rownames(df)

# Plot PC1 vs PC2 with ggplot
ggplot(pca$x) +
  aes(x = ___, y = ___, label = rownames(pca$x)) +
  geom_point(size = 3) +
  geom_text(vjust = -0.5) +
  xlim(-270, 500) +
  xlab("PC1") +
  ylab("PC2") +
  theme_bw()

Q8. Customize your plot so that the colors of the country names match the colors in our UK and Ireland map and table at start of this document.

HINT:

You can provide a named color vector to geom_text() as an argument or add another “layer” with scale_color_manual() to map specific colors to each country.

Once the principal components have been obtained, we can use them to map the relationship between variables (i.e. countries) in therms of these major PCs (i.e. new axis that maximally describe the original data variance).

In our food example here, the four 17 dimensional coordinates are projected down onto the two principal components to obtain the graph above.

As part of the PCA method, we automatically obtain information about the contributions of each principal component to the total variance of the coordinates. This is typically contained in the Eigenvectors returned from such calculations. In the prcomp() function we can use the summary() command above or examine the returned pca$sdev (see below).

In this case approximately 67% of the variance in the data is accounted for by the first principal component, and approximately 97% is accounted for in total by the first two principal components. In this case, we have therefore accounted for the vast majority of the variation in the data using only a two dimensional plot - a dramatic reduction in dimensionality from seventeen dimensions to two.

In practice, it is usually sufficient to include enough principal components so that somewhere in the region of 70% of the variation in the data is accounted for. Looking at the so-called scree plot can help in this regard. Ask Barry about this if you are unsure what we mean here.

Below we can use the square of pca$sdev , which stands for “standard deviation”, to calculate how much variation in the original data each PC accounts for.

v <- round( pca$sdev^2/sum(pca$sdev^2) * 100 )
v

## [1] 67 29  4  0

## or the second row here...
z <- summary(pca)
z$importance

##                              PC1       PC2      PC3          PC4
## Standard deviation     324.15019 212.74780 73.87622 2.921348e-14
## Proportion of Variance   0.67444   0.29052  0.03503 0.000000e+00
## Cumulative Proportion    0.67444   0.96497  1.00000 1.000000e+00

This information can be summarized in a plot of the variances (eigenvalues) with respect to the principal component number (eigenvector number), which is given below.

# Create scree plot with ggplot
variance_df <- data.frame(
  PC = factor(paste0("PC", 1:length(v)), levels = paste0("PC", 1:length(v))),
  Variance = v
)

ggplot(variance_df) +
  aes(x = PC, y = Variance) +
  geom_col(fill = "steelblue") +
  xlab("Principal Component") +
  ylab("Percent Variation") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 0))

Digging deeper (variable loadings)

We can also consider the influence of each of the original variables upon the principal components (typically known as loading scores). This information can be obtained from the prcomp() returned $rotation component. It can also be summarized with a call to biplot(), see below:

## Lets focus on PC1 as it accounts for > 90% of variance 
ggplot(pca$rotation) +
  aes(x = PC1, 
      y = reorder(rownames(pca$rotation), PC1)) +
  geom_col(fill = "steelblue") +
  xlab("PC1 Loading Score") +
  ylab("") +
  theme_bw() +
  theme(axis.text.y = element_text(size = 9))

Here we see observations (foods) with the largest positive loading scores that effectively “push” N. Ireland to right positive side of the plot (including Fresh_potatoes and Soft_drinks).

We can also see the observations/foods with high negative scores that push the other countries to the left side of the plot (including Fresh_fruit and Alcoholic_drinks).

Q9: Generate a similar ‘loadings plot’ for PC2. What two food groups feature prominantely and what does PC2 maninly tell us about?

HINT:

Ask Barry to discuss this with you if he has not already ;-) PCA and related techniques are often tricky to get your head around at the start but we will build more intuition about these techniques as we go through the course - because they are tremendously useful for helping us to analyze all sorts of multidimensional data.

Key-Point: PCA has the awesome ability to effectively reduce the dimensionality of our data set down from 17 to 2, allowing us to assert (using our figures above) that countries England, Wales and Scotland are ‘similar’ with Northern Ireland being different in some way. Furthermore, digging deeper into the loadings we were able to associate certain food types with each cluster of countries. This is supper useful!

2. PCA of RNA-seq data

RNA-seq results often contain a PCA (or related MDS plot). Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, as we have hopefully demonstrated here and in the lecture, it is readily accessible and understandable.

In this example, a small RNA-seq count data set (available from the class website (expression.csv and the tinyurl short link: “https://tinyurl.com/expression-CSV” ) is read into a data frame called rna.data where the columns are individual samples (i.e. cells) and rows are measurements taken for all the samples (i.e. genes).

url2 <- "https://tinyurl.com/expression-CSV"
rna.data <- read.csv(url2, row.names=1)
head(rna.data)

NOTE: The samples are columns, and the genes are rows!

Q10: How many genes and samples are in this data set? How many PCs do you think it will take to have a useful overview of this data set (see below)?

Generating barplots etc. to make sense of this data is really not an exciting or worthwhile option to consider. So lets do PCA and plot the results:

## Again we have to take the transpose of our data 
pca <- prcomp(t(rna.data), scale=TRUE)

# Create data frame for plotting
df <- as.data.frame(pca$x)
df$Sample <- rownames(df)

## Plot with ggplot
ggplot(df) +
  aes(x = PC1, y = PC2, label = Sample) +
  geom_point(size = 3) +
  geom_text(vjust = -0.5, size = 3) +
  xlab("PC1") +
  ylab("PC2") +
  theme_bw()

This quick plot looks interesting with a nice separation of samples into two groups of 5 samples each. Before delving into the details of this grouping let’s first examine a summary of how much variation in the original data each PC accounts for:

summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     9.6237 1.5198 1.05787 1.05203 0.88062 0.82545 0.80111
## Proportion of Variance 0.9262 0.0231 0.01119 0.01107 0.00775 0.00681 0.00642
## Cumulative Proportion  0.9262 0.9493 0.96045 0.97152 0.97928 0.98609 0.99251
##                            PC8     PC9      PC10
## Standard deviation     0.62065 0.60342 3.345e-15
## Proportion of Variance 0.00385 0.00364 0.000e+00
## Cumulative Proportion  0.99636 1.00000 1.000e+00

We can see from this results that PC1 is were all the action is (92.6% of it in fact!). This indicates that we have sucesfully reduced a 100 diminesional data set down to only one dimension that retains the main essential (or principal) features of the origional data. PC1 captures 92.6% of the origional varance with the first two PCs capturing 94.9%. This is quite amazing!

A quick scree plot summary of this Proportion of Variance for each PC can be obtained using ggplot:

# Calculate variance explained
pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var)*100, 1)

# Create scree plot data
scree_df <- data.frame(
  PC = factor(paste0("PC", 1:10), levels = paste0("PC", 1:10)),
  Variance = pca.var[1:10]
)

ggplot(scree_df) +
  aes(x = PC, y = Variance) +
  geom_col(fill = "steelblue") +
  ggtitle("Quick scree plot") +
  xlab("Principal Component") +
  ylab("Variance") +
  theme_bw()

Let’s make another version showing percent variation:

## Percent variance is often more informative to look at 
pca.var.per

##  [1] 92.6  2.3  1.1  1.1  0.8  0.7  0.6  0.4  0.4  0.0

# Create percent variance scree plot
scree_pct_df <- data.frame(
  PC = factor(paste0("PC", 1:10), levels = paste0("PC", 1:10)),
  PercentVariation = pca.var.per[1:10]
)

ggplot(scree_pct_df) +
  aes(x = PC, y = PercentVariation) +
  geom_col(fill = "steelblue") +
  ggtitle("Scree Plot") +
  xlab("Principal Component") +
  ylab("Percent Variation") +
  theme_bw()

Again we can see from this plot that PC1 is were all the action is.

Now lets make our main PCA plot a bit more attractive and useful…

## A vector of colors for wt and ko samples
colvec <- colnames(rna.data)
colvec[grep("wt", colvec)] <- "red"
colvec[grep("ko", colvec)] <- "blue"

# Add condition to data frame
df$condition <- substr(df$Sample, 1, 2)
df$color <- colvec

ggplot(df) +
  aes(x = PC1, y = PC2, color = color, label = Sample) +
  geom_point(size = 3) +
  geom_text(vjust = -0.5, hjust = 0.5, show.legend = FALSE) +
  scale_color_identity() +
  xlab(paste0("PC1 (", pca.var.per[1], "%)")) +
  ylab(paste0("PC2 (", pca.var.per[2], "%)")) +
  theme_bw()

Optional: Gene loadings

For demonstration purposes let’s find the top 10 measurements (genes) that contribute most to pc1 in either direction (+ or -).

loading_scores <- pca$rotation[,1]

## Find the top 10 measurements (genes) that contribute
## most to PC1 in either direction (+ or -)
gene_scores <- abs(loading_scores) 
gene_score_ranked <- sort(gene_scores, decreasing=TRUE)

## show the names of the top 10 genes
top_10_genes <- names(gene_score_ranked[1:10])
top_10_genes

##  [1] "gene100" "gene66"  "gene45"  "gene68"  "gene98"  "gene60"  "gene21" 
##  [8] "gene56"  "gene10"  "gene90"

These may be the genes that we would like to focus on for further analysis (if their expression changes are significant - we will deal with this and further steps of RNA-Seq analysis in subsequent classes).

3. Producing a PDF report

Finally for this lab session, please compile a summary report of your work with answers to the above 10 questions and submit to gradescope. To do this you will need your working Quarto or RMarkdown document to be error free (i.e. you can source it without errors) and select the Render option with format: pdf in your YMAL header section.

4. SKIP: Sync to GitHub

If you have your GitHub account setup correctly (and your git tracked repo from a previous class already synced to GitHub) you can now sync today’s work to GitHub. Note that class order changes in certain quarters due to holidays and other factors so we may not of had our git and github class session yet.

Talk to Barry at this point for some extra discussion and guidance. Essentially, the way you do this will depend on how your current project is setup. Is it already a folder within your GitHub tracked folder? Or is it a separate directory/folder. If it is the later then you will want to quit R Studio and copy your folder into your GitHub tracked folder. Then open this new copy and sync to GitHub via the add/commit/push cycle we used previously. If it is the former then you should be fine to go through the git add/commit/push cycle. Again, discuss with Barry if this is unclear.

Muddy Point Assessment

Link to today’s muddy point assesment.

Session Information

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pheatmap_1.0.12   ggplot2_4.0.0     tidyr_1.3.1       formattable_0.2.1
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     dplyr_1.1.4        compiler_4.4.2    
##  [5] tidyselect_1.2.1   jquerylib_0.1.4    scales_1.4.0       yaml_2.3.10       
##  [9] fastmap_1.2.0      R6_2.6.1           labeling_0.4.3     generics_0.1.3    
## [13] knitr_1.50         htmlwidgets_1.6.4  tibble_3.2.1       bslib_0.9.0       
## [17] pillar_1.10.2      RColorBrewer_1.1-3 rlang_1.1.6        cachem_1.1.0      
## [21] xfun_0.52          sass_0.4.10        S7_0.2.0           cli_3.6.4         
## [25] withr_3.0.2        magrittr_2.0.3     digest_0.6.37      grid_4.4.2        
## [29] rstudioapi_0.17.1  lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.3    
## [33] glue_1.8.0         farver_2.1.2       rmarkdown_2.29     purrr_1.0.4       
## [37] tools_4.4.2        pkgconfig_2.0.3    htmltools_0.5.8.1

Class 7 Lab

Hands on with Principal Component Analysis (PCA)

Overview

Getting organized

1. PCA of UK food data

Data import

Checking your data

Spotting major differences and trends

Side-note: Using ggplot and the need for “tidy” data

Pairs plots and heatmaps

Key-point!

PCA to the rescue

Digging deeper (variable loadings)

2. PCA of RNA-seq data

Optional: Gene loadings

3. Producing a PDF report

4. SKIP: Sync to GitHub

Muddy Point Assessment

Session Information