BIMM-143, Lecture 15

Pathway analysis from RNA-seq differential expression results

BIMM-143 Lecture 15:
Barry Grant < http://thegrantlab.org >
Date: 2018-02-22 (23:17:42 PST on Thu, Feb 22)

Overview

Analysis of high-throughput biological data typically yields a list of genes or proteins requiring further interpretation - for example the ranked lists of differentially expressed genes we have been generating from our RNA-seq analysis to date.

Our intention is typically to use such lists to gain novel insights about genes and proteins that may have roles in a given phenomenon, phenotype or disease progression. However, in many cases these ‘raw’ gene lists are challenging to interpret due to their large size and lack of useful annotations. Hence, our expensively assembled gene lists often fail to convey the full degree of possible insight about the condition being studied.

Pathway analysis (also known as gene set analysis or over-representation analysis), aims to reduce the complexity of interpreting gene lists via mapping the listed genes to known (i.e. annotated) biological pathways, processes and functions.

Side-note: Pathway analysis can actually mean many different things to different people. This includes analysis of Gene Ontology (GO) terms, protein–protein interaction networks, flux-balance analysis from kinetic simulations of pathways, etc. However, pathway analysis most commonly focuses on methods that exploit existing pathway knowledge (e.g. in public repositories such as GO or KEGG), rather than on methods that infer pathways from molecular measurements. These more general approaches are nicely reviewed in this paper:
Khatri, et al. “Ten years of pathway analysis: current approaches and outstanding challenges.” PLoS Comput Biol 8.2 (2012): e1002375.

Patway analysis with R and Bioconductor

There are many freely available tools for pathway or over-representation analysis. As of Nov 2017 Bioconductor alone has over 80 packages categorized under gene set enrichment and over 120 packages categorized under pathways.

Here we play with just one, the GAGE package (which stands for Generally Applicable Gene set Enrichment), to do KEGG pathway enrichment analysis on RNA-seq based differential expression results.

The KEGG pathway database, unlike GO for example, provides functional annotation as well as information about gene products that interact with each other in a given pathway, how they interact (e.g., activation, inhibition, etc.), and where they interact (e.g., cytoplasm, nucleus, etc.). Hence KEGG has the potential to provide extra insight beyond annotation lists of simple molecular function, process etc. from GO terms.

In this analysis, we check for coordinated differential expression over gene sets from KEGG pathways instead of changes of individual genes. The assumption here is that consistent perturbations over a given pathway (gene set) may suggest mechanistic changes.

About our Input Data

The data for for hands-on session comes from GEO entry: GSE37704, which is associated with the following publication:

Trapnell C, Hendrickson DG, Sauvageau M, Goff L et al. “Differential analysis of gene regulation at transcript resolution with RNA-seq”. Nat Biotechnol 2013 Jan;31(1):46-53. PMID: 23222703

The authors report on differential analysis of lung fibroblasts in response to loss of the developmental transcription factor HOXA1. Their results and others indicate that HOXA1 is required for lung fibroblast and HeLa cell cycle progression. In particular their analysis show that “loss of HOXA1 results in significant expression level changes in thousands of individual transcripts, along with isoform switching events in key regulators of the cell cycle”. For our session we have used their Sailfish gene-level estimated counts and hence are restricted to protein-coding genes only.

Section 1. Differential Expression Analysis

You can download the count data and associated metadata from here: GSE37704_featurecounts.csv and GSE37704_metadata.csv. This is similar to our starting point for the last class where we used DESeq2 for the first time. We will use it again today!

#library(dplyr)
library(DESeq2)

Load our data files

metaFile <- "data/GSE37704_metadata.csv"
countFile <- "data/GSE37704_featurecounts.csv"

# Import metadata and take a peak
colData = read.csv(metaFile, row.names=1)
head(colData)

##               condition
## SRR493366 control_sirna
## SRR493367 control_sirna
## SRR493368 control_sirna
## SRR493369      hoxa1_kd
## SRR493370      hoxa1_kd
## SRR493371      hoxa1_kd

# Import countdata
countData = read.csv(countFile, row.names=1)
head(countData)

##                 length SRR493366 SRR493367 SRR493368 SRR493369 SRR493370
## ENSG00000186092    918         0         0         0         0         0
## ENSG00000279928    718         0         0         0         0         0
## ENSG00000279457   1982        23        28        29        29        28
## ENSG00000278566    939         0         0         0         0         0
## ENSG00000273547    939         0         0         0         0         0
## ENSG00000187634   3214       124       123       205       207       212
##                 SRR493371
## ENSG00000186092         0
## ENSG00000279928         0
## ENSG00000279457        46
## ENSG00000278566         0
## ENSG00000273547         0
## ENSG00000187634       258

Hmm… remember that we need the countData and colData files to match up so we will need to remove that odd first column in countData namely contData$length.

# Note we need to remove the odd first $length col
countData <- as.matrix(countData[,-1])
head(countData)

##                 SRR493366 SRR493367 SRR493368 SRR493369 SRR493370
## ENSG00000186092         0         0         0         0         0
## ENSG00000279928         0         0         0         0         0
## ENSG00000279457        23        28        29        29        28
## ENSG00000278566         0         0         0         0         0
## ENSG00000273547         0         0         0         0         0
## ENSG00000187634       124       123       205       207       212
##                 SRR493371
## ENSG00000186092         0
## ENSG00000279928         0
## ENSG00000279457        46
## ENSG00000278566         0
## ENSG00000273547         0
## ENSG00000187634       258

This looks better but there are lots of zero entries in there so let’s get rid of them as we have no data for these.

# Filter count data where you have 0 read count across all samples.
countData = countData[rowSums(countData)>1, ]
head(countData)

##                 SRR493366 SRR493367 SRR493368 SRR493369 SRR493370
## ENSG00000279457        23        28        29        29        28
## ENSG00000187634       124       123       205       207       212
## ENSG00000188976      1637      1831      2383      1226      1326
## ENSG00000187961       120       153       180       236       255
## ENSG00000187583        24        48        65        44        48
## ENSG00000187642         4         9        16        14        16
##                 SRR493371
## ENSG00000279457        46
## ENSG00000187634       258
## ENSG00000188976      1504
## ENSG00000187961       357
## ENSG00000187583        64
## ENSG00000187642        16

Nice now lets setup the DESeqDataSet object required for the DESeq() function and then run the DESeq pipeline. This is again similar to our last days hands-on session.

dds = DESeqDataSetFromMatrix(countData=countData,
                             colData=colData,
                             design=~condition)
dds = DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

dds

## class: DESeqDataSet 
## dim: 15280 6 
## metadata(1): version
## assays(3): counts mu cooks
## rownames(15280): ENSG00000279457 ENSG00000187634 ...
##   ENSG00000276345 ENSG00000271254
## rowData names(21): baseMean baseVar ... deviance maxCooks
## colnames(6): SRR493366 SRR493367 ... SRR493370 SRR493371
## colData names(2): condition sizeFactor

Next, get results for the HoxA1 knockdown versus control siRNA (remember we labeled these as “hoxa1_kd” and “control_sirna” in our original colData metaFile input to DESeq, you can check this above and by running resultsNames(dds) command).

res = results(dds, contrast=c("condition", "hoxa1_kd", "control_sirna"))

Let’s reorder these results by p-value and call summary() on the results object to get a sense of how many genes are up or down-regulated at the default FDR of 0.1.

res = res[order(res$pvalue),]
summary(res)

## 
## out of 15280 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 4352, 28% 
## LFC < 0 (down)   : 4400, 29% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 590, 3.9% 
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Since we mapped and counted against the Ensembl annotation, our results only have information about Ensembl gene IDs. However, our pathway analysis downstream will use KEGG pathways, and genes in KEGG pathways are annotated with Entrez gene IDs. So lets add them as we did the last day.

library("AnnotationDbi")
library("org.Hs.eg.db")

columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

res$symbol = mapIds(org.Hs.eg.db,
                    keys=row.names(res), 
                    column="SYMBOL",
                    keytype="ENSEMBL",
                    multiVals="first")

res$entrez = mapIds(org.Hs.eg.db,
                    keys=row.names(res), 
                    column="ENTREZID",
                    keytype="ENSEMBL",
                    multiVals="first")

res$name =   mapIds(org.Hs.eg.db,
                    keys=row.names(res), 
                    column="GENENAME",
                    keytype="ENSEMBL",
                    multiVals="first")

head(res, 10)

## log2 fold change (MLE): condition hoxa1_kd vs control_sirna 
## Wald test p-value: condition hoxa1 kd vs control sirna 
## DataFrame with 10 rows and 9 columns
##                  baseMean log2FoldChange      lfcSE      stat    pvalue
##                 <numeric>      <numeric>  <numeric> <numeric> <numeric>
## ENSG00000117519  4483.627      -2.422719 0.06001850 -40.36620         0
## ENSG00000183508  2053.881       3.201955 0.07241968  44.21388         0
## ENSG00000159176  5692.463      -2.313737 0.05757255 -40.18820         0
## ENSG00000150938  7442.986      -2.059631 0.05386627 -38.23601         0
## ENSG00000116016  4423.947      -1.888019 0.04318301 -43.72134         0
## ENSG00000136068  3796.127      -1.649792 0.04394825 -37.53942         0
## ENSG00000164251  2348.770       3.344508 0.06907610  48.41773         0
## ENSG00000124766  2576.653       2.392288 0.06171493  38.76352         0
## ENSG00000124762 28106.119       1.832258 0.03892405  47.07264         0
## ENSG00000106366 43719.126      -1.844046 0.04194432 -43.96415         0
##                      padj      symbol      entrez
##                 <numeric> <character> <character>
## ENSG00000117519         0        CNN3        1266
## ENSG00000183508         0      FAM46C       54855
## ENSG00000159176         0       CSRP1        1465
## ENSG00000150938         0       CRIM1       51232
## ENSG00000116016         0       EPAS1        2034
## ENSG00000136068         0        FLNB        2317
## ENSG00000164251         0       F2RL1        2150
## ENSG00000124766         0        SOX4        6659
## ENSG00000124762         0      CDKN1A        1026
## ENSG00000106366         0    SERPINE1        5054
##                                                        name
##                                                 <character>
## ENSG00000117519                                  calponin 3
## ENSG00000183508 family with sequence similarity 46 member C
## ENSG00000159176         cysteine and glycine rich protein 1
## ENSG00000150938 cysteine rich transmembrane BMP regulator 1
## ENSG00000116016            endothelial PAS domain protein 1
## ENSG00000136068                                   filamin B
## ENSG00000164251                 F2R like trypsin receptor 1
## ENSG00000124766                                   SRY-box 4
## ENSG00000124762        cyclin dependent kinase inhibitor 1A
## ENSG00000106366                    serpin family E member 1

Great, this is looking good so far. Now lets see how pathway analysis can help us make further sense out of this ranked list of differentially expressed genes.

Section 2. Pathway Analysis

Here we are going to use the gage package for pathway analysis. Once we have a list of enriched pathways, we’re going to use the pathview package to draw pathway diagrams, shading the molecules in the pathway by their degree of up/down-regulation.

KEGG pathways

The gageData package has pre-compiled databases mapping genes to KEGG pathways and GO terms for common organisms. kegg.sets.hs is a named list of 229 elements. Each element is a character vector of member gene Entrez IDs for a single KEGG pathway. (See also go.sets.hs). The sigmet.idx.hs is an index of numbers of signaling and metabolic pathways in kegg.set.gs. In other words, KEGG pathway include other types of pathway definitions, like “Global Map” and “Human Diseases”, which may be undesirable in pathway analysis. Therefore, kegg.sets.hs[sigmet.idx.hs] gives you the “cleaner” gene sets of signaling and metabolic pathways only.

Side-Note: While there are many freely available tools to do pathway analysis, and some like gage are truly fantastic, many of them are poorly maintained or rarely updated. The DAVID tool that a lot of folks use for simple gene set enrichment analysis was not updated at all between Jan 2010 and Oct 2016.

First we need to do our one time install of these required bioconductor packages:

source("http://bioconductor.org/biocLite.R")
biocLite( c("pathview", "gage", "gageData") )

Now we can load the packages and setup the KEGG data-sets we need.

library(pathview)

## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################

library(gage)
library(gageData)

data(kegg.sets.hs)
data(sigmet.idx.hs)

kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs]
head(kegg.sets.hs, 3)

## $`hsa00232 Caffeine metabolism`
## [1] "10"   "1544" "1548" "1549" "1553" "7498" "9"   
## 
## $`hsa00983 Drug metabolism - other enzymes`
##  [1] "10"     "1066"   "10720"  "10941"  "151531" "1548"   "1549"  
##  [8] "1551"   "1553"   "1576"   "1577"   "1806"   "1807"   "1890"  
## [15] "221223" "2990"   "3251"   "3614"   "3615"   "3704"   "51733" 
## [22] "54490"  "54575"  "54576"  "54577"  "54578"  "54579"  "54600" 
## [29] "54657"  "54658"  "54659"  "54963"  "574537" "64816"  "7083"  
## [36] "7084"   "7172"   "7363"   "7364"   "7365"   "7366"   "7367"  
## [43] "7371"   "7372"   "7378"   "7498"   "79799"  "83549"  "8824"  
## [50] "8833"   "9"      "978"   
## 
## $`hsa00230 Purine metabolism`
##   [1] "100"    "10201"  "10606"  "10621"  "10622"  "10623"  "107"   
##   [8] "10714"  "108"    "10846"  "109"    "111"    "11128"  "11164" 
##  [15] "112"    "113"    "114"    "115"    "122481" "122622" "124583"
##  [22] "132"    "158"    "159"    "1633"   "171568" "1716"   "196883"
##  [29] "203"    "204"    "205"    "221823" "2272"   "22978"  "23649" 
##  [36] "246721" "25885"  "2618"   "26289"  "270"    "271"    "27115" 
##  [43] "272"    "2766"   "2977"   "2982"   "2983"   "2984"   "2986"  
##  [50] "2987"   "29922"  "3000"   "30833"  "30834"  "318"    "3251"  
##  [57] "353"    "3614"   "3615"   "3704"   "377841" "471"    "4830"  
##  [64] "4831"   "4832"   "4833"   "4860"   "4881"   "4882"   "4907"  
##  [71] "50484"  "50940"  "51082"  "51251"  "51292"  "5136"   "5137"  
##  [78] "5138"   "5139"   "5140"   "5141"   "5142"   "5143"   "5144"  
##  [85] "5145"   "5146"   "5147"   "5148"   "5149"   "5150"   "5151"  
##  [92] "5152"   "5153"   "5158"   "5167"   "5169"   "51728"  "5198"  
##  [99] "5236"   "5313"   "5315"   "53343"  "54107"  "5422"   "5424"  
## [106] "5425"   "5426"   "5427"   "5430"   "5431"   "5432"   "5433"  
## [113] "5434"   "5435"   "5436"   "5437"   "5438"   "5439"   "5440"  
## [120] "5441"   "5471"   "548644" "55276"  "5557"   "5558"   "55703" 
## [127] "55811"  "55821"  "5631"   "5634"   "56655"  "56953"  "56985" 
## [134] "57804"  "58497"  "6240"   "6241"   "64425"  "646625" "654364"
## [141] "661"    "7498"   "8382"   "84172"  "84265"  "84284"  "84618" 
## [148] "8622"   "8654"   "87178"  "8833"   "9060"   "9061"   "93034" 
## [155] "953"    "9533"   "954"    "955"    "956"    "957"    "9583"  
## [162] "9615"

The main gage() function requires a named vector of fold changes, where the names of the values are the Entrez gene IDs.

foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)

##      1266     54855      1465     51232      2034      2317 
## -2.422719  3.201955 -2.313737 -2.059631 -1.888019 -1.649792

Now, let’s run the pathway analysis. See help on the gage function with ?gage. Specifically, you might want to try changing the value of same.dir. This value determines whether to test for changes in a gene set toward a single direction (all genes up or down regulated) or changes towards both directions simultaneously (i.e. any genes in the pathway dysregulated).

Here, we’re using same.dir=TRUE, which will give us separate lists for pathways that are upregulated versus pathways that are down-regulated. Let’s look at the first few results from each.

# Get the results
keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)

Lets look at the result object. It is a list with three elements (“greater”, “less” and “stats”).

attributes(keggres)

## $names
## [1] "greater" "less"    "stats"

So it is a list object (you can check it with str(keggres)) and we can use the dollar syntax to access a named element, e.g.

head(keggres$greater)

##                                         p.geomean stat.mean       p.val
## hsa04640 Hematopoietic cell lineage   0.002709366  2.857393 0.002709366
## hsa04630 Jak-STAT signaling pathway   0.005655916  2.557207 0.005655916
## hsa04142 Lysosome                     0.008948808  2.384783 0.008948808
## hsa00140 Steroid hormone biosynthesis 0.009619717  2.432105 0.009619717
## hsa04740 Olfactory transduction       0.014450242  2.239717 0.014450242
## hsa04916 Melanogenesis                0.022339115  2.023074 0.022339115
##                                           q.val set.size        exp1
## hsa04640 Hematopoietic cell lineage   0.3847887       49 0.002709366
## hsa04630 Jak-STAT signaling pathway   0.3847887      103 0.005655916
## hsa04142 Lysosome                     0.3847887      117 0.008948808
## hsa00140 Steroid hormone biosynthesis 0.3847887       26 0.009619717
## hsa04740 Olfactory transduction       0.4624078       39 0.014450242
## hsa04916 Melanogenesis                0.5297970       85 0.022339115

head(keggres$less)

##                                      p.geomean stat.mean        p.val
## hsa04110 Cell cycle               1.004024e-05 -4.353447 1.004024e-05
## hsa03030 DNA replication          8.909718e-05 -3.968605 8.909718e-05
## hsa03013 RNA transport            1.471026e-03 -3.007785 1.471026e-03
## hsa04114 Oocyte meiosis           1.987557e-03 -2.915377 1.987557e-03
## hsa03440 Homologous recombination 2.942017e-03 -2.868137 2.942017e-03
## hsa00240 Pyrimidine metabolism    5.800212e-03 -2.549616 5.800212e-03
##                                         q.val set.size         exp1
## hsa04110 Cell cycle               0.001606438      120 1.004024e-05
## hsa03030 DNA replication          0.007127774       36 8.909718e-05
## hsa03013 RNA transport            0.078454709      143 1.471026e-03
## hsa04114 Oocyte meiosis           0.079502292       98 1.987557e-03
## hsa03440 Homologous recombination 0.094144560       28 2.942017e-03
## hsa00240 Pyrimidine metabolism    0.138500584       95 5.800212e-03

Each keggres$greater and keggres$less object is data matrix with gene sets as rows sorted by p-value. Lets look at both up (greater), down (less), and statistics by calling head() with the lapply() function. As always if you want to find out more about a particular function or its return values use the R help system (e.g. ?gage or ?lapply).

lapply(keggres, head)

## $greater
##                                         p.geomean stat.mean       p.val
## hsa04640 Hematopoietic cell lineage   0.002709366  2.857393 0.002709366
## hsa04630 Jak-STAT signaling pathway   0.005655916  2.557207 0.005655916
## hsa04142 Lysosome                     0.008948808  2.384783 0.008948808
## hsa00140 Steroid hormone biosynthesis 0.009619717  2.432105 0.009619717
## hsa04740 Olfactory transduction       0.014450242  2.239717 0.014450242
## hsa04916 Melanogenesis                0.022339115  2.023074 0.022339115
##                                           q.val set.size        exp1
## hsa04640 Hematopoietic cell lineage   0.3847887       49 0.002709366
## hsa04630 Jak-STAT signaling pathway   0.3847887      103 0.005655916
## hsa04142 Lysosome                     0.3847887      117 0.008948808
## hsa00140 Steroid hormone biosynthesis 0.3847887       26 0.009619717
## hsa04740 Olfactory transduction       0.4624078       39 0.014450242
## hsa04916 Melanogenesis                0.5297970       85 0.022339115
## 
## $less
##                                      p.geomean stat.mean        p.val
## hsa04110 Cell cycle               1.004024e-05 -4.353447 1.004024e-05
## hsa03030 DNA replication          8.909718e-05 -3.968605 8.909718e-05
## hsa03013 RNA transport            1.471026e-03 -3.007785 1.471026e-03
## hsa04114 Oocyte meiosis           1.987557e-03 -2.915377 1.987557e-03
## hsa03440 Homologous recombination 2.942017e-03 -2.868137 2.942017e-03
## hsa00240 Pyrimidine metabolism    5.800212e-03 -2.549616 5.800212e-03
##                                         q.val set.size         exp1
## hsa04110 Cell cycle               0.001606438      120 1.004024e-05
## hsa03030 DNA replication          0.007127774       36 8.909718e-05
## hsa03013 RNA transport            0.078454709      143 1.471026e-03
## hsa04114 Oocyte meiosis           0.079502292       98 1.987557e-03
## hsa03440 Homologous recombination 0.094144560       28 2.942017e-03
## hsa00240 Pyrimidine metabolism    0.138500584       95 5.800212e-03
## 
## $stats
##                                       stat.mean     exp1
## hsa04640 Hematopoietic cell lineage    2.857393 2.857393
## hsa04630 Jak-STAT signaling pathway    2.557207 2.557207
## hsa04142 Lysosome                      2.384783 2.384783
## hsa00140 Steroid hormone biosynthesis  2.432105 2.432105
## hsa04740 Olfactory transduction        2.239717 2.239717
## hsa04916 Melanogenesis                 2.023074 2.023074

Now, let’s process the results to pull out the top 5 upregulated pathways, then further process that just to get the IDs. We’ll use these KEGG pathway IDs downstream for plotting.

## Sanity check displaying all pathways data
pathways = data.frame(id=rownames(keggres$greater), keggres$greater)
head(pathways)

##                                                                          id
## hsa04640 Hematopoietic cell lineage     hsa04640 Hematopoietic cell lineage
## hsa04630 Jak-STAT signaling pathway     hsa04630 Jak-STAT signaling pathway
## hsa04142 Lysosome                                         hsa04142 Lysosome
## hsa00140 Steroid hormone biosynthesis hsa00140 Steroid hormone biosynthesis
## hsa04740 Olfactory transduction             hsa04740 Olfactory transduction
## hsa04916 Melanogenesis                               hsa04916 Melanogenesis
##                                         p.geomean stat.mean       p.val
## hsa04640 Hematopoietic cell lineage   0.002709366  2.857393 0.002709366
## hsa04630 Jak-STAT signaling pathway   0.005655916  2.557207 0.005655916
## hsa04142 Lysosome                     0.008948808  2.384783 0.008948808
## hsa00140 Steroid hormone biosynthesis 0.009619717  2.432105 0.009619717
## hsa04740 Olfactory transduction       0.014450242  2.239717 0.014450242
## hsa04916 Melanogenesis                0.022339115  2.023074 0.022339115
##                                           q.val set.size        exp1
## hsa04640 Hematopoietic cell lineage   0.3847887       49 0.002709366
## hsa04630 Jak-STAT signaling pathway   0.3847887      103 0.005655916
## hsa04142 Lysosome                     0.3847887      117 0.008948808
## hsa00140 Steroid hormone biosynthesis 0.3847887       26 0.009619717
## hsa04740 Olfactory transduction       0.4624078       39 0.014450242
## hsa04916 Melanogenesis                0.5297970       85 0.022339115

Now, let’s try out the pathview() function from the pathview package to make a pathway plot with our result shown in color. To begin with lets manually supply a pathway.id (namely the first part of the "hsa04110 Cell cycle") that we could see from the print out above.

pathview(gene.data=foldchanges, pathway.id="hsa04110")

## Info: Downloading xml files for hsa04110, 1/1 pathways..
## Info: Downloading png files for hsa04110, 1/1 pathways..
## 'select()' returned 1:1 mapping between keys and columns
## Info: Writing image file hsa04110.pathview.png

This downloads the patway figure data from KEGG and adds our results to it. You can play with the other input arguments to pathview() to change the dispay in various ways including generating a PDF graph. For example:

# A different PDF based output of the same data
pathview(gene.data=foldchanges, pathway.id="hsa04110", kegg.native=FALSE)

Here is the default low resolution raster PNG output from the first pathview() call above:

Note how many of the genes in this pathway are pertubed (i.e. colored) in our results.

Now, let’s process our results a bit more to automagicaly pull out the top 5 upregulated pathways, then further process that just to get the IDs needed by the pathview() function. We’ll use these KEGG pathway IDs for plotting below.

## Focus on top 5 upregulated pathways here for demo purposes only
keggrespathways <- rownames(keggres$greater)[1:5]

# Extract the IDs part of each string
keggresids = substr(keggrespathways, start=1, stop=8)
keggresids

## [1] "hsa04640" "hsa04630" "hsa04142" "hsa00140" "hsa04740"

Finally, lets pass these IDs in keggresids to the pathview() function to draw plots for all the top 5 pathways.

pathview(gene.data=foldchanges, pathway.id=keggresids, species="hsa")

Here are the plots:

Section 3. Gene Ontology (GO)

We can also do a similar procedure with gene ontology. Similar to above, go.sets.hs has all GO terms. go.subs.hs is a named list containing indexes for the BP, CC, and MF ontologies. Let’s only do Biological Process.

data(go.sets.hs)
data(go.subs.hs)
gobpsets = go.sets.hs[go.subs.hs$BP]

gobpres = gage(foldchanges, gsets=gobpsets, same.dir=TRUE)

lapply(gobpres, head)

## $greater
##                                              p.geomean stat.mean
## GO:0007156 homophilic cell adhesion       4.893044e-05  3.971869
## GO:0060429 epithelium development         6.727999e-05  3.834578
## GO:0007610 behavior                       2.171759e-04  3.534089
## GO:0048729 tissue morphogenesis           2.471263e-04  3.498950
## GO:0002009 morphogenesis of an epithelium 3.227727e-04  3.429293
## GO:0016337 cell-cell adhesion             8.194676e-04  3.163087
##                                                  p.val     q.val set.size
## GO:0007156 homophilic cell adhesion       4.893044e-05 0.1337863      107
## GO:0060429 epithelium development         6.727999e-05 0.1337863      478
## GO:0007610 behavior                       2.171759e-04 0.2457053      404
## GO:0048729 tissue morphogenesis           2.471263e-04 0.2457053      403
## GO:0002009 morphogenesis of an epithelium 3.227727e-04 0.2567334      326
## GO:0016337 cell-cell adhesion             8.194676e-04 0.3753986      318
##                                                   exp1
## GO:0007156 homophilic cell adhesion       4.893044e-05
## GO:0060429 epithelium development         6.727999e-05
## GO:0007610 behavior                       2.171759e-04
## GO:0048729 tissue morphogenesis           2.471263e-04
## GO:0002009 morphogenesis of an epithelium 3.227727e-04
## GO:0016337 cell-cell adhesion             8.194676e-04
## 
## $less
##                                             p.geomean stat.mean
## GO:0000279 M phase                       1.582159e-16 -8.314874
## GO:0048285 organelle fission             8.120979e-16 -8.149796
## GO:0000280 nuclear division              2.314155e-15 -8.024006
## GO:0007067 mitosis                       2.314155e-15 -8.024006
## GO:0000087 M phase of mitotic cell cycle 6.404776e-15 -7.881237
## GO:0007059 chromosome segregation        1.055849e-11 -6.988384
##                                                 p.val        q.val
## GO:0000279 M phase                       1.582159e-16 6.292245e-13
## GO:0048285 organelle fission             8.120979e-16 1.614857e-12
## GO:0000280 nuclear division              2.314155e-15 2.300848e-12
## GO:0007067 mitosis                       2.314155e-15 2.300848e-12
## GO:0000087 M phase of mitotic cell cycle 6.404776e-15 5.094359e-12
## GO:0007059 chromosome segregation        1.055849e-11 6.998521e-09
##                                          set.size         exp1
## GO:0000279 M phase                            492 1.582159e-16
## GO:0048285 organelle fission                  373 8.120979e-16
## GO:0000280 nuclear division                   349 2.314155e-15
## GO:0007067 mitosis                            349 2.314155e-15
## GO:0000087 M phase of mitotic cell cycle      359 6.404776e-15
## GO:0007059 chromosome segregation             141 1.055849e-11
## 
## $stats
##                                           stat.mean     exp1
## GO:0007156 homophilic cell adhesion        3.971869 3.971869
## GO:0060429 epithelium development          3.834578 3.834578
## GO:0007610 behavior                        3.534089 3.534089
## GO:0048729 tissue morphogenesis            3.498950 3.498950
## GO:0002009 morphogenesis of an epithelium  3.429293 3.429293
## GO:0016337 cell-cell adhesion              3.163087 3.163087

Section 4. Reactome Pathway Analysis

Reactome is database consisting of biological molecules and their relation to pathways and processes. Reactome, such as many other tools, has an online software available (https://reactome.org/) and R package available (https://bioconductor.org/packages/release/bioc/html/ReactomePA.html).

If you would like more information, the documentation is available here: https://reactome.org/user/guide

Let’s now conduct over-representation enrichment analysis and pathway-topology analysis with Reactome using the previous list of significant genes generated from our differential expression results above.

First, Using R, output the list of significant genes at the 0.05 level as a plain text file:

sig_genes <- res[res$padj <= 0.05 & !is.na(res$padj), "symbol"]
print(paste("Total number of significant genes:", length(sig_genes)))

## [1] "Total number of significant genes: 8151"

write.table(sig_genes, file="significant_genes.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)

Then, to perform pathway analysis online go to the Reactome website (https://reactome.org/PathwayBrowser/#TOOL=AT). Select “choose file” to upload your significant gene list. Then, select the parameters “Project to Humans”, then click “Analyze”.

Question: What pathway has the most significant “Entities p-value”? Do the most significant pathways listed match your previous KEGG results? What factors could cause differences between the two methods?

Section 5. GO Analysis

Gene Set Gene Ontology (GO) Enrichment is a method to determine over-represented or under-represented GO terms for a given set of genes. GO terms are formal structured controlled vocabularies (ontologies) for gene products in terms of their biological function. The goal of this analysis is to determine the biological process the given set of genes are associated with.

To perform Gene Set GO Enrichment online go to the website http://www.geneontology.org/page/go-enrichment-analysis. Paste your significant gene list from section 4. Then, select “biological process” and “homo sapiens”, and click submit.

Question: What pathway has the most significant “Entities p-value”? Do the most significant pathways listed match your previous KEGG results? What factors could cause differences between the two methods?

Session Information

The sessionInfo() prints version information about R and any attached packages. It’s a good practice to always run this command at the end of your R session and record it for the sake of reproducibility in the future.

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] gageData_2.16.0            gage_2.28.2               
##  [3] pathview_1.18.2            org.Hs.eg.db_3.5.0        
##  [5] AnnotationDbi_1.40.0       DESeq2_1.18.1             
##  [7] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
##  [9] matrixStats_0.52.2         Biobase_2.38.0            
## [11] GenomicRanges_1.30.1       GenomeInfoDb_1.14.0       
## [13] IRanges_2.12.0             S4Vectors_0.16.0          
## [15] BiocGenerics_0.24.0       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.3.1             bit64_0.9-7            splines_3.4.1         
##  [4] Formula_1.2-2          latticeExtra_0.6-28    blob_1.1.0            
##  [7] GenomeInfoDbData_1.0.0 yaml_2.1.16            RSQLite_2.0           
## [10] backports_1.1.2        lattice_0.20-35        digest_0.6.14         
## [13] RColorBrewer_1.1-2     XVector_0.18.0         checkmate_1.8.5       
## [16] colorspace_1.3-2       htmltools_0.3.6        Matrix_1.2-12         
## [19] plyr_1.8.4             XML_3.98-1.9           pkgconfig_2.0.1       
## [22] genefilter_1.60.0      zlibbioc_1.24.0        xtable_1.8-2          
## [25] scales_0.5.0           BiocParallel_1.12.0    htmlTable_1.11.2      
## [28] tibble_1.3.4           annotate_1.56.1        KEGGREST_1.18.0       
## [31] ggplot2_2.2.1          nnet_7.3-12            lazyeval_0.2.1        
## [34] survival_2.41-3        magrittr_1.5           memoise_1.1.0         
## [37] evaluate_0.10.1        KEGGgraph_1.38.0       foreign_0.8-69        
## [40] graph_1.56.0           tools_3.4.1            data.table_1.10.4-3   
## [43] stringr_1.2.0          munsell_0.4.3          locfit_1.5-9.1        
## [46] cluster_2.0.6          Biostrings_2.46.0      compiler_3.4.1        
## [49] rlang_0.1.6            grid_3.4.1             RCurl_1.95-4.10       
## [52] rstudioapi_0.7         htmlwidgets_1.0        bitops_1.0-6          
## [55] base64enc_0.1-3        rmarkdown_1.8          gtable_0.2.0          
## [58] DBI_0.7                R6_2.2.2               gridExtra_2.3         
## [61] knitr_1.18             bit_1.1-12             Hmisc_4.1-1           
## [64] rprojroot_1.3-2        Rgraphviz_2.22.0       stringi_1.1.6         
## [67] Rcpp_0.12.15           png_0.1-7              geneplotter_1.56.0    
## [70] rpart_4.1-12           acepack_1.4.1