Data wrangling in R - the tidy way!

Outline

Intro. to Nanopore direct RNA sequencing and today’s dataset (see powerpoint)
Intro. to tidyverse
- What is tidyverse and why use it?
- Pipes
Key tidyverse commands
- Filter and select
- Rename, mutate, summarise
- Join
- Pivot longer and pivot wider
Practice with ggplot2
Resources

3. Key tidyverse commands

# Install tidyverse if you haven't already
# install.packages("tidyverse")

# load libraries
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# read in data
nanopolish = read_tsv("polya_results_with_ERCC.tsv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   readname = col_character(),
##   contig = col_character(),
##   position = col_double(),
##   leader_start = col_double(),
##   adapter_start = col_double(),
##   polya_start = col_double(),
##   transcript_start = col_double(),
##   read_rate = col_double(),
##   polya_length = col_double(),
##   qc_tag = col_character()
## )

nanopolish

## # A tibble: 191,157 x 10
##    readname contig position leader_start adapter_start polya_start
##    <chr>    <chr>     <dbl>        <dbl>         <dbl>       <dbl>
##  1 e7f03c0… ENSMU…      236           -1            -1          -1
##  2 649de85… ENSMU…       45            2             3        6961
##  3 69bd9ca… ENSMU…       71            2             3        5489
##  4 eebb6a9… ENSMU…        3          354           693        4852
##  5 ef4c72f… ENSMU…      162            2             3        4592
##  6 45d2338… ENSMU…      139            2             3        7035
##  7 dc9cf0c… ENSMU…      192            2             3        6198
##  8 bb9d4ba… ENSMU…      291            3             4        4771
##  9 0a5988a… ENSMU…       91            3             4        7763
## 10 e5136e3… ENSMU…       26            2             3        8007
## # … with 191,147 more rows, and 4 more variables: transcript_start <dbl>,
## #   read_rate <dbl>, polya_length <dbl>, qc_tag <chr>

Filter rows and select columns

# How many polyA reads pass QC?
filtered_nanopolish = nanopolish %>%
  filter(polya_length > 0, qc_tag == "PASS")

# Export read-ids of passQC polyA reads
passQC_polyA_readIDs = nanopolish %>%
  filter(polya_length > 0, qc_tag == "PASS") %>%
  select(-readname, -polya_length)
write_tsv(passQC_polyA_readIDs, "polya_results_onlypassQC.tsv")

Rename, mutate, summarise

# Let's rename the "contig" column to "transcript"
renamed_nanopolish = filtered_nanopolish %>%
  rename("transcript" = contig)

# Let's calculate read_time and polya_end
# read_time should be equal to read_rate * polya_length
# polya_end = polya_start + polya_length
calculated_nanopolish = filtered_nanopolish %>%
  mutate(read_time = read_rate * polya_length,
         polya_end = polya_start + polya_length)

# Summarize data by transcript
summarized_mrna = renamed_nanopolish %>%
  group_by(transcript) %>%
  summarise(numreads = length(unique(readname)),
            arithmean = mean(polya_length),
            sd = sd(polya_length),
            median = median(polya_length))

## `summarise()` ungrouping output (override with `.groups` argument)

Join

# What if we were interested in gene-level stats?
# I made a "dictionary" that translates transcript ID to gene ID and gene symbol, so let's read it in
dictionary = read_tsv("dictionary.tsv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ensembl_gene_id = col_character(),
##   external_gene_name = col_character(),
##   ensembl_transcript_id_version = col_character()
## )

# Join our renamed_nanopolish dataframe with the dictionary dataframe we just made by transcript ID
nanopolish_gene = left_join(renamed_nanopolish, dictionary,
                            by = c("transcript" = "ensembl_transcript_id_version"))

# Take a look at the documentation for join. What are the different flavors of join and how do they differ?

Now you try it!

# Try summarizing this nanopolish_gene dataframe like we did with our renamed_nanopolish dataframe previously

# What if I was only interested in genes with a minimum of 10 reads and only wanted to know their mean polyA tail length and standard deviation? Can you filter your summarized dataframe and select the appropriate columns?

4.) Practice with ggplot2

Let’s plot our transcript-level stats!

# What is the distribution of polyA tail lengths binned by transcript?
# Let's make a histogram
summarized_mrna %>%
  ggplot(aes(x = arithmean)) +
  geom_histogram(binwidth = 1, fill = "steelblue") +
  labs(title = "Distribution of mean* polyA tail lengths binned by transcript",
       x = "Mean polyA tail length",
       y = "Number of reads")

Pivot_longer and pivot_wider

# What is the distribution of all the polyA tail length stats we calculated?
# Let's make violin plots
# But wait! We need to first reformat our dataframe by pivoting longer.
df_to_plot = summarized_mrna %>%
  select(-numreads) %>%
  pivot_longer(-transcript, names_to = "stat", values_to = "value")

df_to_plot %>%
  ggplot(aes(x = stat, y = value, color = stat, fill = stat)) +
  geom_violin() +
  labs(title = "Transcript-level polyA tail length statistics",
       x = "Statistic",
       y = "Value")

## Warning: Removed 5509 rows containing non-finite values (stat_ydensity).

# Pivot_wider does the exact opposite of pivot_longer so we can return our df to its original format
original = df_to_plot %>%
  pivot_wider(names_from = "stat", values_from = "value")

5. Resources

For visualization of tidy functions (helpful for pivot_longer/pivot_wider aka gather/spread): https://github.com/gadenbuie/tidyexplain
Cheatsheets: https://www.rstudio.com/resources/cheatsheets/

Data wrangling in R - the tidy way!

Kat Lee

5/11/2021

Outline

3. Key tidyverse commands

Filter rows and select columns

Rename, mutate, summarise

Join

Now you try it!

4.) Practice with ggplot2

Pivot_longer and pivot_wider

5. Resources