Day 3 Practice - Solutions for Practice Examples

I-1) Exploring FASTQ files in UNIX.

5. What are the barcodes that appear most frequently?

Answer : For the full command, see below (scroll to right to see all)

zcat data/bioboot_2015a_R2.fastq.gz | awk 'NR % 4 == 2 {print;}' | sort | uniq -c | sort -n -r | less

Expected Output : First several lines are shown below.

 20442 TGACCAA
 20095 CAGATCA
 17475 TTAGGCA
 16342 ACAGTGA
    91 ACAGTGC
    74 GCCCATA
    72 ACCGTGA
    69 TTAGCAT
    69 CAGATCC
 ...

6a. Skeleton Code for FASTQ Demultiplexing

Below is the skeleton python code you can start with

import gzip

bc2id = { ... }  # dictionary to map barcodes to sample IDs
outfs = {}       # dictionary to map barcodes to output file handles
for bc in bc2id:
	outfs[bc] = gzip.open( ... )  # open output file for each barcode
unknown = gzip.open( ... )        # open output file for unknown file

f1 = gzip.open( ... )  # open FASTQ file for the first read
f2 = gzip.open( ... )  # open FASTQ file for the second read

for line in f2:
   # fill in the code below in the following way
   # 1. Read barcode information from second read
   # 2. determine the file handle to write to based on the barcode
   # 3. Read from the first FASTQ and write to the file handle

# close output handle if needed

6b. Answer for FASTQ Demultiplexing - Exact Match

Below is the example solution based on exact match

import gzip

bc2id = {'ACAGTGA' : 'Sample1', 'CAGATCA' : 'Sample2' , 'GCCAATA' : 'Sample3', 'TGACCAA' : 'Sample4' ,'TTAGGCA' : 'Sample5'}
outfs = {}
for bc in bc2id:   ## for each barcode, create a filehandle and store to a dictionary
  outfs[bc] = gzip.open('out/bioboot_2015a_'+bc2id[bc]+'.fastq.gz','w')
outunknown = gzip.open('out/bioboot_2015a_UNKNOWN.fastq.gz','w')

f1 = gzip.open('data/bioboot_2015a_R1.fastq.gz','r')  ## read the first FASTQ
f2 = gzip.open('data/bioboot_2015a_R2.fastq.gz','r')  ## read the second FASTQ

for rname2 in f2:    ## iterate over second FASTQ file (read name first)
  seq2 = f2.readline().rstrip()  ## next line is barcode sequence (remove end-of-line EOL marker)
  dummy2 = f2.readline()         ## next line is '+'
  qual2 = f2.readline()          ## next line is quality
  outf = outunknown              ## initially consider as output
  if ( seq2 in outfs ):          ## if the barcode is in the dictionary
    outf = outfs[seq2]           ## change the output file handle adequately
  outf.write(f1.readline())      ## read four lines from the first FASTQ 
  outf.write(f1.readline())      ##   and write to the designated file handle
  outf.write(f1.readline())
  outf.write(f1.readline())

outunknown.close()               ## need to close the file handles
for bc in outfs:
  outfs[bc].close()

See the expected output. (Number of unclassified reads)

$ zcat data/bioboot_2015a_R1.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
100000
$ zcat zcat out/bioboot_2015a_UNKNOWN.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
4976

4.98% of reads are unclassified.

6c. Answer for FASTQ Demultiplexing - Allowing one mismatch

Below is the example solution allowing up to one mismatch

import gzip

bc2id = {'ACAGTGA' : 'Sample1', 'CAGATCA' : 'Sample2' , 'GCCAATA' : 'Sample3', 'TGACCAA' : 'Sample4' ,'TTAGGCA' : 'Sample5'}
outfs = {}
nucleotides = ('N','A','C','G','T') 
for bc in bc2id:
  f = gzip.open('out/bioboot_2015a_'+bc2id[bc]+'.fastq.gz','w')
  outfs[bc] = f
  bclist = list(bc)            # convert string to list
  for i in range(len(bclist)): # choose position to substitute
    for n in nucleotides:      # choose a nucleotide substitute to
      bclist[i] = n
      newbc = "".join(bclist)
      if ( newbc != bc ):
        outfs[newbc] = f   # add dictionary with additional barcode with one mismatch
  bclist[i] = bc[i]        # convert the character back to the original one
outunknown = gzip.open('out/bioboot_2015a_UNKNOWN.fastq.gz','w')

f1 = gzip.open('data/bioboot_2015a_R1.fastq.gz')  # the rest is identical to the previous one
f2 = gzip.open('data/bioboot_2015a_R2.fastq.gz')

for rname2 in f2:
  seq2 = f2.readline().rstrip()
  dummy2 = f2.readline()
  qual2 = f2.readline()
  outf = outunknown
  if ( seq2 in outfs ):
    outf = outfs[seq2]
  outf.write(f1.readline())
  outf.write(f1.readline())
  outf.write(f1.readline())
  outf.write(f1.readline())

outunknown.close()
for bc in outfs:
  outfs[bc].close()

See the expected output. (Number of unclassified reads)

$ zcat data/bioboot_2015a_R1.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
100000
$ zcat zcat out/bioboot_2015a_UNKNOWN.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
3556

3.56% of reads are unclassified.

III-1) Understanding how to access FASTA files.

7. Are the two subsets of sequences in the above identical to each other?

You can extract the two sequences into separate files using samtools

$ samtools faidx data/hs37d5.fa 20:10000000-10001000 > out/day3.extracted.hs37d5.txt
$ samtools faidx data/ucsc.hg19.fasta chr20:10000000-10001000 > out/day3.extracted.hg19.txt

Running diff on these data will highlight many differences in case.

$ diff out/day3.extracted.hs37d5.txt out/day3.extracted.hg19.txt
1,10c1,10
< >20:10000000-10001000
< TTGTTTACTACAGCTTTGTAGTAAATTTTGAACTCTAAAGTGTTAGTTCTCTAACTTTGT
< TTGTTTTTCAAGAGTGTTTTGACTCTTCTTACTGCATCCCTTGCATTTCCATATGGACTT
< TATAATCAGCCATGTCAACTTCTGCAAGAAAGACAGCTAGGATTTTGATAAGGATTGTGT
< TGAATCTGTAGTCCAATTTTGGGAATACTACCATGTTAACATCGTCTTCCCATCCATGCA
< CGTGCAATAGCTTACCATTTATTTGGTCTTCCTCAATTTCTGTCAACAATGATTTGTAGT
< TTTCAGTTGCAAGTCTTGCACTTCTTTTGTTAAACTTTTTCCAAATATTTATTCTTCTTA
< ATTACATAATTCTCATAATTAATAAAATTCTGAAATTTTCTTAATTTCATTTTTGTGGCC
< ATGCACTTTAAAACTTGCCTTTTAACAAGACTTCCAGATGATTCTTGTGAACATTAAGTT
< TGTGAAGTGCTTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA
---
> >chr20:10000000-10001000
> Ttgtttactacagctttgtagtaaattttgaactctaaagtgttagttctctaactttgt
> ttgtttttcaagagtgttttgactcttcttactgcatcccttgcatttccatatggactt
> tataatcagccatgtcaacttctgcaagaaagacagctaggattttgataaggattgtgt
> tgaatctgtagtccaattttgggaatactaccatgttaacatcgtcttcccatccatgca
> cgtgcaatagcttaccatttatttggtcttcctcaatttctgtcaacaatgatttgtagt
> tttcagttgcaagtcttgcacttcttttgttaaactttttccaaatatttattcttctta
> attacataattctcataattaataaaattctgaaattttcttaatttcatttttgTGGCC
> ATGCACTttaaaacttgccttttaacaagacttccagatgattcttgtgaacattaagtt
> tgtgaagtgctTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA
17,18c17,18
< CTTTCTGGTTCCAGATCTGACACCAATTACCTTTCTGATTTTGGACAAACCACTTAATAT
< TTGTAACTTACAATTACTTCAACTGAATAATAAAAGAATTG
---
> CTttctggttccagatctgacaccaattacctttctgattttggacaaaccacttaatat
> ttgtaacttacaattacttcaactgaataataaaagaattg

Changing the UCSC (hg19) sequences to uppercases will show that they are identical.

$ samtools faidx data/ucsc.hg19.fasta chr20:10000000-10001000 | awk '{ print toupper($0); }' > out/day3.extracted.hg19.txt
$ diff out/day3.extracted.hs37d5.txt out/day3.extracted.hg19.txt 
1c1
< >20:10000000-10001000
---
> >CHR20:10000000-10001000

III-3) Representing genes and transcripts using GTF and genePred format

2a. How many total refFlat entries exist? What is the unique number of canonical gene names and transcript IDs that exist in the refFlat database?

Number of total refFlat entries
```
$ zcat refFlat.txt.gz | wc -l
54196 
```

Number of unique canonical gene names

$ zcat refFlat.txt.gz | cut -f 1 | sort | uniq | wc -l
26397

Number of unique transcript IDs

$ zcat refFlat.txt.gz | cut -f 2 | sort | uniq | wc -l
50354

2b. How many different transcript exist for LDLR gene? How are they similar and different?

List all transcripts with canonical gene name LDLR

$ zcat refFlat.txt.gz | grep -w LDLR
LDLR	NM_001195799	chr19	+	11200037	11244505	11200224	11241992	17	11200037,11210898,11215895,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241956,	11200291,11211021,11216276,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505,
LDLR	NM_001195798	chr19	+	11200037	11244505	11200224	11241992	18	11200037,11210898,11213339,11215895,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241962,	11200291,11211021,11213462,11216276,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505,
LDLR	NM_001195803	chr19	+	11200037	11244505	11200224	11241992	16	11200037,11210898,11213339,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11233849,11238683,11240188,11241956,	11200291,11211021,11213462,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11234020,11238761,11240346,11244505,
LDLR	NM_001195800	chr19	+	11200037	11244505	11200224	11241992	16	11200037,11210898,11213339,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241956,	11200291,11211021,11213462,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505,
LDLR	NM_000527	chr19	+	11200037	11244505	11200224	11241992	18	11200037,11210898,11213339,11215895,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241956,	11200291,11211021,11213462,11216276,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505,

They all have the same start/end positions, but the number of exons and coding sequences are slightly different.

2c. For multiple transcripts with the same canonical gene names, are they always located in the same chromosome or not?

First, the total number of canonical gene names is 26,397

$ zcat refFlat.txt.gz | cut -f 1 | sort | uniq | wc -l
26397

Next, the total number of unique chromosome x canonical gene names is 27,843
```
$ cat refFlat.txt.gz | cut -f 1,3 | sort | uniq | wc -l
27843 
```
This means that up to 1,446 genes are duplicated across multiple chromosomes.

4a. How many genes, transcripts, exons, and CDS exists in GENCODE v19 GTF?

You can check the third column in the GTF

$ zcat data/gencode.v19.annotation.gtf.gz | grep -v ^# | cut -f 3 | sort | uniq -c
723784 CDS
1196293 exon
57820 gene
114 Selenocysteine
84144 start_codon
76196 stop_codon
196520 transcript
284573 UTR

III-4) Working with BED files.

3. Write a python program that print out lines that overlaps with a region.

import sys

script, chrom, beg, end = sys.argv

f = open('refFlat.bed','r')
for line in f:
    tok = line.split()
    if ( ( tok[0] == chrom ) and ( ( tok[2] > beg ) and ( tok[1] < end ) ) ):
        print line,
f.close()