Day 3 Practice - Solutions for Practice Examples

I-1) Exploring FASTQ files in UNIX.

5. What are the barcodes that appear most frequently?

  • Answer : For the full command, see below (scroll to right to see all)

    zcat data/bioboot_2015a_R2.fastq.gz | awk 'NR % 4 == 2 {print;}' | sort | uniq -c | sort -n -r | less
  • Expected Output : First several lines are shown below.

     20442 TGACCAA
     20095 CAGATCA
     17475 TTAGGCA
     16342 ACAGTGA
        91 ACAGTGC
        74 GCCCATA
        72 ACCGTGA
        69 TTAGCAT
        69 CAGATCC
     ...

6a. Skeleton Code for FASTQ Demultiplexing

  • Below is the skeleton python code you can start with
    import gzip
    
    bc2id = { ... }  # dictionary to map barcodes to sample IDs
    outfs = {}       # dictionary to map barcodes to output file handles
    for bc in bc2id:
    	outfs[bc] = gzip.open( ... )  # open output file for each barcode
    unknown = gzip.open( ... )        # open output file for unknown file
    
    f1 = gzip.open( ... )  # open FASTQ file for the first read
    f2 = gzip.open( ... )  # open FASTQ file for the second read
    
    for line in f2:
       # fill in the code below in the following way
       # 1. Read barcode information from second read
       # 2. determine the file handle to write to based on the barcode
       # 3. Read from the first FASTQ and write to the file handle
    
    # close output handle if needed 

6b. Answer for FASTQ Demultiplexing - Exact Match

  • Below is the example solution based on exact match
    import gzip
    
    bc2id = {'ACAGTGA' : 'Sample1', 'CAGATCA' : 'Sample2' , 'GCCAATA' : 'Sample3', 'TGACCAA' : 'Sample4' ,'TTAGGCA' : 'Sample5'}
    outfs = {}
    for bc in bc2id:   ## for each barcode, create a filehandle and store to a dictionary
      outfs[bc] = gzip.open('out/bioboot_2015a_'+bc2id[bc]+'.fastq.gz','w')
    outunknown = gzip.open('out/bioboot_2015a_UNKNOWN.fastq.gz','w')
    
    f1 = gzip.open('data/bioboot_2015a_R1.fastq.gz','r')  ## read the first FASTQ
    f2 = gzip.open('data/bioboot_2015a_R2.fastq.gz','r')  ## read the second FASTQ
    
    for rname2 in f2:    ## iterate over second FASTQ file (read name first)
      seq2 = f2.readline().rstrip()  ## next line is barcode sequence (remove end-of-line EOL marker)
      dummy2 = f2.readline()         ## next line is '+'
      qual2 = f2.readline()          ## next line is quality
      outf = outunknown              ## initially consider as output
      if ( seq2 in outfs ):          ## if the barcode is in the dictionary
        outf = outfs[seq2]           ## change the output file handle adequately
      outf.write(f1.readline())      ## read four lines from the first FASTQ 
      outf.write(f1.readline())      ##   and write to the designated file handle
      outf.write(f1.readline())
      outf.write(f1.readline())
    
    outunknown.close()               ## need to close the file handles
    for bc in outfs:
      outfs[bc].close() 
  • See the expected output. (Number of unclassified reads)
    $ zcat data/bioboot_2015a_R1.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
    100000
    $ zcat zcat out/bioboot_2015a_UNKNOWN.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
    4976 
    • 4.98% of reads are unclassified.

6c. Answer for FASTQ Demultiplexing - Allowing one mismatch

  • Below is the example solution allowing up to one mismatch
    import gzip
    
    bc2id = {'ACAGTGA' : 'Sample1', 'CAGATCA' : 'Sample2' , 'GCCAATA' : 'Sample3', 'TGACCAA' : 'Sample4' ,'TTAGGCA' : 'Sample5'}
    outfs = {}
    nucleotides = ('N','A','C','G','T') 
    for bc in bc2id:
      f = gzip.open('out/bioboot_2015a_'+bc2id[bc]+'.fastq.gz','w')
      outfs[bc] = f
      bclist = list(bc)            # convert string to list
      for i in range(len(bclist)): # choose position to substitute
        for n in nucleotides:      # choose a nucleotide substitute to
          bclist[i] = n
          newbc = "".join(bclist)
          if ( newbc != bc ):
            outfs[newbc] = f   # add dictionary with additional barcode with one mismatch
      bclist[i] = bc[i]        # convert the character back to the original one
    outunknown = gzip.open('out/bioboot_2015a_UNKNOWN.fastq.gz','w')
    
    f1 = gzip.open('data/bioboot_2015a_R1.fastq.gz')  # the rest is identical to the previous one
    f2 = gzip.open('data/bioboot_2015a_R2.fastq.gz')
    
    for rname2 in f2:
      seq2 = f2.readline().rstrip()
      dummy2 = f2.readline()
      qual2 = f2.readline()
      outf = outunknown
      if ( seq2 in outfs ):
        outf = outfs[seq2]
      outf.write(f1.readline())
      outf.write(f1.readline())
      outf.write(f1.readline())
      outf.write(f1.readline())
    
    outunknown.close()
    for bc in outfs:
      outfs[bc].close() 
  • See the expected output. (Number of unclassified reads)
    $ zcat data/bioboot_2015a_R1.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
    100000
    $ zcat zcat out/bioboot_2015a_UNKNOWN.fastq.gz | awk 'NR % 4 == 2 {print;}' | wc -l
    3556 
    • 3.56% of reads are unclassified.

III-1) Understanding how to access FASTA files.

7. Are the two subsets of sequences in the above identical to each other?

  • You can extract the two sequences into separate files using samtools

    $ samtools faidx data/hs37d5.fa 20:10000000-10001000 > out/day3.extracted.hs37d5.txt
    $ samtools faidx data/ucsc.hg19.fasta chr20:10000000-10001000 > out/day3.extracted.hg19.txt 
  • Running diff on these data will highlight many differences in case.

    $ diff out/day3.extracted.hs37d5.txt out/day3.extracted.hg19.txt
    1,10c1,10
    < >20:10000000-10001000
    < TTGTTTACTACAGCTTTGTAGTAAATTTTGAACTCTAAAGTGTTAGTTCTCTAACTTTGT
    < TTGTTTTTCAAGAGTGTTTTGACTCTTCTTACTGCATCCCTTGCATTTCCATATGGACTT
    < TATAATCAGCCATGTCAACTTCTGCAAGAAAGACAGCTAGGATTTTGATAAGGATTGTGT
    < TGAATCTGTAGTCCAATTTTGGGAATACTACCATGTTAACATCGTCTTCCCATCCATGCA
    < CGTGCAATAGCTTACCATTTATTTGGTCTTCCTCAATTTCTGTCAACAATGATTTGTAGT
    < TTTCAGTTGCAAGTCTTGCACTTCTTTTGTTAAACTTTTTCCAAATATTTATTCTTCTTA
    < ATTACATAATTCTCATAATTAATAAAATTCTGAAATTTTCTTAATTTCATTTTTGTGGCC
    < ATGCACTTTAAAACTTGCCTTTTAACAAGACTTCCAGATGATTCTTGTGAACATTAAGTT
    < TGTGAAGTGCTTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA
    ---
    > >chr20:10000000-10001000
    > Ttgtttactacagctttgtagtaaattttgaactctaaagtgttagttctctaactttgt
    > ttgtttttcaagagtgttttgactcttcttactgcatcccttgcatttccatatggactt
    > tataatcagccatgtcaacttctgcaagaaagacagctaggattttgataaggattgtgt
    > tgaatctgtagtccaattttgggaatactaccatgttaacatcgtcttcccatccatgca
    > cgtgcaatagcttaccatttatttggtcttcctcaatttctgtcaacaatgatttgtagt
    > tttcagttgcaagtcttgcacttcttttgttaaactttttccaaatatttattcttctta
    > attacataattctcataattaataaaattctgaaattttcttaatttcatttttgTGGCC
    > ATGCACTttaaaacttgccttttaacaagacttccagatgattcttgtgaacattaagtt
    > tgtgaagtgctTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA
    17,18c17,18
    < CTTTCTGGTTCCAGATCTGACACCAATTACCTTTCTGATTTTGGACAAACCACTTAATAT
    < TTGTAACTTACAATTACTTCAACTGAATAATAAAAGAATTG
    ---
    > CTttctggttccagatctgacaccaattacctttctgattttggacaaaccacttaatat
    > ttgtaacttacaattacttcaactgaataataaaagaattg 
  • Changing the UCSC (hg19) sequences to uppercases will show that they are identical.

    $ samtools faidx data/ucsc.hg19.fasta chr20:10000000-10001000 | awk '{ print toupper($0); }' > out/day3.extracted.hg19.txt
    $ diff out/day3.extracted.hs37d5.txt out/day3.extracted.hg19.txt 
    1c1
    < >20:10000000-10001000
    ---
    > >CHR20:10000000-10001000 

III-3) Representing genes and transcripts using GTF and genePred format

2a. How many total refFlat entries exist? What is the unique number of canonical gene names and transcript IDs that exist in the refFlat database?

  • Number of total refFlat entries
    $ zcat refFlat.txt.gz | wc -l
    54196 
  • Number of unique canonical gene names
    $ zcat refFlat.txt.gz | cut -f 1 | sort | uniq | wc -l
    26397 
  • Number of unique transcript IDs
    $ zcat refFlat.txt.gz | cut -f 2 | sort | uniq | wc -l
    50354 

2b. How many different transcript exist for LDLR gene? How are they similar and different?

  • List all transcripts with canonical gene name LDLR
    $ zcat refFlat.txt.gz | grep -w LDLR
    LDLR	NM_001195799	chr19	+	11200037	11244505	11200224	11241992	17	11200037,11210898,11215895,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241956,	11200291,11211021,11216276,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505,
    LDLR	NM_001195798	chr19	+	11200037	11244505	11200224	11241992	18	11200037,11210898,11213339,11215895,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241962,	11200291,11211021,11213462,11216276,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505,
    LDLR	NM_001195803	chr19	+	11200037	11244505	11200224	11241992	16	11200037,11210898,11213339,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11233849,11238683,11240188,11241956,	11200291,11211021,11213462,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11234020,11238761,11240346,11244505,
    LDLR	NM_001195800	chr19	+	11200037	11244505	11200224	11241992	16	11200037,11210898,11213339,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241956,	11200291,11211021,11213462,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505,
    LDLR	NM_000527	chr19	+	11200037	11244505	11200224	11241992	18	11200037,11210898,11213339,11215895,11217240,11218067,11221327,11222189,11223953,11224210,11226769,11227534,11230767,11231045,11233849,11238683,11240188,11241956,	11200291,11211021,11213462,11216276,11217363,11218190,11221447,11222315,11224125,11224438,11226888,11227674,11230909,11231198,11234020,11238761,11240346,11244505, 
  • They all have the same start/end positions, but the number of exons and coding sequences are slightly different.

2c. For multiple transcripts with the same canonical gene names, are they always located in the same chromosome or not?

  • First, the total number of canonical gene names is 26,397
    $ zcat refFlat.txt.gz | cut -f 1 | sort | uniq | wc -l
    26397 
  • Next, the total number of unique chromosome x canonical gene names is 27,843
    $ cat refFlat.txt.gz | cut -f 1,3 | sort | uniq | wc -l
    27843 
  • This means that up to 1,446 genes are duplicated across multiple chromosomes.

4a. How many genes, transcripts, exons, and CDS exists in GENCODE v19 GTF?

  • You can check the third column in the GTF
    $ zcat data/gencode.v19.annotation.gtf.gz | grep -v ^# | cut -f 3 | sort | uniq -c
    723784 CDS
    1196293 exon
    57820 gene
    114 Selenocysteine
    84144 start_codon
    76196 stop_codon
    196520 transcript
    284573 UTR 

III-4) Working with BED files.

3. Write a python program that print out lines that overlaps with a region.

import sys

script, chrom, beg, end = sys.argv

f = open('refFlat.bed','r')
for line in f:
    tok = line.split()
    if ( ( tok[0] == chrom ) and ( ( tok[2] > beg ) and ( tok[1] < end ) ) ):
        print line,
f.close()