I-1 Exploring FASTQ files in UNIX.
In this part, we will learn how to view FASTQ files from UNIX file system.
- Login to the server, as you learned from Day 1, using
    - Terminal (Mac OS X)
- MobaXTerm (Windows)
- or any other SSH client you prefer.
 $ ssh [your_id]@flux-login.engin.umich.edu (...Enter login credentials...) $ mkdir --p bioboot/day3 $ cd bioboot/day3 
- 
    Create a symbolic link to the directory containing the input files, and add PATH$ ln -s /home/hmkang/bioboot/data/ data - `ln -s [target] [link_name]` creates a *shortcut* of the target file. - See `man ln` to see the detailed usage of `ln`$ export PATH=/home/hmkang/bioboot/bin:${PATH}- Verify that the `data` and `bin` directories are correctly configured.$ ls data/ ls data/ ALL.chr20.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.annotation.vcf.gz gencode.v19.annotation.gtf.gz hs37d5.fa.fai ALL.chr20.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.annotation.vcf.gz.tbi HG00096.chrom20.ILLUMINA.bwa.GBR.exome.20120522.bam ucsc.hg19.fasta bioboot_2015a_R1.fastq.gz HG00096.chrom20.ILLUMINA.bwa.GBR.exome.20120522.bam.bai ucsc.hg19.fasta.fai bioboot_2015a_R2.fastq.gz hs37d5.fa $ which samtools /home/hmkang/bioboot/bin/samtools(Note) You can scroll the box above to the right 
- 
    Examine the contents of the FASTQ files $ zless data/bioboot_2015a_R2.fastq.gz - 
        What do you see? Can you interpret what each line means? 
- Let’s check whether the two FASTQ files are paired.
        $ zcat data/bioboot_2015a_R1.fastq.gz | wc -l $ zcat data/bioboot_2015a_R2.fastq.gz | wc -l 
- Do you think the files are paired? Why?
- Let’s check another way.
        $ zcat data/bioboot_2015a_R1.fastq.gz | head $ zcat data/bioboot_2015a_R2.fastq.gz | head 
- What additional information does it tell us?
 
- 
        
- Display only the sequence read information from FASTQ
    - What should we display? Which line numbers?
 $ zcat data/bioboot_2015a_R2.fastq.gz | awk 'NR % 4 == 2 {print;}' | less- awkis a script language useful for quick data manipulation.
- The script above takes each line as input, and print the 2nd line of every four lines (line number NR is 4N+2).
- We won’t introduce awktoday, but will use it in some occasions explaining specific details.- For more information about awk, This link could be a good starting point.
 
- For more information about 
- sedand- perlone-liners are also useful tools for manipulating data easily- …without having to write a separate program!
 
 
- (DIY) What are the barcodes that appear most frequently?
    - Given : Your input file is data/bioboot_2015a_R2.fastq.gz
- Want : Use UNIX utilities you learned plus awkto find out the sequences that most frequently occurs int the file.
- Hint 1 : You need to use UNIX pipe multiple times, like the example below
        $ zcat data/bioboot_2015a_R2.fastq.gz | awk 'NR % 4 == 2 {print;}' | *** DO SOMETHING HERE *** | less
- Hint 2 : sortutility sorts input lines alphabetically by default. Adding-nsorts input lines numerically. Adding-rsorts input in decreasing order. Seeman sortfor detailed usage.
- Hint 3 : uniqutility filter redundant lines appearing consecutively.uniq -cprovides the number of repeats for each input line. Seeman uniqfor detailed usage
- Need help or done? CLICK HERE FOR THE ANSWER
 
- Given : Your input file is 
- (DIY) Demultiplex a pair of FASTQ files.
    - Given
        - You have a pair of FASTQ files, data/bioboot_2015a_R1.fastq.gzanddata/bioboot_2015a_R2.fastq.gz
- The first file (51bp) contains actual sequence reads.
- The second file (7bp) contains sample barcodes.
            Sample1 ACAGTGA Sample2 CAGATCA Sample3 GCCAATA Sample4 TGACCAA Sample5 TTAGGCA 
- Any pair of barcodes differ by 4 or more nucleotides.
 
- You have a pair of FASTQ files, 
- Want
        - Write a python program that splits the first FASTQ files into six parts.
            - Five for each sample
- and another one for ‘UNKNOWN’ samples.
 
- Find out how many reads belong to each part.
 
- Write a python program that splits the first FASTQ files into six parts.
            
- 
        Hint 1 : Use gziplibrary to read or write compressed files.import gzip fi = gzip.open(input_file_name, 'r') // for reading fo = gzip.open(output_file_name, 'w') // for writing 
- Hint 2 : Simplest way is split based on exact match.
        - If the barcode does not match to any of the sample, consider as UNKNOWN.
- This approach may lose may reads if barcodes have sequencing errors.
- Need more hints? CLICK HERE FOR A SKELETON CODE
- Need a solution? CLICK HERE FOR A SOLUTION
 
- Hint 3 : Allow up to one mismatch in the barcode
        - Because any pair of barcodes are different by 4 or more nucleotides, if there was one base call error in the barcode, it can be unambiguously resolved.
- By modifying outfsfrom the exact match code, we can allow up to one mismatch (how?)
- Need a solution? CLICK HERE FOR A SOLUTION
 
- Bored? Try an advanced version : Use the base quality scores to probablistically calculate the best-matching barcode.
        - You can use ordfunction to convert characters to ASCII code integer.
- Calculate the likelihood of barcode reads for each case to determine the best matching one.
- What would be advantages and disadvantages for this approach?
 
- You can use 
 
- Given
        
Congratulations! Well done in exploring FASTQ files in UNIX!
Now let’s go back to Day 3 Overview.