III-1. Understanding how to access FASTA files
In this part, we will learn how to access FASTA files for genomic analysis.
- If you’re logged out, login back to the server.
$ ssh [your_id]@flux-login.engin.umich.edu (...Enter login credentials...) $ cd bioboot/day3 $ export PATH=/home/hmkang/bioboot/bin:${PATH}
-
Explore the 1000 GRCh37(+decoy) reference in the data directory.
$ less data/hs37d5.fa
This version of FASTA file is the most widely used version of FASTA file in DNA sequence mapping, including in the 1000 Genomes Project.
- (DIY) List all chromosome names available in the FASTA file.
- Hint 1 : use
grep
command. - Hint 2 : Adding
^
in the beginning of thegrep
query searches for the string at the beginning of the line. - Solution : You can list all chromosomes using the following command (make sure to include quotation marks)
$ grep "^>" data/hs37d5.fa | less
- Is this an efficient way to get the answer? Why?
- Hint 1 : use
- (DIY) Alternative method - use FASTA index files produced by
samtools
.samtools
provides a convenient way to explore FASTA files.- Running
samtools faidx
on a FASTA file will create.fai
file (you need a write permission to the directory).$ samtools faidx [input.fasta]
- FASTA index file is human-readable.
$ less data/hs37d5.fa.fai 1 249250621 52 60 61 2 243199373 253404903 60 61 3 198022430 500657651 60 61 4 191154276 701980507 60 61 5 180915260 896320740 60 61 6 171115067 1080251307 60 61 7 159138663 1254218344 60 61 8 146364022 1416009371 60 61 9 141213431 1564812846 60 61 10 135534747 1708379889 60 61 11 135006516 1846173603 60 61 ...
- What does each column represent?
- Chromosome (or sequence) name
- Number of nucleotides (or characters) in each chromosome (or sequence).
- The byte offset where each chromosome sequence starts
- For example, you can access the beginning of chromosome 2 this way.
$ tail -c +253404904 data/hs37d5.fa | head NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
- Number of bases pairs appearing each line.
- Number of bytes used for each line.
- Use
samtools faidx
to access an arbitrary region of genome.- If
.fai
file is available, you can randomly access the FASTA file, without scanning from the beginning of the file - For example, to look at the 1000 bases starting from the 10M-th bases in chromosome 20,
$ samtools faidx data/hs37d5.fa 20:10000000-10001000 >20:10000000-10001000 TTGTTTACTACAGCTTTGTAGTAAATTTTGAACTCTAAAGTGTTAGTTCTCTAACTTTGT TTGTTTTTCAAGAGTGTTTTGACTCTTCTTACTGCATCCCTTGCATTTCCATATGGACTT TATAATCAGCCATGTCAACTTCTGCAAGAAAGACAGCTAGGATTTTGATAAGGATTGTGT TGAATCTGTAGTCCAATTTTGGGAATACTACCATGTTAACATCGTCTTCCCATCCATGCA CGTGCAATAGCTTACCATTTATTTGGTCTTCCTCAATTTCTGTCAACAATGATTTGTAGT TTTCAGTTGCAAGTCTTGCACTTCTTTTGTTAAACTTTTTCCAAATATTTATTCTTCTTA ATTACATAATTCTCATAATTAATAAAATTCTGAAATTTTCTTAATTTCATTTTTGTGGCC ATGCACTTTAAAACTTGCCTTTTAACAAGACTTCCAGATGATTCTTGTGAACATTAAGTT TGTGAAGTGCTTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA ATTCTGTCCCGTAACTCCGCTAGCTCTCCAACAAGAAGTGACTTGACGCAGCCCAAGGTT ACTACTTAACACAATGAATTAAATGTTTTTAAATAAGAGGAAGCAATAAGATTCTAAAGG CTTTTCTGTTTTAATTTTCATGCAATGGAAAACTGGTATTAAATATCTATTTAATTAGGA GGAAACTACAATGCTGACTTTTGTCTGAATTATGTAGATAAGTGATTCATTTGAAACAAT TATTTTGATAATTGTCAATTATCCATTTCATTTTAATGCATTTTTTATTCTTTTTTCAAA AATAGCAACAATTACAACAGTTAAACCTTATAATGAATATGTTTCCTAAACCCTGTTCTA CTTTCTGGTTCCAGATCTGACACCAATTACCTTTCTGATTTTGGACAAACCACTTAATAT TTGTAACTTACAATTACTTCAACTGAATAATAAAAGAATTG
- If
- The FASTA file used above is GRCh37 sequence used for the 1000 Genomes project. Another widely used version of GRCh37 sequence can be downloaded from UCSC (hg19).
- The FASTA and index file is available.
$ less data/ucsc.hg19.fasta.fai chrM 16571 6 50 51 chr1 249250621 16915 50 51 chr2 243199373 254252555 50 51 chr3 198022430 502315922 50 51 chr4 191154276 704298807 50 51 chr5 180915260 899276175 50 51 chr6 171115067 1083809747 50 51 chr7 159138663 1258347122 50 51 chr8 146364022 1420668565 50 51 ...
- What are similarities and differences between the two versions of FASTA files?
- The FASTA and index file is available.
- Note the difference between the representations of chromosome names.
-
Running the same command on the new FASTA file would not work
$ samtools faidx data/ucsc.hg19.fasta 20:10000000-10001000
-
Adding ‘chr’ in the chromosome name will give the similar results, except that repeated sequences are represented in lowercases.
$ samtools faidx data/ucsc.hg19.fasta chr20:10000000-10001000 >chr20:10000000-10001000 Ttgtttactacagctttgtagtaaattttgaactctaaagtgttagttctctaactttgt ttgtttttcaagagtgttttgactcttcttactgcatcccttgcatttccatatggactt tataatcagccatgtcaacttctgcaagaaagacagctaggattttgataaggattgtgt tgaatctgtagtccaattttgggaatactaccatgttaacatcgtcttcccatccatgca cgtgcaatagcttaccatttatttggtcttcctcaatttctgtcaacaatgatttgtagt tttcagttgcaagtcttgcacttcttttgttaaactttttccaaatatttattcttctta attacataattctcataattaataaaattctgaaattttcttaatttcatttttgTGGCC ATGCACTttaaaacttgccttttaacaagacttccagatgattcttgtgaacattaagtt tgtgaagtgctTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA ATTCTGTCCCGTAACTCCGCTAGCTCTCCAACAAGAAGTGACTTGACGCAGCCCAAGGTT ACTACTTAACACAATGAATTAAATGTTTTTAAATAAGAGGAAGCAATAAGATTCTAAAGG CTTTTCTGTTTTAATTTTCATGCAATGGAAAACTGGTATTAAATATCTATTTAATTAGGA GGAAACTACAATGCTGACTTTTGTCTGAATTATGTAGATAAGTGATTCATTTGAAACAAT TATTTTGATAATTGTCAATTATCCATTTCATTTTAATGCATTTTTTATTCTTTTTTCAAA AATAGCAACAATTACAACAGTTAAACCTTATAATGAATATGTTTCCTAAACCCTGTTCTA CTttctggttccagatctgacaccaattacctttctgattttggacaaaccacttaatat ttgtaacttacaattacttcaactgaataataaaagaattg
-
- (DIY) Are the two subsets of sequences in the above identical to each other?
- Hint 1 : You can use
diff
utility to compare the contents between the two files. - Hint 2 : If you want to convert a string to uppercase, a simple
awk
one-liner can do it.$ (previous-command) | awk '{ print toupper($0); }' | (next-command)
- Or alternatively, you can use
tr [:lower:] [:upper:]
. Seeman tr
for detailed usage oftr
command.
- Or alternatively, you can use
- Need help or want to check solutions? CLICK HERE FOR SOLUTIONS
- Each FASTA file may have slight differences. It is important to understand where the FASTA file I am using is coming from (and you should be able to reproduce the FASTA file on your own).
- If two FASTA files are based on the same genome build, for the primary chromosomes, the sequences will be identical (except for the upper-lower cases and handling of ambiguous bases).
- Hint 1 : You can use
Congratulations! Well done in accessing FASTA files.
Now let’s go to next step Accessing aligned sequence reads in SAM/BAM format, or go
back to Day 3 Overview.