III-1. Understanding how to access FASTA files

In this part, we will learn how to access FASTA files for genomic analysis.

If you’re logged out, login back to the server.

$ ssh [your_id]@flux-login.engin.umich.edu
(...Enter login credentials...)
$ cd bioboot/day3
$ export PATH=/home/hmkang/bioboot/bin:${PATH}

Explore the 1000 GRCh37(+decoy) reference in the data directory.
```
 $ less data/hs37d5.fa 
```
This version of FASTA file is the most widely used version of FASTA file in DNA sequence mapping, including in the 1000 Genomes Project.
(DIY) List all chromosome names available in the FASTA file.
- Hint 1 : use grep command.
- Hint 2 : Adding ^ in the beginning of the grep query searches for the string at the beginning of the line.
- Solution : You can list all chromosomes using the following command (make sure to include quotation marks)
```
 $ grep "^>" data/hs37d5.fa | less 
```
- Is this an efficient way to get the answer? Why?

(DIY) Alternative method - use FASTA index files produced by samtools.

samtools provides a convenient way to explore FASTA files.
Running samtools faidx on a FASTA file will create .fai file (you need a write permission to the directory).
```
 $ samtools faidx [input.fasta] 
```

FASTA index file is human-readable.

$ less data/hs37d5.fa.fai
1       249250621       52      60      61
2       243199373       253404903       60      61
3       198022430       500657651       60      61
4       191154276       701980507       60      61
5       180915260       896320740       60      61
6       171115067       1080251307      60      61
7       159138663       1254218344      60      61
8       146364022       1416009371      60      61
9       141213431       1564812846      60      61
10      135534747       1708379889      60      61
11      135006516       1846173603      60      61
...

What does each column represent?

Chromosome (or sequence) name
Number of nucleotides (or characters) in each chromosome (or sequence).

The byte offset where each chromosome sequence starts

For example, you can access the beginning of chromosome 2 this way.

$ tail -c +253404904 data/hs37d5.fa | head
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Number of bases pairs appearing each line.
Number of bytes used for each line.

Use samtools faidx to access an arbitrary region of genome.

If .fai file is available, you can randomly access the FASTA file, without scanning from the beginning of the file

For example, to look at the 1000 bases starting from the 10M-th bases in chromosome 20,

$ samtools faidx data/hs37d5.fa 20:10000000-10001000
>20:10000000-10001000
TTGTTTACTACAGCTTTGTAGTAAATTTTGAACTCTAAAGTGTTAGTTCTCTAACTTTGT
TTGTTTTTCAAGAGTGTTTTGACTCTTCTTACTGCATCCCTTGCATTTCCATATGGACTT
TATAATCAGCCATGTCAACTTCTGCAAGAAAGACAGCTAGGATTTTGATAAGGATTGTGT
TGAATCTGTAGTCCAATTTTGGGAATACTACCATGTTAACATCGTCTTCCCATCCATGCA
CGTGCAATAGCTTACCATTTATTTGGTCTTCCTCAATTTCTGTCAACAATGATTTGTAGT
TTTCAGTTGCAAGTCTTGCACTTCTTTTGTTAAACTTTTTCCAAATATTTATTCTTCTTA
ATTACATAATTCTCATAATTAATAAAATTCTGAAATTTTCTTAATTTCATTTTTGTGGCC
ATGCACTTTAAAACTTGCCTTTTAACAAGACTTCCAGATGATTCTTGTGAACATTAAGTT
TGTGAAGTGCTTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA
ATTCTGTCCCGTAACTCCGCTAGCTCTCCAACAAGAAGTGACTTGACGCAGCCCAAGGTT
ACTACTTAACACAATGAATTAAATGTTTTTAAATAAGAGGAAGCAATAAGATTCTAAAGG
CTTTTCTGTTTTAATTTTCATGCAATGGAAAACTGGTATTAAATATCTATTTAATTAGGA
GGAAACTACAATGCTGACTTTTGTCTGAATTATGTAGATAAGTGATTCATTTGAAACAAT
TATTTTGATAATTGTCAATTATCCATTTCATTTTAATGCATTTTTTATTCTTTTTTCAAA
AATAGCAACAATTACAACAGTTAAACCTTATAATGAATATGTTTCCTAAACCCTGTTCTA
CTTTCTGGTTCCAGATCTGACACCAATTACCTTTCTGATTTTGGACAAACCACTTAATAT
TTGTAACTTACAATTACTTCAACTGAATAATAAAAGAATTG

The FASTA file used above is GRCh37 sequence used for the 1000 Genomes project. Another widely used version of GRCh37 sequence can be downloaded from UCSC (hg19).

The FASTA and index file is available.

 
$ less data/ucsc.hg19.fasta.fai
chrM    16571   6       50      51
chr1    249250621       16915   50      51
chr2    243199373       254252555       50      51
chr3    198022430       502315922       50      51
chr4    191154276       704298807       50      51
chr5    180915260       899276175       50      51
chr6    171115067       1083809747      50      51
chr7    159138663       1258347122      50      51
chr8    146364022       1420668565      50      51
...

What are similarities and differences between the two versions of FASTA files?

Note the difference between the representations of chromosome names.

Running the same command on the new FASTA file would not work
```
$ samtools faidx data/ucsc.hg19.fasta 20:10000000-10001000 
```

Adding ‘chr’ in the chromosome name will give the similar results, except that repeated sequences are represented in lowercases.

$ samtools faidx data/ucsc.hg19.fasta chr20:10000000-10001000   
>chr20:10000000-10001000
Ttgtttactacagctttgtagtaaattttgaactctaaagtgttagttctctaactttgt
ttgtttttcaagagtgttttgactcttcttactgcatcccttgcatttccatatggactt
tataatcagccatgtcaacttctgcaagaaagacagctaggattttgataaggattgtgt
tgaatctgtagtccaattttgggaatactaccatgttaacatcgtcttcccatccatgca
cgtgcaatagcttaccatttatttggtcttcctcaatttctgtcaacaatgatttgtagt
tttcagttgcaagtcttgcacttcttttgttaaactttttccaaatatttattcttctta
attacataattctcataattaataaaattctgaaattttcttaatttcatttttgTGGCC
ATGCACTttaaaacttgccttttaacaagacttccagatgattcttgtgaacattaagtt
tgtgaagtgctTCTCTATTGACAAACTGAACTCACGTGACATTCCACACAGCACTGGCAA
ATTCTGTCCCGTAACTCCGCTAGCTCTCCAACAAGAAGTGACTTGACGCAGCCCAAGGTT
ACTACTTAACACAATGAATTAAATGTTTTTAAATAAGAGGAAGCAATAAGATTCTAAAGG
CTTTTCTGTTTTAATTTTCATGCAATGGAAAACTGGTATTAAATATCTATTTAATTAGGA
GGAAACTACAATGCTGACTTTTGTCTGAATTATGTAGATAAGTGATTCATTTGAAACAAT
TATTTTGATAATTGTCAATTATCCATTTCATTTTAATGCATTTTTTATTCTTTTTTCAAA
AATAGCAACAATTACAACAGTTAAACCTTATAATGAATATGTTTCCTAAACCCTGTTCTA
CTttctggttccagatctgacaccaattacctttctgattttggacaaaccacttaatat
ttgtaacttacaattacttcaactgaataataaaagaattg

(DIY) Are the two subsets of sequences in the above identical to each other?
- Hint 1 : You can use diff utility to compare the contents between the two files.
- Hint 2 : If you want to convert a string to uppercase, a simple awk one-liner can do it.
```
 $ (previous-command) | awk '{ print toupper($0); }' | (next-command) 
```
  - Or alternatively, you can use tr [:lower:] [:upper:]. See man tr for detailed usage of tr command.
- Need help or want to check solutions? CLICK HERE FOR SOLUTIONS
- Each FASTA file may have slight differences. It is important to understand where the FASTA file I am using is coming from (and you should be able to reproduce the FASTA file on your own).
- If two FASTA files are based on the same genome build, for the primary chromosomes, the sequences will be identical (except for the upper-lower cases and handling of ambiguous bases).

Congratulations! Well done in accessing FASTA files.
Now let’s go to next step Accessing aligned sequence reads in SAM/BAM format, or go back to Day 3 Overview.