III-1. Working with VCF files.

Here we will download a file of variant calls and genotypes from the 1000 Genomes project.

  • Log in to flux

    ssh USERNAME@flux-login.engin.umich.edu
    
  • Update your PATH to point to programs we will use

    which samtools   # this won't be able to find samtools
    
    export PATH=${PATH}:/scratch/biobootcamp_fluxod/kitzmanj/samtools-1.1/:/scratch/biobootcamp_fluxod/kitzmanj/samtools-1.1/htslib-1.1/ 
    
    which samtools   # this won't be able to find samtools
    
    • You should now be able to find and run samtools

      kitzmanj@flux-login3:/scratch/biobootcamp_fluxod/kitzmanj$ which samtools
      /scratch/biobootcamp_fluxod/kitzmanj/samtools-1.1/samtools
      
      kitzmanj@flux-login3:/scratch/biobootcamp_fluxod/kitzmanj$ samtools
      
      Program: samtools (Tools for alignments in the SAM format)
      Version: 1.1 (using htslib 1.1)
      
      Usage:   samtools  [options]
      ... 
      
  • Make a new directory to work in

    cd /scratch/biobootcamp_fluxod/$( whoami )/
    mkdir day3_vcf
    cd day3_vcf
    export BASE_DIR=/scratch/biobootcamp_fluxod/$( whoami )/day3_vcf
    
  • Navigate in your web browser to this address: [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/]

  • Download variant calls and genotypes for chr20.

    • Right-click the link to ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz and select “Copy Link Address”

    • Back in Flux, type “wget “ and paste this link:

    wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
    
  • QC check #1: make sure the size of the downloaded file is right:

 ls -l ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
 
  • QC check #2: a “hash” of the file is a characteristic ‘fingerprint’ of its contents. Damage or truncation to the file will change the value of this hash. Often a file will be provided with an accompanying hash (e.g., an md5sum) which can be checked.
 md5sum ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
 


Back to Day 3 Overview.