III-2. Subsetting VCF files
- 
    Extract the metadata and header Rename the VCF file mv ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz chr20.vcf.gz Recall that VCF files have metadata lines beginning with “##” and header lines beginning with “#” zcat chr20.vcf.gz | egrep "^##" 
- 
    Take a subset of rows zcat chr20.vcf.gz | head -n100000 > chr20.first100krows.vcf 
- 
    Examine this file via less less chr20.first100krows.vcf 
- 
    Get the header cat chr20.first100krows.vcf | egrep "^#[^#]" 
- 
    Pick just a selection of columns cat chr20.first100krows.vcf | cut -f1-10,20,30 | less cat chr20.first100krows.vcf | cut -f1-10,20,30 > chr20.first100krows_3samps.vcf 
- 
    Pick variants on chromosome 20 between positions 70000 and 72000 cat chr20.first100krows_3samps.vcf | awk '{OFS="\t"; if ($2 > 70000 && $2 < 71000){ print }}'
- 
    What if we wanted to do this for the whole file, somewhere in the middle? What needs to change? zcat chr20.vcf.gz | awk '{OFS="\t"; if ($2 > 30000000 && $2 < 30001000){ print }}'
- 
    This is much too slow - so let’s use tabix to build an index to the file to enable fast random access tabix tabix -p vcf chr20.vcf.gz # wait .... 
- 
    Now we can quickly retreive individual rows tabix chr20.vcf.gz 20:30000000-30001000 | cut -f1-5 
- 
    How many variants in this region? tabix chr20.vcf.gz 20:30000000-30001000 | cut -f1-5 | wc -l 
- 
    How many are variant in the first sample in this region (column #10, HG00096)? tabix chr20.vcf.gz 20:30000000-30001000 | cut -f10 | wc -l tabix chr20.vcf.gz 20:30000000-30001000 | cut -f10 | grep -v "0.0" | wc -l 
- 
    How many are variant in two samples in this region for two individuals, HG00096 (from British in England and Scotland), and NA18506 (from Yoruba in Ibadan, Nigeria)? cat chr20.first100krows.vcf | cut -f1-10,1775 > chr20.100krows.twoindivs.vcf cat chr20.100krows.twoindivs.vcf | grep CHROM | cut -f10 | head cat chr20.100krows.twoindivs.vcf | grep CHROM | cut -f11 | head cat chr20.100krows.twoindivs.vcf | grep -v "^#" | cut -f10 | grep -v "0.0" | wc -l cat chr20.100krows.twoindivs.vcf | grep -v "^#" | cut -f11 | grep -v "0.0" | wc -l 
- 
    Newly tabix index this file cat chr20.100krows.twoindivs.vcf | bgzip > chr20.100krows.twoindivs.vcf.gz tabix -p vcf chr20.100krows.twoindivs.vcf.gz 
Back to Day 3 Overview.