Finding SNP Sites in the Whole Genome of a Plant Using GATK

As I have discussed in the previous blog, we already have aligned SAM (.sam) file to be further utilized for analyzing genomic variations. In this post, I am going to instruct the steps involved in the analysis of genomic variations among the closely related organisms. In particular, I am presenting the workflow pipeline for the GATK (Genome Analysis Tool Kit) HaplotypeCaller for detecting SNP (Single Nucleotide Polymorphism) sites and their annotation by snpEff

Note: In order to carry out SNP sites detection by GATK, there should be SAM file of the whole genome sequence of an organism, preinstalled GATK software version 4.1.0.1 and above (McKenna et al., 2010) with its accessory packages, Samtools, and snpEff  4.3t (Cingolani et al., 2012) in Linux server.

Image 1. a screenshot of a home page of gatk website (source: https://software.broadinstitute.org/gatk/)

In brief
1. get the SAM file of the whole genome sequence of the sample organism;
2. do SNP call with HaplotypeCaller in GATK4.1.0.1;
3. perform joint genotyping of HaplotypeCaller operated files (for multiple numbers of whole genome sequences);
4. filter the variants after they have been selected;
5. annotate the SNP and INDEL sites with snpEff.

Image 2. a screenshot of a page of snpEff on sourceforge web repository (source: http://snpeff.sourceforge.net/)

In detail
1. Create sorted BAM file from aligned SAM file
In this step, we are going to create a sorted BAM file, which will have a sequence sorted according to coordinates of the reference genome, using the following command. 

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SortSam
    --INPUT=VRiparia.sam \ 
    --OUTPUT=VRiparia_sorted.bam \ 
    --SORT_ORDER=coordinate
--TMP_DIR=/B42T/mukesh/tmp/

Here, VRiparia.sam is an aligned SAM file of Vitis riparia whole genome sequence (the sample sequence) and TMP_DIR is a temporary directory to manage memory problems in Linux.

2. Create markduplicated BAM file or to mark duplicate reads in sorted BAM file
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates
    --INPUT=VRiparia_sorted.bam \ 
    --OUTPUT=VRiparia_dedup_reads.bam \
    --METRICS_FILE=VRiparia_dedup_metrics.txt
--TMP_DIR=/B42T/mukesh/tmp/

OR

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates \
--INPUT=VRiparia_sorted.bam\                                  
--OUTPUT=VRiparia_markduplicates.bam \
--METRICS_FILE=VRiparia_markduplicates_metrics.txt \ 
--OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \              
--CREATE_INDEX=true \                                       
--TMP_DIR=/tmp

3. Create index of reference genome
samtools faidx Vitis_vinifera.fa

4. Create a sequence dictionary for a reference genome 
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CreateSequenceDictionary \       
    --REFERENCE=/B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \ 
    --OUTPUT=Vitis_vinifera.fa.dict
5. Add and replace read groups in markduplicated .bam file
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar AddOrReplaceReadGroups
    --INPUT=/B42T/mukesh/DNA_Seq_Work/VRiparia_dedup_reads.bam \ 
    --OUTPUT=/B42T/mukesh/DNA_Seq_Work/VRiparia_addRG.bam \ 
    --RGID=HWI-D00525a \                                       
    --RGLB= VRiparia_11 \ 
    --RGPL=illumina \ 
    --RGPU=C6DFYANXX \ 
    --RGSM=SRP161488 \ 

6. Create bam index of read group added BAM file
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar BuildBamIndex
    --INPUT=VRiparia_addRG.bam 

7. to index the .vcf file with --known-sites of SNPs [when no --known-sites, may be skipped]
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar IndexFeatureFile \
-F /B42T/mukesh/DNA_Seq_Work/VvSNPs_EnsemblPlants/vitis_vinifera.vcf

8. Do base quality score recalibration [when no --known-sites, may be skipped
 java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar BaseRecalibrator \
   -R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
   --input /B42T/mukesh/DNA_Seq_Work/VRiparia__addRG.bam \
   --known-sites /B42T/mukesh/DNA_Seq_Work/VvSNPs_EnsemblPlants/vitis_vinifera.vcf \
   --output VRiparia_recal_data.table

9. Run HaplotypeCaller for Short Variants Discovery   
 java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -nct 32 -I \
   -R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \ 
   --input /B42T/mukesh/FastqFiles/VRiparia_addRG.bam\
   --emit-ref-confidence GVCF
   --output VRiparia_raw_SNPs_INDELS.g.vcf 

After we get the GVCF (or .g.vcf) file with short variants like SNPs and INDELs, we can follow two ways to perform joint genotyping:
a) by giving commands in separate steps for combining g.vcf files and GenotypeGVCFs: 

As we did not provide commands for combining g.vcf files and their joint genotyping in separate steps, the details in step 10 are missing.

10. Combine all the g.vcf files in one 



11. Perform GenotypeGVCFs     
   java -Xmx8G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs \
   -R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
   --variant Vitis_raw_vcfs.list \
   --output Vitis_raw_variants_output.vcf

          OR

   java -Xmx8G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs \
   -R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
   --variant VRupestris.g.vcf \
   --output VRupestris_raw_SNPs_INDELs.vcf;

b) by writing bash shell having command for both the combine g.vcf and GenotypeGVCFs and executing it:
   
#!/bin/bash
filename=$1
GATK4=/B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
REFERENCE=/B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa
LISTFILE=$PWD/raw_gvcfs_files.txt
CVCFFILE=${filename}.g.vcf
GVCFFILE=${filename}.vcf
for line in `cat $LISTFILE`
do
    {
        i=`expr $i + 1`
        variant="-V $line $variant"
    }
done
echo $variant
java -Xmx24G -jar $GATK4 CombineGVCFs -R $REFERENCE  $variant -O $CVCFFILE
java -Xmx24G -jar $GATK4 GenotypeGVCFs -R $REFERENCE  -V $CVCFFILE -O $GVCFFILE
mv $CVCFFILE $PWD/tmp
bgzip $GVCFFILE
tabix -p vcf $GVCFFILE.gz
echo "$VCFFILE successfully created."


12. SelectVariants

To select SNPs
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants
    -R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
    -V /B42T/mukesh/DNA_Seq_Work/raw_gvcfs_files.txt.vcf.gz \ 
    --select-type-to-include SNP \ 
    --output Vitis_11_raw_snps.vcf 
To select INDELs
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants \ 
    -R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \ 
    -V /B42T/mukesh/DNA_Seq_Work/raw_gvcfs_files.txt.vcf.gz \ 
    --select-type-to-include INDEL \
    --select-type-to-include MIXED \
    --output Vitis_11_raw_INDEL_MIXED.vcf

After we selected variants from the VCF file, now, we can perform VQSR (Variant Quality Score Recalibration). Since there were no enough SNP sites for grapevines in pubicly available databases we skipped this step, instead, we did hard filtering of variants. So, I am mentioning the HardFilter command here. 
    
13. Do HardFilter the Variants
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar VariantFiltration \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa 
--variant /B42T/mukesh/DNA_Seq_Work/Vitis_11_raw_snps.vcf 
--filter-expression "QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -10.0 || ReadPosRankSum < -8.0 || QUAL < 30" --filter-name "my_snp_filter" 
--output Vitis_11_Filtered_snps.vcf

java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar VariantFiltration \ 
    -R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \ 
    --variant /B42T/mukesh/DNA_Seq_Work/Vitis_11_raw_INDEL_MIXED.vcf \ 
    --filter-expression "QD < 2.0 || FS > 100.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8" \ 
    --filter-name "my_indel_filter" \ 
    --output Vitis_11_Filtered_INDEL_MIXED.vcf
14. Conduct SNPs and INDELs annotation using snpEff programme
To annotate SNPs only
java -Xmx4g -jar /B42T/mukesh/snpEff/snpEff.jar Vitis_vinifera Vitis_11_Filtered_snps.vcf > Vitis_11_Ann_SNPs.vcf 

To annotate INDEL and SNPs mixed
java -Xmx8g -jar /B42T/mukesh/snpEff/snpEff.jar Vitis_vinifera Vitis_11_Filtered_INDEL_MIXED.vcf > Vitis_11_Ann_INDEL_MIXED.vcf

References
Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., … Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation  DNA sequencing data. Genome Research, 20(9), 1297–1303. https://doi.org/10.1101/gr.107524.110

Comments

Popular posts from this blog

How we performed the estimation of total flavonoid content ?

The Rapid Method of Estimating Antioxidant Activity and Total Polyphenols Content of Sample Plant Extracts

How to Align Fastq Files of Sample Genomic DNA Sequence to Reference Genome