Finding SNP Sites in the Whole Genome of a Plant Using GATK

As I have discussed in the previous blog, we already have aligned SAM (.sam) file to be further utilized for analyzing genomic variations. In this post, I am going to instruct the steps involved in the analysis of genomic variations among the closely related organisms. In particular, I am presenting the workflow pipeline for the GATK (Genome Analysis Tool Kit) HaplotypeCaller for detecting SNP (Single Nucleotide Polymorphism) sites and their annotation by snpEff.

Note: In order to carry out SNP sites detection by GATK, there should be SAM file of the whole genome sequence of an organism, preinstalled GATK software version 4.1.0.1 and above (McKenna et al., 2010) with its accessory packages, Samtools, and snpEff 4.3t (Cingolani et al., 2012) in Linux server.

Image 1. a screenshot of a home page of gatk website (source: https://software.broadinstitute.org/gatk/)

In brief

1. get the SAM file of the whole genome sequence of the sample organism;

2. do SNP call with HaplotypeCaller in GATK4.1.0.1;

3. perform joint genotyping of HaplotypeCaller operated files (for multiple numbers of whole genome sequences);
4. filter the variants after they have been selected;

5. annotate the SNP and INDEL sites with snpEff.

Image 2. a screenshot of a page of snpEff on sourceforge web repository (source: http://snpeff.sourceforge.net/)

In detail

1. Create sorted BAM file from aligned SAM file

In this step, we are going to create a sorted BAM file, which will have a sequence sorted according to coordinates of the reference genome, using the following command.

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SortSam \

--INPUT=VRiparia.sam \

--OUTPUT=VRiparia_sorted.bam \

--SORT_ORDER=coordinate

--TMP_DIR=/B42T/mukesh/tmp/

Here, VRiparia.sam is an aligned SAM file of Vitis riparia whole genome sequence (the sample sequence) and TMP_DIR is a temporary directory to manage memory problems in Linux.

2. Create markduplicated BAM file or to mark duplicate reads in sorted BAM file

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates \

--INPUT=VRiparia_sorted.bam \

--OUTPUT=VRiparia_dedup_reads.bam \

--METRICS_FILE=VRiparia_dedup_metrics.txt

--TMP_DIR=/B42T/mukesh/tmp/

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates \

--INPUT=VRiparia_sorted.bam\

--OUTPUT=VRiparia_markduplicates.bam \

--METRICS_FILE=VRiparia_markduplicates_metrics.txt \

--OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \

--CREATE_INDEX=true \

--TMP_DIR=/tmp

3. Create index of reference genome

samtools faidx Vitis_vinifera.fa

4. Create a sequence dictionary for a reference genome

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CreateSequenceDictionary \

--REFERENCE=/B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

--OUTPUT=Vitis_vinifera.fa.dict

5. Add and replace read groups in markduplicated .bam file

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar AddOrReplaceReadGroups \

--INPUT=/B42T/mukesh/DNA_Seq_Work/VRiparia_dedup_reads.bam \

--OUTPUT=/B42T/mukesh/DNA_Seq_Work/VRiparia_addRG.bam \

--RGID=HWI-D00525a \

--RGLB= VRiparia_11 \

--RGPL=illumina \

--RGPU=C6DFYANXX \

--RGSM=SRP161488 \

6. Create bam index of read group added BAM file

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar BuildBamIndex \

--INPUT=VRiparia_addRG.bam

7. to index the .vcf file with --known-sites of SNPs [when no --known-sites, may be skipped]

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar IndexFeatureFile \

-F /B42T/mukesh/DNA_Seq_Work/VvSNPs_EnsemblPlants/vitis_vinifera.vcf

8. Do base quality score recalibration [when no --known-sites, may be skipped]

java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar BaseRecalibrator \

-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

--input /B42T/mukesh/DNA_Seq_Work/VRiparia__addRG.bam \

--known-sites /B42T/mukesh/DNA_Seq_Work/VvSNPs_EnsemblPlants/vitis_vinifera.vcf \

--output VRiparia_recal_data.table

9. Run HaplotypeCaller for Short Variants Discovery

java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -nct 32 -I \

-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

--input /B42T/mukesh/FastqFiles/VRiparia_addRG.bam\

--emit-ref-confidence GVCF

--output VRiparia_raw_SNPs_INDELS.g.vcf

After we get the GVCF (or .g.vcf) file with short variants like SNPs and INDELs, we can follow two ways to perform joint genotyping:

a) by giving commands in separate steps for combining g.vcf files and GenotypeGVCFs:

As we did not provide commands for combining g.vcf files and their joint genotyping in separate steps, the details in step 10 are missing.

10. Combine all the g.vcf files in one

11. Perform GenotypeGVCFs

java -Xmx8G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs \

-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

--variant Vitis_raw_vcfs.list \

--output Vitis_raw_variants_output.vcf

java -Xmx8G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs \

-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

--variant VRupestris.g.vcf \

--output VRupestris_raw_SNPs_INDELs.vcf;

b) by writing bash shell having command for both the combine g.vcf and GenotypeGVCFs and executing it:

#!/bin/bash

filename=$1

GATK4=/B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar

REFERENCE=/B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa

LISTFILE=$PWD/raw_gvcfs_files.txt

CVCFFILE=${filename}.g.vcf

GVCFFILE=${filename}.vcf

for line in `cat $LISTFILE`

{

i=`expr $i + 1`

variant="-V $line $variant"

}

done

echo $variant

java -Xmx24G -jar $GATK4 CombineGVCFs -R $REFERENCE $variant -O $CVCFFILE

java -Xmx24G -jar $GATK4 GenotypeGVCFs -R $REFERENCE -V $CVCFFILE -O $GVCFFILE

mv $CVCFFILE $PWD/tmp

bgzip $GVCFFILE

tabix -p vcf $GVCFFILE.gz

echo "$VCFFILE successfully created."

12. SelectVariants

To select SNPs

java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants \

-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

-V /B42T/mukesh/DNA_Seq_Work/raw_gvcfs_files.txt.vcf.gz \

--select-type-to-include SNP \

--output Vitis_11_raw_snps.vcf

To select INDELs

java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants \

-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

-V /B42T/mukesh/DNA_Seq_Work/raw_gvcfs_files.txt.vcf.gz \

--select-type-to-include INDEL \

--select-type-to-include MIXED \

--output Vitis_11_raw_INDEL_MIXED.vcf

After we selected variants from the VCF file, now, we can perform VQSR (Variant Quality Score Recalibration). Since there were no enough SNP sites for grapevines in pubicly available databases we skipped this step, instead, we did hard filtering of variants. So, I am mentioning the HardFilter command here.

13. Do HardFilter the Variants

java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar VariantFiltration \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa

--variant /B42T/mukesh/DNA_Seq_Work/Vitis_11_raw_snps.vcf

--filter-expression "QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -10.0 || ReadPosRankSum < -8.0 || QUAL < 30" --filter-name "my_snp_filter"

--output Vitis_11_Filtered_snps.vcf

java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar VariantFiltration \

-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \

--variant /B42T/mukesh/DNA_Seq_Work/Vitis_11_raw_INDEL_MIXED.vcf \

--filter-expression "QD < 2.0 || FS > 100.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8" \

--filter-name "my_indel_filter" \

--output Vitis_11_Filtered_INDEL_MIXED.vcf

14. Conduct SNPs and INDELs annotation using snpEff programme

To annotate SNPs only

java -Xmx4g -jar /B42T/mukesh/snpEff/snpEff.jar Vitis_vinifera Vitis_11_Filtered_snps.vcf > Vitis_11_Ann_SNPs.vcf

To annotate INDEL and SNPs mixed

java -Xmx8g -jar /B42T/mukesh/snpEff/snpEff.jar Vitis_vinifera Vitis_11_Filtered_INDEL_MIXED.vcf > Vitis_11_Ann_INDEL_MIXED.vcf

References

Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., … Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303. https://doi.org/10.1101/gr.107524.110

Search This Blog

Plantenthusiast

Finding SNP Sites in the Whole Genome of a Plant Using GATK

Comments

Post a Comment

Popular posts from this blog

How to Align Fastq Files of Sample Genomic DNA Sequence to Reference Genome

How did we perform preliminary screening for the phytochemical classes?