Finding SNP Sites in the Whole Genome of a Plant Using GATK
As I have discussed in the previous blog, we already have aligned SAM (.sam) file to be further utilized for analyzing genomic variations. In this post, I am going to instruct the steps involved in the analysis of genomic variations among the closely related organisms. In particular, I am presenting the workflow pipeline for the GATK (Genome Analysis Tool Kit) HaplotypeCaller for detecting SNP (Single Nucleotide Polymorphism) sites and their annotation by snpEff.
Note: In order to carry out SNP sites detection by GATK, there should be SAM file of the whole genome sequence of an organism, preinstalled GATK software version 4.1.0.1 and above (McKenna et al., 2010) with its accessory packages, Samtools, and snpEff 4.3t (Cingolani et al., 2012) in Linux server.
Image 1. a screenshot of a home page of gatk website (source: https://software.broadinstitute.org/gatk/) |
In brief
1. get the SAM file of the whole genome sequence of the sample organism;
2. do SNP call with HaplotypeCaller in GATK4.1.0.1;
3. perform joint genotyping of HaplotypeCaller operated files (for multiple numbers of whole genome sequences);
4. filter the variants after they have been selected;
4. filter the variants after they have been selected;
Image 2. a screenshot of a page of snpEff on sourceforge web repository (source: http://snpeff.sourceforge.net/) |
In detail
1. Create sorted BAM file from aligned SAM file
In this step, we are going to create a sorted BAM file, which will have a sequence sorted according to coordinates of the reference genome, using the following command.
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SortSam \
--INPUT=VRiparia.sam \
--OUTPUT=VRiparia_sorted.bam \
--SORT_ORDER=coordinate
--TMP_DIR=/B42T/mukesh/tmp/
Here, VRiparia.sam is an aligned SAM file of Vitis riparia whole genome sequence (the sample sequence) and TMP_DIR is a temporary directory to manage memory problems in Linux.
2. Create markduplicated BAM file or to mark duplicate reads in sorted BAM file
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates \
--INPUT=VRiparia_sorted.bam \
--OUTPUT=VRiparia_dedup_reads.bam \
--METRICS_FILE=VRiparia_dedup_metrics.txt
--TMP_DIR=/B42T/mukesh/tmp/
OR
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar MarkDuplicates \
--INPUT=VRiparia_sorted.bam\
--OUTPUT=VRiparia_markduplicates.bam \
--METRICS_FILE=VRiparia_markduplicates_metrics.txt \
--OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \
--CREATE_INDEX=true \
--TMP_DIR=/tmp
3. Create index of reference genome
samtools faidx Vitis_vinifera.fa
4. Create a sequence dictionary for a reference genome
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CreateSequenceDictionary \
--REFERENCE=/B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
--OUTPUT=Vitis_vinifera.fa.dict
5. Add and replace read groups in markduplicated .bam file
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar AddOrReplaceReadGroups \
--INPUT=/B42T/mukesh/DNA_Seq_Work/VRiparia_dedup_reads.bam \
--OUTPUT=/B42T/mukesh/DNA_Seq_Work/VRiparia_addRG.bam \
--RGID=HWI-D00525a \
--RGLB= VRiparia_11 \
--RGPL=illumina \
--RGPU=C6DFYANXX \
--RGSM=SRP161488 \
6. Create bam index of read group added BAM file
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar BuildBamIndex \
--INPUT=VRiparia_addRG.bam
7. to index the .vcf file with --known-sites of SNPs [when no --known-sites, may be skipped]
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar IndexFeatureFile \
-F /B42T/mukesh/DNA_Seq_Work/VvSNPs_EnsemblPlants/vitis_vinifera.vcf
8. Do base quality score recalibration [when no --known-sites, may be skipped]
java -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar BaseRecalibrator \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
--input /B42T/mukesh/DNA_Seq_Work/VRiparia__addRG.bam \
--known-sites /B42T/mukesh/DNA_Seq_Work/VvSNPs_EnsemblPlants/vitis_vinifera.vcf \
--output VRiparia_recal_data.table
9. Run HaplotypeCaller for Short Variants Discovery
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar HaplotypeCaller -nct 32 -I \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
--input /B42T/mukesh/FastqFiles/VRiparia_addRG.bam\
--emit-ref-confidence GVCF
--output VRiparia_raw_SNPs_INDELS.g.vcf
After we get the GVCF (or .g.vcf) file with short variants like SNPs and INDELs, we can follow two ways to perform joint genotyping:
a) by giving commands in separate steps for combining g.vcf files and GenotypeGVCFs:
As we did not provide commands for combining g.vcf files and their joint genotyping in separate steps, the details in step 10 are missing.
10. Combine all the g.vcf files in one
11. Perform GenotypeGVCFs
java -Xmx8G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
--variant Vitis_raw_vcfs.list \
--output Vitis_raw_variants_output.vcf
OR
java -Xmx8G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
--variant VRupestris.g.vcf \
--output VRupestris_raw_SNPs_INDELs.vcf;
b) by writing bash shell having command for both the combine g.vcf and GenotypeGVCFs and executing it:
#!/bin/bash
filename=$1
GATK4=/B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
REFERENCE=/B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa
LISTFILE=$PWD/raw_gvcfs_files.txt
CVCFFILE=${filename}.g.vcf
GVCFFILE=${filename}.vcf
for line in `cat $LISTFILE`
do
{
i=`expr $i + 1`
variant="-V $line $variant"
}
done
echo $variant
java -Xmx24G -jar $GATK4 CombineGVCFs -R $REFERENCE $variant -O $CVCFFILE
java -Xmx24G -jar $GATK4 GenotypeGVCFs -R $REFERENCE -V $CVCFFILE -O $GVCFFILE
mv $CVCFFILE $PWD/tmp
bgzip $GVCFFILE
tabix -p vcf $GVCFFILE.gz
echo "$VCFFILE successfully created."
12. SelectVariants
To select SNPs
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
-V /B42T/mukesh/DNA_Seq_Work/raw_gvcfs_files.txt.vcf.gz \
--select-type-to-include SNP \
--output Vitis_11_raw_snps.vcf
To select INDELs
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
-V /B42T/mukesh/DNA_Seq_Work/raw_gvcfs_files.txt.vcf.gz \
--select-type-to-include INDEL \
--select-type-to-include MIXED \
--output Vitis_11_raw_INDEL_MIXED.vcf
After we selected variants from the VCF file, now, we can perform VQSR (Variant Quality Score Recalibration). Since there were no enough SNP sites for grapevines in pubicly available databases we skipped this step, instead, we did hard filtering of variants. So, I am mentioning the HardFilter command here.
13. Do HardFilter the Variants
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar VariantFiltration \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa
--variant /B42T/mukesh/DNA_Seq_Work/Vitis_11_raw_snps.vcf
--filter-expression "QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -10.0 || ReadPosRankSum < -8.0 || QUAL < 30" --filter-name "my_snp_filter"
--output Vitis_11_Filtered_snps.vcf
java -Xmx64G -jar /B42T/mukesh/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar VariantFiltration \
-R /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa \
--variant /B42T/mukesh/DNA_Seq_Work/Vitis_11_raw_INDEL_MIXED.vcf \
--filter-expression "QD < 2.0 || FS > 100.0 || ReadPosRankSum < -20.0 || InbreedingCoeff < -0.8" \
--filter-name "my_indel_filter" \
--output Vitis_11_Filtered_INDEL_MIXED.vcf
14. Conduct SNPs and INDELs annotation using snpEff programme
To annotate SNPs only
java -Xmx4g -jar /B42T/mukesh/snpEff/snpEff.jar Vitis_vinifera Vitis_11_Filtered_snps.vcf > Vitis_11_Ann_SNPs.vcf
To annotate INDEL and SNPs mixed
java -Xmx8g -jar /B42T/mukesh/snpEff/snpEff.jar Vitis_vinifera Vitis_11_Filtered_INDEL_MIXED.vcf > Vitis_11_Ann_INDEL_MIXED.vcf
References
References
Cingolani,
P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., … Ruden, D. M.
(2012). A program for annotating and predicting the effects of single
nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster
strain w1118; iso-2; iso-3. Fly, 6(2), 80–92.
https://doi.org/10.4161/fly.19695
McKenna,
A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., …
DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for
analyzing next-generation DNA sequencing
data. Genome Research, 20(9), 1297–1303.
https://doi.org/10.1101/gr.107524.110
Comments
Post a Comment