How to Align Fastq Files of Sample Genomic DNA Sequence to Reference Genome

Since the aim of this blog is to share the information on various aspects of plants and plant sciences, this time, I am blogging about the bioinformatics tools that help study about the genome of organisms including plants. The application of BWA software or any other software that allows alignment of fastq (.fq) files of the sample genomic DNA sequence to reference genome is an essential step before carrying out further investigations such as the genome analysis. In this post, I am going to present the instruction for the alignment of quality trimmed fastq (.fq) files of a sample genome to the reference genome using BWA (Burrows-Wheeler Aligner) software. This instruction covers installing of BWA software, indexing the reference genome, quality trimming the raw fastq files, and aligning the quality trimmed fastq files to reference genome to get SAM (Sequence Alignment Map) file of a sample genome.

Image 1. The vines of Vitis vinifera cv. 'Pinot Noir' with berries (Source: https://commons.wikimedia.org/)

In brief

Have Linux server preinstalled with softwares like software like Aspera Connect Software, SRA Toolkit, fastp;
Download fastq files of sample genomic DNA sequence;
Trim and quality control the sample genomic DNA sequences;
Download the reference genome sequence and make it to index file;
Download and install BWA software;
Align the sample genome to the reference genome.

NOTE: The entire process described here is based on what we did in the Linux server [ Ubuntu 16.04.4 LTS (GNU/Linux 4.15.0-47-generic x86_64)] following the instruction provided in SourceForge and GitHub. To carry out the whole process, besides BWA software, there must be other software like Aspera Connect Software, SRA Toolkit, fastp (or any software that removes low-quality reads and unidentified reads from raw fastq file) preinstalled in the Linux system.

Image 2. The screenshot of a page of sourceforge.net which contains BWA download link (source: https://sourceforge.net/projects/bio-bwa/ ; Date: 01 September 2019)

In detail

Let us begin with the BWA installing.

1. In order to install BWA software, download the package from its source using the following command:

wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.17.tar.bz2

2. Then, open the downloaded .bz2 file of bwa package using the command:

bunzip2 bwa-0.7.17.tar.bz2

3. After you have opened .bz2 file, you will see a file named bwa-0.7.17.tar (where -0.7.17 is a version and .tar is file type extension). Give the following command to open the .tar file:

tar xvf bwa-0.7.17.tar

4. After you have opened the .tar file of bwa, go to the bwa-0.7.17.tar folder entering the command:

cd bwa-0.7.17

5. Then, just type and enter the 'make' which itself is a command.

make

6. After you have given a 'make' command, do provide the path to the source. For this, execute the command:

source ~/.bashrc export PATH=$PATH:/B42T/mukesh//DNA_Seq_Work/bwa-0.7.17

[Here, /B42T/mukesh//DNA_Seq_Work/bwa-0.7.17 is a directory in your Linux server where you have installed bwa package. ]

7. Now, you can check whether bwa is installed by executing the following command:

bwa

Now, if you do not have the index of a reference genome, you will have to download reference genome from genome database, assemble it according to chromosome numbers, and index it. Here, in my illustration, I have downloaded the wine grape (Vitis vinifera L.) genome (version 12X) as a reference genome from the grape genome database of CRIBI Biotech Centre (http://genomes.cribi.unipd.it/) and indexed the wine grape reference genome as in the steps mentioned below.

Image 3. Screenshot of a homepage of CRIBI grape genome database

8. Give the following command to download the reference Genome of Vitis vinifera version 12X

wget http://genomes.cribi.unipd.it/DATA/GENOME_12X/Genome12X.tar.gz

9. After you have downloaded .tar.gz file of the reference genome, give the following command to open it and get .fa files of all chromosomes:

tar xvfz Genome12X.tar.gz

Once you have entered the above command, you will get FASTA (.fa) files of all the chromosomes of wine grape.

10. Then, you have to concatenate the chromosome files with a command which makes a sequential arrangement from chr1 to chr19, chr1_random to chr19_random, chr_Un for grape genome. The possible command would be like:

cat *.fa > Vitis_vinifera.fa

11. Since you have concatenated all the chromosome files you may not need theses files. So, you can remove or erase all chromosome files by the command:

rm chr*.fa

12. Now, it is time to create the index file of the reference genome by following command:

bwa index [options] <in.fasta>

for my illustration, the command would be,

bwa index -p Vitis_vinifera -a bwtsw Vitis_vinifera.fa

Upto this step, we have installed BWA and created the index file of the reference genome. Now, in the steps below, we are going to align Fastq (.fq) files of sample genomic DNA sequence reads. In this case, I have used sample .fq files of Vitis riparia Michx., which were generated from raw paired-end SRA (.sra; Sequence Read Archive) files. while generating .fq files, the SRA files should be dumped and be quality trimmed the low-quality reads. To convert raw SRA files into quality trimmed .fq files, I followed the steps below:

13. Download the SRA files from NCBI SRA database

14. Since the SRR7819179.sra is a paired-end file, dump it by the command:

fastq-dump --split-files SRR7819179.sra

This command will give SRR7819179_1.fastq and SRR7819179_2.fastq.

15. Then, trim both the fastq files for quality reads by fastp software with this command:

fastp -i SRR7819179_1.fastq -I SRR7819179_2.fastq -o VRiparia_1.fq -O VRiparia_2.fq

16. Finally, the quality trimmed .fq files of sample can be aligned to the reference genome to get SAM (.sam; Sequence Alignment Map) file by this command:

bwa mem [options] idxbase in1.fq in2.fq > result.sam

In my instance, the command is:

bwa mem -M -t 16 /B42T/mukesh/DNA_Seq_Work/index/Vitis_vinifera.fa VRiparia_1.fq VRiparia_2.fq > VRiparia.sam

As we have SAM (.sam) file, we can use it for further analysis like genomic variation.

Search This Blog

Plantenthusiast