Viral Metagenome from a dolphin sample: hunting for a disease causing virus¶
In this tutorial you will learn how to investigate metagenomics data and retrieve draft genome from an assembled metagenome.
We will use a real dataset published in 2017 in a study in dolphins, where fecal samples where prepared for viral metagenomics study. The dolphin had a self-limiting gastroenteritis of suspected viral origin.
Getting the Data¶
First, create an appropriate directory to put the data:
mkdir -p ~/dolphin/data cd ~/dolphin/data
You can download them from here:
curl -O -J -L https://osf.io/4x6qs/download curl -O -J -L https://osf.io/z2xed/download
Alternatively, your instructor will let you know where to get the dataset from.
You should get 2 compressed files:
We will use FastQC to check the quality of our data, as well as fastp for trimming the bad quality part of the reads. If you need a refresher on how and why to check the quality of sequence data, please check the Quality Control and Trimming tutorial
mkdir -p ~/dolphin/results cd ~/dolphin/results ln -s ~/dolphin/data/Dol1* . fastqc Dol1_*.fastq.gz
What is the average read length? The average quality?
Compared to single genome sequencing, which graphs differ?
Quality control with Fastp¶
We will removing the adapters and trim by quality.
Now we run fastp our read files
fastp -i Dol1_S19_L001_R1_001.fastq.gz -o Dol1_trimmed_R1.fastq \ -I Dol1_S19_L001_R2_001.fastq.gz -O Dol1_trimmed_R2.fastq \ --detect_adapter_for_pe --length_required 30 \ --cut_front --cut_tail --cut_mean_quality 10
Check the html report produced.
How many reads were trimmed?
Removing the host sequences by mapping/aligning on the dolphin genome¶
For this we will use Bowtie2. We have downloaded the genome of Tursiops truncatus from Ensembl (fasta file). Then we have run the following command to produce the indexes of the dolphin genome for Bowtie2 (do not run it, we have pre-calculated the results for you):
bowtie2-build Tursiops_truncatus.turTru1.dna.toplevel.fa Tursiops_truncatus
Because this step takes a while, we have precomputed the index files, you can get them from here:
curl -O -J -L https://osf.io/wfk9t/download
First we will extract the bowtie indexes of the dolphin genome into our results directory:
tar -xzvf host_genome.tar.gz
Now we are ready to map our sequencing reads on the dolphin genome:
bowtie2 -x host_genome/Tursiops_truncatus \ -1 Dol1_trimmed_R1.fastq -2 Dol1_trimmed_R2.fastq \ -S dol_map.sam --un-conc Dol_reads_unmapped.fastq --threads 4
How many reads mapped on the dolphin genome?
Taxonomic classification of the trimmed reads¶
We will use Kaiju for the classification of the produced contigs. As we are mainly interested in detecting the viral sequences in our dataset and we want to reduce the computing time and the memory needed, we will build a viruses-only database.
# Kaiju needs to be installed cd mkdir -p databases/kaijudb cd databases/kaijudb makeDB.sh -v
Once the database built, Kaiju tells us that we need only 3 files: Then other files and directories can be removed
# Removing un-needed things rm -r genomes/ rm kaiju_db.bwt rm kaiju_db.sa rm merged.dmp rm kaiju_db.faa
We are now ready to run Kaiju on our trimmed reads
cd dolphin/results kaiju -t ~/databases/kaijudb/nodes.dmp -f ~/databases/kaijudb/kaiju_db.fmi \ -i Dol_reads_unmapped.1.fastq -j Dol_reads_unmapped.2.fastq \ -o Dol1_reads_kaiju.out
In order to visualise the results, we will produce a Krona chart. This step requires to have KronaTools installed:
conda install -c bioconda krona
kaiju2krona -t ~/databases/kaijudb/nodes.dmp -n ~/databases/kaijudb/names.dmp \ -i Dol1_reads_kaiju.out -o Dol1_reads_kaiju.krona -u ktImportText -o Dol1_reads_kaiju.krona.html Dol1_reads_kaiju.krona
If we were interested in bacteria and had a databases containing them, we could also use the command kaijuReport to get a text summary. Unfortunately, it does not provide taxonomic levels for viruses.
Then we copy the produced html file locally to visualise in our web browser.
Megahit will be used for the de novo assembly of the metagenome.
megahit -1 Dol_reads_unmapped.1.fastq -2 Dol_reads_unmapped.2.fastq -o assembly
The resulting assembly can be found under
How many contigs does this assembly contain? Is there any long contig?
Taxonomic classification of contigs¶
We will use Kaiju again with the same viruses-only database for the classification of the produced contigs.
cd assembly kaiju -t ~/databases/kaijudb/nodes.dmp -f ~/databases/kaijudb/kaiju_db.fmi \ -i final.contigs.fa -o Dol1_contigs_kaiju.out
Then we produce the Krona chart:
kaiju2krona -t ~/databases/kaijudb/nodes.dmp -n ~/databases/kaijudb/names.dmp \ -i Dol1_contigs_kaiju.out -o Dol1_contigs_kaiju.krona -u ktImportText -o Dol1_contigs_kaiju.krona.html Dol1_contigs_kaiju.krona
Does the classification of contigs produce different results than the classification of reads?
Extraction of the contig of interest¶
Let's go for a little practice of your Unix skills!
Find a way to to find the longest contig. They look like this:
>k141_1 flag=1 multi=1.0000 len=301 >k141_2 flag=1 multi=1.0000 len=303
hint 1: all lines containing '>'
hint 2: use grep and sed
Once we have this file, we want to sort all the sequences headers by the sequence length (len=X):
cat final.contigs.fa | grep ">" | sed s/len=// | sort -k4n | tail -1
What is the size of the longest contig?
Now that you have identified the sequence header or id of the longest contig, you want to save it to a fasta file.
grep -i '>k141_XXX' -A 1 final.contigs.fa > longest_contig.fasta
You need to replace the XXX by the correct header. The option -A 1 of grep allows to print 1 line additionally to the matching line (which enables to print the full sequence that corresponds to one line). Test with -A 2 (without the redirection to longest_contig.fasta) to see what happens.
Now that you have identified the longest contig, you will check in Kaiju results what was the taxon assigned to this contig.
Have a look at the file Dol1_contigs_kaiju.out. It is structured in 3 columns: classification status (C/U), sequence id, assigned TaxID.
Identify the TaxID of the longest contig and search on NCBI Taxonomy database to which species it corresponds to.
Genome annotation of the contig of interest¶
Once the contig to annotate is extracted and saved in the file longest_contig.fasta, we will use Prokka to detect ORFs (Open Reading Frames) in order to predict genes and their resulting proteins.
First, go to Uniprot database and retrieve a set of protein sequences
belonging to adenoviruses. Save the file as
adenovirus.faa and copy it in
your results directory.
prokka --outdir annotation --kingdom Viruses \ --proteins adenovirus.faa longest_contig.fasta
How many genes and proteins were predicted?
Visualization and manual curation.¶
If there is some time left, you can visualise the produced annotation (gff file) in Ugene or Artemis for example.
Go further with proteins functions¶
For the predicted proteins that are left "hypotetical", you can try running Interproscan on them to get more information on domains and motifs.