Skip to content

Metagenome assembly and binning

In this tutorial you'll learn how to inspect assemble metagenomic data and retrieve draft genomes from assembled metagenomes

We'll use a mock community of 20 bacteria sequenced using the Illumina HiSeq. In reality the data were simulated using InSilicoSeq.

The 20 bacteria in the dataset were selected from the Tara Ocean study that recovered 957 distinct Metagenome-assembled-genomes (or MAGs) that were previsouly unknown! (full list on figshare )

Getting the Data

mkdir -p ~/data
cd ~/data
curl -O -J -L
curl -O -J -L
chmod -w tara_reads_R*

Quality Control

we'll use FastQC to check the quality of our data, as well as sickle for trimming the bad quality part of the reads. If you need a refresher on how and why to check the quality of sequence data, please check the Quality Control and Trimming tutorial

mkdir -p ~/results
cd ~/results
ln -s ~/data/tara_reads_* .
fastqc tara_reads_*.fastq.gz


What is the average read length? The average quality?


Compared to single genome sequencing, what graphs differ?

Now we'll trim the reads using sickle

sickle pe -f tara_reads_R1.fastq.gz -r tara_reads_R2.fastq.gz -t sanger \
    -o tara_trimmed_R1.fastq -p tara_trimmed_R2.fastq -s /dev/null


How many reads were trimmed?


Megahit will be used for the assembly.

megahit -1 tara_trimmed_R1.fastq -2 tara_trimmed_R2.fastq -o tara_assembly

the resulting assenmbly can be found under tara_assembly/final.contigs.fa.


How many contigs does this assembly contain?


First we need to map the reads back against the assembly to get coverage information

ln -s tara_assembly/final.contigs.fa .
bowtie2-build final.contigs.fa final.contigs
bowtie2 -x final.contigs -1 tara_reads_R1.fastq.gz -2 tara_reads_R2.fastq.gz | \
    samtools view -bS -o tara_to_sort.bam
samtools sort tara_to_sort.bam -o tara.bam
samtools index tara.bam

then we run metabat -m 1500 final.contigs.fa tara.bam
mv final.contigs.fa.metabat-bins1500 metabat


How many bins did we obtain?

Checking the quality of the bins

The first time you run checkm you have to create the database

sudo checkm data setRoot ~/.local/data/checkm
checkm lineage_wf -x fa metabat checkm/
checkm bin_qa_plot -x fa checkm metabat plots


Which bins should we keep for downstream analysis?


checkm can plot a lot of metrics. If you have time, check the manual and try to produce different plots


if checkm fails at the phylogeny step, it is likely that your vm doesn't have enough RAM. pplacer requires about 35G of RAM to place the bins in the tree of life.

In that case, execute the following

cd ~/results
curl -O -J -L
tar xzf checkm.tar.gz
checkm qa checkm/ checkm

then plot the completeness

checkm bin_qa_plot -x fa checkm metabat plots

and take a look at plots/bin_qa_plot.png

Further reading