Metagenome assembly and binning¶
In this tutorial you'll learn how to inspect assemble metagenomic data and retrieve draft genomes from assembled metagenomes
We'll use a mock community of 20 bacteria sequenced using the Illumina HiSeq. In reality the data were simulated using InSilicoSeq.
Getting the Data¶
mkdir -p ~/data cd ~/data curl -O -J -L https://osf.io/th9z6/download curl -O -J -L https://osf.io/k6vme/download chmod -w tara_reads_R*
we'll use FastQC to check the quality of our data, as well as sickle for trimming the bad quality part of the reads. If you need a refresher on how and why to check the quality of sequence data, please check the Quality Control and Trimming tutorial
mkdir -p ~/results cd ~/results ln -s ~/data/tara_reads_* . fastqc tara_reads_*.fastq.gz
What is the average read length? The average quality?
Compared to single genome sequencing, what graphs differ?
Now we'll trim the reads using sickle
sickle pe -f tara_reads_R1.fastq.gz -r tara_reads_R2.fastq.gz -t sanger \ -o tara_trimmed_R1.fastq -p tara_trimmed_R2.fastq -s /dev/null
How many reads were trimmed?
Megahit will be used for the assembly.
megahit -1 tara_trimmed_R1.fastq -2 tara_trimmed_R2.fastq -o tara_assembly
the resulting assenmbly can be found under
How many contigs does this assembly contain?
First we need to map the reads back against the assembly to get coverage information
ln -s tara_assembly/final.contigs.fa . bowtie2-build final.contigs.fa final.contigs bowtie2 -x final.contigs -1 tara_reads_R1.fastq.gz -2 tara_reads_R2.fastq.gz | \ samtools view -bS -o tara_to_sort.bam samtools sort tara_to_sort.bam -o tara.bam samtools index tara.bam
then we run metabat
runMetaBat.sh -m 1500 final.contigs.fa tara.bam mv final.contigs.fa.metabat-bins1500 metabat
How many bins did we obtain?
Checking the quality of the bins¶
The first time you run
checkm you have to create the database
sudo checkm data setRoot ~/.local/data/checkm
checkm lineage_wf -x fa metabat checkm/ checkm bin_qa_plot -x fa checkm metabat plots
Which bins should we keep for downstream analysis?
checkm can plot a lot of metrics. If you have time, check the manual and try to produce different plots
if checkm fails at the phylogeny step, it is likely that your vm doesn't have enough RAM. pplacer requires about 35G of RAM to place the bins in the tree of life.
In that case, execute the following
curl -O -J -L https://osf.io/xuzhn/download
tar xzf checkm.tar.gz
checkm qa checkm/lineage.ms checkm
then plot the completeness
checkm bin_qa_plot -x fa checkm metabat plots
and take a look at