July 2, 2019

Hybrid PacBio + Illumina assembly: turning Unicycler on its head


tl;dr Unicycler is an amazing program that does a great job of assembling microbial genomes from short and long reads. In hybrid assembly mode, short reads are assembled into contigs, then these are spanned using long reads such as those from PacBio or Oxford Nanopore Technologies. This is an oversimplified version, but is the gist of it.

Occasionally though, I find that highly repetitive genome structures are better resolved (even vs. Unicycler's 'bold' mode) by doing the opposite: assemble the long reads, then use them as a trusted guide for short read assembly, and then use the long reads to fill the gaps in the guide short assembly.
___________

Note: All of the tools described below are available in BioConda, and can be installed using commands like "conda install lordec". Links to the original tools pages are provided for reference only.

Step 1. Employ LoRDEC to correct the PacBio reads using the short reads:

$ lordec-correct -2 all_illumina.fastq -k 19 -s 3 -i PacBio.filtered_subreads.fastq -o pacbio-corrected.fastq

Step 2. Run Canu on the long reads to get a draft assembly (this is about a 3Mbp genome in my example):


$ canu -p my_isolate_name -d canu genomeSize=3.0m -pacbio-raw pacbio-corrected.fastq

Step 3a. Pick the reasonable contigs from the canu output directory (which we imaginatively called "canu"):


$ cd canu
$ grep '>' my_isolate_name.contigs.fasta 
>tig00000001 len=2709880 reads=3198 covStat=2122.56 gappedBases=no class=contig suggestRepeat=no suggestCircular=no

If more than one contig, and one is a lot smaller than the other, maybe you have plasmids.

Step 3b. Run SPAdes on the short reads with these as trusted contigs:

$ spades.py -o spades.2002_S45 --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto

And if you suspect you have a plasmid, run the same thing again but with a new output directory and the --plasmid flag:

$ spades.py -o spades.2002_S45.plasmid --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto --plasmid

Step 4a. Run minimap2 piped into samtools, for the short reads against the Spades output (needed for Pilon in Step 5):

$ cd spades
$ minimap2 -ax sr contigs.fasta ../../1042_S33_L001_R1_001.fastq ../../1042_S33_L001_R2_001.fastq | samtools sort -T tmp -O bam > short_reads_to_asm.bam

Step 4b. Sanity check that most reads actually mapped, and index the BAM file:

$ samtools index short_reads_to_asm.bam
$ samtools flagstat short_reads_to_asm.bam
843188 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
5026 + 0 supplementary
0 + 0 duplicates
839119 + 0 mapped (99.52% : N/A)
838162 + 0 paired in sequencing
419081 + 0 read1
419081 + 0 read2
825398 + 0 properly paired (98.48% : N/A)
830940 + 0 with itself and mate mapped
3153 + 0 singletons (0.38% : N/A)
624 + 0 with mate mapped to a different chr
543 + 0 with mate mapped to a different chr (mapQ>=5)

Step 5. Run Pilon to polish the Spades assembly.

$ pilon --genome contigs.fasta --frags short_reads_to_asm.bam


Step 6. Map the Pilon assembly to the canu long reads assembly:

$ minimap2 -a ../*.contigs.fasta pilon.fasta > pilon_vs_trusted.sam

Step 7. Use this little script (rename to trusted_pilon_overlay and make executable on Linux) to overlay the Pilon polished Illumina assembly contigs (higher per base quality) onto the long reads canu assembly (lower per base quality) for the final genome:

$ trusted_pilon_overlay ../*.contigs.fasta pilon_vs_trusted.sam pilon_overlay_on_trusted.fasta

Step 8a: Check that the plasmids are real with BLAST matches to exiting plasmids:

blastn -db nt -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.fasta.blastn_nt.txt -remote

Step 8b: Check that you didn't assemble the Illumina phiX spike-in control sequence into your genome! That would be embarrassing (and all too common in GenBank).


$ makeblastdb -dbtype nucl -in phiX.fna
$ blastn -db phiX.fna -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.phiX.txt

The result should be "No hits found".  Woohoo, you're done!

Step 9 (optional).  If you still have multiple contigs, start exploring the assembly graph (the .gfa file in your outputs) visually in Bandage to see what might be the cause. Loading a Unicycler assembly and overlaying the pilon_overlay_on_trusted.fasta file as BLAST matches can be informative too, like below where the bubbles represent multiple nearly identical partially integrated phages in a genome. In the case below, my method builds two contigs, displayed as green and blue BLAST hits on Bandage's 'bold' 22 contigs.