December 13, 2019

Nanopore bacterial genome assemblies

tl;dr You should be able to get 99.95%+ DNA genome assembly accuracy from an Oxford Nanopore Technologies MinION device using the combination of Guppy modified base high accuracy base calling + Flye genome assembly w/ 2 rounds of polishing + 4 rounds of racon polishing + one round of medaka polishing. Steps to make a hybrid assembly with Illumina (SPAdes + pilon) are also included.

On a properly kitted Linux desktop computer with a good graphics card, this should take about an hour per genome, and does *not* require administrator (root or sudo) privileges. All the software in this tutorial can be installed in your local account using standard *nix commands and conda.

There may be similar scripts to do all this elsewhere, but I'll try to explain every step a bit here so you don't feel that you're cargo culting nanopore bioinformatics.

______

Handheld nanopore sequencing with the MinION and its ilk has put the power to generate into every lab, but the informatics can still be a bit intimidating for folks as there are so many options. This is one way to generate bacterial genome assemblies with nanopore data that has worked pretty well for us.

SETUP

I am assuming at minimum that you are running Linux on a computer with at least 4 CPU cores, and an NVIDIA GPU here. Pretty much any gaming laptop would suffice, or a ~$2000 desktop. You should know the basics of launching a terminal window and navigating the file system from the command line. If you don't know these essentials, please spend 45 minutes at Software Carpentry to learn it... trust me you won't regret it!

I am also assuming that you have some R9.4.1 flow cell data, which you've generated on the MinION or Flongle or PromethION and placed it on the Linux machine that you will use to run this tutorial.

If you don't already have a conda software manager in the Linux box, install one with all the defaults:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh 

Complete the installation, then log out/in of your terminal window so it's all set up.

Add conda "channels" (repositories) that might be useful for finding packages we're going to use.

conda config --add channels conda-forge 
conda config --add channels defaults 
conda config --add channels bioconda 

Create an environment that will house all the programs we're using in this tutorial, then activate it:

conda create -n nanopore_assembly 
conda activate nanopore_assembly 

All of the conda commands in this tutorial only need to be run the very first time you try this tutorial, except the conda activate and deactivate commands, which should be run each time.

NANOPORE ASSEMBLY

Now, let's install the Oxford Nanopore Technologies base caller Guppy (for the latest version, update the address by visiting the Community Page):

wget https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_3.4.3_linux64.tar.gz 
tar zxvf ont-guppy_3.4.3_linux64.tar.gz 

Run the Guppy base caller on your data, using the high accuracy model with base modification (methylation) calls:

ont-guppy/bin/guppy_basecaller --input_path /path/to/your/directory/of/fast5_pass --save_path /path/where/you/want/to/write/results --num_callers 4 --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --device cuda:all:100%

The output should look something like this if successful:

ONT Guppy basecalling software version 3.4.3+f4fc735
config file:        /opt/ont/guppy/data/dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg
model file:         /opt/ont/guppy/data/template_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.jsn
input path:         /localhome/gordonp/For Paul/19-11-19 pooled 6-10/pooled_6-10/pooled/20191119_1839_MN30852_FAL04937_2e9551ca/fast5/
save path:          guppy
chunk size:         1000
chunks per runner:  512
records per file:   4000
num basecallers:    8
gpu device:         cuda:all:100%
kernel path:        
runners per device: 8

Found 109 fast5 files to process.
Init time: 1951 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 1639062 ms, Samples called: 23534183594, samples/s: 1.43583e+07
Finishing up any open output files.

Basecalling completed successfully.

There are actually a lot of tweaks you can make to the previous command to accelerate the base calling, depending on the NVIDIA GPU you have on board your computer. This is left as an exercise to the reader. If the command fails, try removing the "--num_callers 4" option, and make sure nothing else is using the GPU on the system except /usr/bin processes when you run the command nvidia-smi.

Your results are in the guppy subdirectory, so let's go there:

cd guppy

*If* you barcoded a number of samples into one run (since a MinION run has capacity to do multiple 2-5Mbp bacterial genomes), you will need to demultiplex FASTQ files generated by Guppy, using the ONT software qcat:

conda install qcat
conda install future
cat *.fastq | qcat --trim -b demultiplexed

Hopefully you see an output that corresponds to the barcodes that you actually used! There will be a subdirectory for each sample in guppy/demultiplexed/barcode##, plus maybe a few false positive barcode IDs that you can safely ignore.

Adapters detected in 336008 of 432735 reads
  NBD104/NBD114 336008: |      ############### |  77.65 %
           none  85909: |                  ### |  19.85 %
Barcodes detected in 336008 of 432735 adapters
      barcode01     16: |                      |   0.00 %
      barcode02      1: |                      |   0.00 %
      barcode04      5: |                      |   0.00 %
      barcode05      9: |                      |   0.00 %
      barcode06 105209: |                 #### |  24.31 %
      barcode07  63486: |                   ## |  14.67 %
      barcode08  67515: |                  ### |  15.60 %
      barcode09  57684: |                   ## |  13.33 %
      barcode10  42037: |                    # |   9.71 %
      barcode12      1: |                      |   0.00 %
      barcode13      5: |                      |   0.00 %
      barcode14      9: |                      |   0.00 %
      barcode15      7: |                      |   0.00 %
      barcode16      1: |                      |   0.00 %
      barcode17      1: |                      |   0.00 %
      barcode18      6: |                      |   0.00 %
      barcode19      5: |                      |   0.00 %
      barcode21      4: |                      |   0.00 %
      barcode22      6: |                      |   0.00 %
      barcode24      1: |                      |   0.00 %
           none  85909: |                  ### |  19.85 %
10818 reads were skipped due to the min. length filter.

Demultiplexing finished in 308.34s

Now we are almost ready to assemble the reads from a sample into a draft genome. If you're multiplexed, you'll need to go into the subdirectory with the results of the demuplexing:

cd demultiplexed

For everything that follows in this tutorial, naturally you will run the commands for each barcode##.fastq file. *Alternatively*, if you only had one sample, concatenate all the Guppy output into one FASTQ file called 'all.fastq':

cat *_0.fastq > all.fastq

We are going to use the Flye assembler to generate a draft genome for a given barcode (or all.fastq if you only had one sample on the run). In this case, I'm assembling the sample labelled with barcode06, and I suspect it might have plasmids so we'll check for those. The genome should be about 3Mbp, hence the '-g 3m' option provided, but don't worry this need only be a rough estimate.

conda install flye 
flye --nano-raw barcode06.fastq --threads 8 --iterations 2 --plasmids -g 3m --out-dir barcode06

In my limited experience, asking for more than 2 iterations of polishing from Flye can be counterproductive, but your mileage may vary. The output is in the directory barcode06, so let's go there:

cd barcode06

The Flye output is in assembly.fasta, but it's kind of rough. Let's now apply a nice general genome polisher called racon to read overlaps of >500bp with a maximum of 15% mismatches. We'll do this over four iterations to fix up the consensus sequence (with the final result in racon4.fasta):

conda install racon
conda install minimap2
ln -s assembly.fasta racon0.fasta
for n in 1 2 3 4; 
do 
minimap2 racon`expr $n - 1`.fasta ../barcode06.fastq > minimap.paf; 
racon ../barcode06.fastq minimap.paf racon`expr $n - 1`.fasta -e 0.15 -t 8 -m 8 -x -6 -g -8 -w 500 > racon$n.fasta; 
done

Why four iterations? That's what the tool in the next step was trained using, so we're feeding it what it's seen before. There is also a fancier version of racon that uses the GPU, but if you're using this tutorial because you're a novice, maybe you're best to explore racon-gpu once you have more experience.

At this point, we'd like to run the ONT-developed medaka polisher which helps resolve a lot of homopolymer consensus errors typical of nanopore signal (e.g. deciding between 5 A's in a row vs. 7). *But*, its dependencies are incompatible with other tools we've used so far, so we'll need to switch to a new conda environment (this is one of the key strengths of conda, to put a fence between incompatible softwares).

conda deactivate
conda create -n nanopore_polishing
conda activate nanopore_polishing
conda install medaka 

Now we're ready to run the medaka polisher:

medaka_consensus -i ../barcode06.fastq -d racon4.fasta -o medaka -t 8 

Your final nanopore-only genome is in medaka/consensus.fasta. Most of the remaining errors ~0.05% errors will be homopolymer stretches.

HYBRID ASSEMBLY (OPTIONAL)

If you also have Illumina paired-end short read data for the same biological sample, these can be used to correct the assembly by first de novo assembling the short reads with SPAdes (using the nanopore long read assembly as a trusted guide to help get through the repetitive regions that are troublesome in short read assembly):

conda install spades
spades.py -o spades --trusted-contigs medaka/consensus.fasta -1 /path/to/illumina/sample_R1_001.fastq.gz  -2 /path/to/illumina/sample_R2_001.fastq.gz

Now let's polish the short read assembly with Pilon, which first requires mapping the Illumina reads to our new SPAdes assembly:

conda install pilon
minimap2 -a spades/contigs.fasta /path/to/illumina/sample_R1_001.fastq.gz /path/to/illumina/sample_R2_001.fastq.gz | samtools sort -T tmp -O bam - > illumina.bam
samtools index illumina.bam
pilon --genome spades/contigs.fasta --frags illumina.bam

Then we'll take note of where the SPAdes contigs are positioned within the nanopore consensus genome.

minimap2 -a medaka/consensus.fasta pilon.fasta > pilon_vs_trusted.sam

Now we need to overlay the high quality (read: much fewer homopolymer problems) short-read assembly on the trusted contiguous long read assembly (read: resolves long repeats properly). I've written a script for that, so we'll download it and change its file mode make it executable.

wget -O trusted_pilon_overlay https://owncloud.westgrid.ca/index.php/s/S4BV5FpEQao7Kbu/download
chmod u+x trusted_pilon_overlay

Now let's run the overlay script:

./trusted_pilon_overlay medaka/consensus.fasta pilon_vs_trusted.sam pilon_overlay_on_trusted.fasta

Your hybrid polished genome is in 'pilon_overlay_on_trusted.fasta'. Enjoy!  Note that if you want to polish any plasmids with the short reads as well, there are additional steps to be taken, but I will do a separate post about that since there are a number of caveats.




July 2, 2019

Hybrid PacBio + Illumina assembly: turning Unicycler on its head


tl;dr Unicycler is an amazing program that does a great job of assembling microbial genomes from short and long reads. In hybrid assembly mode, short reads are assembled into contigs, then these are spanned using long reads such as those from PacBio or Oxford Nanopore Technologies. This is an oversimplified version, but is the gist of it.

Occasionally though, I find that highly repetitive genome structures are better resolved (even vs. Unicycler's 'bold' mode) by doing the opposite: assemble the long reads, then use them as a trusted guide for short read assembly, and then use the long reads to fill the gaps in the guide short assembly.
___________

Note: All of the tools described below are available in BioConda, and can be installed using commands like "conda install lordec". Links to the original tools pages are provided for reference only.

Step 1. Employ LoRDEC to correct the PacBio reads using the short reads:

$ lordec-correct -2 all_illumina.fastq -k 19 -s 3 -i PacBio.filtered_subreads.fastq -o pacbio-corrected.fastq

Step 2. Run Canu on the long reads to get a draft assembly (this is about a 3Mbp genome in my example):


$ canu -p my_isolate_name -d canu genomeSize=3.0m -pacbio-raw pacbio-corrected.fastq

Step 3a. Pick the reasonable contigs from the canu output directory (which we imaginatively called "canu"):


$ cd canu
$ grep '>' my_isolate_name.contigs.fasta 
>tig00000001 len=2709880 reads=3198 covStat=2122.56 gappedBases=no class=contig suggestRepeat=no suggestCircular=no

If more than one contig, and one is a lot smaller than the other, maybe you have plasmids.

Step 3b. Run SPAdes on the short reads with these as trusted contigs:

$ spades.py -o spades.2002_S45 --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto

And if you suspect you have a plasmid, run the same thing again but with a new output directory and the --plasmid flag:

$ spades.py -o spades.2002_S45.plasmid --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto --plasmid

Step 4a. Run minimap2 piped into samtools, for the short reads against the Spades output (needed for Pilon in Step 5):

$ cd spades
$ minimap2 -ax sr contigs.fasta ../../1042_S33_L001_R1_001.fastq ../../1042_S33_L001_R2_001.fastq | samtools sort -T tmp -O bam > short_reads_to_asm.bam

Step 4b. Sanity check that most reads actually mapped, and index the BAM file:

$ samtools index short_reads_to_asm.bam
$ samtools flagstat short_reads_to_asm.bam
843188 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
5026 + 0 supplementary
0 + 0 duplicates
839119 + 0 mapped (99.52% : N/A)
838162 + 0 paired in sequencing
419081 + 0 read1
419081 + 0 read2
825398 + 0 properly paired (98.48% : N/A)
830940 + 0 with itself and mate mapped
3153 + 0 singletons (0.38% : N/A)
624 + 0 with mate mapped to a different chr
543 + 0 with mate mapped to a different chr (mapQ>=5)

Step 5. Run Pilon to polish the Spades assembly.

$ pilon --genome contigs.fasta --frags short_reads_to_asm.bam


Step 6. Map the Pilon assembly to the canu long reads assembly:

$ minimap2 -a ../*.contigs.fasta pilon.fasta > pilon_vs_trusted.sam

Step 7. Use this little script (rename to trusted_pilon_overlay and make executable on Linux) to overlay the Pilon polished Illumina assembly contigs (higher per base quality) onto the long reads canu assembly (lower per base quality) for the final genome:

$ trusted_pilon_overlay ../*.contigs.fasta pilon_vs_trusted.sam pilon_overlay_on_trusted.fasta

Step 8a: Check that the plasmids are real with BLAST matches to exiting plasmids:

blastn -db nt -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.fasta.blastn_nt.txt -remote

Step 8b: Check that you didn't assemble the Illumina phiX spike-in control sequence into your genome! That would be embarrassing (and all too common in GenBank).


$ makeblastdb -dbtype nucl -in phiX.fna
$ blastn -db phiX.fna -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.phiX.txt

The result should be "No hits found".  Woohoo, you're done!

Step 9 (optional).  If you still have multiple contigs, start exploring the assembly graph (the .gfa file in your outputs) visually in Bandage to see what might be the cause. Loading a Unicycler assembly and overlaying the pilon_overlay_on_trusted.fasta file as BLAST matches can be informative too, like below where the bubbles represent multiple nearly identical partially integrated phages in a genome. In the case below, my method builds two contigs, displayed as green and blue BLAST hits on Bandage's 'bold' 22 contigs.