CHGI Bioinformatics

January 30, 2020

Installing a local copy of NextStrain for nCov19 with Conda

NextStrain is great software for visualizing the relationships between isolates of a virus. It is used for influenza, but also for other viruses and they have made a system configured for the novel coronavirus that has emerged out of Wuhan, China. For most people, it is sufficient to view the public nCov19 instance of NextStrain at https://nextstrain.org/ncov?m=div. For the few that want to quickly check a new local sequence against the existing knowledge base, or at least the ability to do so, I've debugged the install process a bit to make it easier for you. Not that you'd be keeping your sequence private for any length of time, right? Submit to GISAID at the very least!
_________________________

Open a terminal.

If you don't already have a conda software manager in the Linux box, install one with all the defaults:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh

Complete the installation, then log out/in of your terminal window so it's all set up.

Now add conda "channels" (repositories) that might be useful for finding packages we're going to use.

conda config --add channels conda-forge

conda config --add channels defaults

conda config --add channels bioconda

You now have Conda installed, congratulations! Your life will now be 37.4452% easier.

Create an environment that will house all the software we're using in for NextStrain, then activate it. There are some bits of code that don't work nicely together with the default packages NextStrain tries to install later, and there are some undocumented prerequisites, so let's avoid all those headaches by configuring our conda environment with the magic combination of packages that does work (disclaimer: I've only tested this on CentOS only).

conda create -n nextstrain

conda activate nextstrain

conda install nodejs python=3.7 datrie bcbio-gff mafft iqtree

Install the node.js code needed for NextStrain:

npm install --global auspice

Install the python code needed for NextStrain:

pip install nextstrain-augur nextstrain-cli

Make sure it has everything it needs:

nextstrain check-setup --set-default

Now you need the code and information (sample metadata, colour schemes, etc.) that have been generated for the nCoV19 virus to plug into the generic NextStrain framework.

git clone https://github.com/nextstrain/ncov.git

Enter the top level directory of the code:

cd ncov

The metadata is updated as new samples are made public, so every few days you'll want to refresh your instance by running this command from inside the top level directory of ncov:

git pull

Now you need a set of sequences for NextStrain to compare to each other. As of this writing there are 13 complete nCov19 genomes in the public GenBank repository. Download the latest FastA file of all complete genomes by clicking download on this page. If you have access to the GISAID data sharing portal, you can download those sequences too and append them to the GenBank data, saving it all in a file under the top level ncov directory called data/sequences.fasta.

The critical step at this stage is to make sure each FastA record's ID (everything between ">" and the first space or new line in the description line) matches the names NextStrain has given to each sequence in the file data/metadata.tsv. This is left as an exercise to the reader as the sequence data you have access to will be ever changing. Often you will need to change a '/' to a '-', remove the "BetaCov/" ID prefix, remove the '_EPI####' suffix from GISAID downloads, etc.

If you had a new sequence of your own that you wanted to add, you would append it to the data/sequences.fasta file, and add a line to the data/metadata.tsv with the relevant information. Note that this will prevent updates with git pull in the future, as git does not want to clobber your changes. Tip: Save the extra lines you wrote in another file for safe keeping, so you can delete data/metadata.tsv and auspice/ncov.json, then successfully run git pull for code updates.

To calculate the relationships between your sequences and generate the Web portal information, run the following (in the top level ncov directory):

snakemake -p

To access the Web interface, launch the portal server from that same directory:

nextstrain view auspice

——————————————————————————————————————————————————————————————————————————————

Open <http://127.0.0.1:4000/> in your browser.

Warning: No datasets detected.

[verbose] Serving index / favicon etc from "/gpfs/home/gordonp/.conda/envs/nextstrain/lib/node_modules/auspice"

[verbose] Serving built javascript from "/gpfs/home/gordonp/.conda/envs/nextstrain/lib/node_modules/auspice/dist"

---------------------------------------------------

Auspice server now running at http://127.0.0.1:4000

Serving auspice version 2.3.5

Looking for datasets in /gpfs/home/gordonp/ncov/auspice

Looking for narratives in /gpfs/home/gordonp/ncov/auspice

---------------------------------------------------

Then open your Web browser to the http address that command spits out, e.g. here I'm looking at the chart with Branch length set to "Divergence". I show you this because I find the (default) Time Branch length topology misleading with so many samples having no mutations, which generates some semi-random subtrees that seem to be dependent on the order in which you have the no-mutation records in data/sequences.fasta:

December 13, 2019

Nanopore bacterial genome assemblies

tl;dr You should be able to get 99.95%+ DNA genome assembly accuracy from an Oxford Nanopore Technologies MinION device using the combination of Guppy modified base high accuracy base calling + Flye genome assembly w/ 2 rounds of polishing + 4 rounds of racon polishing + one round of medaka polishing. Steps to make a hybrid assembly with Illumina (SPAdes + pilon) are also included.

On a properly kitted Linux desktop computer with a good graphics card, this should take about an hour per genome, and does *not* require administrator (root or sudo) privileges. All the software in this tutorial can be installed in your local account using standard *nix commands and conda.

There may be similar scripts to do all this elsewhere, but I'll try to explain every step a bit here so you don't feel that you're cargo culting nanopore bioinformatics.

______

Handheld nanopore sequencing with the MinION and its ilk has put the power to generate into every lab, but the informatics can still be a bit intimidating for folks as there are so many options. This is one way to generate bacterial genome assemblies with nanopore data that has worked pretty well for us.

SETUP

I am assuming at minimum that you are running Linux on a computer with at least 4 CPU cores, and an NVIDIA GPU here. Pretty much any gaming laptop would suffice, or a ~$2000 desktop. You should know the basics of launching a terminal window and navigating the file system from the command line. If you don't know these essentials, please spend 45 minutes at Software Carpentry to learn it... trust me you won't regret it!

I am also assuming that you have some R9.4.1 flow cell data, which you've generated on the MinION or Flongle or PromethION and placed it on the Linux machine that you will use to run this tutorial.

If you don't already have a conda software manager in the Linux box, install one with all the defaults:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Complete the installation, then log out/in of your terminal window so it's all set up.

Add conda "channels" (repositories) that might be useful for finding packages we're going to use.

conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels bioconda

Create an environment that will house all the programs we're using in this tutorial, then activate it:

conda create -n nanopore_assembly
conda activate nanopore_assembly

All of the conda commands in this tutorial only need to be run the very first time you try this tutorial, except the conda activate and deactivate commands, which should be run each time.

NANOPORE ASSEMBLY

Now, let's install the Oxford Nanopore Technologies base caller Guppy (for the latest version, update the address by visiting the Community Page):

wget https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_3.4.3_linux64.tar.gz
tar zxvf ont-guppy_3.4.3_linux64.tar.gz

Run the Guppy base caller on your data, using the high accuracy model with base modification (methylation) calls:

ont-guppy/bin/guppy_basecaller --input_path /path/to/your/directory/of/fast5_pass --save_path /path/where/you/want/to/write/results --num_callers 4 --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --device cuda:all:100%

The output should look something like this if successful:

ONT Guppy basecalling software version 3.4.3+f4fc735

config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg

model file: /opt/ont/guppy/data/template_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.jsn

input path: /localhome/gordonp/For Paul/19-11-19 pooled 6-10/pooled_6-10/pooled/20191119_1839_MN30852_FAL04937_2e9551ca/fast5/

save path: guppy

chunk size: 1000

chunks per runner: 512

records per file: 4000

num basecallers: 8

gpu device: cuda:all:100%

kernel path:

runners per device: 8

Found 109 fast5 files to process.

Init time: 1951 ms

0% 10 20 30 40 50 60 70 80 90 100%

|----|----|----|----|----|----|----|----|----|----|

***************************************************

Caller time: 1639062 ms, Samples called: 23534183594, samples/s: 1.43583e+07

Finishing up any open output files.

Basecalling completed successfully.

There are actually a lot of tweaks you can make to the previous command to accelerate the base calling, depending on the NVIDIA GPU you have on board your computer. This is left as an exercise to the reader. If the command fails, try removing the "--num_callers 4" option, and make sure nothing else is using the GPU on the system except /usr/bin processes when you run the command nvidia-smi.

Your results are in the guppy subdirectory, so let's go there:

cd guppy

*If* you barcoded a number of samples into one run (since a MinION run has capacity to do multiple 2-5Mbp bacterial genomes), you will need to demultiplex FASTQ files generated by Guppy, using the ONT software qcat:

conda install qcat

conda install future
cat *.fastq | qcat --trim -b demultiplexed

Hopefully you see an output that corresponds to the barcodes that you actually used! There will be a subdirectory for each sample in guppy/demultiplexed/barcode##, plus maybe a few false positive barcode IDs that you can safely ignore.

Adapters detected in 336008 of 432735 reads

NBD104/NBD114 336008: | ############### | 77.65 %

none 85909: | ### | 19.85 %

Barcodes detected in 336008 of 432735 adapters

barcode01 16: | | 0.00 %

barcode02 1: | | 0.00 %

barcode04 5: | | 0.00 %

barcode05 9: | | 0.00 %

barcode06 105209: | #### | 24.31 %

barcode07 63486: | ## | 14.67 %

barcode08 67515: | ### | 15.60 %

barcode09 57684: | ## | 13.33 %

barcode10 42037: | # | 9.71 %

barcode12 1: | | 0.00 %

barcode13 5: | | 0.00 %

barcode14 9: | | 0.00 %

barcode15 7: | | 0.00 %

barcode16 1: | | 0.00 %

barcode17 1: | | 0.00 %

barcode18 6: | | 0.00 %

barcode19 5: | | 0.00 %

barcode21 4: | | 0.00 %

barcode22 6: | | 0.00 %

barcode24 1: | | 0.00 %

none 85909: | ### | 19.85 %

10818 reads were skipped due to the min. length filter.

Demultiplexing finished in 308.34s

Now we are almost ready to assemble the reads from a sample into a draft genome. If you're multiplexed, you'll need to go into the subdirectory with the results of the demuplexing:

cd demultiplexed

For everything that follows in this tutorial, naturally you will run the commands for each barcode##.fastq file. *Alternatively*, if you only had one sample, concatenate all the Guppy output into one FASTQ file called 'all.fastq':

cat *_0.fastq > all.fastq

We are going to use the Flye assembler to generate a draft genome for a given barcode (or all.fastq if you only had one sample on the run). In this case, I'm assembling the sample labelled with barcode06, and I suspect it might have plasmids so we'll check for those. The genome should be about 3Mbp, hence the '-g 3m' option provided, but don't worry this need only be a rough estimate.

conda install flye
flye --nano-raw barcode06.fastq --threads 8 --iterations 2 --plasmids -g 3m --out-dir barcode06

In my limited experience, asking for more than 2 iterations of polishing from Flye can be counterproductive, but your mileage may vary. The output is in the directory barcode06, so let's go there:

cd barcode06

The Flye output is in assembly.fasta, but it's kind of rough. Let's now apply a nice general genome polisher called racon to read overlaps of >500bp with a maximum of 15% mismatches. We'll do this over four iterations to fix up the consensus sequence (with the final result in racon4.fasta):

conda install racon
conda install minimap2
ln -s assembly.fasta racon0.fasta
for n in 1 2 3 4;
do
minimap2 racon`expr $n - 1`.fasta ../barcode06.fastq > minimap.paf;
racon ../barcode06.fastq minimap.paf racon`expr $n - 1`.fasta -e 0.15 -t 8 -m 8 -x -6 -g -8 -w 500 > racon$n.fasta;
done

Why four iterations? That's what the tool in the next step was trained using, so we're feeding it what it's seen before. There is also a fancier version of racon that uses the GPU, but if you're using this tutorial because you're a novice, maybe you're best to explore racon-gpu once you have more experience.

At this point, we'd like to run the ONT-developed medaka polisher which helps resolve a lot of homopolymer consensus errors typical of nanopore signal (e.g. deciding between 5 A's in a row vs. 7). *But*, its dependencies are incompatible with other tools we've used so far, so we'll need to switch to a new conda environment (this is one of the key strengths of conda, to put a fence between incompatible softwares).

conda deactivate
conda create -n nanopore_polishing
conda activate nanopore_polishing
conda install medaka

Now we're ready to run the medaka polisher:

medaka_consensus -i ../barcode06.fastq -d racon4.fasta -o medaka -t 8

Your final nanopore-only genome is in medaka/consensus.fasta. Most of the remaining errors ~0.05% errors will be homopolymer stretches.

HYBRID ASSEMBLY (OPTIONAL)

If you also have Illumina paired-end short read data for the same biological sample, these can be used to correct the assembly by first de novo assembling the short reads with SPAdes (using the nanopore long read assembly as a trusted guide to help get through the repetitive regions that are troublesome in short read assembly):

conda install spades
spades.py -o spades --trusted-contigs medaka/consensus.fasta -1 /path/to/illumina/sample_R1_001.fastq.gz -2 /path/to/illumina/sample_R2_001.fastq.gz

Now let's polish the short read assembly with Pilon, which first requires mapping the Illumina reads to our new SPAdes assembly:

conda install pilon
minimap2 -a spades/contigs.fasta /path/to/illumina/sample_R1_001.fastq.gz /path/to/illumina/sample_R2_001.fastq.gz | samtools sort -T tmp -O bam - > illumina.bam
samtools index illumina.bam
pilon --genome spades/contigs.fasta --frags illumina.bam

Then we'll take note of where the SPAdes contigs are positioned within the nanopore consensus genome.

minimap2 -a medaka/consensus.fasta pilon.fasta > pilon_vs_trusted.sam

Now we need to overlay the high quality (read: much fewer homopolymer problems) short-read assembly on the trusted contiguous long read assembly (read: resolves long repeats properly). I've written a script for that, so we'll download it and change its file mode make it executable.

wget -O trusted_pilon_overlay https://owncloud.westgrid.ca/index.php/s/S4BV5FpEQao7Kbu/download
chmod u+x trusted_pilon_overlay

Now let's run the overlay script:

./trusted_pilon_overlay medaka/consensus.fasta pilon_vs_trusted.sam pilon_overlay_on_trusted.fasta

Your hybrid polished genome is in 'pilon_overlay_on_trusted.fasta'. Enjoy! Note that if you want to polish any plasmids with the short reads as well, there are additional steps to be taken, but I will do a separate post about that since there are a number of caveats.

July 2, 2019

Hybrid PacBio + Illumina assembly: turning Unicycler on its head

tl;dr Unicycler is an amazing program that does a great job of assembling microbial genomes from short and long reads. In hybrid assembly mode, short reads are assembled into contigs, then these are spanned using long reads such as those from PacBio or Oxford Nanopore Technologies. This is an oversimplified version, but is the gist of it.

Occasionally though, I find that highly repetitive genome structures are better resolved (even vs. Unicycler's 'bold' mode) by doing the opposite: assemble the long reads, then use them as a trusted guide for short read assembly, and then use the long reads to fill the gaps in the guide short assembly.
___________

Note: All of the tools described below are available in BioConda, and can be installed using commands like "conda install lordec". Links to the original tools pages are provided for reference only.

Step 1. Employ LoRDEC to correct the PacBio reads using the short reads:

$ lordec-correct -2 all_illumina.fastq -k 19 -s 3 -i PacBio.filtered_subreads.fastq -o pacbio-corrected.fastq

Step 2. Run Canu on the long reads to get a draft assembly (this is about a 3Mbp genome in my example):

$ canu -p my_isolate_name -d canu genomeSize=3.0m -pacbio-raw pacbio-corrected.fastq

Step 3a. Pick the reasonable contigs from the canu output directory (which we imaginatively called "canu"):

$ cd canu

$ grep '>' my_isolate_name.contigs.fasta

>tig00000001 len=2709880 reads=3198 covStat=2122.56 gappedBases=no class=contig suggestRepeat=no suggestCircular=no

If more than one contig, and one is a lot smaller than the other, maybe you have plasmids.

Step 3b. Run SPAdes on the short reads with these as trusted contigs:

$ spades.py -o spades.2002_S45 --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto

And if you suspect you have a plasmid, run the same thing again but with a new output directory and the --plasmid flag:

$ spades.py -o spades.2002_S45.plasmid --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto --plasmid

Step 4a. Run minimap2 piped into samtools, for the short reads against the Spades output (needed for Pilon in Step 5):

$ cd spades

$ minimap2 -ax sr contigs.fasta ../../1042_S33_L001_R1_001.fastq ../../1042_S33_L001_R2_001.fastq | samtools sort -T tmp -O bam > short_reads_to_asm.bam

Step 4b. Sanity check that most reads actually mapped, and index the BAM file:

$ samtools index short_reads_to_asm.bam

$ samtools flagstat short_reads_to_asm.bam

843188 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

5026 + 0 supplementary

0 + 0 duplicates

839119 + 0 mapped (99.52% : N/A)

838162 + 0 paired in sequencing

419081 + 0 read1

419081 + 0 read2

825398 + 0 properly paired (98.48% : N/A)

830940 + 0 with itself and mate mapped

3153 + 0 singletons (0.38% : N/A)

624 + 0 with mate mapped to a different chr

543 + 0 with mate mapped to a different chr (mapQ>=5)

Step 5. Run Pilon to polish the Spades assembly.

$ pilon --genome contigs.fasta --frags short_reads_to_asm.bam

Step 6. Map the Pilon assembly to the canu long reads assembly:

$ minimap2 -a ../*.contigs.fasta pilon.fasta > pilon_vs_trusted.sam

Step 7. Use this little script (rename to trusted_pilon_overlay and make executable on Linux) to overlay the Pilon polished Illumina assembly contigs (higher per base quality) onto the long reads canu assembly (lower per base quality) for the final genome:

$ trusted_pilon_overlay ../*.contigs.fasta pilon_vs_trusted.sam pilon_overlay_on_trusted.fasta

Step 8a: Check that the plasmids are real with BLAST matches to exiting plasmids:

blastn -db nt -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.fasta.blastn_nt.txt -remote

Step 8b: Check that you didn't assemble the Illumina phiX spike-in control sequence into your genome! That would be embarrassing (and all too common in GenBank).

$ makeblastdb -dbtype nucl -in phiX.fna

$ blastn -db phiX.fna -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.phiX.txt

The result should be "No hits found". Woohoo, you're done!

Step 9 (optional). If you still have multiple contigs, start exploring the assembly graph (the .gfa file in your outputs) visually in Bandage to see what might be the cause. Loading a Unicycler assembly and overlaying the pilon_overlay_on_trusted.fasta file as BLAST matches can be informative too, like below where the bubbles represent multiple nearly identical partially integrated phages in a genome. In the case below, my method builds two contigs, displayed as green and blue BLAST hits on Bandage's 'bold' 22 contigs.

October 23, 2018

Graphing Kallisto RNASeq results: Pretty box plotting genes by experiment factor levels

Assuming you have already run a bunch of sample with Kallisto against a relevant transcript database, and have the outputs in folders called samplename.kallisto, run this simple script to generate the FPKM data at the transcript and gene levels (in this case, human):

$ kallisto_to_fpkm refseq2gene_human

Start R, and first load the experiment metadata:

> meta <- read.table("meta.tab", header=TRUE)

> meta$path <- as.character(meta$path)

> meta

sample path tgfb pfd

1 Library1 Library1.kallisto 0 0

2 Library2 Library2.kallisto 0 0

3 Library3 Library3.kallisto 0 0

4 Library4 Library4.kallisto 1 0

5 Library5 Library5.kallisto 1 0

6 Library6 Library6.kallisto 1 0

7 Library7 Library7.kallisto 0 1

8 Library8 Library8.kallisto 0 1

9 Library9 Library9.kallisto 0 1

10 Library10 Library10.kallisto 1 1

11 Library11 Library11.kallisto 1 1

12 Library12 Library12.kallisto 1 1

You'll see here that I have two treatments (TGFB and PFD), run independently and in combination, as well as wild type. I'm going to manually assign reasonable factor level names (e.g. "wt" for wild type) that will be used on the graph later.

> sample2category <- hashmap(meta$path, c(rep("wt",3), rep("tgfb", 3), rep("pfd", 3), rep("tgfb+pfd", 3)))

> sample2category

## (character) => (character)

## [Library4.kallisto] => [tgfb]

## [Library2.kallisto] => [wt]

## [Library11.kallisto] => [tgfb+pfd]

## [Library12.kallisto] => [tgfb+pfd]

## [Library5.kallisto] => [tgfb]

## [Library3.kallisto] => [wt]

## [...] => [...]

Looks good. Let's load up the gene level FPKM data we generated at the very start:

> gene_fpkm <- read.table("gene_fpkm.txt", header=TRUE, row.names=1)

Suppose we have a subset of genes that are of particular interest, let's load them. It's a simple text file with one gene name per line, in this case, 30 genes.

> cancer <- read.table("cancer.txt", colClasses=c("character"))

Let's just work with the subset of FPKM values from the genes of interest. For the sake of plotting a reasonable vertical axis range, I'm turning the FPKM values into log2(FPKM+1).

> gene_fpkm_cancer <- t(log1p(gene_fpkm[ecm_cancer$V1,], base=2))

What we need to do to generate the boxplots is turn the (gene, sample) matrix into a flatter long table where we have multiple gene -> value instances for each experiment factor combinations. The flattening is easy, using the function melt().

> library(reshape2)

> d <- melt(gene_fpkm_cancer)

> d[1,]

X1 X2 value

1 Library10.kallisto MMP11 4.196717

As you can see from just printing the first row of the melt()ed table, X1 is the library, X2 is the gene name, value if the log transformed FPKM value. Let's add the category labels we generated earlier.

> d$category <- sample2category[[d$X1]]

> d[1,]

X1 X2 value category

1 Library10.kallisto MMP11 4.273322 tgfb+pfd

Nice, now we are ready to plot, with some fancy options to make it pretty.

> ggplot(d)+ \

geom_boxplot(aes(category, y=value)) + \ # boxplot for each category

facet_grid(.~X2) + \ # boxplot pane for each gene

theme(axis.text.x=element_text(angle=90, hjust=1)) + \ # vertical category labels

scale_x_discrete(limits=c("wt","tgfb","pfd","tgfb+pfd")) # reorder category labels

February 8, 2018

Analyzing Bisulfite Treated Genome Data (differential methylation detection)

tl;dr Bismark+DSS+ChIPpeakAnno+BioMart is good way to find differentially methylated regions of genomes that have been bisulphite treated and sequenced, whether you have one or more experimental factors. There are some tricks to getting the annotations exported due to the non-one-to-one nature of diff methyl regions and nearby genes.
___________________

DSS is a good choice if you are trying to ensure very few false positives (i.e. specificity) while still getting decent recall/sensitivity (according to this nice comparison paper).

mkdir /export/common/dbs/bismark/danRer10

cp reference_genome.fa /export/common/dbs/bismark/danRer10

Ensure that the bowtie2 and samtools executables are in your PATH, e.g. install BioBuilds.

bismark_genome_preparation /export/common/dbs/bismark/danRer10

Then do the following for each of your samples to count the bisulphite converted sites:

bismark -p 4 -un --ambiguous -N 1 -L 28 --non_bs_mm /export/common/dbs/bismark/danRer10/ path/to/my_ctl_1.fastq.gz -B my_ctl_1

bismark_methylation_extractor --scaffolds --bedGraph --parallel 5 --comprehensive --single-end --merge_non_CpG my_ctl_1.bam

Install DSS if you don't already have it, by starting R then:

source("http://bioconductor.org/biocLite.R")

biocLite("DSS");

biocLite("org.Dr.eg.db");

Create an experiment metadata file for the sample factors:

library(DSS)
require(bsseq)

meta <- read.table("meta.tab", header=TRUE)

Blog Archive