January 30, 2020

Installing a local copy of NextStrain for nCov19 with Conda


NextStrain is great software for visualizing the relationships between isolates of a virus. It is used for influenza, but also for other viruses and they have made a system configured for the novel coronavirus that has emerged out of Wuhan, China. For most people, it is sufficient to view the public nCov19 instance of NextStrain at https://nextstrain.org/ncov?m=div. For the few that want to quickly check a new local sequence against the existing knowledge base, or at least the ability to do so, I've debugged the install process a bit to make it easier for you.  Not that you'd be keeping your sequence private for any length of time, right?  Submit to GISAID at the very least!
_________________________

Open a terminal.

If you don't already have a conda software manager in the Linux box, install one with all the defaults:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh 

bash Miniconda3-latest-Linux-x86_64.sh 

Complete the installation, then log out/in of your terminal window so it's all set up.

Now add conda "channels" (repositories) that might be useful for finding packages we're going to use.

conda config --add channels conda-forge 
conda config --add channels defaults 
conda config --add channels bioconda 

You now have Conda installed, congratulations!  Your life will now be 37.4452% easier. 

Create an environment that will house all the software we're using in for NextStrain, then activate it. There are some bits of code that don't work nicely together with the default packages NextStrain tries to install later, and there are some undocumented prerequisites, so let's avoid all those headaches by configuring our conda environment with the magic combination of packages that does work (disclaimer: I've only tested this on CentOS only).

conda create -n nextstrain
conda activate nextstrain
conda install nodejs python=3.7 datrie bcbio-gff mafft iqtree

Install the node.js code needed for NextStrain:

npm install --global auspice

Install the python code needed for NextStrain:

pip install nextstrain-augur nextstrain-cli

Make sure it has everything it needs:

nextstrain check-setup --set-default

Now you need the code and information (sample metadata, colour schemes, etc.) that have been generated for the nCoV19 virus to plug into the generic NextStrain framework. 

git clone https://github.com/nextstrain/ncov.git

Enter the top level directory of the code:


cd ncov

The metadata is updated as new samples are made public, so every few days you'll want to refresh your instance by running this command from inside the top level directory of ncov:

git pull

Now you need a set of sequences for NextStrain to compare to each other. As of this writing there are 13 complete nCov19 genomes in the public GenBank repository. Download the latest FastA file of all complete genomes by clicking download on this page. If you have access to the GISAID data sharing portal, you can download those sequences too and append them to the GenBank data, saving it all in a file under the top level ncov directory called data/sequences.fasta

The critical step at this stage is to make sure each FastA record's ID (everything between ">" and the first space or new line in the description line) matches the names NextStrain has given to each sequence in the file data/metadata.tsv. This is left as an exercise to the reader as the sequence data you have access to will be ever changing. Often you will need to change a '/' to a '-', remove the "BetaCov/" ID prefix, remove the '_EPI####' suffix from GISAID downloads, etc.

If you had a new sequence of your own that you wanted to add, you would append it to the data/sequences.fasta file, and add a line to the data/metadata.tsv with the relevant information. Note that this will prevent updates with git pull in the future, as git does not want to clobber your changes. Tip: Save the extra lines you wrote in another file for safe keeping, so you can delete data/metadata.tsv and auspice/ncov.json, then successfully run git pull for code updates.

To calculate the relationships between your sequences and generate the Web portal information, run the following (in the top level ncov directory):

snakemake -p

To access the Web interface, launch the portal server from that same directory:

nextstrain view auspice

——————————————————————————————————————————————————————————————————————————————
    Open <http://127.0.0.1:4000/> in your browser.

    Warning: No datasets detected.
——————————————————————————————————————————————————————————————————————————————

[verbose] Serving index / favicon etc from  "/gpfs/home/gordonp/.conda/envs/nextstrain/lib/node_modules/auspice"
[verbose] Serving built javascript from     "/gpfs/home/gordonp/.conda/envs/nextstrain/lib/node_modules/auspice/dist"

---------------------------------------------------
Auspice server now running at http://127.0.0.1:4000
Serving auspice version 2.3.5
Looking for datasets in /gpfs/home/gordonp/ncov/auspice
Looking for narratives in /gpfs/home/gordonp/ncov/auspice
---------------------------------------------------

Then open your Web browser to the http address that command spits out, e.g. here I'm looking at the chart with Branch length set to "Divergence". I show you this because I find the (default) Time Branch length topology misleading with so many samples having no mutations, which generates some semi-random subtrees that seem to be dependent on the order in which you have the no-mutation records in data/sequences.fasta: