CHGI Bioinformatics: The Mothur of all eukaryotes

While microbial small subunit RNA analysis (16S) has been commonplace for a number of years now to profile microbial communities for super cheap per sample, and there's an Illumina MiSeq lurking in all corners of universities these days, around here we're seeing an uptick in the use of it for eukaryotic profiling (18S). Fun stuff like dietary analysis of wild animal feces, or sequencing river water to get fish diversity information.

I'm not so silly as to get involved in the Qiime vs. Mothur methodology debate, but let's suppose you've settled on Mothur. Mothur's kind of nice insofar as it's a single binary (Qiime is typically installed as a virtual machine), and the recommended Silva reference files used can be downloaded in a bundle from the Mothur Web site. There are two issues with using the Mothur-provided reference files for eukaryotic analysis: 1) it truncates the taxonomy for eukaryotes the same way as prokaryotes (6 levels) even the though the tree is deeper, so you end up with only order, class or family level designation quite often, and 2) the prebuilt DB is Silva 123 from July 2015. Silva is now at version 128 as of this writing, with 577,832 quality entries vs. the old with 526,361, with 18,213 Eukarya represented vs. 16,209 in version 123. Here's how to address both the eukaryotic and latest version issues:

Download and unpack (as newer version come out adjust the URL accordingly):

wget -O SSURef_NR99_latest_opt.arb.gz https://www.arb-silva.de/fileadmin/silva_databases/current/ARB_files/SSURef_128_SILVA_23_09_16_opt.arb.gz

gzip -d SSURef_NR99_latest_opt.arb.gz

wget -O tax_slv_ssu_latest.txt https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_128.txt

perl -pe 's/ \(Animalia\)//;s/ /_/g' tax_slv_ssu_latest.txt > tax_slv_ssu_latest.fixed.txt

Export the FastA files:

Launch arb:

arb SSURef_NR99_latest_opt.arb

Export as per http://blog.mothur.org/2015/12/03/SILVA-v123-reference-files/ with a final file name of silva.full_latest.fasta

Save the file as silva.full_latest.fasta, then quit arb.

Note that I compiled ARB from source since I'm on CentOS 7 and they only have precompiled binaries for earlier operating systems. If you're in the same boat, you'll need to root/sudo the following, which are not all documented in the installation instructions.

yum install libxml2 transfig libXp libtiff gnuplot-common xorg-x11-xbitmaps Xaw3d xorg-x11-fonts-misc xfig xfig-common motif gnuplot libtiff-devel libxml2-devel libxml2-python lynx glib2-devel imake libXmu-devel libXp-devel motif-devel

Format the sequence:

Here we deviate a bit from the Mothur README for simplicity, and to get the right taxonomic labels for eukaryotes.

mothur "#screen.seqs(fasta=silva.full_latest.fasta, start=1044, end=43116, maxambig=5, processors=8); pcr.seqs(start=1044, end=43116, keepdots=T); degap.seqs(); unique.seqs();"

grep ">" silva.latest.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.latest.good.pcr.ng.unique.accnos

mothur "#get.seqs(fasta=silva.latest.good.pcr.fasta, accnos=silva.latest.good.pcr.ng.unique.accnos)"

mv silva.full_latest.good.pcr.pick.fasta silva.full_latest.align

Run my eukaryotic labelling adjustment script (modified from https://raw.githubusercontent.com/rec3141/diversity-scripts/master/convert_silva_taxonomy.r, since that one didn't work for me, and wasn't parameterized)

Rscript convert_silva_taxonomy.r tax_slv_ssu_latest.txt silva.full_latest.align silva.full_latest.tax

Now you're good to go for any type of SSU analysis (16S or 18S), and follow something like the ever popular MiSeq SOP.

If the above instructions failed for you, download my SILVA 128 tax file here, and the fasta and align.

**Update: there seems to be a problem with 4 Ralstonia sequence taxonomic classifications in the current SILVA release. You'll need to manually fix those in the output taxonomy file to get it to work properly. They have only two levels of classification.

CHGI Bioinformatics

Blog Archive

October 13, 2016

The Mothur of all eukaryotes

No comments:

Post a Comment