Metagenomic read assignment to taxonomy is not super accurate. I discuss a few ways to evaluate whether a signal is either a true or false positive in a human RNAseq experiment context.
___________
I had an RNASeq project that was quite resistant to yielding significant results, based on a patient medical factor and experimental exposure to a cold virus (a clinically approved Rhinovirus A strain). I suspected that the binary (yes/no) exposure factor wasn't useful in my Sleuth gene expression model because the virus wasn't spreading the same way, or maybe at all, in some patients and sampled tissues.
Because single stranded RNA rhinoviruses have polyadenylated genomes/transcripts, our lab's regular polydT bead extraction protocol should theoretically pull these out with the normally targeted human polyA-tailed transcripts. I adapted some of my non-reference Kallisto+Sleuth analysis steps to look for the virus in my data (Lane1 subset) like so, assuming you have your sample names in samplenames.txt (important: in the same order as the meta.tab file used later):
for f in `cat samplenames.txt`; do kallisto quant -i /export/common/dbs/kallisto/hg19.refMrna -o $f.kallisto -b 100 --single -l 180 -s 20 <(gzip -cd myrunid/Data/Intensities/BaseCalls/$f/*L001*.fastq.gz) & done
Note that I launched these in parallel in the background. If you don't have as many cores as you have samples, replace '&' with ';' before "done" to run in series, or use GNU parallel, etc.. Wait for these jobs to finish, then as a quick sanity check to see if the samples all map decently (typically 80%+ map to the RefSeq human transcriptome)...
for f in `cat samplenames.txt`; do kallisto quant -i /export/common/dbs/kallisto/hg19.refMrna -o $f.kallisto -b 100 --single -l 180 -s 20 <(gzip -cd myrunid/Data/Intensities/BaseCalls/$f/*L001*.fastq.gz) & done
Note that I launched these in parallel in the background. If you don't have as many cores as you have samples, replace '&' with ';' before "done" to run in series, or use GNU parallel, etc.. Wait for these jobs to finish, then as a quick sanity check to see if the samples all map decently (typically 80%+ map to the RefSeq human transcriptome)...
for f in `cat samplenames.txt`; do perl -ane '$t+=$F[3];END{$ARGV=~s/abundance.tsv/run_info.json/;print $ARGV,"\t", $t/$1,"\n" if `cat $ARGV`=~/n_processed": (\d+)/s;}' $f.kallisto/abundance.tsv; done
N1pre.kallisto/run_info.json 0.8336
N1time1.kallisto/run_info.json 0.7829
...
Maybe exclude any really bad ones (e.g. 40% mapping a.k.a. < 0.4) from further analysis for now and check what the input quantity and quality (RIN score) was. Then run the Diamond program in BLASTX mode:
for f in `cat samplenames.txt`; do /export/common/programs/diamond-0.7.9 blastx -d /export/common/dbs/diamond/nr -a $f.diamond_nr -q myrunid/Data/Intensities/BaseCalls/ProjectName/$f/*_L001*.fastq.gz -p 1 & done
Same waiting caveat as above, then...
for f in `cat samplenames.txt`; do diamond view -a $f.nr.daa -o $f.nr.diamond.tab -f tab ; done
for f in `cat samplenames.txt`; do /export/common/programs/megan5/tools/blast2lca -v -i $f.nr.diamond.tab -o $f.nr.diamond.tax -g2t /export/common/dbs/megan5/gi_taxid-March2015X.bin; done
In the 60 samples, there were 157 million sequences aligned at the protein level. I aligned these against all 75 million known protein sequences in less than a day on a single machine. Let that sink in and realize how awesome Diamond is. Note that you could use Kraken to do this analysis instead at the DNA level, please let me know how you fair.
Let's first do a sanity check and see how often species are identified at 100% (no matches outside the species lowest common ancestor for all hits) by MEGAN5, which should have the lowest false positive rate for such analyses:
echo viral_load > virus_count.txt
for f in *.tax; do echo -n "$f "; grep -i "Rhinovirus A; 100;" $f | wc -l >> virus_count.txt; done
paste meta.tab virus_count.txt > meta_with_viral_load.txt
Maybe exclude any really bad ones (e.g. 40% mapping a.k.a. < 0.4) from further analysis for now and check what the input quantity and quality (RIN score) was. Then run the Diamond program in BLASTX mode:
for f in `cat samplenames.txt`; do /export/common/programs/diamond-0.7.9 blastx -d /export/common/dbs/diamond/nr -a $f.diamond_nr -q myrunid/Data/Intensities/BaseCalls/ProjectName/$f/*_L001*.fastq.gz -p 1 & done
Same waiting caveat as above, then...
for f in `cat samplenames.txt`; do diamond view -a $f.nr.daa -o $f.nr.diamond.tab -f tab ; done
for f in `cat samplenames.txt`; do /export/common/programs/megan5/tools/blast2lca -v -i $f.nr.diamond.tab -o $f.nr.diamond.tax -g2t /export/common/dbs/megan5/gi_taxid-March2015X.bin; done
In the 60 samples, there were 157 million sequences aligned at the protein level. I aligned these against all 75 million known protein sequences in less than a day on a single machine. Let that sink in and realize how awesome Diamond is. Note that you could use Kraken to do this analysis instead at the DNA level, please let me know how you fair.
perl -ne '$c{$1}++ if /\[Species\] (.*?); 100/;END{for(keys %c){print "$c{$_}\t$_\n"}}' *.tax | sort -n -k 1
1 Abacion magnum
1 Abida secale
1 Acalymma trivittatum
1 Acanthamoeba culbertsoni
1 Acanthascus dawsoni
1 Acanthis flammea
1 Acanthocheilonema odendhali
1 Acanthochromis polyacanthus
1 Acanthocyclops vernalis
1 Acanthosaura crucigera
...
40808 Camelus ferus
45634 Rhinovirus A
47346 Saimiri boliviensis
49148 Rattus norvegicus
58556 Tupaia chinensis
62884 Cricetulus griseus
70044 Callithrix jacchus
74475 Papio anubis
88769 Mus musculus
98365 Ostrea edulis
101606 Chlorocebus sabaeus
119838 Pan paniscus
132084 Rhinopithecus roxellana
143348 Chlamydia psittaci
176297 Pongo abelii
278318 Pan troglodytes
333650 Macaca fascicularis
378764 Gorilla gorilla
472266 Macaca mulatta
889709 synthetic construct
15692650 Homo sapiens
There are 19978079 sequences with 100% at the species level, so we have a 78.5% true positive rate if we assume nearly all the actual sample is of human origin. A little Linnean naming knowledge quickly shows that the most frequent hits besides human are to constructs or primates like macaques, gorillas, and orangutans (17.7%). These are often due to high protein-level sequence similarity and random sequencing errors colliding. The same goes for other mammals, but to a lesser extent.
1 Abacion magnum
1 Abida secale
1 Acalymma trivittatum
1 Acanthamoeba culbertsoni
1 Acanthascus dawsoni
1 Acanthis flammea
1 Acanthocheilonema odendhali
1 Acanthochromis polyacanthus
1 Acanthocyclops vernalis
1 Acanthosaura crucigera
...
40808 Camelus ferus
45634 Rhinovirus A
47346 Saimiri boliviensis
49148 Rattus norvegicus
58556 Tupaia chinensis
62884 Cricetulus griseus
70044 Callithrix jacchus
74475 Papio anubis
88769 Mus musculus
98365 Ostrea edulis
101606 Chlorocebus sabaeus
119838 Pan paniscus
132084 Rhinopithecus roxellana
143348 Chlamydia psittaci
176297 Pongo abelii
278318 Pan troglodytes
333650 Macaca fascicularis
378764 Gorilla gorilla
472266 Macaca mulatta
889709 synthetic construct
15692650 Homo sapiens
There are 19978079 sequences with 100% at the species level, so we have a 78.5% true positive rate if we assume nearly all the actual sample is of human origin. A little Linnean naming knowledge quickly shows that the most frequent hits besides human are to constructs or primates like macaques, gorillas, and orangutans (17.7%). These are often due to high protein-level sequence similarity and random sequencing errors colliding. The same goes for other mammals, but to a lesser extent.
The Rhinovirus A count of 45634 stands out though. We should expect this to be picked up as an artifact much less frequently than a New World monkey based on evolutionary distance...even if we assume convergent evolution with the human host at the AA level the fact is that rhinovirus C has only 3 hits. Rhinovirus A has 4,327 Genbank protein entries and rhinovirus C has 2580, so database bias can't account for the 4 order of magnitude discrepancy. Rhinovirus C is actually completely in line with the mean of 4.2 observations for species assignments under a Poisson distribution for this dataset, i.e. it's part of the (low count) random false positive observations. That's 4.2/60 samples = 0.07 false observations/sample for any given species to be observed at least once with 100% confidence. I conclude that the rhinovirus A detection isn't likely to be artifactual when seen in control samples.
We can now add that to our default sample metadata file and model the viral load effect (log probably makes sense) as part of the expression regression model.
for f in *.tax; do echo -n "$f "; grep -i "Rhinovirus A; 100;" $f | wc -l >> virus_count.txt; done
paste meta.tab virus_count.txt > meta_with_viral_load.txt
Be sure to add 1 to the load so you aren't taking the log of zero, which will cause R to choke.
> library(sleuth)
> meta_with_viral_load <- read.table("meta_with_viral_load.txt");
> meta_with_viral_load
sample path patient_group_factor time viral_load
1 N1pre N1pre.kallisto no 0 0
2 N1time1 N1time1.kallisto no 1 29
> library(sleuth)
> meta_with_viral_load <- read.table("meta_with_viral_load.txt");
> meta_with_viral_load
sample path patient_group_factor time viral_load
1 N1pre N1pre.kallisto no 0 0
2 N1time1 N1time1.kallisto no 1 29
3 N1time2 N1time2.kallisto no 2 1187
4 N2pre N2pre.kallisto no 0 6
5 N2time1 N2time1.kallisto no 1 29
6 N2time2 N1time2.kallisto no 2 17
...
16 S1pre S1pre.kallisto yes 0 0
17 S1time1 S1time1.kallisto yes 1 2006
18 S1time2 S1time2.kallisto yes 2 7292
...
> so <- sleuth_prep(meta_with_viral_load, ~patient_group_factor*time*I(log(viral_load+1)))
...
Note that this sequence was generated on an Illumina NextSeq 500, which does not have the "spread-of-signal" problem to the same degree that has recently been characterized for patterned flow cell Illumina devices. If you are using a HiSeq 4000 or similar, you'll be needing to do extra cleanup steps to remove free primers during library prep before putting samples onto the machine (I assume that Illumina is working fervently to fix the protocol right now). A significant amount (~5-10%) of post-infection sample data might have pre-infection samples barcodes, making the kind of analysis done here moot.
6 N2time2 N1time2.kallisto no 2 17
...
16 S1pre S1pre.kallisto yes 0 0
17 S1time1 S1time1.kallisto yes 1 2006
18 S1time2 S1time2.kallisto yes 2 7292
...
> so <- sleuth_prep(meta_with_viral_load, ~patient_group_factor*time*I(log(viral_load+1)))
...
Note that this sequence was generated on an Illumina NextSeq 500, which does not have the "spread-of-signal" problem to the same degree that has recently been characterized for patterned flow cell Illumina devices. If you are using a HiSeq 4000 or similar, you'll be needing to do extra cleanup steps to remove free primers during library prep before putting samples onto the machine (I assume that Illumina is working fervently to fix the protocol right now). A significant amount (~5-10%) of post-infection sample data might have pre-infection samples barcodes, making the kind of analysis done here moot.
Spread-of-signal exists for the NextSeq 500 and MiSeq, probably more in the 1% range so judge your detection threshold accordingly. In the experiment described here, one of the patients appears to have caught a slightly different rhinovirus A strain before the experimental inoculation and was already in immunity overdrive (i.e. N2pre has a viral load, and inoculation didn't cause a big jump in load for time 1 and 2). I didn't bore you with all the strain level information, but trust me, it's there. If the N2pre load was artifactual by some spread-of-signal we'd expect the Poisson distance matrix to still look decent, but in fact this sample stands out, clustering with the post-infection samples. I therefore excluded it from the analysis, not by cherry picking but by metagenomic/metatranscriptomic (ssRNA viral genome) evidence that make the sample unsuitable for the study.
The sample exclusion and the quantitative modelling increased the number of significantly differentially expressed genes from a handful to hundreds, using data freely available but unmapped by Kallisto in a normal human gene expression analysis.