October 23, 2018

Graphing Kallisto RNASeq results: Pretty box plotting genes by experiment factor levels

Assuming you have already run a bunch of sample with Kallisto against a relevant transcript database, and have the outputs in folders called samplename.kallisto, run this simple script to generate the FPKM data at the transcript and gene levels (in this case, human):

$ kallisto_to_fpkm refseq2gene_human

Start R, and first load the experiment metadata:

> meta <- read.table("meta.tab", header=TRUE)

> meta$path <- as.character(meta$path)
> meta
      sample               path tgfb pfd
1   Library1  Library1.kallisto    0   0
2   Library2  Library2.kallisto    0   0
3   Library3  Library3.kallisto    0   0
4   Library4  Library4.kallisto    1   0
5   Library5  Library5.kallisto    1   0
6   Library6  Library6.kallisto    1   0
7   Library7  Library7.kallisto    0   1
8   Library8  Library8.kallisto    0   1
9   Library9  Library9.kallisto    0   1
10 Library10 Library10.kallisto    1   1
11 Library11 Library11.kallisto    1   1
12 Library12 Library12.kallisto    1   1

You'll see here that I have two treatments (TGFB and PFD), run independently and in combination, as well as wild type.  I'm going to manually assign reasonable factor level names (e.g. "wt" for wild type) that will be used on the graph later.

> sample2category <- hashmap(meta$path, c(rep("wt",3), rep("tgfb", 3), rep("pfd", 3), rep("tgfb+pfd", 3)))

> sample2category
##          (character) => (character)
##  [Library4.kallisto] => [tgfb]     
##  [Library2.kallisto] => [wt]       
## [Library11.kallisto] => [tgfb+pfd] 
## [Library12.kallisto] => [tgfb+pfd] 
##  [Library5.kallisto] => [tgfb]     
##  [Library3.kallisto] => [wt]       
##                [...] => [...]      

Looks good.  Let's load up the gene level FPKM data we generated at the very start:

gene_fpkm <- read.table("gene_fpkm.txt", header=TRUE, row.names=1)

Suppose we have a subset of genes that are of particular interest, let's load them.  It's a simple text file with one gene name per line, in this case, 30 genes.

> cancer <- read.table("cancer.txt", colClasses=c("character"))

Let's just work with the subset of FPKM values from the genes of interest.  For the sake of plotting a reasonable vertical axis range, I'm turning the FPKM values into log2(FPKM+1).

gene_fpkm_cancer <- t(log1p(gene_fpkm[ecm_cancer$V1,], base=2))

What we need to do to generate the boxplots is turn the (gene, sample) matrix into a flatter long table where we have multiple gene -> value instances for each experiment factor combinations.  The flattening is easy, using the function melt().

> library(reshape2)
d <- melt(gene_fpkm_cancer)

> d[1,]

                  X1    X2    value
1 Library10.kallisto MMP11 4.196717

As you can see from just printing the first row of the melt()ed table, X1 is the library, X2 is the gene name, value if the log transformed FPKM value. Let's add the category labels we generated earlier.

> d$category <- sample2category[[d$X1]]
> d[1,]

                  X1    X2    value category
1 Library10.kallisto MMP11 4.273322 tgfb+pfd

Nice, now we are ready to plot, with some fancy options to make it pretty.

> ggplot(d)+ \
  geom_boxplot(aes(category, y=value)) + \                 # boxplot for each category
  facet_grid(.~X2) + \                                     # boxplot pane for each gene
  theme(axis.text.x=element_text(angle=90, hjust=1)) + \   # vertical category labels
  scale_x_discrete(limits=c("wt","tgfb","pfd","tgfb+pfd")) # reorder category labels

February 8, 2018

Analyzing Bisulfite Treated Genome Data (differential methylation detection)

tl;dr Bismark+DSS+ChIPpeakAnno+BioMart is good way to find differentially methylated regions of  genomes that have been bisulphite treated and sequenced, whether you have one or more experimental factors. There are some tricks to getting the annotations exported due to the non-one-to-one nature of diff methyl regions and nearby genes.

DSS is a good choice if you are trying to ensure very few false positives (i.e. specificity) while still getting decent recall/sensitivity (according to this nice comparison paper).

mkdir /export/common/dbs/bismark/danRer10

cp reference_genome.fa /export/common/dbs/bismark/danRer10

Ensure that the bowtie2 and samtools executables are in your PATH, e.g. install BioBuilds

bismark_genome_preparation /export/common/dbs/bismark/danRer10

Then do the following for each of your samples to count the bisulphite converted sites:

bismark -p 4 -un --ambiguous -N 1 -L 28 --non_bs_mm /export/common/dbs/bismark/danRer10/ path/to/my_ctl_1.fastq.gz -B my_ctl_1

bismark_methylation_extractor --scaffolds --bedGraph --parallel 5 --comprehensive --single-end --merge_non_CpG my_ctl_1.bam

Install DSS if you don't already have it, by starting R then:



Create an experiment metadata file for the sample factors:


meta <- read.table("meta.tab", header=TRUE)

      sample site sex
1  CUSH_27_M CUSH   M
2  CUSH_30_F CUSH   F
3  CUSH_35_F CUSH   F
4  CUSH_37_M CUSH   M
5  CUSH_38_F CUSH   F
6  CUSH_39_M CUSH   M
7  GLEN_60_M GLEN   M
8  GLEN_64_M GLEN   M
9  GLEN_65_F GLEN   F
10 GLEN_69_M GLEN   M
11 GLEN_72_F GLEN   F
12 GLEN_76_F GLEN   F

Transform the Bismark output into the format that DSS requires. With replicate whole genome bisulphite data, this can take a while and use a lot of memory (20Gb+).

data <- lapply(meta$sample, function(x){t <- read.table(paste0(x,".bismark.cov"), colClasses = c("character", "integer", "NULL", "NULL", "integer", "integer")); colnames(t) <- c("chr", "pos", "X", "N"); t[,"N"] <- t[,"N"]+t[,"X"]; t})

BSobj <- makeBSseqData(data, meta$sample)

I've got multiple factors in this experiment, so I'll use the more complex fit function in DSS to find loci.

DMLfit <- DMLfit.multiFactor(BSobj, design=meta, formula=~site*sex)

DMLtest.site <- DMLtest.multiFactor(DMLfit, coef=2)

Now merge those loci into regions with more powerful statistics:

DMRtest.site <- callDMR(DMLtest.site, p.threshold=0.05)

ix <- sort(DMLtest.site[,"pvals"], index.return=TRUE)$ix

Inspecting these sorted results I see that the effect of sex is very minimal, so I can go ahead and just consider the site factor in a dichtomous model that includes smoothing for better dispersion estimation in the underlying beta-binomial model of DSS.

dmlTest.site <- DMLtest(BSobj, group1=as.character(meta$sample[1:6]), group2=as.character(meta$sample[7:12]), smoothing=TRUE)

Grab our site-dependent peak loci and peak regions:
dmls <- callDML(dmlTest.site, p.threshold=0.001)
dmrs <- callDMR(dmlTest.site, p.threshold=0.01)

Visually have a look at the most significant differentially methylated region, for sanity's sake:

showOneDMR(dmrs[1,], BSobj)

Then write them to a file:

write.csv(dmrs, "dds_dmr_site_only_fdr_lt_0.01.csv")

Annotated peaks with information from nearby genes:


Unfortunately for me, danRer10 isn't a default in this module, so we need to roll our own data source from BioMart:


mart <- useMart(biomart="ensembl", dataset="drerio_gene_ensembl")

danRer10_anno <- getAnnotation(mart, featureType="TSS")

dmr_gRanges <- GRanges(seqnames=Rle(dmrs$chr), ranges=IRanges(start=dmrs$start, end=dmrs$end))

anno <- annotatePeakInBatch(dmr_gRanges, AnnotationData=danRer10_anno)

G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id", "entrezgene", "description"),values=anno$feature,mart= mart)

We also need to get rid of the leading zeros in the numbers for the peaks that ChIPpeakAnno annoyingly adds. Now you have three tables with different number of rows since some dmrs map to more than one feature, and not all features have annotations. Blergh. Hash tables to the rescue!

e2g <- hash(G_list$ensembl_gene_id, G_list$entrezgene)

e2d <- hash(G_list$ensembl_gene_id, G_list$description)

p2dmr <- hash(1:dim(dmrs)[1], lapply(1:dim(dmrs)[1], function(x){as.vector(c(dmrs$nCG[x], dmrs$meanMethy1[x], dmrs$meanMethy2[x], dmrs$diff.Methy[x], dmrs$areaStat[x]))}))

good_annos <- anno[which(!is.na(anno$feature)),]

h2v <- function(hsh, ks){sapply(ks, function(x){values(hsh[x])})}

Nice! We also need to get rid of the leading zeros in the numbers for the peaks that ChIPpeakAnno annoyingly adds. 

df <- data.frame(good_annos, h2v(e2g, good_annos$feature), h2v(e2d, good_annos$feature), t(h2v(p2dmr, as.character(as.numeric(good_annos$peak)))))

write.csv(df, "dss_site_only_dmr_fdr_lt_0.01_gene_annotation.csv")

Nice! Now we have a list of differentially methylated regions and the transcription start sites that they are closest too. Now you might load this list into Ingenuity Pathway Analyst, DAVID or MatInspector to look for higher level biological patterns linking your genes.

February 1, 2018

Are you losing important transcripts in your Kallisto/Sleuth RNASeq analysis?

tl;dr The default transcript filter function parameters in Sleuth are suitable for a single factor, two level contrast RNASeq experiment. If you are running a two-factor experiment (e.g. knock out vs. wild type, plus control vs. treatment), or an experiment with multiple factor levels (e.g. time series), you should probably use a filter function such as the one described below. You will retain more true positive differentially expressed genes, without generating too many new false positives.


I've been a heavy user of Kallisto and Sleuth for RNASeq analysis for some time, and was used to seeing output similar to the following when loading up a dataset:

> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line)
reading in kallisto results
normalizing est_counts
26036 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

I hadn't given much consideration to how the "filter" statistic was generated, until I had a 5 time point series experiment where we had a priori knowledge of the activation of a transcript only at the last two timepoints. This transcript did not show up in the Sleuth analysis with any p-value, let alone a significant one. A few days later in a two-factor experiment (growth condition and cell line), there were also some missing known transcripts.  

The default filtering function in Sleuth (called basic_filter) requires at least 5 mapped reads to a transcript in at least 47% of the samples. This reduces spurious identification of differential expression in near-zero abundance transcripts, but retains genes that are moderately but consistently expressed in one of two factor levels (e.g. expressed-in-control-only transcripts, or expressed-in-treatment-only transcripts).

If I have two factors in my RNASeq experiment (3 replicates is typical, for 12 samples), this filter would eliminate transcripts only expressed in the interaction term, such as condition:cell_line in the above example.  Here's the metadata:

> meta

        sample                 path condition cell_line
1    NSC_Ctl_1   NSC_Ctl_1.kallisto       NSC       Ctl
2    NSC_Ctl_2   NSC_Ctl_2.kallisto       NSC       Ctl
3    NSC_Ctl_3   NSC_Ctl_3.kallisto       NSC       Ctl
4     NSC_KO_1    NSC_KO_1.kallisto       NSC        KO
5     NSC_KO_2    NSC_KO_2.kallisto       NSC        KO
6     NSC_KO_3    NSC_KO_3.kallisto       NSC        KO
7  Odiff_Ctl_1 Odiff_Ctl_1.kallisto        OD       Ctl
8  Odiff_Ctl_2 Odiff_Ctl_2.kallisto        OD       Ctl
9  Odiff_Ctl_3 Odiff_Ctl_3.kallisto        OD       Ctl
10  Odiff_KO_1  Odiff_KO_1.kallisto        OD        KO
11  Odiff_KO_2  Odiff_KO_2.kallisto        OD        KO
12  Odiff_KO_3  Odiff_KO_3.kallisto        OD        KO

The condition:cell_line term gleans data from only 3 (25%) of the samples (i.e. those that are OD:KO). Let's change the filter to only require >=5 reads in 25% of the samples...

> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line,
                          filter_fun=function(x){basic_filter(x, 5, 0.25)})
reading in kallisto results
normalizing est_counts
36320 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

Whoa! We just increased the number of transcripts passing filter by 50%, which leads to a huge inflation of false positives in the differential expression, and just as importantly, detrimentally affects the q-values for the genes in our original, default-filtered analysis.  A smarter filter might be to require 100% of samples with any present factor level to have at least 5 reads, i.e. keep any transcript where all replicate samples for a factor moderately express it.

[Puts on thinking cap, writes several failed attempts, then...]

> design_filter <- function(design, row, min_reads=5, min_prop = 0.47){
    sum(apply(design, 2, function(x){
        y <- as.factor(x);
        return(max(tapply(row, y, function(f){sum(f >= min_reads)})/
                   tapply(row, y, length)) == 1 
                   || basic_filter(row, min_reads, min_prop)
    })) > 0}

To pass in the design matrix that my new filter requires, I can just reuse the one my first call to sleuth_prep() generated, rather than making it myself.  Probably not a bad idea to do it this way in any case, so we can then compare how many transcripts pass this new filter vs. the default filter.

> so_redux <- sleuth_prep(meta, ~cell_line*condition, 
reading in kallisto results
normalizing est_counts
26370 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

Although for this dataset the new filter also requires ~25% of samples to have moderate expression, the added constraint that those 25% cover all replicates of some factor level means adding just 334 transcripts to the analysis instead of more than 10,000.  This seems much more reasonable to me, and my known true positive transcript suddenly appeared. #winning

Note that the design_filter() should work for any set of nomimal factors, but not quantitative factors. A column slice of the design matrix could be passed in accordingly in the so_redux code above. 

November 28, 2017

Basic brdU analysis

Here's a quick run down of how to generate a log ratio plot for brdU IP pulldown vs. whole cell extract run with NGS DNA sequencing. This type of experiment is done to find how commonly used origins of replication are in yeast genomes for example, as the brdU is incorporated in soon-to-be daughter cell DNA during cell division.  For the purposes of this walkthrough, you can use https://www.ebi.ac.uk/ena/data/view/SRR1284645 (wt1) and https://www.ebi.ac.uk/ena/data/view/SRR1284649 (wcewt1).

1) Map the reads to the genome for both whole cell extract and the experimental brdu IP capture (e.g. wild type strain used here):

/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wcewt1.fastq.gz | samtools sort -O bam -T wcewt1 > wcewt1.genome.bam 

/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wt1.fastq.gz | samtools sort -O bam -T wt1 > wt1.genome.bam 

Note that if you have paired end data, specify both on the command line for each bwa command.

2) Calculate the coverage across the genome for each sample (sacCer3.genome has the name and length of each chromosome tab separated, one chr per line):

bedtools genomecov -ibam wt1.genome.bam -g sacCer3.genome -d > wt1.genome.bdg
bedtools genomecov -ibam wcewt1.genome.bam -g sacCer3.genome -d > wcewt1.genome.bdg

Note that if you have paired end data, ad the flag -pc to the bedtools command.

3) Start R and calculate a 1000 base pair window to compare coverage of the WCE to the brdU pulldown, stepping 100bp for each window recalculation so we have 100bp "buckets" of same coverage rather than base pair resolution.

wt1 <- read.table("wt1.genome.bdg")
wcewt1 <- read.table("wcewt1.genome.bdg")
window <- function(x, i, w){mean(x[max(1,i-w/2):min(length(x),i+w/2)])}
buckets <- seq(1, length(wt1$V3), 100)

Avoid annoying scientific notation parsing issues in IGV later with this magic:


Before we actually apply the windowing, figure out how much more overall sequence their is in one sample vs the other, as we need to normalize for that. Also, of course, add 1 to the totals so we aren't taking the log of zero (which causes R to have conniptions).

norm_factor <- sum(as.numeric(wt1$V3))/sum(as.numeric(wcewt1$V3))

log_ratio <- sapply(buckets, function(x){log(window(wt1$V3, x, 1000)/norm_factor+1)/log(window(wcewt1$V3, x, 1000)+1)})

write.table(data.frame(wt1$V1[buckets], wt1$V2[buckets], wt1$V2[buckets]+99, log_ratio), sep="\t", file="log_ratio_wt1_v_wcewt1.bed")

4) I'm lazy and reformat the output in Perl rather than putting all the right options into the R write.table() command. It also excludes chrM as that seems to raise the ceiling too much for the data range and makes everything else look small by default when graphed (e.g. 0-3 for normal but up to 16 for chrM).

perl -ane '$F[1]=~tr/"//d; print join("\t", @F[1..4]),"\n" unless $. == 1 or $F[1] eq "chrM"' log_ratio_wt1_v_wcewt1.bed > log_ratio_wt1_v_wcewt1.bdg

You're done!  Now you can load your BedGraph file into IGV, etc. and look at peak heights...more on comparing samples using quantile normalization, etc. to come.

May 29, 2017

The ungoogleable: Authenticating a Galaxy Portal with Active Directory

As a rule, Google is very useful for finding examples of how to configure software, especially when you're doing Unix-y stuff and using well worn programs like the Apache Web Server.  A recent configuration task I needed to perform was the exception that proved the rule: there is no working, non-trivial example on the Web of how to configure Apache to authenticate a user against an LDAP server that requires authentication to search (you don't want to hard code in a user name and password in the httpd.conf file, of course). Windows Active Directories often fall into this category, as they do not allow an anonymous search to start (Apache's default is an anon search, then bind with a user name, then search for attributes).

Specifically, the Galaxy Web server is a popular Python based bioinformatics package.  It uses Paster to serve up the Web pages, but expects you to put a user authentication proxy in front of it if you want to use CAS, LDAP, etc. to manage the portal's user accounts. Galaxy's Paster process then just blindly accepts the REMOTE_USER variable sent along by the proxy with each HTTP request. Ideally, the REMOTE_USER is a fully qualified e-mail address so that Galaxy can send e-mail notifications for features that have that capability e.g., notify a user when their analysis is complete. Kind of like this:

The keys to getting this to work are Apache's AuthLDAPInitialBindPattern and AuthLDAPRemoteUserAttribute directives. Feast your eyes on the following Apache config, that:
  1. takes a user name like "work_group_name\user_login_name"
  2. authenticates against an Active Directory with a long, complex binding DN like "CN=user_login_name,OU=work_group_name,OU=Users,OU=Accounts,DC=..."
  3. queries the user information in the directory for an e-mail address containing attribute (here userPrincipalName)
  4. forwards the e-mail address on to the paster Web server in the REMOTE_USER variable

# If the LDAP server cert is not chained with any of your root certs on the box and you can't import it
LDAPVerifyServerCert off
<VirtualHost my_proxy_site:443>
        ServerName my_proxy_site
        # Require SSL since the users will be entering their org credentials so we shouldn't accept them as plain-text
        SSLEngine on
        SSLCertificateFile /etc/ssl/certs/my_proxy_site.crt
        SSLCertificateKeyFile /etc/ssl/certs/my_proxy_site.key

        RewriteEngine On
        # Serve up static content directly rather than proxying, for efficiency
        RewriteRule ^/static/style/(.*) /opt/gls/galaxy/static/june_2007_style/blue/$1 [L]
        RewriteRule ^/static/scripts/(.*) /opt/galaxy/static/scripts/packed/$1 [L]
        RewriteRule ^/static/(.*) /opt/galaxy/static/$1 [L]
        RewriteRule ^/favicon.ico /opt/galaxy/static/favicon.ico [L]
        RewriteRule ^/robots.txt /opt/galaxy/static/robots.txt [L]
        # Send everything else through the proxy
        RewriteRule ^(.*) "http://the_web_server_expecting_remote_user_set:8080/$1" [P]

        <location "/">
        Order deny,allow
        Deny from all
        # We're picking specific subnets that are allowed to connect and authenticate
        Allow from 192.168.
        Allow from 10.0.0.

        # Authenticate against the org's Active Directory (one of three in this case)
        AuthName "Please login using your Organization_Name_Here credentials"
        AuthType Basic
        AuthBasicProvider ldap
        AuthLDAPURL "ldaps://ldap_server_name1 ldap_server_name2 ldap_server_name3/DC=my_suborg,DC=my_org,DC=my_tld?sAMAccountName,userPrincipalName?sub"
        AuthLDAPInitialBindAsUser on
        AuthLDAPInitialBindPattern (.+)\\(.+) CN=$2,OU=$1,OU=Users,OU=Accounts,DC=my_suborg,DC=my_org,DC=my_tld
        AuthLDAPRemoteUserAttribute userPrincipalName
        Require valid-user

        # Pass the LDAP account name we just authenticated on to the final destination (the Web server process at the_web_server_expecting_remote_user_set:8080) with the REMOTE_USER header set
        RewriteEngine On
        RewriteCond %{LA-U:REMOTE_USER} (.+)
        RewriteRule .* - [E=RU:%1]
        RequestHeader set REMOTE_USER %{RU}e


April 18, 2017

Data for free: viral metagenomics in human RNASeq data

tl;dr Even with a polydT bead RNASeq extraction, you can find interesting metagenomic data, as many single stranded RNA viruses have polyadenylated genomes/transcripts. In an experiment involving exposure of patients to rhinovirus, measuring the viral load in the samples (Diamond + MEGAN5) yielded a quantitative factor effect that 1) increased the number of genes significantly differentially expressed, and 2) allowed me to exclude a sample due to evidence of patient infection before the experimental inoculation.

Metagenomic read assignment to taxonomy is not super accurate. I discuss a few ways to evaluate whether a signal is either a true or false positive in a human RNAseq experiment context.

I had an RNASeq project that was quite resistant to yielding significant results, based on a patient medical factor and experimental exposure to a cold virus (a clinically approved Rhinovirus A strain). I suspected that the binary (yes/no) exposure factor wasn't useful in my Sleuth gene expression model because the virus wasn't spreading the same way, or maybe at all, in some patients and sampled tissues.

Because single stranded RNA rhinoviruses have polyadenylated genomes/transcripts, our lab's regular polydT bead extraction protocol should theoretically pull these out with the normally targeted human polyA-tailed transcripts.  I adapted some of my non-reference Kallisto+Sleuth analysis steps to look for the virus in my data (Lane1 subset) like so, assuming you have your sample names in samplenames.txt (important: in the same order as the meta.tab file used later):

for f in `cat samplenames.txt`; do kallisto quant -i /export/common/dbs/kallisto/hg19.refMrna -o $f.kallisto -b 100 --single -l 180 -s 20 <(gzip -cd myrunid/Data/Intensities/BaseCalls/$f/*L001*.fastq.gz) & done

Note that I launched these in parallel in the background. If you don't have as many cores as you have samples, replace '&' with ';' before "done" to run in series, or use GNU parallel, etc.. Wait for these jobs to finish, then as a quick sanity check to see if the samples all map decently (typically 80%+ map to the RefSeq human transcriptome)...

for f in `cat samplenames.txt`; do perl -ane '$t+=$F[3];END{$ARGV=~s/abundance.tsv/run_info.json/;print $ARGV,"\t", $t/$1,"\n" if `cat $ARGV`=~/n_processed": (\d+)/s;}' $f.kallisto/abundance.tsv; done
N1pre.kallisto/run_info.json 0.8336
N1time1.kallisto/run_info.json 0.7829

Maybe exclude any really bad ones (e.g. 40% mapping a.k.a. < 0.4) from further analysis for now and check what the input quantity and quality (RIN score) was. Then run the Diamond program in BLASTX mode:

for f in `cat samplenames.txt`; do /export/common/programs/diamond-0.7.9 blastx -d /export/common/dbs/diamond/nr -a $f.diamond_nr -q myrunid/Data/Intensities/BaseCalls/ProjectName/$f/*_L001*.fastq.gz -p 1 & done

Same waiting caveat as above, then...

for f in `cat samplenames.txt`; do diamond view -a $f.nr.daa -o $f.nr.diamond.tab -f tab ; done
for f in `cat samplenames.txt`; do /export/common/programs/megan5/tools/blast2lca -v -i $f.nr.diamond.tab -o $f.nr.diamond.tax -g2t /export/common/dbs/megan5/gi_taxid-March2015X.bin; done

In the 60 samples, there were 157 million sequences aligned at the protein level. I aligned these against all 75 million known protein sequences in less than a day on a single machine.  Let that sink in and realize how awesome Diamond is. Note that you could use Kraken to do this analysis instead at the DNA level, please let me know how you fair.

Let's first do a sanity check and see how often species are identified at 100% (no matches outside the species lowest common ancestor for all hits) by MEGAN5, which should have the lowest false positive rate for such analyses:

perl -ne '$c{$1}++ if /\[Species\] (.*?); 100/;END{for(keys %c){print "$c{$_}\t$_\n"}}' *.tax | sort -n -k 1

1 Abacion magnum
1 Abida secale
1 Acalymma trivittatum
1 Acanthamoeba culbertsoni
1 Acanthascus dawsoni
1 Acanthis flammea
1 Acanthocheilonema odendhali
1 Acanthochromis polyacanthus
1 Acanthocyclops vernalis
1 Acanthosaura crucigera
40808 Camelus ferus
45634 Rhinovirus A
47346 Saimiri boliviensis
49148 Rattus norvegicus
58556 Tupaia chinensis
62884 Cricetulus griseus
70044 Callithrix jacchus
74475 Papio anubis
88769 Mus musculus
98365 Ostrea edulis
101606 Chlorocebus sabaeus
119838 Pan paniscus
132084 Rhinopithecus roxellana
143348 Chlamydia psittaci
176297 Pongo abelii
278318 Pan troglodytes
333650 Macaca fascicularis
378764 Gorilla gorilla
472266 Macaca mulatta
889709 synthetic construct
15692650 Homo sapiens

There are 19978079 sequences with 100% at the species level, so we have a 78.5% true positive rate if we assume nearly all the actual sample is of human origin. A little Linnean naming knowledge quickly shows that the most frequent hits besides human are to constructs or primates like macaques, gorillas, and orangutans (17.7%). These are often due to high protein-level sequence similarity and random sequencing errors colliding. The same goes for other mammals, but to a lesser extent.

The Rhinovirus A count of 45634 stands out though. We should expect this to be picked up as an artifact much less frequently than a New World monkey based on evolutionary distance...even if we assume convergent evolution with the human host at the AA level the fact is that rhinovirus C has only 3 hits. Rhinovirus A has 4,327 Genbank protein entries and rhinovirus C has 2580, so database bias can't account for the 4 order of magnitude discrepancy. Rhinovirus C is actually completely in line with the mean of 4.2 observations for species assignments under a Poisson distribution for this dataset, i.e. it's part of the (low count) random false positive observations.  That's 4.2/60 samples = 0.07 false observations/sample for any given species to be observed at least once with 100% confidence. I conclude that the rhinovirus A detection isn't likely to be artifactual when seen in control samples. 

We can now add that to our default sample metadata file and model the viral load effect (log probably makes sense) as part of the expression regression model.

echo viral_load > virus_count.txt

for f in *.tax; do echo -n "$f "; grep -i "Rhinovirus A; 100;" $f | wc -l >> virus_count.txt; done

paste meta.tab virus_count.txt > meta_with_viral_load.txt

Be sure to add 1 to the load so you aren't taking the log of zero, which will cause R to choke.

> library(sleuth)
> meta_with_viral_load <- read.table("meta_with_viral_load.txt");
> meta_with_viral_load
    sample             path patient_group_factor time viral_load
1    N1pre   N1pre.kallisto                   no    0          0
2  N1time1 N1time1.kallisto 
                  no    1         29
3  N1time2 N1time2.kallisto                   no    2       1187
4    N2pre   N2pre.kallisto                   no    0          6
5  N2time1 N2time1.kallisto                   no    1         29
6  N2time2 N1time2.kallisto
                   no    2         17
16   S1pre   S1pre.kallisto                  yes    0          0
17 S1time1 S1time1.kallisto                  yes    1       2006
18 S1time2 S1time2.kallisto                  yes    2       7292

> so <- sleuth_prep(meta_with_viral_load, ~patient_group_factor*time*I(log(viral_load+1)))


Note that this sequence was generated on an Illumina NextSeq 500, which does not have the "spread-of-signal" problem to the same degree that has recently been characterized for patterned flow cell Illumina devices. If you are using a HiSeq 4000 or similar, you'll be needing to do extra cleanup steps to remove free primers during library prep before putting samples onto the machine (I assume that Illumina is working fervently to fix the protocol right now). A significant amount (~5-10%) of post-infection sample data might have pre-infection samples barcodes, making the kind of analysis done here moot.

Spread-of-signal exists for the NextSeq 500 and MiSeq, probably more in the 1% range so judge your detection threshold accordingly. In the experiment described here, one of the patients appears to have caught a slightly different rhinovirus A strain before the experimental inoculation and was already in immunity overdrive (i.e. N2pre has a viral load, and inoculation didn't cause a big jump in load for time 1 and 2). I didn't bore you with all the strain level information, but trust me, it's there. If the N2pre load was artifactual by some spread-of-signal we'd expect the Poisson distance matrix to still look decent, but in fact this sample stands out, clustering with the post-infection samples. I therefore excluded it from the analysis, not by cherry picking but by metagenomic/metatranscriptomic (ssRNA viral genome) evidence that make the sample unsuitable for the study.

The sample exclusion and the quantitative modelling increased the number of significantly differentially expressed genes from a handful to hundreds, using data freely available but unmapped by Kallisto in a normal human gene expression analysis.