tl;dr If you have an unknown dosage factor, make it binary, and run a Wilcox-Mann-Whitney U test to improve statistical power if you have ~20 samples or more. In a real test case of 20 samples, we went from 2 significant changes using the likelihood ratio test (LRT) for each sample group independently, to over 1000 significant using the (non-parametric) U test.
Suppose you have RNASeq samples that can be divided into two groups with 1) variable dose of a factor, and 2) no dose. Also suppose you don't know the dose for any given sample in group 1. I ran into this issue analyzing environmental sampling across 5 riparian sites (4 samples at each site) up and downstream of a wastewater source. By running a U test between upstream and downstream samples the downstream pollution dosage doesn't matter. That's because it's a non-parametric test that does not make assumptions of normality for the "replicates" in the upstream and downstream groups. Once you have about 20 samples, real differentially expressed genes start to survive Benjamini-Hochberg multiple testing correction.
Names have been changed to protect the unpublished.
____________
We'll pick up my blogged de novo assembly RNASeq analysis after the Kallisto mapping step, so we have a directory called samplename_trinity.kallisto for each sample. We make an experiment metadata file (metadata_trinity.tab) like so where we note whether the site is upstream or downstream:
sample path site downstream
15 15_trinity.kallisto site_4 1
17 17_trinity.kallisto site_4 1
21 21_trinity.kallisto site_4 1
23 23_trinity.kallisto site_4 1
29 29_trinity.kallisto site_2 0
30 30_trinity.kallisto site_2 0
34 34_trinity.kallisto site_2 0
35 35_trinity.kallisto site_2 0
40 40_trinity.kallisto site_1 0
43 43_trinity.kallisto site_1 0
4 4_trinity.kallisto site_5 1
50 50_trinity.kallisto site_1 0
57 57_trinity.kallisto site_1 0
59 59_trinity.kallisto site_3 1
62 62_trinity.kallisto site_3 1
68 68_trinity.kallisto site_3 1
6 6_trinity.kallisto site_5 1
70 70_trinity.kallisto site_3 1
7 7_trinity.kallisto site_5 1
8 8_trinity.kallisto site_5 1
If you used a reference genome instead of Trinity, just follow omit the "_trinity" from the path values.
____________
We'll pick up my blogged de novo assembly RNASeq analysis after the Kallisto mapping step, so we have a directory called samplename_trinity.kallisto for each sample. We make an experiment metadata file (metadata_trinity.tab) like so where we note whether the site is upstream or downstream:
sample path site downstream
15 15_trinity.kallisto site_4 1
17 17_trinity.kallisto site_4 1
21 21_trinity.kallisto site_4 1
23 23_trinity.kallisto site_4 1
29 29_trinity.kallisto site_2 0
30 30_trinity.kallisto site_2 0
34 34_trinity.kallisto site_2 0
35 35_trinity.kallisto site_2 0
40 40_trinity.kallisto site_1 0
43 43_trinity.kallisto site_1 0
4 4_trinity.kallisto site_5 1
50 50_trinity.kallisto site_1 0
57 57_trinity.kallisto site_1 0
59 59_trinity.kallisto site_3 1
62 62_trinity.kallisto site_3 1
68 68_trinity.kallisto site_3 1
6 6_trinity.kallisto site_5 1
70 70_trinity.kallisto site_3 1
7 7_trinity.kallisto site_5 1
8 8_trinity.kallisto site_5 1
If you used a reference genome instead of Trinity, just follow omit the "_trinity" from the path values.
First, let's collate a Transcripts Per Million (TPM) file for all transcripts as produced by Kallisto for each sample. This is a normalized measure of transcript expression. Start with generating the header line:
perl -e 'print "target_id\t",join("\t",map {/(.*)\//;$1} @ARGV),"\n";' *_trinity.kallisto/abundance.tsv > all_trinity_abundance.tsv
Then append the TPM values:
paste *_trinity.kallisto/abundance.tsv | perl -ane 'print $F[0];for (1..$#F){print "\t$F[$_]" if /[49]$/}print "\n"' | tail -n +2 >> all_trinity_abundance.tsv
Alternatively if you want to use FPKM (i.e. normalized for gene length too):
Now we can start R, load the data and get to some statistics!
meta <- read.table("metadata_trinity.tab", header=TRUE, row.names=TRUE)
abundance_full <- read.table("all_trinity_abundance.tsv", header=TRUE)
Let's only keep the transcripts that have an average of 1 TPM or more, by filtering rows with a sum of less than 20. Most technical noise bootstrap estimates in Kallisto are less than 1 in the previous analysis of this dataset.
abundance <- abundance_full[rowSums(abundance_full)>20,]
rownames(abundance) <- rownames(abundance_full[rowSums(abundance_full)>20,])
Now, for each transcript let's calculate the log2 fold change between the downstream sites and the base level of the factor, upstream.
log2fc <- function(x){log2(sum(abundance[x,meta$downstream])/sum(abundance[x,!meta$downstream]))}
l <- c()
for(i in 1:length(abundance[,1])){l <- append(l,log2fc(i))}
Now let's run our Wilcox-Mann-Whitney U test for each transcript. The test sorts the TPMs numerically and checks if there's a non-random grouping of downstream TPM values in the rankings. If all downstream sites have the lowest or highest TPMs for a transcript, that yields the smallest p-value.
wil <- function(x){df <- data.frame(tpm=t(abundance[x,])[,1], downstream=meta$downstream); w <- wilcox.test(tpm ~ downstream, data=df); w$p.value}
pvals <- c()
for(i in 1:length(abundance[,1])){pvals <- append(pvals,wil(i))}
Of course, if you have enough transcripts measured, eventually all downstream samples will rank together in a transcript by chance. Let's apply a multiple testing correction to the p-values. The False Discovery Rate (FDR) a.k.a. Benjamini-Hochberg, is quite popular as for most experiments it's neither too conservative nor too lax.
q <- p.adjust(pvals, method="fdr")
Let's decorate the abundance information with the fold change and U test stats, then note those with adjusted p-value (a.k.a. q-value) < 0.05.
abundance[,"log2fc"] <- l
abundance[,"qval"] <- q
qsig <- which(q < 0.05)
length(qsig)
[1] 1073
Hey, that's a lot more than we got using the likelihood ratio test for each individual site, where only two transcripts significant across all the downstream sites! Strength in replicate sample numbers. The q-values won't blow you away with a 8:12 split, they're in the (0.01,0.05) range, which is why I suggest this technique when you have 20 or more samples. You can try with less, but you're most likely just spinning your wheels.
The log2 fold change isn't super useful, as we have no assumption of normality amongst replicates (since the downstream sites have different and unknown pollution exposure levels). BUT we can use it to see the set union of transcripts with a fairly strong effect at one site or consistent expression changes across sites. This could be useful for pollution biomarker discovery.
length(which(abs(abundance[qsig,"log2fc"]) > 1))
[1] 118
That's still quite a bit to work with! Let's write out all the qval passed transcripts to a file:
write.csv(abundance[qsig,], file="up_downstream_wilcox_q_lt_0.05.csv")
Not all of these are going to be real, but a subsequent analysis in IPA shows a statistically significant number of observations in the RNASeq that are consistent with the literature, so we're probably in the right track. To be extra sure, let's randomly label the samples as upstream or downstream and redo the analysis. Hopefully we don't get any significant q-values.
meta$downstream <- sapply(1:length(meta$downstream), function(x) sample(c(0,1),1))
abundance <- abundance_full[rowSums(abundance_full[2:21])>20,2:21]
perl -e 'print "target_id\t",join("\t",map {/(.*)\//;$1} @ARGV),"\n";' *_trinity.kallisto/abundance.tsv > all_trinity_abundance.tsv
Then append the TPM values:
paste *_trinity.kallisto/abundance.tsv | perl -ane 'print $F[0];for (1..$#F){print "\t$F[$_]" if /[49]$/}print "\n"' | tail -n +2 >> all_trinity_abundance.tsv
Alternatively if you want to use FPKM (i.e. normalized for gene length too):
paste *.kallisto/abundance.tsv | tail -n +2 | perl -ane 'print $F[0];for (1..$#F){print "\t",$F[$_]/$F[$_-2]*1000 if /[49]$/}print "\n"' >> all_abundance_fpkm.tsv
Now we can start R, load the data and get to some statistics!
meta <- read.table("metadata_trinity.tab", header=TRUE, row.names=TRUE)
abundance_full <- read.table("all_trinity_abundance.tsv", header=TRUE)
Let's only keep the transcripts that have an average of 1 TPM or more, by filtering rows with a sum of less than 20. Most technical noise bootstrap estimates in Kallisto are less than 1 in the previous analysis of this dataset.
abundance <- abundance_full[rowSums(abundance_full)>20,]
rownames(abundance) <- rownames(abundance_full[rowSums(abundance_full)>20,])
Now, for each transcript let's calculate the log2 fold change between the downstream sites and the base level of the factor, upstream.
log2fc <- function(x){log2(sum(abundance[x,meta$downstream])/sum(abundance[x,!meta$downstream]))}
l <- c()
for(i in 1:length(abundance[,1])){l <- append(l,log2fc(i))}
Now let's run our Wilcox-Mann-Whitney U test for each transcript. The test sorts the TPMs numerically and checks if there's a non-random grouping of downstream TPM values in the rankings. If all downstream sites have the lowest or highest TPMs for a transcript, that yields the smallest p-value.
wil <- function(x){df <- data.frame(tpm=t(abundance[x,])[,1], downstream=meta$downstream); w <- wilcox.test(tpm ~ downstream, data=df); w$p.value}
pvals <- c()
for(i in 1:length(abundance[,1])){pvals <- append(pvals,wil(i))}
Of course, if you have enough transcripts measured, eventually all downstream samples will rank together in a transcript by chance. Let's apply a multiple testing correction to the p-values. The False Discovery Rate (FDR) a.k.a. Benjamini-Hochberg, is quite popular as for most experiments it's neither too conservative nor too lax.
q <- p.adjust(pvals, method="fdr")
Let's decorate the abundance information with the fold change and U test stats, then note those with adjusted p-value (a.k.a. q-value) < 0.05.
abundance[,"log2fc"] <- l
abundance[,"qval"] <- q
qsig <- which(q < 0.05)
length(qsig)
[1] 1073
Hey, that's a lot more than we got using the likelihood ratio test for each individual site, where only two transcripts significant across all the downstream sites! Strength in replicate sample numbers. The q-values won't blow you away with a 8:12 split, they're in the (0.01,0.05) range, which is why I suggest this technique when you have 20 or more samples. You can try with less, but you're most likely just spinning your wheels.
The log2 fold change isn't super useful, as we have no assumption of normality amongst replicates (since the downstream sites have different and unknown pollution exposure levels). BUT we can use it to see the set union of transcripts with a fairly strong effect at one site or consistent expression changes across sites. This could be useful for pollution biomarker discovery.
length(which(abs(abundance[qsig,"log2fc"]) > 1))
[1] 118
That's still quite a bit to work with! Let's write out all the qval passed transcripts to a file:
write.csv(abundance[qsig,], file="up_downstream_wilcox_q_lt_0.05.csv")
Not all of these are going to be real, but a subsequent analysis in IPA shows a statistically significant number of observations in the RNASeq that are consistent with the literature, so we're probably in the right track. To be extra sure, let's randomly label the samples as upstream or downstream and redo the analysis. Hopefully we don't get any significant q-values.
meta$downstream <- sapply(1:length(meta$downstream), function(x) sample(c(0,1),1))
abundance <- abundance_full[rowSums(abundance_full[2:21])>20,2:21]
pvals <- c()
for(i in 1:length(abundance[,1])){pvals <- append(pvals,wil(i))}
q_random <- p.adjust(pvals, method="fdr")
qsig_random <- which(q_random < 0.05)
length(qsig_random)
[1] 0
*mic drop*
for(i in 1:length(abundance[,1])){pvals <- append(pvals,wil(i))}
q_random <- p.adjust(pvals, method="fdr")
qsig_random <- which(q_random < 0.05)
length(qsig_random)
[1] 0
*mic drop*
Hold on, that's a bit premature. Let's pick up the mic again because we may not be doing a fair random comparison. In the real configuration, the downstream status is consistent with site status for each samples. How many random ways could we do this with 2 upstream and 3 downstream site designations?
library(combinat)
downstream_options <- unique(permn(c(0,0,1,1,1)))Which gives us the ten unique permutations of the sites assigned binary downstream values.
downstream_options
[[1]]
[1] 0 0 1 1 1
[[2]]
[1] 0 1 0 1 1
[[3]]
[1] 1 0 0 1 1
[[4]]
[1] 0 1 1 0 1
[[5]]
[1] 1 0 1 0 1
[[6]]
[1] 1 1 0 0 1
[[7]]
[1] 1 0 1 1 0
[[8]]
[1] 1 1 0 1 0
[[9]]
[1] 0 1 1 1 0
[[10]]
[1] 1 1 1 0 0
meta$site <- substr(meta$site, 5, 5) # strip site name to just number, serves double duty as index into downstream_options combinations
combo_qvals <- function(combo){meta$downstream <<- apply(meta, 1, function(x){combo[as.numeric(x[3])]}); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; q_random <- p.adjust(pvals, method="fdr"); return(length(which(q_random < 0.05)))}
result <- sapply(downstream_options, combo_qvals)
result
[[1]]
[1] 1073
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
[[5]]
[1] 0
[[6]]
[1] 0
[[7]]
[1] 0
[[8]]
[1] 0
[[9]]
[1] 0
[[10]]
[1] 0
*mic drop* [for real this time]