tl;dr When you combine RNASeq experiments, the assumption about normality across samples with the same nominal factors goes out the window. The standard tests give 16 transcripts even when the experiment batch effect is modelled fully. For transcript-level (e.g. Kallisto output) abundance data with at least ~25 samples, we can test two factors in a non-parametric (expression rank) way by using a Kruskal-Wallis test.
U-Test for each transcript passing Kruskal-Wallis will tell us what factors (or their interaction) are significantly changing the expression level. This is despite a global U-Test of transcripts turning up nothing. In the example below, we get 776 candidates in the combined analysis, with 71 in the factor interaction term. Many of those 71 are enriched in specific pathways, and do not overlap with the 16 found via parametric tests.
Recently, I analyzed a RNASeq experiment redo. The experiment was redone because the original yielded no useful significant transcript changes across the experimental factor contrast of interest (exercise vs. non-exercise). The redone experiment did yield some significant contrasts for the exercise factor, but I thought to myself: can't we combine the two experiments and get even better statistical power? The short answer is no if you're using the standard tests like Wald and Likelihood Ratio, because they assume replicates that are normally distributed. We shouldn't assume that the transcript expression levels for replicates across the two experiments are normally distributed.
Here's a three step process to model the two-factor experiment non-paramterically with the Kruskal Wallis test. 1) Check for outliers, 2) Run Kruskal-Wallis, 3) Run
Note that since we are going to be using specific commands to gather the raw abundance data from Kallisto later, it's important that the samples be in alphabetical order in the metadata file.
What immediately leaps out to me is that the main grouping factor is the experiment, i.e. all the DayXXX samples are grouped together. The second biggest grouping factor within each experiment is the day (5 vs. 10). Also, good contrasts yield a "checkerboard" effect in the distance matrix. This is much more like a mosaic.
Does this mean that it'll be tough to pull out the exercise factor effect from the combined dataset? This is where a non-parametric test helps out, because it does not assume a normal distribution of the effect amongst "replicate" samples, just that the direction of the effect is the same (I'm glossing over details here, but it's generally true). Let's look at the residual sum of squares (RSS) of the model fit to get a relative measure of how well we are modelling the data with the known factors.
so <- sleuth_prep(meta, ~condition*time)
so <- sleuth_fit(so)
sum(so$fits$full$summary[,2])
[1] 1159405
Getting back to the distance matrix and outlier analysis, do you see the light (distant) streak running through the matrix for samples 9 and 12 (X-axis label)? They are all off on their own (cluster tree on the left margin), and don't relate nicely via the general factor trends. Let's try excluding them from the model and see how it affects our RSS.
meta_no_outliers <- meta[c(1:8,10:11,13:28),]
so_no_outliers <- sleuth_prep(meta_no_outliers, ~condition*time)
so_no_outliers <- sleuth_fit(so_no_outliers)
sum(so_no_outliers$fits$full$summary[,2])
[1] 1052599
So we reduces the RSS by over 9% by excluding 7% of the samples (2 of 28). Generally, excluding a good replicate will reduce the RSS by only up to half its sample weight (i.e. we'd expect an RSS reduction of <5% for these two samples, but we got 9% so they are really outliers). Maybe these samples were prepared separately, etc. we'd have to go back to the lab book to look for an explanation.
Sample 27 from the second experiment run also looks a bit fishy, all off on its one in the cluster tree. Let's try dropping it too.
meta_no_outliers_redux <- meta[c(1:8,10:11,13:26,28),]
so_redux <- sleuth_prep(meta_no_outliers_redux, ~condition*time)
so_redux <- sleuth_fit(so_redux)
sum(so_redux$fits$full$summary[,2])
[1] 1048939
The exclusion of Sample 27 (Exercise_2) dropped the RSS by a measly incremental 0.35% even though we dropped 3.8% of the sample (1 of 26). We should keep it because it isn't an outlier contributing to a bad model fit.
While it's fresh in our minds, let's assign an experiment factor to the metadata, which as you can see in the bottom right of the matrix is samples 10-20 once we eliminate the two big outliers. This will help us reduce the RSS later.
> meta_no_outliers$which_experiment <- as.factor(c(rep.int(0,8),rep.int(1,10),rep(0,8)))
> meta_no_outliers
sample path condition time groups which_experiment
1 D10_control_2 D10_control_2.kallisto ctl ten ctl_ten 0
2 D10_control_3 D10_control_3.kallisto ctl ten ctl_ten 0
3 D10_control_4 D10_control_4.kallisto ctl ten ctl_ten 0
4 D10_control_5 D10_control_5.kallisto ctl ten ctl_ten 0
5 D5_control_1 D5_control_1.kallisto ctl five ctl_five 0
6 D5_control_2 D5_control_2.kallisto ctl five ctl_five 0
7 D5_control_5 D5_control_5.kallisto ctl five ctl_five 0
8 D5_control_6 D5_control_6.kallisto ctl five ctl_five 0
10 Day10_Control2 Day10_Control2.kallisto ctl ten ctl_ten 1
11 Day10_Control3 Day10_Control3.kallisto ctl ten ctl_ten 1
13 Day10_Exercise2 Day10_Exercise2.kallisto exp ten exp_ten 1
14 Day10_Exercise3 Day10_Exercise3.kallisto exp ten exp_ten 1
15 Day5_Control1 Day5_Control1.kallisto ctl five ctl_five 1
16 Day5_Control2 Day5_Control2.kallisto ctl five ctl_five 1
17 Day5_Control3 Day5_Control3.kallisto ctl five ctl_five 1
18 Day5_Exercise1 Day5_Exercise1.kallisto exp five exp_five 1
19 Day5_Exercise2 Day5_Exercise2.kallisto exp five exp_five 1
20 Day5_Exercise3 Day5_Exercise3.kallisto exp five exp_five 1
21 Exercise_12 Exercise_12.kallisto exp five exp_five 0
22 Exercise_13 Exercise_13.kallisto exp ten exp_ten 0
23 Exercise_17 Exercise_17.kallisto exp ten exp_ten 0
24 Exercise_18 Exercise_18.kallisto exp ten exp_ten 0
25 Exercise_19 Exercise_19.kallisto exp ten exp_ten 0
26 Exercise_1 Exercise_1.kallisto exp five exp_five 0
27 Exercise_2 Exercise_2.kallisto exp five exp_five 0
28 Exercise_7 Exercise_7.kallisto exp five exp_five 0
Step 2: Run the non-parametric tests
Let's first try the non-parametric U-test (a.k.a. Wilcox or Mann-Whitney test) to check for difference between the exercise and non-exercise samples.
> abundance <- abundance_full[rowSums(abundance_full)>length(meta_no_outliers$sample),c(1:8,10:11,13:28)]
> wil <- function(x){df <- data.frame(tpm=t(abundance[x,])[,1], fctr=m); w <- wilcox.test(tpm ~ fctr, data=df); w$p.value} #[ed. m will be populated with the factor values for each sample in the next function]
> calc_wilcox_qvals <- function(combo){m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; return(p.adjust(pvals, method="fdr"))}
> condition_wilcox_qvals <- calc_wilcox_qvals(meta_no_outliers$condition)
There were 50 or more warnings (use warnings() to see the first 50) #[ed. This is due to rank ties in the U-test, you can generally ignore these]
> table(condition_wilcox_qvals < .05)
FALSE
23953
D'oh! We don't have any transcript differential expression picked up by the U-test that survive the False Discovery Rate (Benjaimin-Hochberg in this case) multiple testing correction. We know from previous experience that we might pick up a two-level contrast once we have at least 15 Kallisto samples. We can also show that we have the power to do so in this experiment by looking at one of the primary grouping factors in the distance matrix: the time levels of Day5 vs Day10.
> time_wilcox_qvals <- calc_wilcox_qvals(meta_no_outliers$time)
There were 50 or more warnings (use warnings() to see the first 50)
> table(time_wilcox_qvals < .05)
FALSE TRUE
21176 2777
If exercise-vs.-non isn't separating by rank nicely across the two experiments, this might be because at Day 5, the effect of exercise isn't very strong so those exercise samples mix ranks with the non-exercise samples. The way to improve our power to detect the Day10 exercise effect would be to look at the interaction (synergistic) effect of the day and exercise factors, as per the "groups" metadata factor we created earlier that has four levels. The U-test only works with two factor levels. The generalization of it for more than two factor levels in the Kruskal-Wallis test, which we'll employ here.
> kru <- function(x){d <- data.frame(tpm=t(abundance[x,])[,1], groups=m); k <- kruskal.test(tpm ~ groups, data=d); k$p.value}
> kru_qvals <- function(combo){m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,kru(i))}; return(p.adjust(pvals, method="fdr"))}
> k <- kru_qvals(meta_no_outliers$groups)
> table(k < .05)
FALSE TRUE
23193 760
[Aside: An astute reader will note that we have less than 8 samples in each of the four factor levels, yet we are picking out significant contrasts when the simpler U-test is underpowered at <8 samples per factor level. Because we have four factor levels, there are more unique permutations possible of the ranks than when there's only two, hence the better performance here. Thinking of it another way, for any fixed string length, the number of unique ways to write 1's and 0's is << the number of ways to write 0, 1, 2, and 3. More unique ranking choices means more information entropy, which means greater statistical power for any given factor level.]
Let's note the Kruskal-Wallis significant transcripts for later.
> kru_sig_abundance <- abundance[which(k < .05),]
> sig_ks <- k[which(k < .05)]
Let's do a sanity check and see if a random labelling of the samples gives any significant changes (it shouldn't).
> k_random <- kru_qvals(sample(meta_no_outliers$groups))
> table(k_random < .05)
FALSE
23953
Phew! You could run the random sampling a bunch of times, that's left as an exercise to the reader.
Stage 3: Post-hoc significance testing with many factor levels
Now, we know that 760 transcripts have factor-levels with significantly different rankings according to our groups label, but which of the groups factor's four levels are the ones that are significant? We need to contrast each pair of factor levels.
This page from the R Companion provides a nice summary of ways to do the
post-hoc analysis of the differing groups. Let's use the U-Test since it behaves nicely without an even number of items in each category (remember that we removed two samples, and if you don't believe me...
> table(meta_no_outliers$groups)
ctl_five ctl_ten exp_five exp_ten
7 6 7 6
).
No only do we want to use the U-test (a.k.a. Mann-Whitney or Wilcox) for each pair of factor levels and do multiple testing correction (FDR again), but we need to load a couple of libraries that'll give us useful functions to print out the group comparisons later.
> library(multcompView)
> library(rcompanion)
> categorize <- function(x){pt <- pairwise.wilcox.test(as.numeric(x), meta_no_outliers$groups, p.adjust.method="fdr");
pt$p.value[which(is.nan(pt$p.value))] <- 1; multcompLetters(fullPTable(pt$p.value), compare="<", threshold=0.05, Letters=letters, reverse=FALSE)$Letters}
> kru_sig_diff_groups <- apply(kru_sig_abundance, 1, categorize)
This will give us for each of the 760 transcripts a compact representation of the significant grouping information for the 4 factor levels, e.g.
> kru_sig_diff_groups[,1]
ctl_five ctl_ten exp_five exp_ten
"ab" "c" "a" "bc"
This means that ctl_five is statistically indistinguishable from exp_five (they shared 'a') and exp_ten(they share 'b'), ctl_ten is indistinguishable from exp_ten (they share 'c'), exp_five and exp_ten are distinct (they share no letters), etc.. Let's write a function to keep only the U-test results that have exp_ten (our interaction term, index 4) distinct from either non-exercise control time points (indices 1& 2).
> exercise_only <- function(x){!grepl(x[1],x[4]) && !grepl(x[4],x[1]) && !grepl(x[2],x[4]) && !grepl(x[4],x[2])}
> eo <- apply(kru_sig_diff_groups, 2, exercise_only)
> table(eo)
eo
FALSE TRUE
689 71
> kru_exercise_sigids <- colnames(kru_sig_diff_groups[,eo])
Eureka! We have 71 significant Day10 exercise differentially expressed transcripts across both experiments. This was done based solely on expression rank of transcripts across samples, so we didn't need the experiments to be truly replicates, just moving in the same direction generally. We could write a function to calculate the log2 fold changes, but it'd get complicated as we have multiple factor levels, and won't include estimates of technical variability, etc that Sleuth provides. Even if the parametric test results from Sleuth aren't going to be significant due to the non-normality of the samples, the fold change information is useful so let's run it.
> so <- sleuth_prep(meta_no_outliers, ~condition*time+which_experiment+condition:which_experiment+time:which_experiment)
> sum(so$fits$full$summary[,2])
[1] 864793.9
Note that this is a much better RSS than originally (~1160K). We did two things: exclude the outliers (~9% drop in RSS), and we explicitly modelled the experiments and their effect on the other factors (~18% drop in RSS). This will help the Sleuth results for fold-change* be more realistic. Run the Wald and Likelihood Ratio (LRT) tests.
> so <- sleuth_fit(so, ~time+which_experiment+time:which_experiment, "no_condition")
...
> so <- sleuth_fit(so, ~condition+which_experiment+condition:which_experiment, "no_time")
...
> so <- sleuth_fit(so, ~condition+time+which_experiment+condition:which_experiment+time:which_experiment, "no_interaction")
...
> so <- sleuth_lrt(so, 'no_condition', 'full')
> so <- sleuth_lrt(so, 'no_time', 'full')
> so <- sleuth_lrt(so, 'no_interaction', 'full')
> lrt_condition <- sleuth_results(so, 'no_condition:full', test_type = 'lrt')
> lrt_time <- sleuth_results(so, 'no_time:full', test_type = 'lrt')
> lrt_interaction <- sleuth_results(so, 'no_interaction:full', test_type = 'lrt')
> so <- sleuth_wt(so, 'conditionexp')
> so <- sleuth_wt(so, 'timeten')
> so <- sleuth_wt(so, 'conditionexp:timeten')
> so <- sleuth_wt(so, 'conditionexp:which_experiment1')
> so <- sleuth_wt(so, 'timeten:which_experiment1')
> wt_condition <- sleuth_results(so, 'conditionexp')
> wt_time <- sleuth_results(so, 'timeten')
> wt_interaction <- sleuth_results(so, 'conditionexp:timeten')
> wt_condition1 <- sleuth_results(so, 'conditionexp:which_experiment1')
> wt_time1 <- sleuth_results(so, 'timeten:which_experiment1')
Did the parametric tests show up anything significant?
> table(wt_condition$qval < .05)
FALSE TRUE
33456 16
> table(lrt_condition$qval < .05)
FALSE TRUE
33466 6
> table(lrt_interaction$qval < .05)
A few, but not as many as our non-parametric Kruskal-Wallis (71). The LRT (6) results are a subset of the Wald results (10). Let's put the non-paramteric and parametric results together for reporting.
> all_sigids <- unique(c(wt_condition_sigids,kru_exercise_sigids))
> length(all_sigids)
[1] 87
87 = 71 + 16 so there is no overlap in the two tests' results. Let's define a function that provides the post-hoc pairwise U-test results when available, or NAs when not available, for reporting purposes.
> kru_data <- function(id){if(id %in% colnames(kru_sig_diff_groups)){c(kru_sig_diff_groups[,which(colnames(kru_sig_diff_groups) == id)],sig_ks[which(colnames(kru_sig_diff_groups)==id)])}else{rep(NA,5)}}
Gather the Wald test results (which includes the fold-change*) and the pairwise U-test results for each of the 87 significantly differentially expressed transcripts, then transpose the rows and columns, then write to a file.
> dl <- vapply(all_sigids, function(id){unlist(c(wt_condition[wt_condition$target_id == id,],wt_condition1[wt_condition1$target_id == id,],wt_time[wt_time$target_id == id,],kru_data(id)))}, FUN.VALUE=rep("1", 38))
> write.csv(t(dl), "kruskal_wallis_q_lt_0.05_u_test_fdr_0.05.csv")
I'll load this into Excel and remove extraneous Wald test result columns that just confuse most people reading the analysis, and convert the beta values to log2 fold-changes* with an Excel formula for users' didactic purposes.**
If you look at the exp_five column, you'll see that it is often indistinguishable from (shares a letter with) one or both of the controls. This explains why the two-factor level U-test failed, and why using the Kruskal-Wallis test was key to seeing the interaction effect. As per usual, take any individual transcript with a grain of salt, but look for pathway enrichment in IPA, DAVID, etc.
________
* Not really fold-change, but the natural logarithm of the bias estimator. Read the Sleuth manual for details. It's worth it.
** A clever reader will note that you can change the log base with a constant value. Once again, I used this formula for didactic purposes, making the base conversion more explicit.