tl;dr there are RNA experiments where despite reasonable sample selection, no significant changes will be found. Especially in patient-cenetered studies, sometimes you just can't get true "replicates".
In three stages we can correct for outliers, spurious correlations, and a quantitative factor like patient age that has a skewed (as opposed to Gaussian) dosage effect on transcripts. This process gets us from 2 significantly regulated transcripts to 285 (q-value < 0.05). This involves using Kallisto, Sleuth, DESeq2, and some home brewed code around R's built-in Wilcox-Mann-Whitney U Test and skewness measure.
__________________________
Let's start with an experiment designed with a reasonable balance of males and females in each of three age categories, under 40, over 60, and in between. I noticed on the sample submission form that the input RNA concentrations were unusually divergent, so I've included those values in case they were informative (might affect PCR duplicate levels).
library(sleuth)
In three stages we can correct for outliers, spurious correlations, and a quantitative factor like patient age that has a skewed (as opposed to Gaussian) dosage effect on transcripts. This process gets us from 2 significantly regulated transcripts to 285 (q-value < 0.05). This involves using Kallisto, Sleuth, DESeq2, and some home brewed code around R's built-in Wilcox-Mann-Whitney U Test and skewness measure.
__________________________
Let's start with an experiment designed with a reasonable balance of males and females in each of three age categories, under 40, over 60, and in between. I noticed on the sample submission form that the input RNA concentrations were unusually divergent, so I've included those values in case they were informative (might affect PCR duplicate levels).
library(sleuth)
meta <- read.table("meta.tab", header=TRUE)
meta$path <- as.character(meta$path)
meta$age_group <- as.factor(meta$age_group)
meta
sample path sex age_group age conc
1 57_50Y_P2 57_50Y_P2.kallisto M 1 50 55
2 63_75Y_P2 63_75Y_P2.kallisto M 2 75 99
3 SATP100 SATP100.kallisto M 0 27 203
4 SATP105 SATP105.kallisto F 0 35 136
5 SATP116 SATP116.kallisto M 1 46 131
6 SATP119 SATP119.kallisto M 2 74 511
7 SATP120 SATP120.kallisto M 1 41 466
8 SATP125 SATP125.kallisto F 0 30 697
9 SATP137 SATP137.kallisto M 0 19 183
10 SATP140 SATP140.kallisto F 0 39 264
11 SATP154 SATP154.kallisto F 2 61 256
12 SATP68 SATP68.kallisto M 0 37 303
13 SATP75 SATP75.kallisto M 2 65 112
14 SATP82 SATP82.kallisto F 2 78 21
15 SATP91 SATP91.kallisto F 1 57 736
16 SATP92 SATP92.kallisto F 1 53 136
17 SATP95 SATP95.kallisto F 1 52 53
18 SATP98 SATP98.kallisto F 2 73 83
> so <- sleuth_prep(meta, ~sex*age_group+age+sex:age+conc)
...
> so <- sleuth_fit(so)
...
> sum(so$fits$full$summary[,2])
[1] 347228.1
The last line gives us the residual sum of squares of the model fit. The lower the better. Let's continue on as per half way down a previous blog post, calculating the Likelihood Ratio Test value for each factor in each transcript.
The last line gives us the residual sum of squares of the model fit. The lower the better. Let's continue on as per half way down a previous blog post, calculating the Likelihood Ratio Test value for each factor in each transcript.
so <- sleuth_fit(so, ~sex+conc, "no_age")
so <- sleuth_fit(so, ~age_group+age+conc, "no_sex")
so <- sleuth_fit(so, ~sex+age_group+age+conc, "no_int")
so <- sleuth_fit(so, ~sex*age_group+sex:age+age+conc, "no_conc")
lrt_sex <- sleuth_results(so, 'no_sex:full', test_type = 'lrt')
so <- sleuth_lrt(so, 'no_sex', 'full')
so <- sleuth_lrt(so, 'no_age', 'full')
so <- sleuth_lrt(so, 'no_int', 'full')
so <- sleuth_lrt(so, 'no_conc', 'full')
lrt_sex <- sleuth_results(so, 'no_sex:full', test_type = 'lrt')
lrt_age <- sleuth_results(so, 'no_age:full', test_type = 'lrt')
lrt_int <- sleuth_results(so, 'no_int:full', test_type = 'lrt')
lrt_conc <- sleuth_results(so, 'no_conc:full', test_type = 'lrt')
lrt_age.sig_ids <- lrt_age$target_id[which(lrt_age$qval < 0.05)]
lrt_sex.sig_ids <- lrt_sex$target_id[which(lrt_sex$qval < 0.05)]
lrt_int.sig_ids <- lrt_int$target_id[which(lrt_int$qval < 0.05)]
lrt_conc.sig_ids <- lrt_conc$target_id[which(lrt_conc$qval < 0.05)]
Let's start by identifying outliers.
Stage 1: Identifying outlier samples
First, we need to collate the normalized TPM transcript abundance data as per my previous blog post. You could probably do it with the so info you've already got from the above analysis, but it's trickier I find. Let's load all the transcripts with expression values using DESeq2.
Normalize the count data so that the variance is similar across the range of expression values (necessary for a good Poisson distance measure). This is a nice, easy to use feature of DESeq2. Then visualize the distance matrix.
abundance_full <- read.table("all_abundance.tsv", header=TRUE, row.names=1)
abundance_full_integer <- apply(abundance_full, c(1,2), function(x){as.integer(x)})
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData=abundance_full_integer, colData=data.frame(meta), design=~sex*age_group+age+sex:age+conc)
dds_redux <- dds[rowSums(counts(dds)) > 1,]
rld <- rlog(dds_redux, blind=FALSE)
library("pheatmap")
library("RColorBrewer")
library("PoiClaClu")
poisd <- PoissonDistance(t(abundance_full))
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste(meta$sex, meta$age, sep=":")
rownames(samplePoisDistMatrix) <- paste(meta$sex, meta$age, sep=":")
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(samplePoisDistMatrix, clustering_distance_rows=poisd$dd,
clustering_distance_cols=poisd$dd, col=colors)
Let's also check that in closely related samples (such as 8 & 12 in the lower right corner) show fairly even variance across the range of expression values, otherwise our nice matrix isn't very meaningful.
plot(assay(rld)[,c(8,12)], xlim=c(0,16), ylim=c(0,16), pch=16, cex=0.3)
Looks good, i.e. it's fairly evenly thick above zero, and there's no feathering in the lower left. Using just a log 2 transform on the Kallisto TPM data yields a different tree, and the scatter plot above is a lot fatter at the bottom, so it was worth using DESeq2's normalization.
Back to the outliers. In the matrix, you'll see that there are three samples, with indices 4, 5, and 14 along the bottom that are strong outliers. Not only are they strong outliers, but they are closely related to each other despite not sharing either a sex or age factor level. This gives us a strong suspicion that there's something wrong here like cross-contamination, prep batch effect, etc. but it's not obvious what. Let's exclude these samples and see how much the model of known factors improves.
> meta_no_outliers <- meta[c(1:3,6:13,15:18),]
> so_no_outliers <- sleuth_prep(meta_no_outliers, ~sex*age_group+age+sex:age+conc)
...
> so_no_outliers <- sleuth_fit(so_no_outliers)
...
> sum(so_no_outliers$fits$full$summary[,2])
[1] 193932.1
> length(lrt_age.sig_ids)
[1] 8
> length(lrt_sex.sig_ids)
[1] 8
> length(lrt_int.sig_ids)
[1] 4
> length(lrt_conc.sig_ids)
[1] 71
D'oh! Looks like the concentration of input RNA is explaining most of the variance. Collectively this is 84 genes, which we will write out to a file, even if the concentration effect is suspect.
Stage 2: Spurious factor correlations
Getting back to our potential sources of failure in the analysis, let's check for spurious correlation between the concentration and the age factor (the main factor of interest). Plot the data points and a trend line.
> plot(data.frame(meta$age,meta$conc))
> abline(lm(meta$age ~ meta$conc))
> plot(data.frame(meta$age,meta$conc))
> meta$conc_low <- meta$conc <= 55
> meta$conc_low
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE TRUE FALSE FALSE TRUE FALSE
In case you're wondering how just excluding the outlier samples and the concentration from the model works out, you get just one significant change, only in age_group1 (between 40yo and 60yo), and it's a transcript at that showed up in the previous analyses repeatedly. You'll see shortly how the binary concentration modelling works out a lot better.
Stage 3: Rationally defining a factor level threshold for a skewed factor effect (numeric factor -> two-level factor)
Let's keep only the samples with a TPM of at least 1 on average.
> abundance <- abundance_full[rowSums(abundance_full)>length(meta_clean$age),]
Let's define all the combinations of young/old possible by setting the young/old threshold to each observed value in turn.
> oldness_treshold_combos <- t(sapply(meta_clean$age, function(x){meta_clean$age > x}))
Let's define a function to run the Wilcox-Mann-Whitney U Test, as I have in a previous post. As an aside, I also tried using the Kruskal-Wallis H test which extends the U Test notion to three or more factor levels (using the 0/1/2 age_group factor from earlier), but it's underpowered with this number of samples. As you'll see below, we'll have enough trouble with Wilcox already. We'll stick the distribution of old/young across samples in a global variable called "m" later.
> wil <- function(x){d <- data.frame(tpm=t(abundance[x,])[,1], old=m); w <- wilcox.test(tpm ~ old, data=d); w$p.value}
Now define the function that assigns a young/old combination to "m", then runs the Wilcox test and finally applies the FDR (Benjamini-Hochberg) p-value multiple testing correction to give us "q-values". The initial check for the sum combo below is a special condition for the oldest sample, which gives no age contrast if used as a threshold and buggers up the whole process.
> perm_qvals <- function(combo){if(!sum(combo)){return(rep_len(c(NA), length(abundance[,1])))}; m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; return(p.adjust(pvals, method="fdr"))}
Run the qval test on each combination we calculated earlier.
> combo_results <- apply(oldness_treshold_combos, 1, perm_qvals)
There were 50 or more warnings (use warnings() to see the first 50) [you can ignore these, it's complaining about ties in the rankings for the Wilcox test for certain transcripts, in which case it can't calculate an exact p-value)]
> apply(combo_results, 2, function(x){sum(as.integer(x<.05))})
[1] 285 NA 0 0 0 0 0 0 0 0 0 0 0 0 0
> res <- combo_results[,1]
> names(res) <- rownames(abundance)
Let's try the analysis again, but see what happens if we don't exclude the outliers, so we have 18 samples, which will give us multiple 8+ split options (8/10, 9/9,10/8):
It would seem in this case that the benefit of added samples for the U Test is outweighed by the outlying nature of the expression values in those 3 extra samples, as we get no significant changes. The three samples are not just overall outliers, but also specifically in their response to the experimental factor of interest (age). That is to say, they disrupt the rank of genes grouped as young/old, not just the replicate normality assumption of the Wald and LRT tests on the TPM values. Had we gotten good results in the combo_results_full vector, we might have included the outliers and proceeded with caution in comparing the betas from Sleuth with either 15 or 18 samples.
Let's note the significant IDs from the 15 sample test for later when we want to print out results to a file. I'm doing this in two steps, for clarity.
> abundance <- abundance_full[rowSums(abundance_full)>length(meta$age),]
> rownames(abundance) <- rownames(abundance_full[rowSums(abundance_full)>length(meta$age),])
> oldness_treshold_combos <- t(sapply(meta$age, function(x){meta$age > x}))
> combo_results_full <- apply(oldness_treshold_combos, 1, perm_qvals)
apply(combo_results_full, 2, function(x){sum(as.integer(x<.05))})
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 0 0
It would seem in this case that the benefit of added samples for the U Test is outweighed by the outlying nature of the expression values in those 3 extra samples, as we get no significant changes. The three samples are not just overall outliers, but also specifically in their response to the experimental factor of interest (age). That is to say, they disrupt the rank of genes grouped as young/old, not just the replicate normality assumption of the Wald and LRT tests on the TPM values. Had we gotten good results in the combo_results_full vector, we might have included the outliers and proceeded with caution in comparing the betas from Sleuth with either 15 or 18 samples.
Let's note the significant IDs from the 15 sample test for later when we want to print out results to a file. I'm doing this in two steps, for clarity.
> sig_indices <- res < .05
> meta_clean$old <- meta_clean$age > 50
> meta_clean
sample path sex age_group age conc conc_low old
1 57_50Y_P2 57_50Y_P2.kallisto M 1 50 55 TRUE FALSE
2 63_75Y_P2 63_75Y_P2.kallisto M 2 75 99 FALSE TRUE
3 SATP100 SATP100.kallisto M 0 27 203 FALSE FALSE
6 SATP119 SATP119.kallisto M 2 74 511 FALSE TRUE
7 SATP120 SATP120.kallisto M 1 41 466 FALSE FALSE
8 SATP125 SATP125.kallisto F 0 30 697 FALSE FALSE
9 SATP137 SATP137.kallisto M 0 19 183 FALSE FALSE
10 SATP140 SATP140.kallisto F 0 39 264 FALSE FALSE
11 SATP154 SATP154.kallisto F 2 61 256 FALSE TRUE
12 SATP68 SATP68.kallisto M 0 37 303 FALSE FALSE
13 SATP75 SATP75.kallisto M 2 65 112 FALSE TRUE
15 SATP91 SATP91.kallisto F 1 57 736 FALSE TRUE
16 SATP92 SATP92.kallisto F 1 53 136 FALSE TRUE
17 SATP95 SATP95.kallisto F 1 52 53 TRUE TRUE
18 SATP98 SATP98.kallisto F 2 73 83 FALSE TRUE
Luckily for us, the age threshold of >50 puts one low concentration sample in the young category and one in the old, so both factors can be included in the final model.
> so_final <- sleuth_prep(meta_clean, ~sex*old+age:old+conc_low)
> so_final <- sleuth_fit(so_final)
> sum(so_final$fits$full$summary[,2])
[1] 266102.6
By the way, did our spurious correlation removal from Stage 2 help? Let's see what the RSS is without it in the model.
By the way, did our spurious correlation removal from Stage 2 help? Let's see what the RSS is without it in the model.
> so_final <- sleuth_fit(so_final, ~sex*old+age:old, "no_conc")
> sum(so_final$fits$no_conc$summary[,2])
[1] 300209.5
Nice, it reduces the noise by over 10%. It's not as good an RSS as we had with the quantitative concentration factor, but it's much less biased. Let's go on with the inclusive model and calculate the betas/fold-changes with a Wald test.
> so_final <- sleuth_wt(so_final, "sexM")
> so_final <- sleuth_wt(so_final, "oldTRUE")
> so_final <- sleuth_wt(so_final, "sexM:oldTRUE")
> so_final <- sleuth_wt(so_final, "oldTRUE:age")
> so_final <- sleuth_wt(so_final, "oldFALSE:age")
> so_final <- sleuth_wt(so_final, "conc_lowTRUE")
> wt_sex <- sleuth_results(so_final, 'sexM')
> wt_old <- sleuth_results(so_final, 'oldTRUE')
> wt_int <- sleuth_results(so_final, 'sexM:oldTRUE')
> wt_old_age <- sleuth_results(so_final, 'oldTRUE:age')
> wt_young_age <- sleuth_results(so_final, 'oldFALSE:age')
> wt_conc_low <- sleuth_results(so_final, 'conc_lowTRUE')
> table(wt_sex[,"qval"] <.05)
FALSE TRUE
28142 2
> table(wt_old[,"qval"] <.05)
FALSE TRUE
28142 2
> table(wt_int[,"qval"] <.05)
FALSE TRUE
28142 2
> table(wt_conc[,"qval"] <.05)
FALSE TRUE
28133 11
> table(wt_old_age[,"qval"] <.05)
FALSE TRUE
28142 2
> table(wt_young_age[,"qval"] <.05)
FALSE TRUE
28143 1
So we have very few significant q values, but that's to be expected, because the Wald test assumes you have a normal distribution and an accurate assessment of variance. We have neither, but we want the beta values that the test generates. These will still be our most accurate assessment of factor driven fold-change in transcript abundance.
wt_shared_sex <- wt_sex[wt_sex$target_id %in% ids,]
wt_shared_old <- wt_sex[wt_old$target_id %in% ids,]
wt_shared_int <- wt_int[wt_int$target_id %in% ids,]
wt_shared_old_age <- wt_old_age[wt_old_age$target_id %in% ids,]
wt_shared_young_age <- wt_old_age[wt_young_age$target_id %in% ids,]
wt_shared_young_age <- wt_old_age[wt_conc$target_id %in% ids,]
wt_shared_conc_low <- wt_conc_low[wt_conc_low$target_id %in% ids,]
wt_shared_sex <- wt_shared_sex[sort.list(wt_shared_sex[,1]),]
wt_shared_old <- wt_shared_old[sort.list(wt_shared_old[,1]),]
wt_shared_int <- wt_shared_int[sort.list(wt_shared_int[,1]),]
wt_shared_young_age <- wt_shared_young_age[sort.list(wt_shared_young_age[,1]),]
wt_shared_old_age <- wt_shared_old_age[sort.list(wt_shared_old_age[,1]),]
wt_shared_conc_low <- wt_shared_conc_low[sort.list(wt_shared_conc_low[,1]),]
sig_abundance <- sig_abundance[sort.list(rownames(sig_abundance)),]
In our data for example, a negative skew would mean that the effect of the fold-change is more intense in the older people within the old group. A positive skew would mean that the FC values for the old-designated samples pile up closer to the old cutoff (>50yo) so increased age isn't a big a factor. Let's see if the skew of the abundance of the 285 significant transcripts are quantitatively different from the general transcript abundance.
> library(moments)
> sig_abundance <- abundance[sig_indices,]
> sk <- apply(sig_abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})
> summary(sk)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.72200 -0.29110 0.09093 0.07834 0.44870 1.50800
> sk0 <- apply(abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})
> summary(sk0)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-2.0620 -0.1577 0.2699 0.3049 0.7294 2.2680 62
Looks like the 285 transcripts have a long tail more to the left (more negative) than the whole set of transcripts. Let's check more thoroughly. First, are both distributions (of skewness) normal? Run the Shapiro-Wilk test to find out.
> shapiro.test(sk)
...
W = 0.9934, p-value = 0.2449
> shapiro.test(sample(sk0, length(ids)))
...
W = 0.99399, p-value = 0.3266
Yes. Do they have the same variance? Run an F-test, or you could try Levene's test or Bartlett's test, etc.. they all fail for this dataset.
> var.test(sk, sk0)
F test to compare two variances
data: sk and sk0
F = 0.699, num df = 284, denom df = 19418, p-value = 7.016e-05
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5961302 0.8307902
sample estimates:
ratio of variances
0.6990001
Which makes running a t-test moot since it assumes equality of variance in each sample group. What we can say easily is that the skewness is further left and tighter (about 1/3 as much) in the significantly changes transcripts than in a random sample of 285 transcripts. That means many transcripts are affected in the really old a lot more than the moderately old, and luckily we don't have to give a specific threshold (like 60 in the original analysis) to catch that phenomenon, nor did the effect have to be normally distributed in our samples.
> sum(sk)
[1] 22.32632
> sum(sample(sk0,length(sk)), na.rm=TRUE)
[1] 78.9279
> mean(sk)
[1] 0.07833795
> mean(sk0, na.rm=TRUE)
[1] 0.3049191
For all the 285 U-test passing transcripts let's write out all the factor contrasts we've calculated so far, and the old samples skew stat adjusted for sk0 (to help account for the effect of the specific distribution of sample ages we have in the analysis).
sk_adjusted <- sk - (mean(sk0, na.rm=TRUE)-mean(sk))
> d <- data.frame(ids,avg_tpm=rowMeans(sig_abundance[ids,]),u_test_qval=res[ids],skew=sk_adjusted,wt_shared_sex[,"b"],wt_shared_sex[,"qval"],wt_shared_old[,"b"],wt_shared_old[,"qval"],wt_shared_int[,"qval"],wt_shared_int[,"b"],wt_shared_young_age[,"qval"],wt_shared_young_age[,"b"],wt_shared_old_age[,"qval"],wt_shared_old_age[,"b"])
> write.csv(d, "u_test_no_outliers_q_lt_0.05.csv")
Pondering the merits and validity of including a kurtosis measure is left as an exercise for the reader.
P.S.: significant changes were corroborated with consistency scores in pathway analysis (IPA) from both the 84 and 285 item lists, albeit different pathways, therefore it's probably worth following up on both.
Paul, Thank you very much for the posting. One question about the outlier. I have a data set that disagrees between pca(sleuth generated)result and the heatmap(following your post). What should I check first to investigate this matter?
ReplyDeleteHeatmap say couple samples are outliers but on the pca plot these samples and others in the group segregate nicely. bit puzzled. Thank you.
There are a couple of possibilities. First, check if the PCA plot based on TPM (rather than the default, est_counts) recapitulates the distance matrix in terms of outliers. If so, you have some pseduomapping biases related to longer genes. If not, look at how much of the total variance the first and second principal components explain. The matrix represents overall distance, but if there is a really strong variance between samples for a specific set of genes (usually and hopefully corresponding to one of your experimental factors), this may mask any general outlying nature of samples in a PCA plot. In some sense this is good, because it means that the signal generated by the factor is strong enough to be resistant to noise.
ReplyDeletecould you explain what "pseudomapping bias related to longer gene" is?
Delete