March 1, 2017

This is not normal: dealing with outliers, correlated factors, and skewed factor effects in RNASeq (a Kallisto/Sleuth/DESeq2/U-Test/skewness mash-up)



tl;dr there are RNA experiments where despite reasonable sample selection, no significant changes will be found. Especially in patient-cenetered studies, sometimes you just can't get true "replicates".

In three stages we can correct for outliers, spurious correlations, and a quantitative factor like patient age that has a skewed (as opposed to Gaussian) dosage effect on transcripts. This process gets us from 2 significantly regulated transcripts to 285 (q-value < 0.05). This involves using Kallisto, Sleuth, DESeq2, and some home brewed code around R's built-in Wilcox-Mann-Whitney U Test and skewness measure.
__________________________

Let's start with an experiment designed with a reasonable balance of males and females in each of three age categories, under 40, over 60, and in between.  I noticed on the sample submission form that the input RNA concentrations were unusually divergent, so I've included those values in case they were informative (might affect PCR duplicate levels).

library(sleuth)
meta <- read.table("meta.tab", header=TRUE)
meta$path <- as.character(meta$path)

meta$age_group <- as.factor(meta$age_group)
meta
      sample               path sex age_group age conc
1  57_50Y_P2 57_50Y_P2.kallisto   M         1  50   55
2  63_75Y_P2 63_75Y_P2.kallisto   M         2  75   99
3    SATP100   SATP100.kallisto   M         0  27  203
4    SATP105   SATP105.kallisto   F         0  35  136
5    SATP116   SATP116.kallisto   M         1  46  131
6    SATP119   SATP119.kallisto   M         2  74  511
7    SATP120   SATP120.kallisto   M         1  41  466
8    SATP125   SATP125.kallisto   F         0  30  697
9    SATP137   SATP137.kallisto   M         0  19  183
10   SATP140   SATP140.kallisto   F         0  39  264
11   SATP154   SATP154.kallisto   F         2  61  256
12    SATP68    SATP68.kallisto   M         0  37  303
13    SATP75    SATP75.kallisto   M         2  65  112
14    SATP82    SATP82.kallisto   F         2  78   21
15    SATP91    SATP91.kallisto   F         1  57  736
16    SATP92    SATP92.kallisto   F         1  53  136
17    SATP95    SATP95.kallisto   F         1  52   53
18    SATP98    SATP98.kallisto   F         2  73   83
> so <- sleuth_prep(meta, ~sex*age_group+age+sex:age+conc)
...
> so <- sleuth_fit(so)
...
> sum(so$fits$full$summary[,2])
[1] 347228.1

The last line gives us the residual sum of squares of the model fit.  The lower the better. Let's continue on as per half way down a previous blog post, calculating the Likelihood Ratio Test value for each factor in each transcript.


so <- sleuth_fit(so, ~sex+conc, "no_age")
so <- sleuth_fit(so, ~age_group+age+conc, "no_sex")
so <- sleuth_fit(so, ~sex+age_group+age+conc, "no_int")
so <- sleuth_fit(so, ~sex*age_group+sex:age+age+conc, "no_conc")
lrt_sex <- sleuth_results(so, 'no_sex:full', test_type = 'lrt')
so <- sleuth_lrt(so, 'no_sex', 'full')

so <- sleuth_lrt(so, 'no_age', 'full')
so <- sleuth_lrt(so, 'no_int', 'full')
so <- sleuth_lrt(so, 'no_conc', 'full')
lrt_sex <- sleuth_results(so, 'no_sex:full', test_type = 'lrt')
lrt_age <- sleuth_results(so, 'no_age:full', test_type = 'lrt')
lrt_int <- sleuth_results(so, 'no_int:full', test_type = 'lrt')
lrt_conc <- sleuth_results(so, 'no_conc:full', test_type = 'lrt')
lrt_age.sig_ids <- lrt_age$target_id[which(lrt_age$qval < 0.05)]
lrt_sex.sig_ids <- lrt_sex$target_id[which(lrt_sex$qval < 0.05)]
lrt_int.sig_ids <- lrt_int$target_id[which(lrt_int$qval < 0.05)]
lrt_conc.sig_ids <- lrt_conc$target_id[which(lrt_conc$qval < 0.05)]

If we inspect the *.sig_ids variables, we see only two genes or sometimes none.  And it's the same two genes popping up in the various lists.  Not good.  There are three main potential culprits: outlier samples, spurious factor correlation, or non-normal distributions (the LRT test assumes normally distributed values in each factor level).  The first and second can happen in any experiment, and the third can especially happen with a factor whose effect is as non-linear as the age of the study participants from which the cells are derived, and our selection of 40 and 60 to create the age_group levels is somewhat arbitrary.  

Let's start by identifying outliers.

Stage 1: Identifying outlier samples

First, we need to collate the normalized TPM transcript abundance data as per my previous blog post. You could probably do it with the so info you've already got from the above analysis, but it's trickier I find. Let's load all the transcripts with expression values using DESeq2.

abundance_full <- read.table("all_abundance.tsv", header=TRUE, row.names=1)
abundance_full_integer <- apply(abundance_full, c(1,2), function(x){as.integer(x)})
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData=abundance_full_integer, colData=data.frame(meta), design=~sex*age_group+age+sex:age+conc)
dds_redux <- dds[rowSums(counts(dds)) > 1,]

Normalize the count data so that the variance is similar across the range of expression values (necessary for a good Poisson distance measure). This is a nice, easy to use feature of DESeq2. Then visualize the distance matrix.

rld <- rlog(dds_redux, blind=FALSE)
library("pheatmap")
library("RColorBrewer")
library("PoiClaClu")
poisd <- PoissonDistance(t(abundance_full))
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste(meta$sex, meta$age, sep=":")
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(samplePoisDistMatrix, clustering_distance_rows=poisd$dd, 
clustering_distance_cols=poisd$dd, col=colors)


Let's also check that in closely related samples (such as 8 & 12 in the lower right corner) show fairly even variance across the range of expression values, otherwise our nice matrix isn't very meaningful.

plot(assay(rld)[,c(8,12)], xlim=c(0,16), ylim=c(0,16), pch=16, cex=0.3)




Looks good, i.e. it's fairly evenly thick above zero, and there's no feathering in the lower left. Using just a log 2 transform on the Kallisto TPM data yields a different tree, and the scatter plot above is a lot fatter at the bottom, so it was worth using DESeq2's normalization.

Back to the outliers. In the matrix, you'll see that there are three samples, with indices 4, 5, and 14 along the bottom that are strong outliers.  Not only are they strong outliers, but they are closely related to each other despite not sharing either a sex or age factor level.  This gives us a strong suspicion that there's something wrong here like cross-contamination, prep batch effect, etc. but it's not obvious what.  Let's exclude these samples and see how much the model of known factors improves.

> meta_no_outliers <- meta[c(1:3,6:13,15:18),]
> so_no_outliers <- sleuth_prep(meta_no_outliers, ~sex*age_group+age+sex:age+conc)
...
> so_no_outliers <- sleuth_fit(so_no_outliers)
...
> sum(so_no_outliers$fits$full$summary[,2])
[1] 193932.1

The 194K residual sum of squares (RSS) is a lot better than the 347K we had in the original analysis with all 18 samples! If we work through the original analysis with just the 15, we get...

> length(lrt_age.sig_ids)

[1] 8
> length(lrt_sex.sig_ids)
[1] 8
> length(lrt_int.sig_ids)
[1] 4
> length(lrt_conc.sig_ids)
[1] 71

D'oh! Looks like the concentration of input RNA is explaining most of the variance. Collectively this is 84 genes, which we will write out to a file, even if the concentration effect is suspect.

Stage 2: Spurious factor correlations

Getting back to our potential sources of failure in the analysis, let's check for spurious correlation between the concentration and the age factor (the main factor of interest). Plot the data points and a trend line.

> plot(data.frame(meta$age,meta$conc))
> abline(lm(meta$age ~ meta$conc))

Ouch, the concentration as a dosage factor is probably absorbing age dosage effect. The same plot for the outlier-less metadata set eliminates the lower right dot, leaving us with two really low concentrations that are centered in the age distribution (50 & 52). That's lucky for us, because it means that we can change the concentration numeric factor into a binary one to capture just those two points without disturbing the balance of the age factor (e.g. if we'd set the low threshold to 100 it'd have all sample that are older and could still confound the age factor modelling). It's the best we can do with the card we've been dealt.

> plot(data.frame(meta$age,meta$conc))
> meta$conc_low <- meta$conc <= 55
> meta$conc_low
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE  TRUE FALSE FALSE  TRUE FALSE

In case you're wondering how just excluding the outlier samples and the concentration from the model works out, you get just one significant change, only in age_group1 (between 40yo and 60yo), and it's a transcript at that showed up in the previous analyses repeatedly. You'll see shortly how the binary concentration modelling works out a lot better.

Stage 3: Rationally defining a factor level threshold for a skewed factor effect (numeric factor -> two-level factor)

Let's keep only the samples with a TPM of at least 1 on average.

> abundance <- abundance_full[rowSums(abundance_full)>length(meta_clean$age),]

Let's define all the combinations of young/old possible by setting the young/old threshold to each observed value in turn.

> oldness_treshold_combos <- t(sapply(meta_clean$age, function(x){meta_clean$age > x}))

Let's define a function to run the Wilcox-Mann-Whitney U Test, as I have in a previous post. As an aside, I also tried using the Kruskal-Wallis H test which extends the U Test notion to three or more factor levels (using the 0/1/2 age_group factor from earlier), but it's underpowered with this number of samples. As you'll see below, we'll have enough trouble with Wilcox already.  We'll stick the distribution of old/young across samples in a global variable called "m" later.

> wil <- function(x){d <- data.frame(tpm=t(abundance[x,])[,1], old=m); w <- wilcox.test(tpm ~ old, data=d); w$p.value}

Now define the function that assigns a young/old combination to "m", then runs the Wilcox test and finally applies the FDR (Benjamini-Hochberg) p-value multiple testing correction to give us "q-values". The initial check for the sum combo below is a special condition for the oldest sample, which gives no age contrast if used as a threshold and buggers up the whole process. 

perm_qvals <- function(combo){if(!sum(combo)){return(rep_len(c(NA), length(abundance[,1])))}; m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; return(p.adjust(pvals, method="fdr"))}

Run the qval test on each combination we calculated earlier.

> combo_results <- apply(oldness_treshold_combos, 1, perm_qvals)
There were 50 or more warnings (use warnings() to see the first 50) [you can ignore these, it's complaining about ties in the rankings for the Wilcox test for certain transcripts, in which case it can't calculate an exact p-value)]

> apply(combo_results, 2, function(x){sum(as.integer(x<.05))})
 [1] 285  NA   0   0   0   0   0   0   0   0   0   0   0   0   0

> res <- combo_results[,1]
> names(res) <- rownames(abundance)

It may seem strange that only the first threshold (which is old = age > 50 if you look back at the order of samples in meta.tab) has significant changes, but you have to keep in mind, as we determined in a previous post, that you need at least eight samples in each factor level to make the U Test p-value survive FDR multiple testing correction in Kallisto-based RNASeq where you'll have 18,000+ tests (i.e. expressed mRNA isoforms in a tissue). Only an 8/7 split comes close to satisfying that when we have 15 samples, like we do here.

Let's try the analysis again, but see what happens if we don't exclude the outliers, so we have 18 samples, which will give us multiple 8+ split options (8/10, 9/9,10/8):

> abundance <- abundance_full[rowSums(abundance_full)>length(meta$age),]
> rownames(abundance) <- rownames(abundance_full[rowSums(abundance_full)>length(meta$age),])
> oldness_treshold_combos <- t(sapply(meta$age, function(x){meta$age > x}))
> combo_results_full <- apply(oldness_treshold_combos, 1, perm_qvals)
apply(combo_results_full, 2, function(x){sum(as.integer(x<.05))})
 [1]  0  0  0  0  0  0  0  0  0  0  0  0  0 NA  0  0  0  0

It would seem in this case that the benefit of added samples for the U Test is outweighed by the outlying nature of the expression values in those 3 extra samples, as we get no significant changes. The three samples are not just overall outliers, but also specifically in their response to the experimental factor of interest (age). That is to say, they disrupt the rank of genes grouped as young/old, not just the replicate normality assumption of the Wald and LRT tests on the TPM values. Had we gotten good results in the combo_results_full vector, we might have included the outliers and proceeded with caution in comparing the betas from Sleuth with either 15 or 18 samples.

Let's note the significant IDs from the 15 sample test for later when we want to print out results to a file. I'm doing this in two steps, for clarity.


> sig_indices <- res < .05
> ids <- sort(rownames(abundance)[sig_indices])

Let's run with the >50 cutoff in the 15 samples and redo the Sleuth analysis with both the binary age factor and the quantitative age factor.  The beta values are more useful than the raw fold-changes I could calculate directly from the abundance matrix variable because 1) the effect of other factors are subtracted from the data, and 2) especially for low abundance transcripts beta more accurately reflects the effect of the age factor since it properly accounts for measurement error though the bootstrap analysis Sleuth does.


> meta_clean$old <- meta_clean$age > 50
> meta_clean
      sample               path sex age_group age conc conc_low   old
1  57_50Y_P2 57_50Y_P2.kallisto   M         1  50   55     TRUE FALSE
2  63_75Y_P2 63_75Y_P2.kallisto   M         2  75   99    FALSE  TRUE
3    SATP100   SATP100.kallisto   M         0  27  203    FALSE FALSE
6    SATP119   SATP119.kallisto   M         2  74  511    FALSE  TRUE
7    SATP120   SATP120.kallisto   M         1  41  466    FALSE FALSE
8    SATP125   SATP125.kallisto   F         0  30  697    FALSE FALSE
9    SATP137   SATP137.kallisto   M         0  19  183    FALSE FALSE
10   SATP140   SATP140.kallisto   F         0  39  264    FALSE FALSE
11   SATP154   SATP154.kallisto   F         2  61  256    FALSE  TRUE
12    SATP68    SATP68.kallisto   M         0  37  303    FALSE FALSE
13    SATP75    SATP75.kallisto   M         2  65  112    FALSE  TRUE
15    SATP91    SATP91.kallisto   F         1  57  736    FALSE  TRUE
16    SATP92    SATP92.kallisto   F         1  53  136    FALSE  TRUE
17    SATP95    SATP95.kallisto   F         1  52   53     TRUE  TRUE
18    SATP98    SATP98.kallisto   F         2  73   83    FALSE  TRUE

Luckily for us, the age threshold of >50 puts one low concentration sample in the young category and one in the old, so both factors can be included in the final model.

> so_final <- sleuth_prep(meta_clean, ~sex*old+age:old+conc_low)
> so_final <- sleuth_fit(so_final)
> sum(so_final$fits$full$summary[,2])
[1] 266102.6

By the way, did our spurious correlation removal from Stage 2 help?  Let's see what the RSS is without it in the model.

> so_final <- sleuth_fit(so_final, ~sex*old+age:old, "no_conc")
> sum(so_final$fits$no_conc$summary[,2])
[1] 300209.5

Nice, it reduces the noise by over 10%.  It's not as good an RSS as we had with the quantitative concentration factor, but it's much less biased. Let's go on with the inclusive model and calculate the betas/fold-changes with a Wald test.

> so_final <- sleuth_wt(so_final, "sexM")
> so_final <- sleuth_wt(so_final, "oldTRUE")
> so_final <- sleuth_wt(so_final, "sexM:oldTRUE")
> so_final <- sleuth_wt(so_final, "oldTRUE:age")
> so_final <- sleuth_wt(so_final, "oldFALSE:age")
> so_final <- sleuth_wt(so_final, "conc_lowTRUE")
> wt_sex <- sleuth_results(so_final, 'sexM')
> wt_old <- sleuth_results(so_final, 'oldTRUE')
> wt_int <- sleuth_results(so_final, 'sexM:oldTRUE')
> wt_old_age <- sleuth_results(so_final, 'oldTRUE:age')
> wt_young_age <- sleuth_results(so_final, 'oldFALSE:age')
> wt_conc_low <- sleuth_results(so_final, 'conc_lowTRUE')
> table(wt_sex[,"qval"] <.05)

FALSE  TRUE 
28142     2 
> table(wt_old[,"qval"] <.05)

FALSE  TRUE 
28142     2 
> table(wt_int[,"qval"] <.05)

FALSE  TRUE 
28142     2 
> table(wt_conc[,"qval"] <.05)

FALSE  TRUE 
28133    11 
> table(wt_old_age[,"qval"] <.05)

FALSE  TRUE 
28142     2 
> table(wt_young_age[,"qval"] <.05)

FALSE  TRUE 
28143     1 

So we have very few significant q values, but that's to be expected, because the Wald test assumes you have a normal distribution and an accurate assessment of variance.  We have neither, but we want the beta values that the test generates.  These will still be our most accurate assessment of factor driven fold-change in transcript abundance.

wt_shared_sex <- wt_sex[wt_sex$target_id %in% ids,]
wt_shared_old <- wt_sex[wt_old$target_id %in% ids,]
wt_shared_int <- wt_int[wt_int$target_id %in% ids,]
wt_shared_old_age <- wt_old_age[wt_old_age$target_id %in% ids,]
wt_shared_young_age <- wt_old_age[wt_young_age$target_id %in% ids,]
wt_shared_young_age <- wt_old_age[wt_conc$target_id %in% ids,]
wt_shared_conc_low <- wt_conc_low[wt_conc_low$target_id %in% ids,]
wt_shared_sex <- wt_shared_sex[sort.list(wt_shared_sex[,1]),]
wt_shared_old <- wt_shared_old[sort.list(wt_shared_old[,1]),]
wt_shared_int <- wt_shared_int[sort.list(wt_shared_int[,1]),]
wt_shared_young_age <- wt_shared_young_age[sort.list(wt_shared_young_age[,1]),]
wt_shared_old_age <- wt_shared_old_age[sort.list(wt_shared_old_age[,1]),]
wt_shared_conc_low <- wt_shared_conc_low[sort.list(wt_shared_conc_low[,1]),]

sig_abundance <- sig_abundance[sort.list(rownames(sig_abundance)),]

If the sample values going into the fold-changes for the age factor level (old/young) are not normally distributed, is there something useful we can say about their distribution? One summary statistic we can grasp fairly easily is skewness, which when negative indicates that the mean is less than the median, and conversely that when positive the mean is greater than the median. From the Wikipedia entry:


In our data for example, a negative skew would mean that the effect of the fold-change is more intense in the older people within the old group. A positive skew would mean that the FC values for the old-designated samples pile up closer to the old cutoff (>50yo) so increased age isn't a big a factor. Let's see if the skew of the abundance of the 285 significant transcripts are quantitatively different from the general transcript abundance.

> library(moments)
> sig_abundance <- abundance[sig_indices,]
> sk <- apply(sig_abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})
> summary(sk)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.72200 -0.29110  0.09093  0.07834  0.44870  1.50800 
> sk0 <- apply(abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})
> summary(sk0)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-2.0620 -0.1577  0.2699  0.3049  0.7294  2.2680      62 

Looks like the 285 transcripts have a long tail more to the left (more negative) than the whole set of transcripts. Let's check more thoroughly.  First, are both distributions (of skewness) normal? Run the Shapiro-Wilk test to find out.

> shapiro.test(sk)
...
W = 0.9934, p-value = 0.2449

> shapiro.test(sample(sk0, length(ids)))
...
W = 0.99399, p-value = 0.3266

Yes.  Do they have the same variance? Run an F-test, or you could try Levene's test or Bartlett's test, etc.. they all fail for this dataset.

> var.test(sk, sk0)

F test to compare two variances

data:  sk and sk0
F = 0.699, num df = 284, denom df = 19418, p-value = 7.016e-05
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.5961302 0.8307902
sample estimates:
ratio of variances 
         0.6990001 

Which makes running a t-test moot since it assumes equality of variance in each sample group. What we can say easily is that the skewness is further left and tighter (about 1/3 as much) in the significantly changes transcripts than in a random sample of 285 transcripts. That means many transcripts are affected in the really old a lot more than the moderately old, and luckily we don't have to give a specific threshold (like 60 in the original analysis) to catch that phenomenon, nor did the effect have to be normally distributed in our samples.

> sum(sk)
[1] 22.32632
> sum(sample(sk0,length(sk)), na.rm=TRUE)
[1] 78.9279

> mean(sk)
[1] 0.07833795
> mean(sk0, na.rm=TRUE)
[1] 0.3049191

For all the 285 U-test passing transcripts let's write out all the factor contrasts we've calculated so far, and the old samples skew stat adjusted for sk0 (to help account for the effect of the specific distribution of sample ages we have in the analysis).

sk_adjusted <- sk - (mean(sk0, na.rm=TRUE)-mean(sk))

> d <- data.frame(ids,avg_tpm=rowMeans(sig_abundance[ids,]),u_test_qval=res[ids],skew=sk_adjusted,wt_shared_sex[,"b"],wt_shared_sex[,"qval"],wt_shared_old[,"b"],wt_shared_old[,"qval"],wt_shared_int[,"qval"],wt_shared_int[,"b"],wt_shared_young_age[,"qval"],wt_shared_young_age[,"b"],wt_shared_old_age[,"qval"],wt_shared_old_age[,"b"])


> write.csv(d, "u_test_no_outliers_q_lt_0.05.csv")

Pondering the merits and validity of including a kurtosis measure is left as an exercise for the reader.

So, that's one way to analyze RNASeq when the expected expression values aren't necessarily normally distributed and don't have a reasonable estimate of variance. Now you have two gene lists to play with: those that appear normally distributed amongst the age predefined groups (84, mostly attributable to input RNA concentration conflated with age), and those that show a less/greater then 50yo dichtomy (285). Some transcripts on the former list are also in the latter list. While the beta (~fold change) values in the latter aren't super meaningful, taken together with the skewness value for the transcript you can focus on transcript changes that are skewed to the really old, or more middle aged, etc.

P.S.: significant changes were corroborated with consistency scores in pathway analysis (IPA) from both the 84 and 285 item lists, albeit different pathways, therefore it's probably worth following up on both.

3 comments:

  1. Paul, Thank you very much for the posting. One question about the outlier. I have a data set that disagrees between pca(sleuth generated)result and the heatmap(following your post). What should I check first to investigate this matter?
    Heatmap say couple samples are outliers but on the pca plot these samples and others in the group segregate nicely. bit puzzled. Thank you.

    ReplyDelete
  2. There are a couple of possibilities. First, check if the PCA plot based on TPM (rather than the default, est_counts) recapitulates the distance matrix in terms of outliers. If so, you have some pseduomapping biases related to longer genes. If not, look at how much of the total variance the first and second principal components explain. The matrix represents overall distance, but if there is a really strong variance between samples for a specific set of genes (usually and hopefully corresponding to one of your experimental factors), this may mask any general outlying nature of samples in a PCA plot. In some sense this is good, because it means that the signal generated by the factor is strong enough to be resistant to noise.

    ReplyDelete
    Replies
    1. could you explain what "pseudomapping bias related to longer gene" is?

      Delete