CHGI Bioinformatics: This is not normal: dealing with outliers, correlated factors, and skewed factor effects in RNASeq (a Kallisto/Sleuth/DESeq2/U-Test/skewness mash-up)

tl;dr there are RNA experiments where despite reasonable sample selection, no significant changes will be found. Especially in patient-cenetered studies, sometimes you just can't get true "replicates".

In three stages we can correct for outliers, spurious correlations, and a quantitative factor like patient age that has a skewed (as opposed to Gaussian) dosage effect on transcripts. This process gets us from 2 significantly regulated transcripts to 285 (q-value < 0.05). This involves using Kallisto, Sleuth, DESeq2, and some home brewed code around R's built-in Wilcox-Mann-Whitney U Test and skewness measure.
__________________________

Let's start with an experiment designed with a reasonable balance of males and females in each of three age categories, under 40, over 60, and in between. I noticed on the sample submission form that the input RNA concentrations were unusually divergent, so I've included those values in case they were informative (might affect PCR duplicate levels).

library(sleuth)

meta <- read.table("meta.tab", header=TRUE)

meta$path <- as.character(meta$path)

meta$age_group <- as.factor(meta$age_group)

Stage 1: Identifying outlier samples

First, we need to collate the normalized TPM transcript abundance data as per my previous blog post. You could probably do it with the so info you've already got from the above analysis, but it's trickier I find. Let's load all the transcripts with expression values using DESeq2.

abundance_full <- read.table("all_abundance.tsv", header=TRUE, row.names=1)

abundance_full_integer <- apply(abundance_full, c(1,2), function(x){as.integer(x)})

library("DESeq2")

dds <- DESeqDataSetFromMatrix(countData=abundance_full_integer, colData=data.frame(meta), design=~sex*age_group+age+sex:age+conc)

dds_redux <- dds[rowSums(counts(dds)) > 1,]

Normalize the count data so that the variance is similar across the range of expression values (necessary for a good Poisson distance measure). This is a nice, easy to use feature of DESeq2. Then visualize the distance matrix.

rld <- rlog(dds_redux, blind=FALSE)

library("pheatmap")

library("RColorBrewer")

library("PoiClaClu")

poisd <- PoissonDistance(t(abundance_full))

samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste(meta$sex, meta$age, sep=":")

colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

pheatmap(samplePoisDistMatrix, clustering_distance_rows=poisd$dd,

clustering_distance_cols=poisd$dd, col=colors)

Let's also check that in closely related samples (such as 8 & 12 in the lower right corner) show fairly even variance across the range of expression values, otherwise our nice matrix isn't very meaningful.

plot(assay(rld)[,c(8,12)], xlim=c(0,16), ylim=c(0,16), pch=16, cex=0.3)

Looks good, i.e. it's fairly evenly thick above zero, and there's no feathering in the lower left. Using just a log 2 transform on the Kallisto TPM data yields a different tree, and the scatter plot above is a lot fatter at the bottom, so it was worth using DESeq2's normalization.

Back to the outliers. In the matrix, you'll see that there are three samples, with indices 4, 5, and 14 along the bottom that are strong outliers. Not only are they strong outliers, but they are closely related to each other despite not sharing either a sex or age factor level. This gives us a strong suspicion that there's something wrong here like cross-contamination, prep batch effect, etc. but it's not obvious what. Let's exclude these samples and see how much the model of known factors improves.

> meta_no_outliers <- meta[c(1:3,6:13,15:18),]

> so_no_outliers <- sleuth_prep(meta_no_outliers, ~sex*age_group+age+sex:age+conc)

...

> so_no_outliers <- sleuth_fit(so_no_outliers)

...

> sum(so_no_outliers$fits$full$summary[,2])

[1] 193932.1

The 194K residual sum of squares (RSS) is a lot better than the 347K we had in the original analysis with all 18 samples! If we work through the original analysis with just the 15, we get...

> length(lrt_age.sig_ids)

[1] 8

> length(lrt_sex.sig_ids)

[1] 8

> length(lrt_int.sig_ids)

[1] 4

> length(lrt_conc.sig_ids)

[1] 71

D'oh! Looks like the concentration of input RNA is explaining most of the variance. Collectively this is 84 genes, which we will write out to a file, even if the concentration effect is suspect.

Stage 2: Spurious factor correlations

Getting back to our potential sources of failure in the analysis, let's check for spurious correlation between the concentration and the age factor (the main factor of interest). Plot the data points and a trend line.

> plot(data.frame(meta$age,meta$conc))

> abline(lm(meta$age ~ meta$conc))

Ouch, the concentration as a dosage factor is probably absorbing age dosage effect. The same plot for the outlier-less metadata set eliminates the lower right dot, leaving us with two really low concentrations that are centered in the age distribution (50 & 52). That's lucky for us, because it means that we can change the concentration numeric factor into a binary one to capture just those two points without disturbing the balance of the age factor (e.g. if we'd set the low threshold to 100 it'd have all sample that are older and could still confound the age factor modelling). It's the best we can do with the card we've been dealt.

> plot(data.frame(meta$age,meta$conc))

> meta$conc_low <- meta$conc <= 55

> meta$conc_low

[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[13] FALSE TRUE FALSE FALSE TRUE FALSE

In case you're wondering how just excluding the outlier samples and the concentration from the model works out, you get just one significant change, only in age_group1 (between 40yo and 60yo), and it's a transcript at that showed up in the previous analyses repeatedly. You'll see shortly how the binary concentration modelling works out a lot better.

Stage 3: Rationally defining a factor level threshold for a skewed factor effect (numeric factor -> two-level factor)

Let's keep only the samples with a TPM of at least 1 on average.

> abundance <- abundance_full[rowSums(abundance_full)>length(meta_clean$age),]

Let's define all the combinations of young/old possible by setting the young/old threshold to each observed value in turn.

> oldness_treshold_combos <- t(sapply(meta_clean$age, function(x){meta_clean$age > x}))

Let's define a function to run the Wilcox-Mann-Whitney U Test, as I have in a previous post. As an aside, I also tried using the Kruskal-Wallis H test which extends the U Test notion to three or more factor levels (using the 0/1/2 age_group factor from earlier), but it's underpowered with this number of samples. As you'll see below, we'll have enough trouble with Wilcox already. We'll stick the distribution of old/young across samples in a global variable called "m" later.

> wil <- function(x){d <- data.frame(tpm=t(abundance[x,])[,1], old=m); w <- wilcox.test(tpm ~ old, data=d); w$p.value}

Now define the function that assigns a young/old combination to "m", then runs the Wilcox test and finally applies the FDR (Benjamini-Hochberg) p-value multiple testing correction to give us "q-values". The initial check for the sum combo below is a special condition for the oldest sample, which gives no age contrast if used as a threshold and buggers up the whole process.

> perm_qvals <- function(combo){if(!sum(combo)){return(rep_len(c(NA), length(abundance[,1])))}; m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; return(p.adjust(pvals, method="fdr"))}

Run the qval test on each combination we calculated earlier.

> combo_results <- apply(oldness_treshold_combos, 1, perm_qvals)

There were 50 or more warnings (use warnings() to see the first 50) [you can ignore these, it's complaining about ties in the rankings for the Wilcox test for certain transcripts, in which case it can't calculate an exact p-value)]

> apply(combo_results, 2, function(x){sum(as.integer(x<.05))})

[1] 285 NA 0 0 0 0 0 0 0 0 0 0 0 0 0

> res <- combo_results[,1]

> names(res) <- rownames(abundance)

It may seem strange that only the first threshold (which is old = age > 50 if you look back at the order of samples in meta.tab) has significant changes, but you have to keep in mind, as we determined in a previous post, that you need at least eight samples in each factor level to make the U Test p-value survive FDR multiple testing correction in Kallisto-based RNASeq where you'll have 18,000+ tests (i.e. expressed mRNA isoforms in a tissue). Only an 8/7 split comes close to satisfying that when we have 15 samples, like we do here.

Let's try the analysis again, but see what happens if we don't exclude the outliers, so we have 18 samples, which will give us multiple 8+ split options (8/10, 9/9,10/8):

> abundance <- abundance_full[rowSums(abundance_full)>length(meta$age),]

> rownames(abundance) <- rownames(abundance_full[rowSums(abundance_full)>length(meta$age),])

> oldness_treshold_combos <- t(sapply(meta$age, function(x){meta$age > x}))

> combo_results_full <- apply(oldness_treshold_combos, 1, perm_qvals)

apply(combo_results_full, 2, function(x){sum(as.integer(x<.05))})

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 0 0

It would seem in this case that the benefit of added samples for the U Test is outweighed by the outlying nature of the expression values in those 3 extra samples, as we get no significant changes. The three samples are not just overall outliers, but also specifically in their response to the experimental factor of interest (age). That is to say, they disrupt the rank of genes grouped as young/old, not just the replicate normality assumption of the Wald and LRT tests on the TPM values. Had we gotten good results in the combo_results_full vector, we might have included the outliers and proceeded with caution in comparing the betas from Sleuth with either 15 or 18 samples.

Let's note the significant IDs from the 15 sample test for later when we want to print out results to a file. I'm doing this in two steps, for clarity.

> sig_indices <- res < .05

> ids <- sort(rownames(abundance)[sig_indices])

Let's run with the >50 cutoff in the 15 samples and redo the Sleuth analysis with both the binary age factor and the quantitative age factor. The beta values are more useful than the raw fold-changes I could calculate directly from the abundance matrix variable because 1) the effect of other factors are subtracted from the data, and 2) especially for low abundance transcripts beta more accurately reflects the effect of the age factor since it properly accounts for measurement error though the bootstrap analysis Sleuth does.

> meta_clean$old <- meta_clean$age > 50

> meta_clean

sample path sex age_group age conc conc_low old

1 57_50Y_P2 57_50Y_P2.kallisto M 1 50 55 TRUE FALSE

2 63_75Y_P2 63_75Y_P2.kallisto M 2 75 99 FALSE TRUE

3 SATP100 SATP100.kallisto M 0 27 203 FALSE FALSE

6 SATP119 SATP119.kallisto M 2 74 511 FALSE TRUE

7 SATP120 SATP120.kallisto M 1 41 466 FALSE FALSE

8 SATP125 SATP125.kallisto F 0 30 697 FALSE FALSE

9 SATP137 SATP137.kallisto M 0 19 183 FALSE FALSE

10 SATP140 SATP140.kallisto F 0 39 264 FALSE FALSE

11 SATP154 SATP154.kallisto F 2 61 256 FALSE TRUE

12 SATP68 SATP68.kallisto M 0 37 303 FALSE FALSE

13 SATP75 SATP75.kallisto M 2 65 112 FALSE TRUE

15 SATP91 SATP91.kallisto F 1 57 736 FALSE TRUE

16 SATP92 SATP92.kallisto F 1 53 136 FALSE TRUE

17 SATP95 SATP95.kallisto F 1 52 53 TRUE TRUE

18 SATP98 SATP98.kallisto F 2 73 83 FALSE TRUE

Luckily for us, the age threshold of >50 puts one low concentration sample in the young category and one in the old, so both factors can be included in the final model.

> so_final <- sleuth_prep(meta_clean, ~sex*old+age:old+conc_low)

> so_final <- sleuth_fit(so_final)

> sum(so_final$fits$full$summary[,2])

[1] 266102.6

By the way, did our spurious correlation removal from Stage 2 help? Let's see what the RSS is without it in the model.

> so_final <- sleuth_fit(so_final, ~sex*old+age:old, "no_conc")

> sum(so_final$fits$no_conc$summary[,2])

[1] 300209.5

Nice, it reduces the noise by over 10%. It's not as good an RSS as we had with the quantitative concentration factor, but it's much less biased. Let's go on with the inclusive model and calculate the betas/fold-changes with a Wald test.

> so_final <- sleuth_wt(so_final, "sexM")

> so_final <- sleuth_wt(so_final, "oldTRUE")

> so_final <- sleuth_wt(so_final, "sexM:oldTRUE")

> so_final <- sleuth_wt(so_final, "oldTRUE:age")

> so_final <- sleuth_wt(so_final, "oldFALSE:age")

> so_final <- sleuth_wt(so_final, "conc_lowTRUE")

> wt_sex <- sleuth_results(so_final, 'sexM')

> wt_old <- sleuth_results(so_final, 'oldTRUE')

> wt_int <- sleuth_results(so_final, 'sexM:oldTRUE')

> wt_old_age <- sleuth_results(so_final, 'oldTRUE:age')

> wt_young_age <- sleuth_results(so_final, 'oldFALSE:age')

> wt_conc_low <- sleuth_results(so_final, 'conc_lowTRUE')

> table(wt_sex[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_old[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_int[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_conc[,"qval"] <.05)

FALSE TRUE

28133 11

> table(wt_old_age[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_young_age[,"qval"] <.05)

FALSE TRUE

28143 1

So we have very few significant q values, but that's to be expected, because the Wald test assumes you have a normal distribution and an accurate assessment of variance. We have neither, but we want the beta values that the test generates. These will still be our most accurate assessment of factor driven fold-change in transcript abundance.

wt_shared_sex <- wt_sex[wt_sex$target_id %in% ids,]

wt_shared_old <- wt_sex[wt_old$target_id %in% ids,]

wt_shared_int <- wt_int[wt_int$target_id %in% ids,]

wt_shared_old_age <- wt_old_age[wt_old_age$target_id %in% ids,]

wt_shared_young_age <- wt_old_age[wt_young_age$target_id %in% ids,]

wt_shared_young_age <- wt_old_age[wt_conc$target_id %in% ids,]

wt_shared_conc_low <- wt_conc_low[wt_conc_low$target_id %in% ids,]

wt_shared_sex <- wt_shared_sex[sort.list(wt_shared_sex[,1]),]

wt_shared_old <- wt_shared_old[sort.list(wt_shared_old[,1]),]

wt_shared_int <- wt_shared_int[sort.list(wt_shared_int[,1]),]

wt_shared_young_age <- wt_shared_young_age[sort.list(wt_shared_young_age[,1]),]

wt_shared_old_age <- wt_shared_old_age[sort.list(wt_shared_old_age[,1]),]

wt_shared_conc_low <- wt_shared_conc_low[sort.list(wt_shared_conc_low[,1]),]

sig_abundance <- sig_abundance[sort.list(rownames(sig_abundance)),]

If the sample values going into the fold-changes for the age factor level (old/young) are not normally distributed, is there something useful we can say about their distribution? One summary statistic we can grasp fairly easily is skewness, which when negative indicates that the mean is less than the median, and conversely that when positive the mean is greater than the median. From the Wikipedia entry:

In our data for example, a negative skew would mean that the effect of the fold-change is more intense in the older people within the old group. A positive skew would mean that the FC values for the old-designated samples pile up closer to the old cutoff (>50yo) so increased age isn't a big a factor. Let's see if the skew of the abundance of the 285 significant transcripts are quantitatively different from the general transcript abundance.

> library(moments)

> sig_abundance <- abundance[sig_indices,]

> sk <- apply(sig_abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})

> summary(sk)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.72200 -0.29110 0.09093 0.07834 0.44870 1.50800

> sk0 <- apply(abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})

> summary(sk0)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

-2.0620 -0.1577 0.2699 0.3049 0.7294 2.2680 62

Looks like the 285 transcripts have a long tail more to the left (more negative) than the whole set of transcripts. Let's check more thoroughly. First, are both distributions (of skewness) normal? Run the Shapiro-Wilk test to find out.

> shapiro.test(sk)

...

W = 0.9934, p-value = 0.2449

> shapiro.test(sample(sk0, length(ids)))

...

W = 0.99399, p-value = 0.3266

Yes. Do they have the same variance? Run an F-test, or you could try Levene's test or Bartlett's test, etc.. they all fail for this dataset.

> var.test(sk, sk0)

F test to compare two variances

data: sk and sk0

F = 0.699, num df = 284, denom df = 19418, p-value = 7.016e-05

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.5961302 0.8307902

sample estimates:

ratio of variances

0.6990001

Which makes running a t-test moot since it assumes equality of variance in each sample group. What we can say easily is that the skewness is further left and tighter (about 1/3 as much) in the significantly changes transcripts than in a random sample of 285 transcripts. That means many transcripts are affected in the really old a lot more than the moderately old, and luckily we don't have to give a specific threshold (like 60 in the original analysis) to catch that phenomenon, nor did the effect have to be normally distributed in our samples.

> sum(sk)

[1] 22.32632

> sum(sample(sk0,length(sk)), na.rm=TRUE)

[1] 78.9279

> mean(sk)

[1] 0.07833795

> mean(sk0, na.rm=TRUE)

[1] 0.3049191

For all the 285 U-test passing transcripts let's write out all the factor contrasts we've calculated so far, and the old samples skew stat adjusted for sk0 (to help account for the effect of the specific distribution of sample ages we have in the analysis).

sk_adjusted <- sk - (mean(sk0, na.rm=TRUE)-mean(sk))

> d <- data.frame(ids,avg_tpm=rowMeans(sig_abundance[ids,]),u_test_qval=res[ids],skew=sk_adjusted,wt_shared_sex[,"b"],wt_shared_sex[,"qval"],wt_shared_old[,"b"],wt_shared_old[,"qval"],wt_shared_int[,"qval"],wt_shared_int[,"b"],wt_shared_young_age[,"qval"],wt_shared_young_age[,"b"],wt_shared_old_age[,"qval"],wt_shared_old_age[,"b"])

> write.csv(d, "u_test_no_outliers_q_lt_0.05.csv")

Pondering the merits and validity of including a kurtosis measure is left as an exercise for the reader.

So, that's one way to analyze RNASeq when the expected expression values aren't necessarily normally distributed and don't have a reasonable estimate of variance. Now you have two gene lists to play with: those that appear normally distributed amongst the age predefined groups (84, mostly attributable to input RNA concentration conflated with age), and those that show a less/greater then 50yo dichtomy (285). Some transcripts on the former list are also in the latter list. While the beta (~fold change) values in the latter aren't super meaningful, taken together with the skewness value for the transcript you can focus on transcript changes that are skewed to the really old, or more middle aged, etc.

P.S.: significant changes were corroborated with consistency scores in pathway analysis (IPA) from both the 84 and 285 item lists, albeit different pathways, therefore it's probably worth following up on both.

3 comments:

HIM72July 2, 2017 at 10:31 PM
Paul, Thank you very much for the posting. One question about the outlier. I have a data set that disagrees between pca(sleuth generated)result and the heatmap(following your post). What should I check first to investigate this matter?
Heatmap say couple samples are outliers but on the pca plot these samples and others in the group segregate nicely. bit puzzled. Thank you.
Paul GordonJuly 4, 2017 at 2:15 PM
There are a couple of possibilities. First, check if the PCA plot based on TPM (rather than the default, est_counts) recapitulates the distance matrix in terms of outliers. If so, you have some pseduomapping biases related to longer genes. If not, look at how much of the total variance the first and second principal components explain. The matrix represents overall distance, but if there is a really strong variance between samples for a specific set of genes (usually and hopefully corresponding to one of your experimental factors), this may mask any general outlying nature of samples in a PCA plot. In some sense this is good, because it means that the signal generated by the factor is strong enough to be resistant to noise.

CHGI Bioinformatics

Blog Archive

March 1, 2017

This is not normal: dealing with outliers, correlated factors, and skewed factor effects in RNASeq (a Kallisto/Sleuth/DESeq2/U-Test/skewness mash-up)

Stage 1: Identifying outlier samples

Stage 2: Spurious factor correlations

Stage 3: Rationally defining a factor level threshold for a skewed factor effect (numeric factor -> two-level factor)

3 comments: