February 16, 2017

RNASeq mic drop: U Test for detecting differential expression when factor dosage is unknown



tl;dr If you have an unknown dosage factor, make it binary, and run a Wilcox-Mann-Whitney U test to improve statistical power if you have ~20 samples or more. In a real test case of 20 samples, we went from 2 significant changes using the likelihood ratio test (LRT) for each sample group independently, to over 1000 significant using the (non-parametric) U test.

Suppose you have RNASeq samples that can be divided into two groups with 1) variable dose of a factor, and 2) no dose. Also suppose you don't know the dose for any given sample in group 1. I ran into this issue analyzing environmental sampling across 5 riparian sites (4 samples at each site) up and downstream of a wastewater source. By running a U test between upstream and downstream samples the downstream pollution dosage doesn't matter. That's because it's a non-parametric test that does not make assumptions of normality for the "replicates" in the upstream and downstream groups. Once you have about 20 samples, real differentially expressed genes start to survive Benjamini-Hochberg multiple testing correction.

Names have been changed to protect the unpublished.
____________

We'll pick up my blogged de novo assembly RNASeq analysis after the Kallisto mapping step, so we have a directory called samplename_trinity.kallisto for each sample. We make an experiment metadata file (metadata_trinity.tab) like so where we note whether the site is upstream or downstream:

sample path site downstream
15 15_trinity.kallisto site_4 1
17 17_trinity.kallisto site_4 1
21 21_trinity.kallisto site_4 1
23 23_trinity.kallisto site_4 1
29 29_trinity.kallisto site_2 0
30 30_trinity.kallisto site_2 0
34 34_trinity.kallisto site_2 0
35 35_trinity.kallisto site_2 0
40 40_trinity.kallisto site_1 0
43 43_trinity.kallisto site_1 0
4 4_trinity.kallisto site_5 1
50 50_trinity.kallisto site_1 0
57 57_trinity.kallisto site_1 0
59 59_trinity.kallisto site_3 1
62 62_trinity.kallisto site_3 1
68 68_trinity.kallisto site_3 1
6 6_trinity.kallisto site_5 1
70 70_trinity.kallisto site_3 1
7 7_trinity.kallisto site_5 1
8 8_trinity.kallisto site_5 1


If you used a reference genome instead of Trinity, just follow omit the "_trinity" from the path values.

First, let's collate a Transcripts Per Million (TPM) file for all transcripts as produced by Kallisto for each sample. This is a normalized measure of transcript expression. Start with generating the header line:

perl -e 'print "target_id\t",join("\t",map {/(.*)\//;$1} @ARGV),"\n";' *_trinity.kallisto/abundance.tsv > all_trinity_abundance.tsv

Then append the TPM values:

paste *_trinity.kallisto/abundance.tsv | perl -ane 'print $F[0];for (1..$#F){print "\t$F[$_]" if /[49]$/}print "\n"' | tail -n +2 >> all_trinity_abundance.tsv

Alternatively if you want to use FPKM (i.e. normalized for gene length too):


paste *.kallisto/abundance.tsv | tail -n +2 | perl -ane 'print $F[0];for (1..$#F){print "\t",$F[$_]/$F[$_-2]*1000 if /[49]$/}print "\n"' >> all_abundance_fpkm.tsv

Now we can start R, load the data and get to some statistics!

meta <- read.table("metadata_trinity.tab", header=TRUE, row.names=TRUE)
abundance_full <- read.table("all_trinity_abundance.tsv", header=TRUE)


Let's only keep the transcripts that have an average of 1 TPM or more, by filtering rows with a sum of less than 20.  Most technical noise bootstrap estimates in Kallisto are less than 1 in the previous analysis of this dataset.

abundance <- abundance_full[rowSums(abundance_full)>20,]
rownames(abundance) <- rownames(abundance_full[rowSums(abundance_full)>20,])


Now, for each transcript let's calculate the log2 fold change between the downstream sites and the base level of the factor, upstream.

log2fc <- function(x){log2(sum(abundance[x,meta$downstream])/sum(abundance[x,!meta$downstream]))}
l <- c()
for(i in 1:length(abundance[,1])){l <- append(l,log2fc(i))}


Now let's run our Wilcox-Mann-Whitney U test for each transcript.  The test sorts the TPMs numerically and checks if there's a non-random grouping of downstream TPM values in the rankings. If all downstream sites have the lowest or highest TPMs for a transcript, that yields the smallest p-value.

wil <- function(x){df <- data.frame(tpm=t(abundance[x,])[,1], downstream=meta$downstream); w <- wilcox.test(tpm ~ downstream, data=df); w$p.value}
pvals <- c()
for(i in 1:length(abundance[,1])){pvals <- append(pvals,wil(i))}


Of course, if you have enough transcripts measured, eventually all downstream samples will rank together in a transcript by chance.  Let's apply a multiple testing correction to the p-values.  The False Discovery Rate (FDR) a.k.a. Benjamini-Hochberg, is quite popular as for most experiments it's neither too conservative nor too lax.

q <- p.adjust(pvals, method="fdr")

Let's decorate the abundance information with the fold change and U test stats, then note those with adjusted p-value (a.k.a. q-value) < 0.05.

abundance[,"log2fc"] <- l
abundance[,"qval"] <- q

qsig <- which(q < 0.05)
length(qsig)

[1] 1073

Hey, that's a lot more than we got using the likelihood ratio test for each individual site, where only two transcripts significant across all the downstream sites! Strength in replicate sample numbers.  The q-values won't blow you away with a 8:12 split, they're in the (0.01,0.05) range, which is why I suggest this technique when you have 20 or more samples. You can try with less, but you're most likely just spinning your wheels.

The log2 fold change isn't super useful, as we have no assumption of normality amongst replicates (since the downstream sites have different and unknown pollution exposure levels). BUT we can use it to see the set union of transcripts with a fairly strong effect at one site or consistent expression changes across sites. This could be useful for pollution biomarker discovery.

length(which(abs(abundance[qsig,"log2fc"]) > 1))
[1] 118

That's still quite a bit to work with!  Let's write out all the qval passed transcripts to a file:

write.csv(abundance[qsig,], file="up_downstream_wilcox_q_lt_0.05.csv")

Not all of these are going to be real, but a subsequent analysis in IPA shows a statistically significant number of observations in the RNASeq that are consistent with the literature, so we're probably in the right track.  To be extra sure, let's randomly label the samples as upstream or downstream and redo the analysis. Hopefully we don't get any significant q-values.

meta$downstream <- sapply(1:length(meta$downstream), function(x) sample(c(0,1),1))
abundance <- abundance_full[rowSums(abundance_full[2:21])>20,2:21]
pvals <- c()
for(i in 1:length(abundance[,1])){pvals <- append(pvals,wil(i))}
q_random <- p.adjust(pvals, method="fdr")
qsig_random <- which(q_random < 0.05)
length(qsig_random)
[1] 0


*mic drop*

Hold on, that's a bit premature.  Let's pick up the mic again because we may not be doing a fair random comparison.  In the real configuration, the downstream status is consistent with site status for each samples.  How many random ways could we do this with 2 upstream and 3 downstream site designations?

library(combinat)
downstream_options <- unique(permn(c(0,0,1,1,1)))

Which gives us the ten unique permutations of the sites assigned binary downstream values.

downstream_options
[[1]]
[1] 0 0 1 1 1

[[2]]
[1] 0 1 0 1 1

[[3]]
[1] 1 0 0 1 1

[[4]]
[1] 0 1 1 0 1

[[5]]
[1] 1 0 1 0 1

[[6]]
[1] 1 1 0 0 1

[[7]]
[1] 1 0 1 1 0

[[8]]
[1] 1 1 0 1 0

[[9]]
[1] 0 1 1 1 0

[[10]]
[1] 1 1 1 0 0

The first combination is the one we used already (the real one, where site_1 and site_2 are upstream).  Let's run the analysis again, but for each of the 10 possibilities.  Ideally only the first (real) combination gives significant results.

meta$site <- substr(meta$site, 5, 5) # strip site name to just number, serves double duty as index into downstream_options combinations
combo_qvals <- function(combo){meta$downstream <<- apply(meta, 1, function(x){combo[as.numeric(x[3])]}); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; q_random <- p.adjust(pvals, method="fdr"); return(length(which(q_random < 0.05)))}

result <- sapply(downstream_options, combo_qvals)

result
[[1]]
[1] 1073

[[2]]
[1] 0

[[3]]
[1] 0

[[4]]
[1] 0

[[5]]
[1] 0

[[6]]
[1] 0

[[7]]
[1] 0

[[8]]
[1] 0

[[9]]
[1] 0

[[10]]
[1] 0

*mic drop* [for real this time]



February 13, 2017

Modelling RNASeq like a car in traffic: using hybrid qualitative and quantitative factors in Sleuth to find elusive significant contrasts


tl;dr If you are using gene knockouts or knockdowns (RNAi) as experimental factors in RNASeq experiments, sensitivity to detect differential expression for ALL genes can be improved by simultaneously modelling both the qualitative (knockout -/+) and quantitative (transcripts per million) values for the gene(s) being manipulated. When adding qualitative factors, optimize the regression model by minimizing the residual sum of squares of the model fit. In the example below, we went from zero mRNA transcripts with significant contrasts for factor interactions (e.g. shRNA gene knockdown+treatment) to hundreds by using this hybrid factor modelling.

The best analogy that I can think of is to imagine that you are trying to determine several metrics of a motor vehicle (such as the fuel consumption rate, RPMs, oil temperature) of a car at any given moment, given only snapshots of the motor's speed (quantitative), and the type of road the vehicle is on (highway/city, qualitative). Either known factor by itself isn't a great indicator of the desired metrics since you could be stuck in rush hour traffic on the highway, or racing stop-and-go in the city.  Together, they provide a more predictive context for the other metrics (i.e. implicitly modelling the gear the car is in, which affects fuel economy and RPMs for given speeds). In the same way, the reaction of other genes to the levels of p53 and the shRNA are likely to be better implicitly modelled by knowing if there is a knockout/down (qualitative) AND what the level of that muting is (quantitative). 

____________

Recently our centre sequenced and analyzed a three factor RNASeq experiment done in triplicate, meaning there were 24 samples.  The first factor was two cell lines (one a p53 knockout, the other wild type, i.e. minus and plus), the second factor was a shRNA knockdown of a gene of interest or GFP, and the third was a drug treatment. At first, I did a straightforward Kallisto/Sleuth analysis of the factors and their interactions. There were a bunch of p53 knockout-specific changes, and shRNA-specific changes, but not really anything else (including no treatment responses).  It was time to break out the the sample distance matrix analysis using an orthogonal count method.  Here's the matrix:



There are a few things to point out here.  First, there is a batch effect, which is evident by the grouping (clustering tree on the left margin) of samples in bottom of the matrix by sample extraction date (D1, D2, D3). So I added the date to the model, as well as its interaction with other factors (this has worked out well previously).  So my metadata table looks like this:


sample path p53 sh treatment date
HCT116_p53_minus_shATM_DMSO_Dec_1 HCT116_p53_minus_shATM_DMSO_Dec_1.kallisto minus shATM ctl D2
HCT116_p53_minus_shATM_DMSO_Dec_2 HCT116_p53_minus_shATM_DMSO_Dec_2.kallisto minus shATM ctl D3
HCT116_p53_minus_shATM_DMSO_Nov_30 HCT116_p53_minus_shATM_DMSO_Nov_30.kallisto minus shATM ctl D1
HCT116_p53_minus_shATM_PARPi_Dec_1 HCT116_p53_minus_shATM_PARPi_Dec_1.kallisto minus shATM exp D2
HCT116_p53_minus_shATM_PARPi_Dec_2 HCT116_p53_minus_shATM_PARPi_Dec_2.kallisto minus shATM exp D3
HCT116_p53_minus_shATM_PARPi_Nov_30 HCT116_p53_minus_shATM_PARPi_Nov_30.kallisto minus shATM exp D1
HCT116_p53_minus_shGFP_DMSO_Dec_1 HCT116_p53_minus_shGFP_DMSO_Dec_1.kallisto minus GFP ctl D2
HCT116_p53_minus_shGFP_DMSO_Dec_2 HCT116_p53_minus_shGFP_DMSO_Dec_2.kallisto minus GFP ctl D3
HCT116_p53_minus_shGFP_DMSO_Nov_30 HCT116_p53_minus_shGFP_DMSO_Nov_30.kallisto minus GFP ctl D1
HCT116_p53_minus_shGFP_PARPi_Dec_1 HCT116_p53_minus_shGFP_PARPi_Dec_1.kallisto minus GFP exp D2
HCT116_p53_minus_shGFP_PARPi_Dec_2 HCT116_p53_minus_shGFP_PARPi_Dec_2.kallisto minus GFP exp D3
HCT116_p53_minus_shGFP_PARPi_Nov_30 HCT116_p53_minus_shGFP_PARPi_Nov_30.kallisto minus GFP exp D1
HCT116_p53_plus_shATM_DMSO_Dec_1 HCT116_p53_plus_shATM_DMSO_Dec_1.kallisto plus shATM ctl D2
HCT116_p53_plus_shATM_DMSO_Dec_2 HCT116_p53_plus_shATM_DMSO_Dec_2.kallisto plus shATM ctl D3
HCT116_p53_plus_shATM_DMSO_Nov_30 HCT116_p53_plus_shATM_DMSO_Nov_30.kallisto plus shATM ctl D1
HCT116_p53_plus_shATM_PARPi_Dec_1 HCT116_p53_plus_shATM_PARPi_Dec_1.kallisto plus shATM exp D2
HCT116_p53_plus_shATM_PARPi_Dec_2 HCT116_p53_plus_shATM_PARPi_Dec_2.kallisto plus shATM exp D3
HCT116_p53_plus_shATM_PARPi_Nov_30 HCT116_p53_plus_shATM_PARPi_Nov_30.kallisto plus shATM exp D1
HCT116_p53_plus_shGFP_DMSO_Dec_1 HCT116_p53_plus_shGFP_DMSO_Dec_1.kallisto plus GFP ctl D2
HCT116_p53_plus_shGFP_DMSO_Dec_2 HCT116_p53_plus_shGFP_DMSO_Dec_2.kallisto plus GFP ctl D3
HCT116_p53_plus_shGFP_DMSO_Nov_30 HCT116_p53_plus_shGFP_DMSO_Nov_30.kallisto plus GFP ctl D1
HCT116_p53_plus_shGFP_PARPi_Dec_1 HCT116_p53_plus_shGFP_PARPi_Dec_1.kallisto plus GFP exp D2
HCT116_p53_plus_shGFP_PARPi_Dec_2 HCT116_p53_plus_shGFP_PARPi_Dec_2.kallisto plus GFP exp D3
HCT116_p53_plus_shGFP_PARPi_Nov_30 HCT116_p53_plus_shGFP_PARPi_Nov_30.kallisto plus GFP exp D1


And the analysis looks like so:


library(sleuth)
meta_qual <- read.table("meta_qual.tab", header=TRUE)
meta_qual$path <- as.character(meta$path)
so_qual <- sleuth_prep(meta_qual, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date)
reading in kallisto results
........................
normalizing est_counts
29870 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

> so_qual <- sleuth_fit(so_qual)
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas

I'll exclude the Sleuth messages from hereon in for brevity. After running through a typical likelihood ratio test workflow for the factors, let's see how many genes are differentially expressed in each factor contrast (3) and their interactions (also 3, for a total of 6).  We modelled the date effect but aren't interested in reporting them, just accounting for them to improve the factors of interest.

> length(lrt_p53.sig_ids)
[1] 19472
> length(lrt_sh.sig_ids)
[1] 17970
> length(lrt_treatment.sig_ids)
[1] 28
> length(lrt_p53_sh_int.sig_ids)
[1] 9579
> length(lrt_p53_treatment_int.sig_ids)
[1] 4
> length(lrt_sh_treatment_int.sig_ids)
[1] 0


Okay, this is looking somewhat better.  We do have some treatment-specific changes now, and a bunch of p53:sh interaction changes.  Inspecting these, about half are mirror images of the p53 changes (gene +X change in p53 plus, -X in shATM), which suggests that we have confounding factors we haven't modelled yet. A closer inspection of the distance matrix produced earlier shows that there are four p53 minus samples that cluster quite closely with the p53 plus samples, including a mix of treated and untreated, and the rest of the shATM knockdowns are peppered semi-randomly in the p53 minus part of the tree. What are we missing?

To gauge how well we've modelled the driving factors of the gene expression, we can look at the residual sum of squares for the regression.  Minimization is optimization for this sum. Where can we find this?  A little digging shows that the so object has a fits member, which itself has named regression models.  By default, the model we first fitted is called "full". Each regression model contains a summary data frame.  Let's check it out the first row of that data frame:

> so_qual$fits$full$summary[1,]
           x_group      rss sigma_sq sigma_q_sq mean_obs  var_obs target_id[...]

1 (-0.000966,0.01] 29.05466 2.021473   1.206822 1.246232 3.421352 NM_000195[...]

That second column, "rss" is the Residual Sum of Squares we're after. Let's look at the sum across all the transcripts modelled:
> sum(so_qual$fits$full$summary[,2])
[1] 232757.7

If we're going to improve the regression model, which leads to better statistical power to pick up the interaction terms, we'll want to get a total RSS of less than 232758. RNAi knockdown is infamous for having highly variable efficiency, so maybe we should not be using a binary on/off factor to model it, but rather model using the actual expression value since we have that from the RNAseq data itself.  My Kallisto reference file uses RefSeq MRNA models, so I look up the main NM_####### ID for ATM, which can be found by clicking the RefSeq mRNA link on the right hand sidebar of the ATM RefGene page.  It's NM_000051.  The transcripts per million information is readily available as the last column in the Kallisto abundance files in each sample's output directory.


-bash-4.2$ grep NM_000051 *.kallisto/abundance.tsv
HCT116_p53_minus_shATM_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 1882.6 4.93759
HCT116_p53_minus_shATM_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2618.66 5.9662
HCT116_p53_minus_shATM_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2392.05 4.00764
HCT116_p53_minus_shATM_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 2212.67 4.54057
HCT116_p53_minus_shATM_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2345.31 5.07381
HCT116_p53_minus_shATM_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 3151.77 5.03178
HCT116_p53_minus_shGFP_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 3851.38 7.11942
HCT116_p53_minus_shGFP_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 1981.88 4.0778
HCT116_p53_minus_shGFP_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2621.98 8.19509
HCT116_p53_minus_shGFP_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 2458.8 3.66504
HCT116_p53_minus_shGFP_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2710.64 4.47078
HCT116_p53_minus_shGFP_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2243.22 5.37964
HCT116_p53_plus_shATM_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 2441.1 4.88496
HCT116_p53_plus_shATM_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 3736.36 4.75133
HCT116_p53_plus_shATM_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 3259.82 6.86958
HCT116_p53_plus_shATM_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 3175.35 4.85537
HCT116_p53_plus_shATM_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 1790.03 4.44556
HCT116_p53_plus_shATM_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 3625.5 6.01011
HCT116_p53_plus_shGFP_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 1662.14 4.27169
HCT116_p53_plus_shGFP_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2445.64 5.1909
HCT116_p53_plus_shGFP_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2185.2 4.63558
HCT116_p53_plus_shGFP_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 1640.12 4.15573
HCT116_p53_plus_shGFP_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2548.95 4.62631
HCT116_p53_plus_shGFP_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 1864.82 4.69346


I did the same for the p53 mRNA level (NM_000546), since there is no guarantee that p53 is evenly expressed in each of the samples. Now my sample metadata file looks like (meta_quant.tab):

sample path p53 sh treatment date
HCT116_p53_minus_shATM_DMSO_Dec_1 HCT116_p53_minus_shATM_DMSO_Dec_1.kallisto 1.07691 4.93759 ctl D2
HCT116_p53_minus_shATM_DMSO_Dec_2 HCT116_p53_minus_shATM_DMSO_Dec_2.kallisto 4.67624 5.9662 ctl D3
HCT116_p53_minus_shATM_DMSO_Nov_30 HCT116_p53_minus_shATM_DMSO_Nov_30.kallisto 1.17641 4.00764 ctl D1
HCT116_p53_minus_shATM_PARPi_Dec_1 HCT116_p53_minus_shATM_PARPi_Dec_1.kallisto 5.94978 4.54057 exp D2
HCT116_p53_minus_shATM_PARPi_Dec_2 HCT116_p53_minus_shATM_PARPi_Dec_2.kallisto 5.98334 5.07381 exp D3
HCT116_p53_minus_shATM_PARPi_Nov_30 HCT116_p53_minus_shATM_PARPi_Nov_30.kallisto 7.44437 5.03178 exp D1
HCT116_p53_minus_shGFP_DMSO_Dec_1 HCT116_p53_minus_shGFP_DMSO_Dec_1.kallisto 0.53949 7.11942 ctl D2
HCT116_p53_minus_shGFP_DMSO_Dec_2 HCT116_p53_minus_shGFP_DMSO_Dec_2.kallisto 8.92308 4.0778 ctl D3
HCT116_p53_minus_shGFP_DMSO_Nov_30 HCT116_p53_minus_shGFP_DMSO_Nov_30.kallisto 5.6365 8.19509 ctl D1
HCT116_p53_minus_shGFP_PARPi_Dec_1 HCT116_p53_minus_shGFP_PARPi_Dec_1.kallisto 10.7367 3.66504 exp D2
HCT116_p53_minus_shGFP_PARPi_Dec_2 HCT116_p53_minus_shGFP_PARPi_Dec_2.kallisto 10.1989 4.47078 exp D3
HCT116_p53_minus_shGFP_PARPi_Nov_30 HCT116_p53_minus_shGFP_PARPi_Nov_30.kallisto 16.878 5.37964 exp D1
HCT116_p53_plus_shATM_DMSO_Dec_1 HCT116_p53_plus_shATM_DMSO_Dec_1.kallisto 0.10751 4.88496 ctl D2
HCT116_p53_plus_shATM_DMSO_Dec_2 HCT116_p53_plus_shATM_DMSO_Dec_2.kallisto 0.55286 4.75133 ctl D3
HCT116_p53_plus_shATM_DMSO_Nov_30 HCT116_p53_plus_shATM_DMSO_Nov_30.kallisto 0.38625 6.86958 ctl D1
HCT116_p53_plus_shATM_PARPi_Dec_1 HCT116_p53_plus_shATM_PARPi_Dec_1.kallisto 0.77159 4.85537 exp D2
HCT116_p53_plus_shATM_PARPi_Dec_2 HCT116_p53_plus_shATM_PARPi_Dec_2.kallisto 0.82322 4.44556 exp D3
HCT116_p53_plus_shATM_PARPi_Nov_30 HCT116_p53_plus_shATM_PARPi_Nov_30.kallisto 0.59059 6.01011 exp D1
HCT116_p53_plus_shGFP_DMSO_Dec_1 HCT116_p53_plus_shGFP_DMSO_Dec_1.kallisto 0.52125 4.27169 ctl D2
HCT116_p53_plus_shGFP_DMSO_Dec_2 HCT116_p53_plus_shGFP_DMSO_Dec_2.kallisto 0.72088 5.1909 ctl D3
HCT116_p53_plus_shGFP_DMSO_Nov_30 HCT116_p53_plus_shGFP_DMSO_Nov_30.kallisto 0.39565 4.63558 ctl D1
HCT116_p53_plus_shGFP_PARPi_Dec_1 HCT116_p53_plus_shGFP_PARPi_Dec_1.kallisto 0.71229 4.15573 exp D2
HCT116_p53_plus_shGFP_PARPi_Dec_2 HCT116_p53_plus_shGFP_PARPi_Dec_2.kallisto 0.68656 4.62631 exp D3
HCT116_p53_plus_shGFP_PARPi_Nov_30 HCT116_p53_plus_shGFP_PARPi_Nov_30.kallisto 0.65503 4.69346 exp D1


Note that the p53 expression values are all over the place in the nominally p53 "minus" samples, which gives us some hope that quantitative modelling will help. Let's rerun the analysis, except this time the regression model will treat p53 and sh as dosage effects, which is the default for any numeric factor column in R.

> library(sleuth)
> meta_quant <- read.table("meta_quant.tab", header=TRUE)
> meta_quant$path <- as.character(meta$path)
> so_quant <- sleuth_prep(meta_quant, 
~p53*sh+treatment+p53:treatment+sh:treatment+p53:date+sh:date+treatment:date+date)
> so_quant <- sleuth_fit(so_quant)

Let's check if we improved the regression model by using the quantitative factors instead of qualitative.

> sum(so_quant$fits$full$summary[,2])
[1] 325008.8

Ouch, that's a lot more than our original 232757.7 RSS!  At this point I tried a few mathematical manipulations of the expression values to see if that'd help.  After all, maybe the dosage response is logarithmic rather than linear?  No need to run sleuth_prep() again, we can just try as many extra models as we like using sleuth_fit().

> so_quant <- sleuth_fit(so_quant, ~I(log(p53))*sh+treatment+I(log(p53)):treatment+I(log(p53)):date+sh:date+sh:treatment+treatment:date+date, "full_log_p53")


You'll note that I replaced p53 with I(log(p53)).  The I is a function that Inhibits Interpretation, allowing the logarithm function to be passed to the model rather than evaluated in the command call itself.  Did that help?

> sum(so_quant$fits$full_log_p53$summary[,2])
[1] 300397.4

A bit, but not close to the original yet.  Since their are a lot of values for p53 range from 0.108 to 1, we're getting a much big spread (-2.2, 0) in the logarithm of small values than in large values, so let's round up to 1, which will effectively make samples with <1 have a dosage value of 0 (because log(1) = 0). Let's also invert the sh quantity, which gives greater weight to small values, i.e. better knockdown.

> so_quant <- sleuth_fit(so, ~I(log(ceiling(p53)))*I(1/sh)+treatment+I(log(ceiling(p53))):treatment+I(log(ceiling(p53))):date+I(1/sh):date+I(1/sh):treatment+treatment:date+date, "full_p53_log_and_sh_reciprocal")

> sum(so$fits$full_p53_log_and_sh_reciprocal$summary[,2])
[1] 274487.9

Better, but still not as good as the original qualitative-only model.  This is reflected in the LRT results if we bother to do the normal downstream workflow: 

> length(lrt_p53.sig_ids)
[1] 13282
> length(lrt_sh.sig_ids)
[1] 14224
> length(lrt_treatment.sig_ids)
[1] 4
> length(lrt_treatment_sh_int.sig_ids)
[1] 2
> length(lrt_treatment_p53_int.sig_ids)
[1] 0
> length(lrt_sh_p53_int.sig_ids)
[1] 1


Various other manipulation didn't help much.  It then struck me that modelling both the qualitative and quantitative values simultaneously could help.  Let's combine the metadata and see, first modelling just the main p53 effect seen in the distance matrix.


> meta_hybrid <- meta_qual
> meta_hybrid$p53_quant <- meta_quant$p53

> meta_hybrid$sh_quant <- meta_quant$sh
> so_hybrid <- sleuth_prep(meta_hybrid, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date+p53_quant)
> so_hybrid <- sleuth_fit(so_hybrid)
> sum(so_hybrid$fits$full$summary[,2])
[1] 193880.4

Woohoo!  That around a 20% drop in the residual sum of squares, and already we're seeing that the additional p53_quant factor is removing enough systematic bias to start making the interaction effects of the 3 factors significant.

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date+I(log(ceiling(p53_quant))), "hybrid_log_p53")
[a bunch of boring code goes here, see below...]

> length(lrt_p53.sig_ids)
[1] 22346
> length(lrt_sh.sig_ids)
[1] 2634
> length(lrt_treatment.sig_ids)
[1] 9
> length(lrt_sh_treatment_int.sig_ids)
[1] 64
> length(lrt_p53_treatment_int.sig_ids)
[1] 17
> length(lrt_sh_p53_int.sig_ids)
[1] 2511

Let's add in the sh quantitative effect.

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date+p53_quant+sh_quant, "full_sh")
> sum(so_hybrid$fits$full_sh$summary[,2])
[1] 150439.9

Excellent! We've managed to reduce the residuals by almost a third (~23K to ~15K).  Further mathematical manipulations of the quantitative factors didn't help, so let's assume this is as good as it gets continue with a standard LRT differential expression workflow from here.

> so_hybrid <- sleuth_fit(so_hybrid, ~sh+treatment+sh:date+sh:treatment+treatment:date+date+sh_quant, "no_p53")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53+treatment+p53:date+p53:treatment+treatment:date+date+p53_quant, "no_sh")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+p53:date+sh:date+date+p53_quant+sh_quant, "no_treatment")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:date+sh:date+p53:treatment+treatment:date+date+p53_quant+sh_quant, "no_sh_treatment_int")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:date+sh:date+sh:treatment+treatment:date+date+p53_quant+sh_quant, "no_p53_treatment_int")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53+sh+treatment+p53:date+sh:date+p53:treatment+sh:treatment+treatment:date+date+p53_quant+sh_quant, "no_sh_p53_int")

> so_hybrid <- sleuth_lrt(so_hybrid, 'no_p53', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_sh', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_treatment', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_p53_treatment_int' , 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_sh_treatment_int', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_sh_p53_int', 'full_sh')
> lrt_p53 <- sleuth_results(so_hybrid, 'no_p53:full_sh', test_type = 'lrt')
> lrt_sh <- sleuth_results(so_hybrid, 'no_sh:full_sh', test_type = 'lrt')
> lrt_treatment <- sleuth_results(so_hybrid, 'no_treatment:full_sh', test_type = 'lrt')
> lrt_p53_treatment_int <- sleuth_results(so_hybrid, 'no_p53_treatment_int:full_sh', test_type = 'lrt')
> lrt_sh_treatment_int <- sleuth_results(so_hybrid, 'no_sh_treatment_int:full_sh', test_type = 'lrt')
> lrt_sh_p53_int <- sleuth_results(so_hybrid, 'no_sh_p53_int:full_sh', test_type = 'lrt')
> lrt_p53.sig_ids <- lrt_p53$target_id[which(lrt_p53$qval < 0.05)]
> lrt_sh.sig_ids <- lrt_sh$target_id[which(lrt_sh$qval < 0.05)]
> lrt_treatment.sig_ids <- lrt_treatment$target_id[which(lrt_treatment$qval < 0.05)]
> lrt_sh_treatment_int.sig_ids <- lrt_sh_treatment_int$target_id[which(lrt_sh_treatment_int$qval < 0.05)]
> lrt_p53_treatment_int.sig_ids <- lrt_p53_treatment_int$target_id[which(lrt_p53_treatment_int$qval < 0.05)]
> lrt_sh_p53_int.sig_ids <- lrt_sh_p53_int$target_id[which(lrt_sh_p53_int$qval < 0.05)]

Now, let's look how this improved model has affected the factor contrast lists under the LRT test.

> length(lrt_p53.sig_ids)
[1] 23099
> length(lrt_sh.sig_ids)
[1] 13483
> length(lrt_treatment.sig_ids)
[1] 164
> length(lrt_sh_treatment_int.sig_ids)
[1] 291
> length(lrt_p53_treatment_int.sig_ids)
[1] 161
> length(lrt_sh_p53_int.sig_ids)
[1] 5515

This is a lot more in line with what we know from previous experiments and Western blots, etc.  Important to note is that we got significant changes in our original qualitative factors (p53 minus/plus and shGFP/ATM) by reducing the confounding expression "noise" with the quantitative models. Note that if you run an LRT test on a quantitative factor in Sleuth, it makes an assumption of normally distributed values for abundance in each sample, which is unlikely to be true unless you've very carefully chosen your samples. Let's go on and run the Wald tests so we can get the betas (~natural logarithm fold change), which IS something we can model for the quantitative traits on a per-quantitative-unit basis if we picked the right quantitative effect (linear, log, etc.).

> so_hybrid <- sleuth_wt(so_hybrid, 'p53plus')
> so_hybrid <- sleuth_wt(so_hybrid, 'shshATM')
> so_hybrid <- sleuth_wt(so_hybrid, 'treatmentexp')
> so_hybrid <- sleuth_wt(so_hybrid, 'p53_quant')
> so_hybrid <- sleuth_wt(so_hybrid, 'sh_quant')
> so_hybrid <- sleuth_wt(so_hybrid, 'shshATM:treatmentexp')
> so_hybrid <- sleuth_wt(so_hybrid, 'p53plus:shshATM')
> so_hybrid <- sleuth_wt(so_hybrid, 'p53plus:treatmentexp')
> wt_p53 <- sleuth_results(so_hybrid, 'p53plus')
> wt_sh <- sleuth_results(so_hybrid, 'shshATM')
> wt_treatment <- sleuth_results(so_hybrid, 'treatmentexp')
> wt_p53_quant <- sleuth_results(so_hybrid, 'p53_quant')
> wt_sh_quant <- sleuth_results(so_hybrid, 'sh_quant')
> wt_sh_treatment_int <- sleuth_results(so_hybrid, 'shshATM:treatmentexp')
> wt_p53_sh_int <- sleuth_results(so_hybrid, 'p53plus:shshATM')
> wt_p53_treatment_int <- sleuth_results(so_hybrid, 'p53plus:treatmentexp')


Now let's keep the Wald test results that are also in the LRT results, and reorder them so everyone's got the their results for the same targets (RefSeq mRNA IDs) alphabetically, then print the important result bits (ID, WT and LRT q-values, fold changes for each factor & interaction) to a file.


> lrt_ids <- unique(lrt_p53.sig_ids, lrt_sh.sig_ids, lrt_treatment.sig_ids, lrt_sh_treatment_int.sig_ids, lrt_p53_treatment_int.sig_ids, lrt_sh_p53_int.sig_ids)
> lrt_shared_p53 <- lrt_p53[lrt_p53$target_id %in% lrt_ids,]
> lrt_shared_sh <- lrt_sh[lrt_sh$target_id %in% lrt_ids,]
> lrt_shared_treatment <- lrt_treatment[lrt_treatment$target_id %in% lrt_ids,]
> lrt_shared_p53_quant <- lrt_treatment[lrt_p53_quant$target_id %in% lrt_ids,]
> lrt_shared_sh_quant <- lrt_treatment[lrt_sh_quant$target_id %in% lrt_ids,]
> lrt_shared_sh_treatment_int <- lrt_sh_treatment_int[lrt_sh_treatment_int$target_id %in% lrt_ids,]
> lrt_shared_p53_treatment_int <- lrt_sh_treatment_int[lrt_p53_treatment_int$target_id %in% lrt_ids,]
> lrt_shared_sh_p53_int <- lrt_p53_sh_int[lrt_p53_sh_int$target_id %in% lrt_ids,]
> lrt_shared_p53 <- lrt_shared_p53[sort.list(lrt_shared_p53[,1]),]
> lrt_shared_sh <- lrt_shared_sh[sort.list(lrt_shared_sh[,1]),]
> lrt_shared_treatment <- lrt_shared_treatment[sort.list(lrt_shared_treatment[,1]),]
> lrt_shared_p53_quant <- lrt_shared_p53_quant[sort.list(lrt_shared_p53_quant[,1]),]
> lrt_shared_sh_quant <- lrt_shared_sh_quant[sort.list(lrt_shared_sh_quant[,1]),]
> lrt_shared_sh_treatment_int <- lrt_shared_sh_treatment_int[sort.list(lrt_shared_sh_treatment_int[,1]),]
> lrt_shared_p53_treatment_int <- lrt_shared_p53_treatment_int[sort.list(lrt_shared_p53_treatment_int[,1]),]
> lrt_shared_sh_p53_int <- lrt_shared_sh_p53_int[sort.list(lrt_shared_sh_p53_int[,1]),]

> wt_shared_p53 <- wt_p53[wt_p53$target_id %in% lrt_ids,]
> wt_shared_sh <- wt_sh[wt_sh$target_id %in% lrt_ids,]
> wt_shared_treatment <- wt_treatment[wt_treatment$target_id %in% lrt_ids,]
> wt_shared_p53_quant <- wt_treatment[wt_p53_quant$target_id %in% lrt_ids,]
> wt_shared_sh_quant <- wt_treatment[wt_sh_quant$target_id %in% lrt_ids,]
> wt_shared_sh_treatment_int <- wt_sh_treatment_int[wt_sh_treatment_int$target_id %in% lrt_ids,]
> wt_shared_p53_treatment_int <- wt_sh_treatment_int[wt_p53_treatment_int$target_id %in% lrt_ids,]
> wt_shared_sh_p53_int <- wt_p53_sh_int[wt_p53_sh_int$target_id %in% lrt_ids,]
> wt_shared_p53 <- wt_shared_p53[sort.list(wt_shared_p53[,1]),]
> wt_shared_sh <- wt_shared_sh[sort.list(wt_shared_sh[,1]),]
> wt_shared_treatment <- wt_shared_treatment[sort.list(wt_shared_treatment[,1]),]
> wt_shared_p53_quant <- wt_shared_p53_quant[sort.list(wt_shared_p53_quant[,1]),]
> wt_shared_sh_quant <- wt_shared_sh_quant[sort.list(wt_shared_sh_quant[,1]),]
> wt_shared_sh_treatment_int <- wt_shared_sh_treatment_int[sort.list(wt_shared_sh_treatment_int[,1]),]
> wt_shared_p53_treatment_int <- wt_shared_p53_treatment_int[sort.list(wt_shared_p53_treatment_int[,1]),]
> wt_shared_sh_p53_int <- wt_shared_sh_p53_int[sort.list(wt_shared_sh_p53_int[,1]),]
> combined_result <- data.frame(lrt_shared_p53[,"target_id"],lrt_shared_p53[,"qval"],wt_shared_p53[,"b"],wt_shared_p53[,"qval"],wt_shared_p53_quant[,"b"],wt_shared_p53_quant[,"qval"],lrt_shared_sh[,"qval"],wt_shared_sh[,"b"],wt_shared_sh[,"qval"],wt_shared_sh_quant[,"b"],wt_shared_sh_quant[,"qval"],lrt_shared_treatment[,"qval"],wt_shared_treatment[,"b"],wt_shared_treatment[,"qval"],lrt_shared_sh_p53_int[,"qval"],wt_shared_sh_p53_int[,"b"],wt_shared_sh_p53_int[,"qval"],lrt_shared_p53_treatment_int[,"qval"],wt_shared_p53_treatment_int[,"b"],wt_shared_p53_treatment_int[,"qval"],lrt_shared_sh_treatment_int[,"qval"],wt_shared_sh_treatment_int[,"b"],wt_shared_sh_treatment_int[,"qval"])
> write.csv(combined_result, file="hct_hybrid_dge_lrt_q_0.05.csv")

The results are encouraging. Most of the sh:p53 interactions that were mirroring the p53 changes have gone away, and most of the LRT-significant changes in the interaction terms aren't mirrors either. We've dealt with the batch effects, and the variability of gene expression in knockdowns and knockouts pretty effectively.  You'll note that the q-values for the LRT tests are better than the Wald test in situations like this with so many factors in the regression model, because the LRT test has greater power.  Not all the genes listed as significant in the LRT are going to be real, but it's certainly a good starting point for pathway enrichment analysis, etc., and the Wald test results with qval < 0.05 can be taken as more of a sure thing.


February 9, 2017

Debugging RNASeq: sample swaps and batch effects

tl;dr If a standard differential expression analysis yields nothing for an experiment, look at the distance matrix between samples in an RNASeq experiment and see if anything looks out of place. Sometimes you can validate sample swaps when different cell lines are involved, and even strong batch effects can be modelled effectively to reveal the underlying factor effects you're expecting.

______________
There's a particular sinking feeling you get when you do a differential gene expression analysis of a carefully planned RNASeq experiment and you get no significant changes. I got that recently, when analyzing a two-factor Latin Square experiment done in triplicate.  The experiment involved two different cell lines (factor 1, with levels E545K and EV), and a treatment (factor 2, with level plus and minus).  We had good reason to believe that there should be an interaction effect (factor 3) between the one of the cell lines and the treatment.  I ran a Kallisto + Sleuth analysis with a linear model of ~cell_line+treatment+cell_line:treatment, and an experiment metadata file like so:

sample                path                           cell_line treatment 
E545K_minus_CP_Dec_1  E545K_minus_CP_Dec_1.kallisto  K         ctl      
E545K_minus_CP_Dec_2  E545K_minus_CP_Dec_2.kallisto  K         ctl       
E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K         ctl       
E545K_plus_CP_Dec_1   E545K_plus_CP_Dec_1.kallisto   K         exp      
E545K_plus_CP_Dec_2   E545K_plus_CP_Dec_2.kallisto   K         exp       
E545K_plus_CP_Nov_30  E545K_plus_CP_Nov_30.kallisto  K         exp      
EV_minus_CP_Dec_1     EV_minus_CP_Dec_1.kallisto     EV        ctl      
EV_minus_CP_Dec_2     EV_minus_CP_Dec_2.kallisto     EV        ctl       
EV_minus_CP_Nov_30    EV_minus_CP_Nov_30.kallisto    EV        ctl      
EV_plus_CP_Dec_1      EV_plus_CP_Dec_1.kallisto      EV        exp      
EV_plus_CP_Dec_2      EV_plus_CP_Dec_2.kallisto      EV        exp      
EV_plus_CP_Nov_30     EV_plus_CP_Nov_30.kallisto     EV        exp      

Note that I called the E545K samples "K", because I'm lazy and want to use the default behaviour of R which is to make the alphabetically first factor value the base value, so reported ratios will be K:EV. As I said before, I got nothing significant...so I decided to run a bwa + DESeq2 analysis instead with the same model. That analysis generates a distance matrix like so:



[Note: you can glean similar information from the Sleuth PCA plots in the sleuth_live interface, but I often find them too cluttered] Probably the most important thing to look at on this chart is the cluster tree on the left hand side. First, most of the EV cell line samples cluster together at the bottom, with the one exception of EV_minus_CP_Dec2 way up with the first sample, E545K_plus_CP_Dec2. A single E545K sample, E545K_minus_CP_Dec2, clusters with the EV samples.  Maybe these two are swapped?

For supporting evidence, we can look in the mapped reads for the G>A variant that's specific to cell line E545K.  There's something you can't do with a gene expression microarray! (Talk about kicking a platform when it's down).
for b in E*.bam; do
  echo $b;
  samtools mpileup -r chr3:178936091-178936091 $b 2> /dev/null;
done

E545K_minus_CP_Dec_1.bam
chr3 178936091 N 64 ggggggggggggggggggggggaggggggggggggggggggggggggggggggggggggg^Wg^Wg^Wg^Wg AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE/EEEEEEE/EEEEEEEEEEEEEEEEEEEE

E545K_minus_CP_Dec_2.bam
chr3 178936091 N 20 gggggggggggggggggggg EEEAEEEEEEEEEE/EE<EE

E545K_minus_CP_Nov_30.bam
chr3 178936091 N 77 ggggggggggggggggggggggggggggggggggggggggggggggggggggggaggggggggggggggggggggg^Wg AAAEEEEEEAEEA/EEEEEEE/EEEAEAEEEEEEEEEEEEAAEEEE/EEEEAAEEEEEEEEEEEEEE/E<<EAEEEE

E545K_plus_CP_Dec_1.bam
chr3 178936091 N 33 gggggggggaggggggggggggagggggggggg AAAEEEEEEE<EEEE<E/EEEEEEEEAEE/EEE

E545K_plus_CP_Dec_2.bam
chr3 178936091 N 56 ggggggggAgggggggggggggggggggggggggggggggggggggggaggggggg AEEEEEEEEEAEAEE/EEEEEEEEEEEEEAEAEEEEE/AE/EE<E/EEAEEEEE<E

E545K_plus_CP_Nov_30.bam
chr3 178936091 N 36 ggagggggggaggggggggggggggggggggggggg E6EEEE/EEEE/AEEEEEAEEEE/EEAAEEEEEE//

EV_minus_CP_Dec_1.bam
chr3 178936091 N 30 g$ggggggggggggggggggggggggggggg AAAAEEEEEEEEEE/EEEEEAEEE/EEEA<

EV_minus_CP_Dec_2.bam
chr3 178936091 N 47 gggggggggagggggggaggggggggggaggggggggggggggggg^Wg AEAEEEEEEEEEEAEEEEEEEEEE/EEAEEEEEEEEAE6EEEE/AEE

EV_minus_CP_Nov_30.bam
chr3 178936091 N 52 g$ggggggggggggggggggggggggggggggggggggggggggggggggggg AAEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE6EEEEAEEEE<EA/

EV_plus_CP_Dec_1.bam
chr3 178936091 N 31 gggggggggggggggggggggggggggggg^Wg AEEEEEE/EEEEEEEAEAEEEEEEE<A/EEE

EV_plus_CP_Dec_2.bam
chr3 178936091 N 30 gggggggggggggggggggggggggggggg AEEEEAEEEEEEEEAEEEEEEEEEE6<AAE

EV_plus_CP_Nov_30.bam
chr3 178936091 N 34 ggggggggGggggggggggggggggggggggggg AAAEEEE/EEEEEEEEA/EE/EEEEEEEEEEEA<


I've added in the DNA base highlights manually to show which samples have the variant. We'd expect it to be the first 6 normally (all the E545K... samples). The two that we suspect are swapped (also highlighted) defy the expectation: the EV sample has G>A, the E545K sample doesn't. Had the swap been between sample in the same cell line we wouldn't be able to use this evidence, but we're lucky I guess.

Rerunning the analysis with the samples swapped yields almost nothing.

The second thing to notice about the distance matrix is that the E545K samples (given the sample swap) pair off in the tree closely based on the date the sample was prepared, but the EV samples do not (treatment vs non is the primary clustering factor for those).  This suggests a strong batch effect that's specific to E545K cells, and is worse at later dates.  Let's swap the samples in the metadata file, and model the batch effect in a new Sleuth analysis in order to recover the real differential expression signal for the factors of interest: cell line, treatment and cell line/treatment interaction.  This means that our new metadata file (e_meta_swapped.tab) should look like this:

sample                path                           cell_line treatment date
E545K_minus_CP_Dec_1  E545K_minus_CP_Dec_1.kallisto  K         ctl       D1
E545K_minus_CP_Dec_2  EV_minus_CP_Dec_2.kallisto     K         ctl       D2
E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K         ctl       D0
E545K_plus_CP_Dec_1   E545K_plus_CP_Dec_1.kallisto   K         exp       D1
E545K_plus_CP_Dec_2   E545K_plus_CP_Dec_2.kallisto   K         exp       D2
E545K_plus_CP_Nov_30  E545K_plus_CP_Nov_30.kallisto  K         exp       D0
EV_minus_CP_Dec_1     EV_minus_CP_Dec_1.kallisto     EV        ctl       D1
EV_minus_CP_Dec_2     E545K_minus_CP_Dec_2.kallisto  EV        ctl       D2
EV_minus_CP_Nov_30    EV_minus_CP_Nov_30.kallisto    EV        ctl       D0
EV_plus_CP_Dec_1      EV_plus_CP_Dec_1.kallisto      EV        exp       D1
EV_plus_CP_Dec_2      EV_plus_CP_Dec_2.kallisto      EV        exp       D2
EV_plus_CP_Nov_30     EV_plus_CP_Nov_30.kallisto     EV        exp       D0

As I mentioned earlier, the batch effect seems worse (cluster tree separation is higher) at later dates, so we set Nov 30 as the base level for the date factor by giving it the alphabetically first factor level name, D0. Don't use just 0, 1, 2 unless you want to treat the date as a dosage effect (i.e. the hypotheical effect at Dec 2 should be exactly twice as much as at Dec 1).

I tried modelling just a straight date batch effect by adding date to the model.  This yielded 1000+ significant changes, but they were almost mirrors of the E545K effect (e.g. when E545K cells had a change of X for a transcript, the date factor usually had a change of -X).  This indicates that the E545K changes are artifactual, so we need to model the interaction of the date and cell line (date:cell_line). On with the a new (correct) Sleuth analysis!

library(sleuth)
meta <- read.table("e_meta_swapped.tab", header=TRUE)
meta$path <- as.character(meta$path)

so <- sleuth_prep(meta, ~cell_line*treatment+date:cell_line+date)
so <- sleuth_fit(so)

Note that cell_line*treatment is R syntactic sugar for cell_line+treatment+cell_line:treatment. This gives us a model where gene expression is affected by 3 factors and two factor interactions.*

Now, one by one, we need to exclude each of the factors of the experiment from the linear model to test what transcripts are affected by the factor's exclusion.

so <- sleuth_fit(so, ~treatment+date, "no_cell_line")
so <- sleuth_fit(so, ~cell_line*treatment, "no_date")
so <- sleuth_fit(so, ~cell_line+date:cell_line+date, "no_treatment")
so <- sleuth_fit(so, ~cell_line+treatment+date:cell_line+date, "no_interaction")
so <- sleuth_fit(so, ~cell_line*treatment+date, "no_date_interaction")

The significance test of choice here is the Likelihood Ratio Test (LRT). Run it.

so <- sleuth_lrt(so, 'no_cell_line', 'full')
so <- sleuth_lrt(so, 'no_treatment', 'full')
so <- sleuth_lrt(so, 'no_interaction', 'full')
so <- sleuth_lrt(so, 'no_date', 'full')
so <- sleuth_lrt(so, 'no_date_interaction', 'full')

Grab the results, and filter to those with a qval (multiple-testing corrected p-value) of <.05.

results_table_cell_line_lrt <- sleuth_results(so, 'no_cell_line:full', test_type = 'lrt')
results_table_treatment_lrt <- sleuth_results(so, 'no_treatment:full', test_type = 'lrt')
results_table_interaction_lrt <- sleuth_results(so, 'no_interaction:full', test_type = 'lrt')
results_table_date_lrt <- sleuth_results(so, 'no_date:full', test_type = 'lrt')
results_table_date_interaction_lrt <- sleuth_results(so, 'no_date_interaction:full', test_type = 'lrt')

cell_line.lrt.sig_ids <- results_table_cell_line_lrt$target_id[which(results_table_cell_line_lrt$qval < 0.05)]
treatment.lrt.sig_ids <- results_table_treatment_lrt$target_id[which(results_table_treatment_lrt$qval < 0.05)]
interaction.lrt.sig_ids <- results_table_interaction_lrt$target_id[which(results_table_interaction_lrt$qval < 0.05)]
date.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]
date_interaction.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]


Now, let's make a (non-redundant) list of all of the target IDs that pass the LRT significance filter. I'm excluding the LRT test results for date and date:cell_line because they're huge, and I don't really care to report the specifics of those batch effects unless they are in genes that are also affected by the original factors of interest (cell line, treatment, and their interaction).

lrt_ids <- unique(c(cell_line.lrt.sig_ids, treatment.lrt.sig_ids, interaction.lrt.sig_ids))

The LRT test doesn't generate fold-change estimates for transcripts since it only considers the factors' presence, not their levels. We need to run a Wald test for each factor level (7 total since the date factor has 2 levels to contrast with base level D0).

so <- sleuth_wt(so, 'dateD1')
so <- sleuth_wt(so, 'dateD2')
so <- sleuth_wt(so, 'cell_lineK')
so <- sleuth_wt(so, 'cell_lineK:treatmentexp')
so <- sleuth_wt(so, 'treatmentexp')
so <- sleuth_wt(so, 'cell_lineK:dateD1')
so <- sleuth_wt(so, 'cell_lineK:dateD2')

Grab the results.

results_table_date2_wt <- sleuth_results(so, 'dateD2')
results_table_date1_wt <- sleuth_results(so, 'dateD1')
results_table_date1_int_wt <- sleuth_results(so, 'cell_lineK:dateD1')
results_table_date2_int_wt <- sleuth_results(so, 'cell_lineK:dateD2')
results_table_cell_line_wt <- sleuth_results(so, 'cell_lineK')
results_table_interaction_wt <- sleuth_results(so, 'cell_lineK:treatmentexp')
results_table_treatment_wt <- sleuth_results(so, 'treatmentexp')

Let's report out only the Wald test results of each factor contrast that correspond to the LRT-passing targets (transcripts) identified earlier in the list "ids".

shared_results_cell_line <- results_table_cell_line_wt[results_table_cell_line_wt$target_id %in% lrt_ids,]
shared_results_treatment <- results_table_treatment_wt[results_table_treatment_wt$target_id %in% lrt_ids,]
shared_results_interaction <- results_table_interaction_wt[results_table_interaction_wt$target_id %in% lrt_ids,]
shared_results_date2_int <- results_table_date2_int_wt[results_table_date2_int_wt$target_id %in% lrt_ids,]
shared_results_date1_int <- results_table_date1_int_wt[results_table_date1_int_wt$target_id %in% lrt_ids,]
shared_results_date1 <- results_table_date1_wt[results_table_date1_wt$target_id %in% lrt_ids,]
shared_results_date2 <- results_table_date2_wt[results_table_date2_wt$target_id %in% lrt_ids,]

Since we want to report out all the factors together (i.e. influence of each factor for a transcript are reported on the same row), we need to sort each factor's Wald test results alphabetically by ID for consistency.

shared_results_cell_line <- shared_results_cell_line[sort.list(shared_results_cell_line[,1]),]
shared_results_treatment <- shared_results_treatment[sort.list(shared_results_treatment[,1]),]
shared_results_interaction <- shared_results_interaction[sort.list(shared_results_interaction[,1]),]
shared_results_date2_int <- shared_results_date2_int[sort.list(shared_results_date2_int[,1]),]
shared_results_date1_int <- shared_results_date1_int[sort.list(shared_results_date1_int[,1]),]
shared_results_date1 <- shared_results_date1[sort.list(shared_results_date1[,1]),]
shared_results_date2 <- shared_results_date2[sort.list(shared_results_date2[,1]),]

Write out all the LRT-significant data to a file. I could get fancy and splice out specific columns to keep, but that's a lot of syntax and it's easier to do that in Excel or with UNIX commands like cut.

write.csv(c(shared_results_cell_line,shared_results_treatment,shared_results_interaction,shared_results_date2_int,shared_results_date1_int,shared_results_date1,shared_results_date2), "e_kallisto_swapped_lrt_q0.05_max.csv")

This looks a lot better! Now we have ~330 genes that show an interaction effect for the cell line and treatment whereas before the swap and cell line specific batch effect modelling we had nothing.  It took some work for sure, but a lot less than redoing the experiment and resequencing everything. This was a quick run to get some directions for grant applications based on pathway analysis etc., so overall patterns are what matter. They'll want to run follow-up experiments and qPCR on salient transcripts in the existing samples to verify the results.

_____________

*If you tried to include the treatment:date interaction and other factor combinations in the model with these just 12 samples, you're likely to get a nasty message like:
Error in solve.default(t(X) %*% X) : 

  Lapack routine dgesv: system is exactly singular: U[7,7] = 0

Which in essence means that you only have one sample for each combination of factors. That'd leave zero degrees of freedom for statistical tests. That's no good, to put it lightly. There may be an interaction effect of the treatment and date, but as per the distance matrix shown earlier, any such effect is negligible compared to the cell line batch effect, so we'll make do without modelling it.