February 13, 2017

Modelling RNASeq like a car in traffic: using hybrid qualitative and quantitative factors in Sleuth to find elusive significant contrasts


tl;dr If you are using gene knockouts or knockdowns (RNAi) as experimental factors in RNASeq experiments, sensitivity to detect differential expression for ALL genes can be improved by simultaneously modelling both the qualitative (knockout -/+) and quantitative (transcripts per million) values for the gene(s) being manipulated. When adding qualitative factors, optimize the regression model by minimizing the residual sum of squares of the model fit. In the example below, we went from zero mRNA transcripts with significant contrasts for factor interactions (e.g. shRNA gene knockdown+treatment) to hundreds by using this hybrid factor modelling.

The best analogy that I can think of is to imagine that you are trying to determine several metrics of a motor vehicle (such as the fuel consumption rate, RPMs, oil temperature) of a car at any given moment, given only snapshots of the motor's speed (quantitative), and the type of road the vehicle is on (highway/city, qualitative). Either known factor by itself isn't a great indicator of the desired metrics since you could be stuck in rush hour traffic on the highway, or racing stop-and-go in the city.  Together, they provide a more predictive context for the other metrics (i.e. implicitly modelling the gear the car is in, which affects fuel economy and RPMs for given speeds). In the same way, the reaction of other genes to the levels of p53 and the shRNA are likely to be better implicitly modelled by knowing if there is a knockout/down (qualitative) AND what the level of that muting is (quantitative). 

____________

Recently our centre sequenced and analyzed a three factor RNASeq experiment done in triplicate, meaning there were 24 samples.  The first factor was two cell lines (one a p53 knockout, the other wild type, i.e. minus and plus), the second factor was a shRNA knockdown of a gene of interest or GFP, and the third was a drug treatment. At first, I did a straightforward Kallisto/Sleuth analysis of the factors and their interactions. There were a bunch of p53 knockout-specific changes, and shRNA-specific changes, but not really anything else (including no treatment responses).  It was time to break out the the sample distance matrix analysis using an orthogonal count method.  Here's the matrix:



There are a few things to point out here.  First, there is a batch effect, which is evident by the grouping (clustering tree on the left margin) of samples in bottom of the matrix by sample extraction date (D1, D2, D3). So I added the date to the model, as well as its interaction with other factors (this has worked out well previously).  So my metadata table looks like this:


sample path p53 sh treatment date
HCT116_p53_minus_shATM_DMSO_Dec_1 HCT116_p53_minus_shATM_DMSO_Dec_1.kallisto minus shATM ctl D2
HCT116_p53_minus_shATM_DMSO_Dec_2 HCT116_p53_minus_shATM_DMSO_Dec_2.kallisto minus shATM ctl D3
HCT116_p53_minus_shATM_DMSO_Nov_30 HCT116_p53_minus_shATM_DMSO_Nov_30.kallisto minus shATM ctl D1
HCT116_p53_minus_shATM_PARPi_Dec_1 HCT116_p53_minus_shATM_PARPi_Dec_1.kallisto minus shATM exp D2
HCT116_p53_minus_shATM_PARPi_Dec_2 HCT116_p53_minus_shATM_PARPi_Dec_2.kallisto minus shATM exp D3
HCT116_p53_minus_shATM_PARPi_Nov_30 HCT116_p53_minus_shATM_PARPi_Nov_30.kallisto minus shATM exp D1
HCT116_p53_minus_shGFP_DMSO_Dec_1 HCT116_p53_minus_shGFP_DMSO_Dec_1.kallisto minus GFP ctl D2
HCT116_p53_minus_shGFP_DMSO_Dec_2 HCT116_p53_minus_shGFP_DMSO_Dec_2.kallisto minus GFP ctl D3
HCT116_p53_minus_shGFP_DMSO_Nov_30 HCT116_p53_minus_shGFP_DMSO_Nov_30.kallisto minus GFP ctl D1
HCT116_p53_minus_shGFP_PARPi_Dec_1 HCT116_p53_minus_shGFP_PARPi_Dec_1.kallisto minus GFP exp D2
HCT116_p53_minus_shGFP_PARPi_Dec_2 HCT116_p53_minus_shGFP_PARPi_Dec_2.kallisto minus GFP exp D3
HCT116_p53_minus_shGFP_PARPi_Nov_30 HCT116_p53_minus_shGFP_PARPi_Nov_30.kallisto minus GFP exp D1
HCT116_p53_plus_shATM_DMSO_Dec_1 HCT116_p53_plus_shATM_DMSO_Dec_1.kallisto plus shATM ctl D2
HCT116_p53_plus_shATM_DMSO_Dec_2 HCT116_p53_plus_shATM_DMSO_Dec_2.kallisto plus shATM ctl D3
HCT116_p53_plus_shATM_DMSO_Nov_30 HCT116_p53_plus_shATM_DMSO_Nov_30.kallisto plus shATM ctl D1
HCT116_p53_plus_shATM_PARPi_Dec_1 HCT116_p53_plus_shATM_PARPi_Dec_1.kallisto plus shATM exp D2
HCT116_p53_plus_shATM_PARPi_Dec_2 HCT116_p53_plus_shATM_PARPi_Dec_2.kallisto plus shATM exp D3
HCT116_p53_plus_shATM_PARPi_Nov_30 HCT116_p53_plus_shATM_PARPi_Nov_30.kallisto plus shATM exp D1
HCT116_p53_plus_shGFP_DMSO_Dec_1 HCT116_p53_plus_shGFP_DMSO_Dec_1.kallisto plus GFP ctl D2
HCT116_p53_plus_shGFP_DMSO_Dec_2 HCT116_p53_plus_shGFP_DMSO_Dec_2.kallisto plus GFP ctl D3
HCT116_p53_plus_shGFP_DMSO_Nov_30 HCT116_p53_plus_shGFP_DMSO_Nov_30.kallisto plus GFP ctl D1
HCT116_p53_plus_shGFP_PARPi_Dec_1 HCT116_p53_plus_shGFP_PARPi_Dec_1.kallisto plus GFP exp D2
HCT116_p53_plus_shGFP_PARPi_Dec_2 HCT116_p53_plus_shGFP_PARPi_Dec_2.kallisto plus GFP exp D3
HCT116_p53_plus_shGFP_PARPi_Nov_30 HCT116_p53_plus_shGFP_PARPi_Nov_30.kallisto plus GFP exp D1


And the analysis looks like so:


library(sleuth)
meta_qual <- read.table("meta_qual.tab", header=TRUE)
meta_qual$path <- as.character(meta$path)
so_qual <- sleuth_prep(meta_qual, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date)
reading in kallisto results
........................
normalizing est_counts
29870 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

> so_qual <- sleuth_fit(so_qual)
fitting measurement error models
shrinkage estimation
Adding missing grouping variables: `x_group`
computing variance of betas

I'll exclude the Sleuth messages from hereon in for brevity. After running through a typical likelihood ratio test workflow for the factors, let's see how many genes are differentially expressed in each factor contrast (3) and their interactions (also 3, for a total of 6).  We modelled the date effect but aren't interested in reporting them, just accounting for them to improve the factors of interest.

> length(lrt_p53.sig_ids)
[1] 19472
> length(lrt_sh.sig_ids)
[1] 17970
> length(lrt_treatment.sig_ids)
[1] 28
> length(lrt_p53_sh_int.sig_ids)
[1] 9579
> length(lrt_p53_treatment_int.sig_ids)
[1] 4
> length(lrt_sh_treatment_int.sig_ids)
[1] 0


Okay, this is looking somewhat better.  We do have some treatment-specific changes now, and a bunch of p53:sh interaction changes.  Inspecting these, about half are mirror images of the p53 changes (gene +X change in p53 plus, -X in shATM), which suggests that we have confounding factors we haven't modelled yet. A closer inspection of the distance matrix produced earlier shows that there are four p53 minus samples that cluster quite closely with the p53 plus samples, including a mix of treated and untreated, and the rest of the shATM knockdowns are peppered semi-randomly in the p53 minus part of the tree. What are we missing?

To gauge how well we've modelled the driving factors of the gene expression, we can look at the residual sum of squares for the regression.  Minimization is optimization for this sum. Where can we find this?  A little digging shows that the so object has a fits member, which itself has named regression models.  By default, the model we first fitted is called "full". Each regression model contains a summary data frame.  Let's check it out the first row of that data frame:

> so_qual$fits$full$summary[1,]
           x_group      rss sigma_sq sigma_q_sq mean_obs  var_obs target_id[...]

1 (-0.000966,0.01] 29.05466 2.021473   1.206822 1.246232 3.421352 NM_000195[...]

That second column, "rss" is the Residual Sum of Squares we're after. Let's look at the sum across all the transcripts modelled:
> sum(so_qual$fits$full$summary[,2])
[1] 232757.7

If we're going to improve the regression model, which leads to better statistical power to pick up the interaction terms, we'll want to get a total RSS of less than 232758. RNAi knockdown is infamous for having highly variable efficiency, so maybe we should not be using a binary on/off factor to model it, but rather model using the actual expression value since we have that from the RNAseq data itself.  My Kallisto reference file uses RefSeq MRNA models, so I look up the main NM_####### ID for ATM, which can be found by clicking the RefSeq mRNA link on the right hand sidebar of the ATM RefGene page.  It's NM_000051.  The transcripts per million information is readily available as the last column in the Kallisto abundance files in each sample's output directory.


-bash-4.2$ grep NM_000051 *.kallisto/abundance.tsv
HCT116_p53_minus_shATM_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 1882.6 4.93759
HCT116_p53_minus_shATM_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2618.66 5.9662
HCT116_p53_minus_shATM_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2392.05 4.00764
HCT116_p53_minus_shATM_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 2212.67 4.54057
HCT116_p53_minus_shATM_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2345.31 5.07381
HCT116_p53_minus_shATM_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 3151.77 5.03178
HCT116_p53_minus_shGFP_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 3851.38 7.11942
HCT116_p53_minus_shGFP_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 1981.88 4.0778
HCT116_p53_minus_shGFP_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2621.98 8.19509
HCT116_p53_minus_shGFP_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 2458.8 3.66504
HCT116_p53_minus_shGFP_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2710.64 4.47078
HCT116_p53_minus_shGFP_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2243.22 5.37964
HCT116_p53_plus_shATM_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 2441.1 4.88496
HCT116_p53_plus_shATM_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 3736.36 4.75133
HCT116_p53_plus_shATM_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 3259.82 6.86958
HCT116_p53_plus_shATM_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 3175.35 4.85537
HCT116_p53_plus_shATM_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 1790.03 4.44556
HCT116_p53_plus_shATM_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 3625.5 6.01011
HCT116_p53_plus_shGFP_DMSO_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 1662.14 4.27169
HCT116_p53_plus_shGFP_DMSO_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2445.64 5.1909
HCT116_p53_plus_shGFP_DMSO_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 2185.2 4.63558
HCT116_p53_plus_shGFP_PARPi_Dec_1.kallisto/abundance.tsv:NM_000051 13147 12968 1640.12 4.15573
HCT116_p53_plus_shGFP_PARPi_Dec_2.kallisto/abundance.tsv:NM_000051 13147 12968 2548.95 4.62631
HCT116_p53_plus_shGFP_PARPi_Nov_30.kallisto/abundance.tsv:NM_000051 13147 12968 1864.82 4.69346


I did the same for the p53 mRNA level (NM_000546), since there is no guarantee that p53 is evenly expressed in each of the samples. Now my sample metadata file looks like (meta_quant.tab):

sample path p53 sh treatment date
HCT116_p53_minus_shATM_DMSO_Dec_1 HCT116_p53_minus_shATM_DMSO_Dec_1.kallisto 1.07691 4.93759 ctl D2
HCT116_p53_minus_shATM_DMSO_Dec_2 HCT116_p53_minus_shATM_DMSO_Dec_2.kallisto 4.67624 5.9662 ctl D3
HCT116_p53_minus_shATM_DMSO_Nov_30 HCT116_p53_minus_shATM_DMSO_Nov_30.kallisto 1.17641 4.00764 ctl D1
HCT116_p53_minus_shATM_PARPi_Dec_1 HCT116_p53_minus_shATM_PARPi_Dec_1.kallisto 5.94978 4.54057 exp D2
HCT116_p53_minus_shATM_PARPi_Dec_2 HCT116_p53_minus_shATM_PARPi_Dec_2.kallisto 5.98334 5.07381 exp D3
HCT116_p53_minus_shATM_PARPi_Nov_30 HCT116_p53_minus_shATM_PARPi_Nov_30.kallisto 7.44437 5.03178 exp D1
HCT116_p53_minus_shGFP_DMSO_Dec_1 HCT116_p53_minus_shGFP_DMSO_Dec_1.kallisto 0.53949 7.11942 ctl D2
HCT116_p53_minus_shGFP_DMSO_Dec_2 HCT116_p53_minus_shGFP_DMSO_Dec_2.kallisto 8.92308 4.0778 ctl D3
HCT116_p53_minus_shGFP_DMSO_Nov_30 HCT116_p53_minus_shGFP_DMSO_Nov_30.kallisto 5.6365 8.19509 ctl D1
HCT116_p53_minus_shGFP_PARPi_Dec_1 HCT116_p53_minus_shGFP_PARPi_Dec_1.kallisto 10.7367 3.66504 exp D2
HCT116_p53_minus_shGFP_PARPi_Dec_2 HCT116_p53_minus_shGFP_PARPi_Dec_2.kallisto 10.1989 4.47078 exp D3
HCT116_p53_minus_shGFP_PARPi_Nov_30 HCT116_p53_minus_shGFP_PARPi_Nov_30.kallisto 16.878 5.37964 exp D1
HCT116_p53_plus_shATM_DMSO_Dec_1 HCT116_p53_plus_shATM_DMSO_Dec_1.kallisto 0.10751 4.88496 ctl D2
HCT116_p53_plus_shATM_DMSO_Dec_2 HCT116_p53_plus_shATM_DMSO_Dec_2.kallisto 0.55286 4.75133 ctl D3
HCT116_p53_plus_shATM_DMSO_Nov_30 HCT116_p53_plus_shATM_DMSO_Nov_30.kallisto 0.38625 6.86958 ctl D1
HCT116_p53_plus_shATM_PARPi_Dec_1 HCT116_p53_plus_shATM_PARPi_Dec_1.kallisto 0.77159 4.85537 exp D2
HCT116_p53_plus_shATM_PARPi_Dec_2 HCT116_p53_plus_shATM_PARPi_Dec_2.kallisto 0.82322 4.44556 exp D3
HCT116_p53_plus_shATM_PARPi_Nov_30 HCT116_p53_plus_shATM_PARPi_Nov_30.kallisto 0.59059 6.01011 exp D1
HCT116_p53_plus_shGFP_DMSO_Dec_1 HCT116_p53_plus_shGFP_DMSO_Dec_1.kallisto 0.52125 4.27169 ctl D2
HCT116_p53_plus_shGFP_DMSO_Dec_2 HCT116_p53_plus_shGFP_DMSO_Dec_2.kallisto 0.72088 5.1909 ctl D3
HCT116_p53_plus_shGFP_DMSO_Nov_30 HCT116_p53_plus_shGFP_DMSO_Nov_30.kallisto 0.39565 4.63558 ctl D1
HCT116_p53_plus_shGFP_PARPi_Dec_1 HCT116_p53_plus_shGFP_PARPi_Dec_1.kallisto 0.71229 4.15573 exp D2
HCT116_p53_plus_shGFP_PARPi_Dec_2 HCT116_p53_plus_shGFP_PARPi_Dec_2.kallisto 0.68656 4.62631 exp D3
HCT116_p53_plus_shGFP_PARPi_Nov_30 HCT116_p53_plus_shGFP_PARPi_Nov_30.kallisto 0.65503 4.69346 exp D1


Note that the p53 expression values are all over the place in the nominally p53 "minus" samples, which gives us some hope that quantitative modelling will help. Let's rerun the analysis, except this time the regression model will treat p53 and sh as dosage effects, which is the default for any numeric factor column in R.

> library(sleuth)
> meta_quant <- read.table("meta_quant.tab", header=TRUE)
> meta_quant$path <- as.character(meta$path)
> so_quant <- sleuth_prep(meta_quant, 
~p53*sh+treatment+p53:treatment+sh:treatment+p53:date+sh:date+treatment:date+date)
> so_quant <- sleuth_fit(so_quant)

Let's check if we improved the regression model by using the quantitative factors instead of qualitative.

> sum(so_quant$fits$full$summary[,2])
[1] 325008.8

Ouch, that's a lot more than our original 232757.7 RSS!  At this point I tried a few mathematical manipulations of the expression values to see if that'd help.  After all, maybe the dosage response is logarithmic rather than linear?  No need to run sleuth_prep() again, we can just try as many extra models as we like using sleuth_fit().

> so_quant <- sleuth_fit(so_quant, ~I(log(p53))*sh+treatment+I(log(p53)):treatment+I(log(p53)):date+sh:date+sh:treatment+treatment:date+date, "full_log_p53")


You'll note that I replaced p53 with I(log(p53)).  The I is a function that Inhibits Interpretation, allowing the logarithm function to be passed to the model rather than evaluated in the command call itself.  Did that help?

> sum(so_quant$fits$full_log_p53$summary[,2])
[1] 300397.4

A bit, but not close to the original yet.  Since their are a lot of values for p53 range from 0.108 to 1, we're getting a much big spread (-2.2, 0) in the logarithm of small values than in large values, so let's round up to 1, which will effectively make samples with <1 have a dosage value of 0 (because log(1) = 0). Let's also invert the sh quantity, which gives greater weight to small values, i.e. better knockdown.

> so_quant <- sleuth_fit(so, ~I(log(ceiling(p53)))*I(1/sh)+treatment+I(log(ceiling(p53))):treatment+I(log(ceiling(p53))):date+I(1/sh):date+I(1/sh):treatment+treatment:date+date, "full_p53_log_and_sh_reciprocal")

> sum(so$fits$full_p53_log_and_sh_reciprocal$summary[,2])
[1] 274487.9

Better, but still not as good as the original qualitative-only model.  This is reflected in the LRT results if we bother to do the normal downstream workflow: 

> length(lrt_p53.sig_ids)
[1] 13282
> length(lrt_sh.sig_ids)
[1] 14224
> length(lrt_treatment.sig_ids)
[1] 4
> length(lrt_treatment_sh_int.sig_ids)
[1] 2
> length(lrt_treatment_p53_int.sig_ids)
[1] 0
> length(lrt_sh_p53_int.sig_ids)
[1] 1


Various other manipulation didn't help much.  It then struck me that modelling both the qualitative and quantitative values simultaneously could help.  Let's combine the metadata and see, first modelling just the main p53 effect seen in the distance matrix.


> meta_hybrid <- meta_qual
> meta_hybrid$p53_quant <- meta_quant$p53

> meta_hybrid$sh_quant <- meta_quant$sh
> so_hybrid <- sleuth_prep(meta_hybrid, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date+p53_quant)
> so_hybrid <- sleuth_fit(so_hybrid)
> sum(so_hybrid$fits$full$summary[,2])
[1] 193880.4

Woohoo!  That around a 20% drop in the residual sum of squares, and already we're seeing that the additional p53_quant factor is removing enough systematic bias to start making the interaction effects of the 3 factors significant.

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date+I(log(ceiling(p53_quant))), "hybrid_log_p53")
[a bunch of boring code goes here, see below...]

> length(lrt_p53.sig_ids)
[1] 22346
> length(lrt_sh.sig_ids)
[1] 2634
> length(lrt_treatment.sig_ids)
[1] 9
> length(lrt_sh_treatment_int.sig_ids)
[1] 64
> length(lrt_p53_treatment_int.sig_ids)
[1] 17
> length(lrt_sh_p53_int.sig_ids)
[1] 2511

Let's add in the sh quantitative effect.

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:treatment+p53:date+sh:date+sh:treatment+treatment:date+date+p53_quant+sh_quant, "full_sh")
> sum(so_hybrid$fits$full_sh$summary[,2])
[1] 150439.9

Excellent! We've managed to reduce the residuals by almost a third (~23K to ~15K).  Further mathematical manipulations of the quantitative factors didn't help, so let's assume this is as good as it gets continue with a standard LRT differential expression workflow from here.

> so_hybrid <- sleuth_fit(so_hybrid, ~sh+treatment+sh:date+sh:treatment+treatment:date+date+sh_quant, "no_p53")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53+treatment+p53:date+p53:treatment+treatment:date+date+p53_quant, "no_sh")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+p53:date+sh:date+date+p53_quant+sh_quant, "no_treatment")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:date+sh:date+p53:treatment+treatment:date+date+p53_quant+sh_quant, "no_sh_treatment_int")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53*sh+treatment+p53:date+sh:date+sh:treatment+treatment:date+date+p53_quant+sh_quant, "no_p53_treatment_int")

> so_hybrid <- sleuth_fit(so_hybrid, ~p53+sh+treatment+p53:date+sh:date+p53:treatment+sh:treatment+treatment:date+date+p53_quant+sh_quant, "no_sh_p53_int")

> so_hybrid <- sleuth_lrt(so_hybrid, 'no_p53', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_sh', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_treatment', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_p53_treatment_int' , 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_sh_treatment_int', 'full_sh')
> so_hybrid <- sleuth_lrt(so_hybrid, 'no_sh_p53_int', 'full_sh')
> lrt_p53 <- sleuth_results(so_hybrid, 'no_p53:full_sh', test_type = 'lrt')
> lrt_sh <- sleuth_results(so_hybrid, 'no_sh:full_sh', test_type = 'lrt')
> lrt_treatment <- sleuth_results(so_hybrid, 'no_treatment:full_sh', test_type = 'lrt')
> lrt_p53_treatment_int <- sleuth_results(so_hybrid, 'no_p53_treatment_int:full_sh', test_type = 'lrt')
> lrt_sh_treatment_int <- sleuth_results(so_hybrid, 'no_sh_treatment_int:full_sh', test_type = 'lrt')
> lrt_sh_p53_int <- sleuth_results(so_hybrid, 'no_sh_p53_int:full_sh', test_type = 'lrt')
> lrt_p53.sig_ids <- lrt_p53$target_id[which(lrt_p53$qval < 0.05)]
> lrt_sh.sig_ids <- lrt_sh$target_id[which(lrt_sh$qval < 0.05)]
> lrt_treatment.sig_ids <- lrt_treatment$target_id[which(lrt_treatment$qval < 0.05)]
> lrt_sh_treatment_int.sig_ids <- lrt_sh_treatment_int$target_id[which(lrt_sh_treatment_int$qval < 0.05)]
> lrt_p53_treatment_int.sig_ids <- lrt_p53_treatment_int$target_id[which(lrt_p53_treatment_int$qval < 0.05)]
> lrt_sh_p53_int.sig_ids <- lrt_sh_p53_int$target_id[which(lrt_sh_p53_int$qval < 0.05)]

Now, let's look how this improved model has affected the factor contrast lists under the LRT test.

> length(lrt_p53.sig_ids)
[1] 23099
> length(lrt_sh.sig_ids)
[1] 13483
> length(lrt_treatment.sig_ids)
[1] 164
> length(lrt_sh_treatment_int.sig_ids)
[1] 291
> length(lrt_p53_treatment_int.sig_ids)
[1] 161
> length(lrt_sh_p53_int.sig_ids)
[1] 5515

This is a lot more in line with what we know from previous experiments and Western blots, etc.  Important to note is that we got significant changes in our original qualitative factors (p53 minus/plus and shGFP/ATM) by reducing the confounding expression "noise" with the quantitative models. Note that if you run an LRT test on a quantitative factor in Sleuth, it makes an assumption of normally distributed values for abundance in each sample, which is unlikely to be true unless you've very carefully chosen your samples. Let's go on and run the Wald tests so we can get the betas (~natural logarithm fold change), which IS something we can model for the quantitative traits on a per-quantitative-unit basis if we picked the right quantitative effect (linear, log, etc.).

> so_hybrid <- sleuth_wt(so_hybrid, 'p53plus')
> so_hybrid <- sleuth_wt(so_hybrid, 'shshATM')
> so_hybrid <- sleuth_wt(so_hybrid, 'treatmentexp')
> so_hybrid <- sleuth_wt(so_hybrid, 'p53_quant')
> so_hybrid <- sleuth_wt(so_hybrid, 'sh_quant')
> so_hybrid <- sleuth_wt(so_hybrid, 'shshATM:treatmentexp')
> so_hybrid <- sleuth_wt(so_hybrid, 'p53plus:shshATM')
> so_hybrid <- sleuth_wt(so_hybrid, 'p53plus:treatmentexp')
> wt_p53 <- sleuth_results(so_hybrid, 'p53plus')
> wt_sh <- sleuth_results(so_hybrid, 'shshATM')
> wt_treatment <- sleuth_results(so_hybrid, 'treatmentexp')
> wt_p53_quant <- sleuth_results(so_hybrid, 'p53_quant')
> wt_sh_quant <- sleuth_results(so_hybrid, 'sh_quant')
> wt_sh_treatment_int <- sleuth_results(so_hybrid, 'shshATM:treatmentexp')
> wt_p53_sh_int <- sleuth_results(so_hybrid, 'p53plus:shshATM')
> wt_p53_treatment_int <- sleuth_results(so_hybrid, 'p53plus:treatmentexp')


Now let's keep the Wald test results that are also in the LRT results, and reorder them so everyone's got the their results for the same targets (RefSeq mRNA IDs) alphabetically, then print the important result bits (ID, WT and LRT q-values, fold changes for each factor & interaction) to a file.


> lrt_ids <- unique(lrt_p53.sig_ids, lrt_sh.sig_ids, lrt_treatment.sig_ids, lrt_sh_treatment_int.sig_ids, lrt_p53_treatment_int.sig_ids, lrt_sh_p53_int.sig_ids)
> lrt_shared_p53 <- lrt_p53[lrt_p53$target_id %in% lrt_ids,]
> lrt_shared_sh <- lrt_sh[lrt_sh$target_id %in% lrt_ids,]
> lrt_shared_treatment <- lrt_treatment[lrt_treatment$target_id %in% lrt_ids,]
> lrt_shared_p53_quant <- lrt_treatment[lrt_p53_quant$target_id %in% lrt_ids,]
> lrt_shared_sh_quant <- lrt_treatment[lrt_sh_quant$target_id %in% lrt_ids,]
> lrt_shared_sh_treatment_int <- lrt_sh_treatment_int[lrt_sh_treatment_int$target_id %in% lrt_ids,]
> lrt_shared_p53_treatment_int <- lrt_sh_treatment_int[lrt_p53_treatment_int$target_id %in% lrt_ids,]
> lrt_shared_sh_p53_int <- lrt_p53_sh_int[lrt_p53_sh_int$target_id %in% lrt_ids,]
> lrt_shared_p53 <- lrt_shared_p53[sort.list(lrt_shared_p53[,1]),]
> lrt_shared_sh <- lrt_shared_sh[sort.list(lrt_shared_sh[,1]),]
> lrt_shared_treatment <- lrt_shared_treatment[sort.list(lrt_shared_treatment[,1]),]
> lrt_shared_p53_quant <- lrt_shared_p53_quant[sort.list(lrt_shared_p53_quant[,1]),]
> lrt_shared_sh_quant <- lrt_shared_sh_quant[sort.list(lrt_shared_sh_quant[,1]),]
> lrt_shared_sh_treatment_int <- lrt_shared_sh_treatment_int[sort.list(lrt_shared_sh_treatment_int[,1]),]
> lrt_shared_p53_treatment_int <- lrt_shared_p53_treatment_int[sort.list(lrt_shared_p53_treatment_int[,1]),]
> lrt_shared_sh_p53_int <- lrt_shared_sh_p53_int[sort.list(lrt_shared_sh_p53_int[,1]),]

> wt_shared_p53 <- wt_p53[wt_p53$target_id %in% lrt_ids,]
> wt_shared_sh <- wt_sh[wt_sh$target_id %in% lrt_ids,]
> wt_shared_treatment <- wt_treatment[wt_treatment$target_id %in% lrt_ids,]
> wt_shared_p53_quant <- wt_treatment[wt_p53_quant$target_id %in% lrt_ids,]
> wt_shared_sh_quant <- wt_treatment[wt_sh_quant$target_id %in% lrt_ids,]
> wt_shared_sh_treatment_int <- wt_sh_treatment_int[wt_sh_treatment_int$target_id %in% lrt_ids,]
> wt_shared_p53_treatment_int <- wt_sh_treatment_int[wt_p53_treatment_int$target_id %in% lrt_ids,]
> wt_shared_sh_p53_int <- wt_p53_sh_int[wt_p53_sh_int$target_id %in% lrt_ids,]
> wt_shared_p53 <- wt_shared_p53[sort.list(wt_shared_p53[,1]),]
> wt_shared_sh <- wt_shared_sh[sort.list(wt_shared_sh[,1]),]
> wt_shared_treatment <- wt_shared_treatment[sort.list(wt_shared_treatment[,1]),]
> wt_shared_p53_quant <- wt_shared_p53_quant[sort.list(wt_shared_p53_quant[,1]),]
> wt_shared_sh_quant <- wt_shared_sh_quant[sort.list(wt_shared_sh_quant[,1]),]
> wt_shared_sh_treatment_int <- wt_shared_sh_treatment_int[sort.list(wt_shared_sh_treatment_int[,1]),]
> wt_shared_p53_treatment_int <- wt_shared_p53_treatment_int[sort.list(wt_shared_p53_treatment_int[,1]),]
> wt_shared_sh_p53_int <- wt_shared_sh_p53_int[sort.list(wt_shared_sh_p53_int[,1]),]
> combined_result <- data.frame(lrt_shared_p53[,"target_id"],lrt_shared_p53[,"qval"],wt_shared_p53[,"b"],wt_shared_p53[,"qval"],wt_shared_p53_quant[,"b"],wt_shared_p53_quant[,"qval"],lrt_shared_sh[,"qval"],wt_shared_sh[,"b"],wt_shared_sh[,"qval"],wt_shared_sh_quant[,"b"],wt_shared_sh_quant[,"qval"],lrt_shared_treatment[,"qval"],wt_shared_treatment[,"b"],wt_shared_treatment[,"qval"],lrt_shared_sh_p53_int[,"qval"],wt_shared_sh_p53_int[,"b"],wt_shared_sh_p53_int[,"qval"],lrt_shared_p53_treatment_int[,"qval"],wt_shared_p53_treatment_int[,"b"],wt_shared_p53_treatment_int[,"qval"],lrt_shared_sh_treatment_int[,"qval"],wt_shared_sh_treatment_int[,"b"],wt_shared_sh_treatment_int[,"qval"])
> write.csv(combined_result, file="hct_hybrid_dge_lrt_q_0.05.csv")

The results are encouraging. Most of the sh:p53 interactions that were mirroring the p53 changes have gone away, and most of the LRT-significant changes in the interaction terms aren't mirrors either. We've dealt with the batch effects, and the variability of gene expression in knockdowns and knockouts pretty effectively.  You'll note that the q-values for the LRT tests are better than the Wald test in situations like this with so many factors in the regression model, because the LRT test has greater power.  Not all the genes listed as significant in the LRT are going to be real, but it's certainly a good starting point for pathway enrichment analysis, etc., and the Wald test results with qval < 0.05 can be taken as more of a sure thing.


No comments:

Post a Comment