tl;dr If a standard differential expression analysis yields nothing for an experiment, look at the distance matrix between samples in an RNASeq experiment and see if anything looks out of place. Sometimes you can validate sample swaps when different cell lines are involved, and even strong batch effects can be modelled effectively to reveal the underlying factor effects you're expecting.

______________

There's a particular sinking feeling you get when you do a differential gene expression analysis of a carefully planned RNASeq experiment and you get no significant changes. I got that recently, when analyzing a two-factor Latin Square experiment done in triplicate. The experiment involved two different cell lines (factor 1, with levels E545K and EV), and a treatment (factor 2, with level plus and minus). We had good reason to believe that there should be an interaction effect (factor 3) between the one of the cell lines and the treatment. I ran a Kallisto + Sleuth analysis with a linear model of ~cell_line+treatment+cell_line:treatment, and an experiment metadata file like so:sample path cell_line treatment

E545K_minus_CP_Dec_1 E545K_minus_CP_Dec_1.kallisto K ctl

E545K_minus_CP_Dec_2 E545K_minus_CP_Dec_2.kallisto K ctl

E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K ctl

E545K_plus_CP_Dec_1 E545K_plus_CP_Dec_1.kallisto K exp

E545K_plus_CP_Dec_2 E545K_plus_CP_Dec_2.kallisto K exp

E545K_plus_CP_Nov_30 E545K_plus_CP_Nov_30.kallisto K exp

EV_minus_CP_Dec_1 EV_minus_CP_Dec_1.kallisto EV ctl

EV_minus_CP_Dec_2 EV_minus_CP_Dec_2.kallisto EV ctl

EV_minus_CP_Nov_30 EV_minus_CP_Nov_30.kallisto EV ctl

EV_plus_CP_Dec_1 EV_plus_CP_Dec_1.kallisto EV exp

EV_plus_CP_Dec_2 EV_plus_CP_Dec_2.kallisto EV exp

EV_plus_CP_Nov_30 EV_plus_CP_Nov_30.kallisto EV exp

Note that I called the E545K samples "K", because I'm lazy and want to use the default behaviour of R which is to make the alphabetically first factor value the base value, so reported ratios will be K:EV. As I said before, I got nothing significant...so I decided to run a bwa + DESeq2 analysis instead with the same model. That analysis generates a distance matrix like so:

[Note: you can glean similar information from the Sleuth PCA plots in the sleuth_live interface, but I often find them too cluttered] Probably the most important thing to look at on this chart is the cluster tree on the left hand side. First, most of the EV cell line samples cluster together at the bottom, with the one exception of EV_minus_CP_Dec2 way up with the first sample, E545K_plus_CP_Dec2. A single E545K sample, E545K_minus_CP_Dec2, clusters with the EV samples. Maybe these two are swapped?

for b in E*.bam; do

echo $b;

samtools mpileup -r chr3:178936091-178936091 $b 2> /dev/null;

done

E545K_minus_CP_Dec_1.bam

chr3 178936091 N 64 ggggggggggggggggggggggaggggggggggggggggggggggggggggggggggggg^Wg^Wg^Wg^Wg AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE/EEEEEEE/EEEEEEEEEEEEEEEEEEEE

**E545K_minus_CP_Dec_2.bam**

chr3 178936091 N 20 gggggggggggggggggggg EEEAEEEEEEEEEE/EE<EE

E545K_minus_CP_Nov_30.bam

chr3 178936091 N 77 ggggggggggggggggggggggggggggggggggggggggggggggggggggggaggggggggggggggggggggg^Wg AAAEEEEEEAEEA/EEEEEEE/EEEAEAEEEEEEEEEEEEAAEEEE/EEEEAAEEEEEEEEEEEEEE/E<<EAEEEE

E545K_plus_CP_Dec_1.bam

chr3 178936091 N 33 gggggggggaggggggggggggagggggggggg AAAEEEEEEE<EEEE<E/EEEEEEEEAEE/EEE

E545K_plus_CP_Dec_2.bam

chr3 178936091 N 56 ggggggggAgggggggggggggggggggggggggggggggggggggggaggggggg AEEEEEEEEEAEAEE/EEEEEEEEEEEEEAEAEEEEE/AE/EE<E/EEAEEEEE<E

E545K_plus_CP_Nov_30.bam

chr3 178936091 N 36 ggagggggggaggggggggggggggggggggggggg E6EEEE/EEEE/AEEEEEAEEEE/EEAAEEEEEE//

EV_minus_CP_Dec_1.bam

chr3 178936091 N 30 g$ggggggggggggggggggggggggggggg AAAAEEEEEEEEEE/EEEEEAEEE/EEEA<

**EV_minus_CP_Dec_2.bam**

chr3 178936091 N 47 gggggggggagggggggaggggggggggaggggggggggggggggg^Wg AEAEEEEEEEEEEAEEEEEEEEEE/EEAEEEEEEEEAE6EEEE/AEE

EV_minus_CP_Nov_30.bam

chr3 178936091 N 52 g$ggggggggggggggggggggggggggggggggggggggggggggggggggg AAEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE6EEEEAEEEE<EA/

EV_plus_CP_Dec_1.bam

chr3 178936091 N 31 gggggggggggggggggggggggggggggg^Wg AEEEEEE/EEEEEEEAEAEEEEEEE<A/EEE

EV_plus_CP_Dec_2.bam

chr3 178936091 N 30 gggggggggggggggggggggggggggggg AEEEEAEEEEEEEEAEEEEEEEEEE6<AAE

EV_plus_CP_Nov_30.bam

chr3 178936091 N 34 ggggggggGggggggggggggggggggggggggg AAAEEEE/EEEEEEEEA/EE/EEEEEEEEEEEA<

Rerunning the analysis with the samples swapped yields

Note that cell_line*treatment is R syntactic sugar for cell_line+treatment+cell_line:treatment. This gives us a model where gene expression is affected by 3 factors and two factor interactions.*

Now, one by one, we need to exclude each of the factors of the experiment from the linear model to test what transcripts are affected by the factor's exclusion.

The significance test of choice here is the Likelihood Ratio Test (LRT). Run it.

Grab the results, and filter to those with a qval (multiple-testing corrected p-value) of <.05.

date_interaction.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]

Now, let's make a (non-redundant) list of all of the target IDs that pass the LRT significance filter. I'm excluding the LRT test results for date and date:cell_line because they're huge, and I don't really care to report the specifics of those batch effects unless they are in genes that are also affected by the original factors of interest (cell line, treatment, and their interaction).

The LRT test doesn't generate fold-change estimates for transcripts since it only considers the factors' presence, not their levels. We need to run a Wald test for each factor level (7 total since the date factor has 2 levels to contrast with base level D0).

Grab the results.

Let's report out only the Wald test results of each factor contrast that correspond to the LRT-passing targets (transcripts) identified earlier in the list "ids".

This looks a lot better! Now we have ~330 genes that show an interaction effect for the cell line and treatment whereas before the swap and cell line specific batch effect modelling we had nothing. It took some work for sure, but a lot less than redoing the experiment and resequencing everything. This was a quick run to get some directions for grant applications based on pathway analysis etc., so overall patterns are what matter. They'll want to run follow-up experiments and qPCR on salient transcripts in the existing samples to verify the results.

_____________

*If you tried to include the treatment:date interaction and other factor combinations in the model with these just 12 samples, you're likely to get a nasty message like:

Which in essence means that you only have one sample for each combination of factors. That'd leave zero degrees of freedom for statistical tests. That's no good, to put it lightly. There may be an interaction effect of the treatment and date, but as per the distance matrix shown earlier, any such effect is negligible compared to the cell line batch effect, so we'll make do without modelling it.

*almost*nothing.
The second thing to notice about the distance matrix is that the E545K samples (given the sample swap) pair off in the tree closely based on the date the sample was prepared, but the EV samples do not (treatment vs non is the primary clustering factor for those). This suggests a strong batch effect that's specific to E545K cells, and is worse at later dates. Let's swap the samples in the metadata file, and model the batch effect in a new Sleuth analysis in order to recover the real differential expression signal for the factors of interest: cell line, treatment and cell line/treatment interaction. This means that our new metadata file (e_meta_swapped.tab) should look like this:

sample path cell_line treatment date

E545K_minus_CP_Dec_1 E545K_minus_CP_Dec_1.kallisto K ctl D1

E545K_minus_CP_Dec_2 EV_minus_CP_Dec_2.kallisto K ctl D2

E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K ctl D0

E545K_plus_CP_Dec_1 E545K_plus_CP_Dec_1.kallisto K exp D1

E545K_plus_CP_Dec_2 E545K_plus_CP_Dec_2.kallisto K exp D2

E545K_plus_CP_Nov_30 E545K_plus_CP_Nov_30.kallisto K exp D0

EV_minus_CP_Dec_1 EV_minus_CP_Dec_1.kallisto EV ctl D1

EV_minus_CP_Dec_2 E545K_minus_CP_Dec_2.kallisto EV ctl D2

EV_minus_CP_Nov_30 EV_minus_CP_Nov_30.kallisto EV ctl D0

EV_plus_CP_Dec_1 EV_plus_CP_Dec_1.kallisto EV exp D1

EV_plus_CP_Dec_2 EV_plus_CP_Dec_2.kallisto EV exp D2

EV_plus_CP_Nov_30 EV_plus_CP_Nov_30.kallisto EV exp D0

As I mentioned earlier, the batch effect seems worse (cluster tree separation is higher) at later dates, so we set Nov 30 as the base level for the date factor by giving it the alphabetically first factor level name, D0. Don't use just 0, 1, 2 unless you want to treat the date as a dosage effect (i.e. the hypotheical effect at Dec 2 should be exactly twice as much as at Dec 1).

I tried modelling just a straight date batch effect by adding date to the model. This yielded 1000+ significant changes, but they were almost mirrors of the E545K effect (e.g. when E545K cells had a change of X for a transcript, the date factor usually had a change of -X). This indicates that the E545K changes are artifactual, so we need to model the interaction of the date and cell line (date:cell_line). On with the a new (correct) Sleuth analysis!

library(sleuth)

E545K_minus_CP_Dec_1 E545K_minus_CP_Dec_1.kallisto K ctl D1

E545K_minus_CP_Dec_2 EV_minus_CP_Dec_2.kallisto K ctl D2

E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K ctl D0

E545K_plus_CP_Dec_1 E545K_plus_CP_Dec_1.kallisto K exp D1

E545K_plus_CP_Dec_2 E545K_plus_CP_Dec_2.kallisto K exp D2

E545K_plus_CP_Nov_30 E545K_plus_CP_Nov_30.kallisto K exp D0

EV_minus_CP_Dec_1 EV_minus_CP_Dec_1.kallisto EV ctl D1

EV_minus_CP_Dec_2 E545K_minus_CP_Dec_2.kallisto EV ctl D2

EV_minus_CP_Nov_30 EV_minus_CP_Nov_30.kallisto EV ctl D0

EV_plus_CP_Dec_1 EV_plus_CP_Dec_1.kallisto EV exp D1

EV_plus_CP_Dec_2 EV_plus_CP_Dec_2.kallisto EV exp D2

EV_plus_CP_Nov_30 EV_plus_CP_Nov_30.kallisto EV exp D0

As I mentioned earlier, the batch effect seems worse (cluster tree separation is higher) at later dates, so we set Nov 30 as the base level for the date factor by giving it the alphabetically first factor level name, D0. Don't use just 0, 1, 2 unless you want to treat the date as a dosage effect (i.e. the hypotheical effect at Dec 2 should be exactly twice as much as at Dec 1).

I tried modelling just a straight date batch effect by adding date to the model. This yielded 1000+ significant changes, but they were almost mirrors of the E545K effect (e.g. when E545K cells had a change of X for a transcript, the date factor usually had a change of -X). This indicates that the E545K changes are artifactual, so we need to model the interaction of the date and cell line (date:cell_line). On with the a new (correct) Sleuth analysis!

library(sleuth)

meta <- read.table("e_meta_swapped.tab", header=TRUE)

meta$path <- as.character(meta$path)

so <- sleuth_prep(meta, ~cell_line*treatment+date:cell_line+date)

so <- sleuth_fit(so)

Now, one by one, we need to exclude each of the factors of the experiment from the linear model to test what transcripts are affected by the factor's exclusion.

so <- sleuth_fit(so, ~treatment+date, "no_cell_line")

so <- sleuth_fit(so, ~cell_line*treatment, "no_date")

so <- sleuth_fit(so, ~cell_line+date:cell_line+date, "no_treatment")

so <- sleuth_fit(so, ~cell_line+treatment+date:cell_line+date, "no_interaction")

so <- sleuth_fit(so, ~cell_line*treatment+date, "no_date_interaction")

The significance test of choice here is the Likelihood Ratio Test (LRT). Run it.

so <- sleuth_lrt(so, 'no_cell_line', 'full')

so <- sleuth_lrt(so, 'no_treatment', 'full')

so <- sleuth_lrt(so, 'no_interaction', 'full')

so <- sleuth_lrt(so, 'no_date', 'full')

so <- sleuth_lrt(so, 'no_date_interaction', 'full')

results_table_cell_line_lrt <- sleuth_results(so, 'no_cell_line:full', test_type = 'lrt')

results_table_treatment_lrt <- sleuth_results(so, 'no_treatment:full', test_type = 'lrt')

results_table_interaction_lrt <- sleuth_results(so, 'no_interaction:full', test_type = 'lrt')

results_table_date_lrt <- sleuth_results(so, 'no_date:full', test_type = 'lrt')

results_table_date_interaction_lrt <- sleuth_results(so, 'no_date_interaction:full', test_type = 'lrt')

cell_line.lrt.sig_ids <- results_table_cell_line_lrt$target_id[which(results_table_cell_line_lrt$qval < 0.05)]

treatment.lrt.sig_ids <- results_table_treatment_lrt$target_id[which(results_table_treatment_lrt$qval < 0.05)]

interaction.lrt.sig_ids <- results_table_interaction_lrt$target_id[which(results_table_interaction_lrt$qval < 0.05)]

date.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]date_interaction.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]

lrt_ids <- unique(c(cell_line.lrt.sig_ids, treatment.lrt.sig_ids, interaction.lrt.sig_ids))

so <- sleuth_wt(so, 'dateD1')

so <- sleuth_wt(so, 'dateD2')

so <- sleuth_wt(so, 'cell_lineK')

so <- sleuth_wt(so, 'cell_lineK:treatmentexp')

so <- sleuth_wt(so, 'treatmentexp')

so <- sleuth_wt(so, 'cell_lineK:dateD1')

so <- sleuth_wt(so, 'cell_lineK:dateD2')

results_table_date2_wt <- sleuth_results(so, 'dateD2')

results_table_date1_wt <- sleuth_results(so, 'dateD1')

results_table_date1_int_wt <- sleuth_results(so, 'cell_lineK:dateD1')

results_table_date2_int_wt <- sleuth_results(so, 'cell_lineK:dateD2')

results_table_cell_line_wt <- sleuth_results(so, 'cell_lineK')

results_table_interaction_wt <- sleuth_results(so, 'cell_lineK:treatmentexp')

results_table_treatment_wt <- sleuth_results(so, 'treatmentexp')

shared_results_cell_line <- results_table_cell_line_wt[results_table_cell_line_wt$target_id %in% lrt_ids,]

shared_results_treatment <- results_table_treatment_wt[results_table_treatment_wt$target_id %in% lrt_ids,]

shared_results_interaction <- results_table_interaction_wt[results_table_interaction_wt$target_id %in% lrt_ids,]

shared_results_date2_int <- results_table_date2_int_wt[results_table_date2_int_wt$target_id %in% lrt_ids,]

shared_results_date1_int <- results_table_date1_int_wt[results_table_date1_int_wt$target_id %in% lrt_ids,]

shared_results_date1 <- results_table_date1_wt[results_table_date1_wt$target_id %in% lrt_ids,]

shared_results_date2 <- results_table_date2_wt[results_table_date2_wt$target_id %in% lrt_ids,]

Since we want to report out all the factors together (i.e. influence of each factor for a transcript are reported on the same row), we need to sort each factor's Wald test results alphabetically by ID for consistency.
shared_results_cell_line <- shared_results_cell_line[sort.list(shared_results_cell_line[,1]),]

shared_results_treatment <- shared_results_treatment[sort.list(shared_results_treatment[,1]),]

shared_results_interaction <- shared_results_interaction[sort.list(shared_results_interaction[,1]),]

shared_results_date2_int <- shared_results_date2_int[sort.list(shared_results_date2_int[,1]),]

shared_results_date1_int <- shared_results_date1_int[sort.list(shared_results_date1_int[,1]),]

shared_results_date1 <- shared_results_date1[sort.list(shared_results_date1[,1]),]

shared_results_date2 <- shared_results_date2[sort.list(shared_results_date2[,1]),]

Write out all the LRT-significant data to a file. I could get fancy and splice out specific columns to keep, but that's a lot of syntax and it's easier to do that in Excel or with UNIX commands like cut.

write.csv(c(shared_results_cell_line,shared_results_treatment,shared_results_interaction,shared_results_date2_int,shared_results_date1_int,shared_results_date1,shared_results_date2), "e_kallisto_swapped_lrt_q0.05_max.csv")

This looks a lot better! Now we have ~330 genes that show an interaction effect for the cell line and treatment whereas before the swap and cell line specific batch effect modelling we had nothing. It took some work for sure, but a lot less than redoing the experiment and resequencing everything. This was a quick run to get some directions for grant applications based on pathway analysis etc., so overall patterns are what matter. They'll want to run follow-up experiments and qPCR on salient transcripts in the existing samples to verify the results.

_____________

*If you tried to include the treatment:date interaction and other factor combinations in the model with these just 12 samples, you're likely to get a nasty message like:

Error in solve.default(t(X) %*% X) :

Lapack routine dgesv: system is exactly singular: U[7,7] = 0

## No comments:

## Post a Comment