February 9, 2017

Debugging RNASeq: sample swaps and batch effects

tl;dr If a standard differential expression analysis yields nothing for an experiment, look at the distance matrix between samples in an RNASeq experiment and see if anything looks out of place. Sometimes you can validate sample swaps when different cell lines are involved, and even strong batch effects can be modelled effectively to reveal the underlying factor effects you're expecting.

There's a particular sinking feeling you get when you do a differential gene expression analysis of a carefully planned RNASeq experiment and you get no significant changes. I got that recently, when analyzing a two-factor Latin Square experiment done in triplicate.  The experiment involved two different cell lines (factor 1, with levels E545K and EV), and a treatment (factor 2, with level plus and minus).  We had good reason to believe that there should be an interaction effect (factor 3) between the one of the cell lines and the treatment.  I ran a Kallisto + Sleuth analysis with a linear model of ~cell_line+treatment+cell_line:treatment, and an experiment metadata file like so:

sample                path                           cell_line treatment 
E545K_minus_CP_Dec_1  E545K_minus_CP_Dec_1.kallisto  K         ctl      
E545K_minus_CP_Dec_2  E545K_minus_CP_Dec_2.kallisto  K         ctl       
E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K         ctl       
E545K_plus_CP_Dec_1   E545K_plus_CP_Dec_1.kallisto   K         exp      
E545K_plus_CP_Dec_2   E545K_plus_CP_Dec_2.kallisto   K         exp       
E545K_plus_CP_Nov_30  E545K_plus_CP_Nov_30.kallisto  K         exp      
EV_minus_CP_Dec_1     EV_minus_CP_Dec_1.kallisto     EV        ctl      
EV_minus_CP_Dec_2     EV_minus_CP_Dec_2.kallisto     EV        ctl       
EV_minus_CP_Nov_30    EV_minus_CP_Nov_30.kallisto    EV        ctl      
EV_plus_CP_Dec_1      EV_plus_CP_Dec_1.kallisto      EV        exp      
EV_plus_CP_Dec_2      EV_plus_CP_Dec_2.kallisto      EV        exp      
EV_plus_CP_Nov_30     EV_plus_CP_Nov_30.kallisto     EV        exp      

Note that I called the E545K samples "K", because I'm lazy and want to use the default behaviour of R which is to make the alphabetically first factor value the base value, so reported ratios will be K:EV. As I said before, I got nothing significant...so I decided to run a bwa + DESeq2 analysis instead with the same model. That analysis generates a distance matrix like so:

[Note: you can glean similar information from the Sleuth PCA plots in the sleuth_live interface, but I often find them too cluttered] Probably the most important thing to look at on this chart is the cluster tree on the left hand side. First, most of the EV cell line samples cluster together at the bottom, with the one exception of EV_minus_CP_Dec2 way up with the first sample, E545K_plus_CP_Dec2. A single E545K sample, E545K_minus_CP_Dec2, clusters with the EV samples.  Maybe these two are swapped?

For supporting evidence, we can look in the mapped reads for the G>A variant that's specific to cell line E545K.  There's something you can't do with a gene expression microarray! (Talk about kicking a platform when it's down).
for b in E*.bam; do
  echo $b;
  samtools mpileup -r chr3:178936091-178936091 $b 2> /dev/null;

chr3 178936091 N 64 ggggggggggggggggggggggaggggggggggggggggggggggggggggggggggggg^Wg^Wg^Wg^Wg AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE/EEEEEEE/EEEEEEEEEEEEEEEEEEEE

chr3 178936091 N 20 gggggggggggggggggggg EEEAEEEEEEEEEE/EE<EE

chr3 178936091 N 77 ggggggggggggggggggggggggggggggggggggggggggggggggggggggaggggggggggggggggggggg^Wg AAAEEEEEEAEEA/EEEEEEE/EEEAEAEEEEEEEEEEEEAAEEEE/EEEEAAEEEEEEEEEEEEEE/E<<EAEEEE

chr3 178936091 N 33 gggggggggaggggggggggggagggggggggg AAAEEEEEEE<EEEE<E/EEEEEEEEAEE/EEE

chr3 178936091 N 56 ggggggggAgggggggggggggggggggggggggggggggggggggggaggggggg AEEEEEEEEEAEAEE/EEEEEEEEEEEEEAEAEEEEE/AE/EE<E/EEAEEEEE<E

chr3 178936091 N 36 ggagggggggaggggggggggggggggggggggggg E6EEEE/EEEE/AEEEEEAEEEE/EEAAEEEEEE//

chr3 178936091 N 30 g$ggggggggggggggggggggggggggggg AAAAEEEEEEEEEE/EEEEEAEEE/EEEA<

chr3 178936091 N 47 gggggggggagggggggaggggggggggaggggggggggggggggg^Wg AEAEEEEEEEEEEAEEEEEEEEEE/EEAEEEEEEEEAE6EEEE/AEE

chr3 178936091 N 52 g$ggggggggggggggggggggggggggggggggggggggggggggggggggg AAEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE6EEEEAEEEE<EA/

chr3 178936091 N 31 gggggggggggggggggggggggggggggg^Wg AEEEEEE/EEEEEEEAEAEEEEEEE<A/EEE

chr3 178936091 N 30 gggggggggggggggggggggggggggggg AEEEEAEEEEEEEEAEEEEEEEEEE6<AAE

chr3 178936091 N 34 ggggggggGggggggggggggggggggggggggg AAAEEEE/EEEEEEEEA/EE/EEEEEEEEEEEA<

I've added in the DNA base highlights manually to show which samples have the variant. We'd expect it to be the first 6 normally (all the E545K... samples). The two that we suspect are swapped (also highlighted) defy the expectation: the EV sample has G>A, the E545K sample doesn't. Had the swap been between sample in the same cell line we wouldn't be able to use this evidence, but we're lucky I guess.

Rerunning the analysis with the samples swapped yields almost nothing.

The second thing to notice about the distance matrix is that the E545K samples (given the sample swap) pair off in the tree closely based on the date the sample was prepared, but the EV samples do not (treatment vs non is the primary clustering factor for those).  This suggests a strong batch effect that's specific to E545K cells, and is worse at later dates.  Let's swap the samples in the metadata file, and model the batch effect in a new Sleuth analysis in order to recover the real differential expression signal for the factors of interest: cell line, treatment and cell line/treatment interaction.  This means that our new metadata file (e_meta_swapped.tab) should look like this:

sample                path                           cell_line treatment date
E545K_minus_CP_Dec_1  E545K_minus_CP_Dec_1.kallisto  K         ctl       D1
E545K_minus_CP_Dec_2  EV_minus_CP_Dec_2.kallisto     K         ctl       D2
E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K         ctl       D0
E545K_plus_CP_Dec_1   E545K_plus_CP_Dec_1.kallisto   K         exp       D1
E545K_plus_CP_Dec_2   E545K_plus_CP_Dec_2.kallisto   K         exp       D2
E545K_plus_CP_Nov_30  E545K_plus_CP_Nov_30.kallisto  K         exp       D0
EV_minus_CP_Dec_1     EV_minus_CP_Dec_1.kallisto     EV        ctl       D1
EV_minus_CP_Dec_2     E545K_minus_CP_Dec_2.kallisto  EV        ctl       D2
EV_minus_CP_Nov_30    EV_minus_CP_Nov_30.kallisto    EV        ctl       D0
EV_plus_CP_Dec_1      EV_plus_CP_Dec_1.kallisto      EV        exp       D1
EV_plus_CP_Dec_2      EV_plus_CP_Dec_2.kallisto      EV        exp       D2
EV_plus_CP_Nov_30     EV_plus_CP_Nov_30.kallisto     EV        exp       D0

As I mentioned earlier, the batch effect seems worse (cluster tree separation is higher) at later dates, so we set Nov 30 as the base level for the date factor by giving it the alphabetically first factor level name, D0. Don't use just 0, 1, 2 unless you want to treat the date as a dosage effect (i.e. the hypotheical effect at Dec 2 should be exactly twice as much as at Dec 1).

I tried modelling just a straight date batch effect by adding date to the model.  This yielded 1000+ significant changes, but they were almost mirrors of the E545K effect (e.g. when E545K cells had a change of X for a transcript, the date factor usually had a change of -X).  This indicates that the E545K changes are artifactual, so we need to model the interaction of the date and cell line (date:cell_line). On with the a new (correct) Sleuth analysis!

meta <- read.table("e_meta_swapped.tab", header=TRUE)
meta$path <- as.character(meta$path)

so <- sleuth_prep(meta, ~cell_line*treatment+date:cell_line+date)
so <- sleuth_fit(so)

Note that cell_line*treatment is R syntactic sugar for cell_line+treatment+cell_line:treatment. This gives us a model where gene expression is affected by 3 factors and two factor interactions.*

Now, one by one, we need to exclude each of the factors of the experiment from the linear model to test what transcripts are affected by the factor's exclusion.

so <- sleuth_fit(so, ~treatment+date, "no_cell_line")
so <- sleuth_fit(so, ~cell_line*treatment, "no_date")
so <- sleuth_fit(so, ~cell_line+date:cell_line+date, "no_treatment")
so <- sleuth_fit(so, ~cell_line+treatment+date:cell_line+date, "no_interaction")
so <- sleuth_fit(so, ~cell_line*treatment+date, "no_date_interaction")

The significance test of choice here is the Likelihood Ratio Test (LRT). Run it.

so <- sleuth_lrt(so, 'no_cell_line', 'full')
so <- sleuth_lrt(so, 'no_treatment', 'full')
so <- sleuth_lrt(so, 'no_interaction', 'full')
so <- sleuth_lrt(so, 'no_date', 'full')
so <- sleuth_lrt(so, 'no_date_interaction', 'full')

Grab the results, and filter to those with a qval (multiple-testing corrected p-value) of <.05.

results_table_cell_line_lrt <- sleuth_results(so, 'no_cell_line:full', test_type = 'lrt')
results_table_treatment_lrt <- sleuth_results(so, 'no_treatment:full', test_type = 'lrt')
results_table_interaction_lrt <- sleuth_results(so, 'no_interaction:full', test_type = 'lrt')
results_table_date_lrt <- sleuth_results(so, 'no_date:full', test_type = 'lrt')
results_table_date_interaction_lrt <- sleuth_results(so, 'no_date_interaction:full', test_type = 'lrt')

cell_line.lrt.sig_ids <- results_table_cell_line_lrt$target_id[which(results_table_cell_line_lrt$qval < 0.05)]
treatment.lrt.sig_ids <- results_table_treatment_lrt$target_id[which(results_table_treatment_lrt$qval < 0.05)]
interaction.lrt.sig_ids <- results_table_interaction_lrt$target_id[which(results_table_interaction_lrt$qval < 0.05)]
date.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]
date_interaction.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]

Now, let's make a (non-redundant) list of all of the target IDs that pass the LRT significance filter. I'm excluding the LRT test results for date and date:cell_line because they're huge, and I don't really care to report the specifics of those batch effects unless they are in genes that are also affected by the original factors of interest (cell line, treatment, and their interaction).

lrt_ids <- unique(c(cell_line.lrt.sig_ids, treatment.lrt.sig_ids, interaction.lrt.sig_ids))

The LRT test doesn't generate fold-change estimates for transcripts since it only considers the factors' presence, not their levels. We need to run a Wald test for each factor level (7 total since the date factor has 2 levels to contrast with base level D0).

so <- sleuth_wt(so, 'dateD1')
so <- sleuth_wt(so, 'dateD2')
so <- sleuth_wt(so, 'cell_lineK')
so <- sleuth_wt(so, 'cell_lineK:treatmentexp')
so <- sleuth_wt(so, 'treatmentexp')
so <- sleuth_wt(so, 'cell_lineK:dateD1')
so <- sleuth_wt(so, 'cell_lineK:dateD2')

Grab the results.

results_table_date2_wt <- sleuth_results(so, 'dateD2')
results_table_date1_wt <- sleuth_results(so, 'dateD1')
results_table_date1_int_wt <- sleuth_results(so, 'cell_lineK:dateD1')
results_table_date2_int_wt <- sleuth_results(so, 'cell_lineK:dateD2')
results_table_cell_line_wt <- sleuth_results(so, 'cell_lineK')
results_table_interaction_wt <- sleuth_results(so, 'cell_lineK:treatmentexp')
results_table_treatment_wt <- sleuth_results(so, 'treatmentexp')

Let's report out only the Wald test results of each factor contrast that correspond to the LRT-passing targets (transcripts) identified earlier in the list "ids".

shared_results_cell_line <- results_table_cell_line_wt[results_table_cell_line_wt$target_id %in% lrt_ids,]
shared_results_treatment <- results_table_treatment_wt[results_table_treatment_wt$target_id %in% lrt_ids,]
shared_results_interaction <- results_table_interaction_wt[results_table_interaction_wt$target_id %in% lrt_ids,]
shared_results_date2_int <- results_table_date2_int_wt[results_table_date2_int_wt$target_id %in% lrt_ids,]
shared_results_date1_int <- results_table_date1_int_wt[results_table_date1_int_wt$target_id %in% lrt_ids,]
shared_results_date1 <- results_table_date1_wt[results_table_date1_wt$target_id %in% lrt_ids,]
shared_results_date2 <- results_table_date2_wt[results_table_date2_wt$target_id %in% lrt_ids,]

Since we want to report out all the factors together (i.e. influence of each factor for a transcript are reported on the same row), we need to sort each factor's Wald test results alphabetically by ID for consistency.

shared_results_cell_line <- shared_results_cell_line[sort.list(shared_results_cell_line[,1]),]
shared_results_treatment <- shared_results_treatment[sort.list(shared_results_treatment[,1]),]
shared_results_interaction <- shared_results_interaction[sort.list(shared_results_interaction[,1]),]
shared_results_date2_int <- shared_results_date2_int[sort.list(shared_results_date2_int[,1]),]
shared_results_date1_int <- shared_results_date1_int[sort.list(shared_results_date1_int[,1]),]
shared_results_date1 <- shared_results_date1[sort.list(shared_results_date1[,1]),]
shared_results_date2 <- shared_results_date2[sort.list(shared_results_date2[,1]),]

Write out all the LRT-significant data to a file. I could get fancy and splice out specific columns to keep, but that's a lot of syntax and it's easier to do that in Excel or with UNIX commands like cut.

write.csv(c(shared_results_cell_line,shared_results_treatment,shared_results_interaction,shared_results_date2_int,shared_results_date1_int,shared_results_date1,shared_results_date2), "e_kallisto_swapped_lrt_q0.05_max.csv")

This looks a lot better! Now we have ~330 genes that show an interaction effect for the cell line and treatment whereas before the swap and cell line specific batch effect modelling we had nothing.  It took some work for sure, but a lot less than redoing the experiment and resequencing everything. This was a quick run to get some directions for grant applications based on pathway analysis etc., so overall patterns are what matter. They'll want to run follow-up experiments and qPCR on salient transcripts in the existing samples to verify the results.


*If you tried to include the treatment:date interaction and other factor combinations in the model with these just 12 samples, you're likely to get a nasty message like:
Error in solve.default(t(X) %*% X) : 

  Lapack routine dgesv: system is exactly singular: U[7,7] = 0

Which in essence means that you only have one sample for each combination of factors. That'd leave zero degrees of freedom for statistical tests. That's no good, to put it lightly. There may be an interaction effect of the treatment and date, but as per the distance matrix shown earlier, any such effect is negligible compared to the cell line batch effect, so we'll make do without modelling it.

No comments:

Post a Comment