February 2, 2016

Are you using DESeq2 correctly for Illumina RNASeq?

tl;dr If you have single end Illumina RNASeq data the reads are antisense. You cannot use DESeq2 to analyze the alignment files (BAMs) with summarizeOverlaps() as described in the vignette. You need to use htseq-count with the "stranded=reverse" option and use DESeqDataSetFromHTSeqCount(), or you need to use a preprocessing function for the reads in R that is new in the GenomicRanges package starting in version 1.23.26 (April 12th, 2016). Full analysis walkthrough starts halfway this page.

Like many people I've been moving away from Cufflinks based RNASeq gene expression analysis to DESeq2, largely due to its more robust statistics on tough samples.  There are great tutorials (e.g. http://www.bioconductor.org/help/workflows/rnaseqGene/) that show you how to get started making pretty graphs from your data with DESeq2 and related R/Bioconductor packages. This post is to warn you that it can all go horribly wrong but look okay on the surface!

If you're new to it, unless you know to do some very specific googling or your wet lab is right on top of these things and tells you, you might not realize that the single end Illumina RNASeq protocol generates antisense reads, not sense read.  Here's an image buried deep in the Illumina login-only Web site.

I suspect that the GenomicAlignments and/or DESeq2 package authors haven't worked with such datasets much, because there is no way in summarizeOverlaps() method to count only antisense reads, and the DESeq2 tutorial linked above doesn't mention this possible snag. Setting ignore.strand=TRUE in summarizeOverlaps(), will do in a pinch, but you lose all the advantages of a stranded sequencing protocol...most prominently the gene resolution of mRNAs with overlapping exons.

The most annoying bit of this whole thing is that it's not glaringly obvious running through the DESeq2 tutorial that you've got the wrong data, since there is enough seep through to the sense strand (~2% typically but I've seen up to 15%) that the graphs and stats look okay prima facie.  To make things worse, the seep through affects certain genes much more than others so it's not like just subsampling the correct strand's data. Here's a PCA plot of samples using the sense strand, where the samples still cluster reasonably well by factor groups:

The variance is a bit fat at the low end, but not alarmingly so, and the range is normal:

And you do get a list of statistically significant changes for the various factor contrasts using DESeq2. But it's the wrong data, we should be using the antisense counts! If you want to determine from the aligned data if you have sense or antisense, here's a quick command I whipped up to generate a BED file with percentage strand concordance for ungapped/clipped 75bp reads in myfile.bam for each gene in a UCSC refFlat file:

cut -f 1-6 mm10_refFlat.bed | uniq | perl -ane '$t=$r=0;for $hit (split /\n/, `samtools view myfile.bam $F[0]:$F[1]-$F[2]`){my @a=split /\t/,$hit; next unless $hit =~ /7\dM/; $t++; $r++ if $a[1] & 16}print join("\t", @F[0..3],int(100*($F[5] eq "-" ? $r/$t:($t-$r)/$t)),$F[5],$t),"\n" if $t' > myfile.strand.bed

You might want to stick that in a shell script file for reuse. Changing the /7\dM/ regex for longer reads is left as an exercise for the reader ;-).

Looking at the results, it was obvious that the % strand concordance (5th column) is near zero for the vast majority of genes, at any read depth (7th column).  Without any a priori knowledge of a sample's wet lab prep that's how you know you've got to deviate from the DESeq2 tutorial and use htseq-count:

htseq-count -s reverse -f bam in.bam mm10_refGene_w_genenames.gtf > out.gene_counts.tab

This has the added advantage of giving you gene names in the row labels for DESeq2 output rather than Entrez GeneID as per the DESeq2 tutorial. If you want to generate a UCSC gene model set, here's a script that adds the gene_ids from a refFlat BED to a refGene GTF table retrieved using the UCSC Table Browser. Make sure you run cpanm Set::IntervalTree if you don't have this Perl module installed already. It handles most idiosyncrasies of the UCSC gene models, like weird miRNAs that span several genes. It can also make a BED by the way.

Unfortunately, HTSeq is pretty slow -- 2 or 3x slower than Bioconductor's summarizeOverlaps() -- and then you need to build an extra correspondence table to use with DESeqDataSetFromHTSeqCount(), and you can't take advantage of the nice automated read counting parallelization in R.  Now as of April 2016 you can use a built-in function when you go to load your read counts as an extra argument, preprocess.reads=invertStrand like so to correctly use strand-specific single end Illumina RNASeq data:

se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)

Here's a slightly more foolproof version of the DESeq2 tutorial commands if you're starting with a bunch of BAM files from Illumina RNASeq data of mice (Mus musculus), and a sample design file that includes the BAM file names to mitigate file sorting or subsetting issues (sample_meta.tab):

meta <- read.csv("sample_meta.tab")
bamfiles <- BamFileList(as.character(meta$BAM_filename))
ebg <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene")
multicoreParam <- MulticoreParam(workers = 8) // If you have an 8-core CPU
register(multicoreParam)                      // ditto
invertStrand <- function(x){GenomicRanges:::invertRleStrand(x)} // if you don't have the latest GenomicRanges version
se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)
At this point let's say it's a knockout mouse (e.g. Adh1) being subjected to some treatment and you want to just do a quick sanity check for the number of reads mapping to the knockout gene in each sample (normalized just for the total mapped read depth of each sample).  You can do the following to get the right index for that gene (the knownGene database is indexed on Entrez Gene ID):

counts <- assay(se, "counts")

gene_symbols <- select(org.Mm.eg.db, rownames(counts), "SYMBOL", "ENTREZID")
knockout_gene_id <- gene_symbols[which(gene_symbols$SYMBOL == "Adh1"),]$ENTREZID
Assuming relative abundance for the knockout in each sample makes sense, we can go on and do all the stuff in the DESeq2 tutorial. As an aside you could also generate an FPKM file for use in other methods that need it:

transcripts_per_million <- counts/colSums(counts)*1000000
gene_lengths_in_k <- sum(width(reduce(ebg)))/1000
fpkm <- transcripts_per_million/gene_lengths_in_k
fpkm_symbol_df <- data.frame(SYMBOL=gene_symbols, fpkm)
write.csv(fpkm_symbol_df, file="gene_fpkm.csv")

Now on with the DESeq2 tutorial
colData(se) <- DataFrame(meta)
dds <- DESeqDataSet(se, design = ~ Treatment + Knockout + Treatment:Knockout)
dds_redux <- dds[rowSums(counts(dds)) > 1,]
rld <- rlog(dds_redux, blind=FALSE)
plot(assay(rld)[,1:2], pch=16, cex=0.3)
Ah, that looks nicer! That fanning at the left edge has gone away, and we have a lot more highly expressed genes (upper right). Quasi-homoskedastic for distance analysis, etc...

poisd <- PoissonDistance(t(counts(dds_redux)))
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste(meta$Treatment, meta$Knockout, meta$Sample_ID, sep=":")
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(samplePoisDistMatrix, clustering_distance_rows=poisd$dd, clustering_distance_cols=poisd$dd, col=colors)

I've labelled the Y axis with Treatment:Knockout:Sample_ID from sample_meta.tab. The first two samples (14_2W6 and 5_2i3, i.e. samples #5 and #17 per the X axis) look like outliers, I might have to revisit those if the differential expression analysis turns out badly. Otherwise the tree clustering on the left reflects the TRUE/FALSE experiment/control breakdown of the samples nicely, and most of the 0 Knockout (i.e. wildtype) are grouped in the top half. This graph looked quite different before the antisense.strand=TRUE patch! Let's check out the PCA plot to compare to the original:

plotPCA(rld, intgroup = c("Treatment", "Knockout"))

Whoa! Those two are really outliers (not nearly as obvious in the wrong-strand graph). Note that by default you can't see the individual sample labels in the plot. Since it's just a standard ggplot2 object we can add the labels once we have the PCA data. The PCA data can be grabbed by calling the underlying plot function directly with a special argument, returnData.
rld_plotdata <- DESeq2:::plotPCA.DESeqTransform(rld, intgroup=c("Treatment", "Knockout"), returnData=TRUE) library(ggplot2) plotPCA(rld, intgroup=c("Treatment", "Knockout")) + annotate("text", x=rld_plotdata$PC1, y=rld_plotdata$PC2, label=rownames(rld_plotdata))
Yup, it's those two samples. Probably want to go back and exclude those two samples from the analysis, by splicing out the relevant bits (indices 5 & 17) of the se object.

se_redux <- se[,c(seq(1,4),seq(6,16),seq(18,20))]
dds <- DESeqDataSet(se_redux, design = ~ Treatment + Knockout + Treatment:Knockout)
dds_redux <- dds[rowSums(counts(dds)) > 1,]
rld <- rlog(dds_redux, blind=FALSE)

I reran the charts and they cluster much better. Now on to the differential expression:

deseq <- DESeq(dds_redux)

Here's the less obvious part: I have several groups I can compare in my model: Treatment vs. Control, or Knockout zygosity, or the interaction of the two.  Let's look at how we address these in Bioconductor:

[1] "Intercept"     "TreatmentTRUE"     "Knockout"     "TreatmentTRUE.Knockout"

So now we can proceed to retrieve the differential expression (FDR=0.05) for each "contrast":

result_treatment <- results(deseq, contrast=list("TreatmentTRUE"), alpha=.05)
result_knockout <- results(deseq, contrast=list("Knockout"), alpha=.05)
result_interaction <- results(deseq, contrast=list("TreatmentTRUE.Knockout"), alpha=.05)

Now let's check the number of entries with multiple testing adjusted p-value <.05 and see if it's manageable. If not, proceed to 0.01 or change your model, etc.:

result_knockout.sig <- which(result_knockout$padj < 0.05)
[1] 320

I can deal with 320. Finally, print the relevant columns to a file along with the corresponding gene names (repeat this for each of the three results):

result_knockout.sig_symbol_df <- data.frame(result_knockout[result_knockout.sig,], SYMBOL=gene_symbols$SYMBOL[result_knockout.sig])
write.csv(result_knockout.sig_symbol_df[c("SYMBOL", "baseMean", "log2FoldChange", "padj")], file="knockout_dge.csv")


  1. No need to patch, see the preprocess.reads or mode (which can be a function) argument; simply manipulate strand as appropriate. A more complete answer to your query to Maintainer at bioconductor.org is in progress; post to support.bioconductor.org if you'd like more public discussion.

    1. HI Martin,

      Thanks, I've updated the post to do a walkthrough using preprocess.reads instead of a patch. Hopefully this post will still be useful to folks though as it's not easy to find this info unless you know exactly what you're looking for, or indeed that Illumina reads are antisense in the first place.