October 6, 2016

Using Kallisto & Sleuth for RNASeq analysis

Kallisto is a fast pseudo-mapping tool for RNAseq data that has some advantages over more established mapping and quantification methods like bwa and htseq-count. In very broad strokes, it doesn't do a traditional spliced alignment to a reference genome, but rather assigns read to a reference set of transcripts based on kmer counts. This works when you have well-annotated genome like human or mouse. Where the kmer transcript assignment is ambiguous, it uses bootstrap analysis to estimate the probability of correct assignment to a transcript. It can bootstrap because kmer counting is super fast, and takes little memory, making it ideal for running analysis on a laptop for instance.

I've been holding back because the stats software to interpret differential gene expression from the bootstrapped values has been slowly evolving. In June a preprint about this analysis software, called Sleuth, was put up on bioRXiv. So it was probably time I checked it out more thoroughly.

In particular I was drawn by microdissected tissue RNASeq experiments where DESeq2 was only showing very modest differential expression of genes, but a lot of it (1000's of genes). In theory, Sleuth should work better at resolving splicing variation in genes which might be going on, or even maybe compensating for tissue impurity insofar as it causes systematic underreporting of differential expression ratios.

Here's a brief rundown on running a combined Kallisto + Sleuth analysis.

Create a Kallisto searchable database of transcripts for your species, if you don't already have one. Personally I like RefSeq, but you could use Ensembl transcripts, etc.  Here's how to get the RefSeq fastas, while stripping the GIs and leaving the NM_ or NR_ ID as the one Kallisto will report.

wget ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.*.rna.fna.gz

gzip -cd mouse.*.gz | perl -i -pe 's/^.*?\|ref\|(.*?)\|/>$1/' > mm10.refMrna.fa

Index the reference file with Kallisto. It'll take a few minutes.

kallisto index -i /path/to/dbdir/mm10.refMrna mm10.refMrna.fa

Run kallisto for each of your samples, here I have single end RNASeq.  The 180/20 is a reasonable insert size/std dev estimate for TruSeq RNA v2 library preps, probably the most common type of sequencing these days. 190bp might be good to as per our own lab's TapeStation stats over many runs, but in practice the effect of changing this value by ten is negligible. The 100 specifies the number of bootstraps to run.

kallisto quant -i /path/to/dbdir/mm10.refMrna -o sample_id.kallisto -b 100 --single -l 180 -s 20 <(gzip -cd /path/to/unique_sample_id_prefix*.fastq.gz)

Now let's run some differential expression analysis. First, create a tab delimited metadata file for the experiment so we have the factors for each sample ID (header "sample"), and the file path (header "path") to the kallisto output for each sample.  The file should look something like this:

sample      path               time
mysample1   mysample1.kallisto D0
mysample2   mysample2.kallisto D0
mysample3   mysample3.kallisto D5
mysample4   mysample4.kallisto D5

I've put D0, D5 here so we have a factor with two values, and by default the alphabetically first factor value is considered the control in R (i.e. ratios will be reported as exp:ctl, so D5:D0).  You could have put 0, and 5 in the column instead, but then you'd have to coerce the numbers it into factor levels, so why bother?  Putting just numbers only make sense if you have a range of numeric values and are expecting a dosage response in the regression model.

Note that the tutorial on the Sleuth Web site uses a somewhat convoluted method to get the right metadata table together. Here, I've simplified it, assuming you are running R from the directory where all the kallisto quant output directories reside.

meta <- read.table("meta.tab", header=TRUE)

meta$path <- as.character(meta$path)

Make a regression model for the factor(s) in the metadata file.  Here we only have one, called "time" in the metadata file prepared earlier.

so <- sleuth_prep(meta, ~ time)

Model the gene expression as responding to the time factor.

so <- sleuth_fit(so)

Create another model where the gene expression is not dependent on any factor.

so <- sleuth_fit(so, ~1, "reduced")

Run a likelihood ratio test between the two models to see what transcripts appear to really be affected by the time factor value.

so <- sleuth_lrt(so, 'reduced', 'full')
results_table_lrt <- sleuth_results(so, 'reduced:full', test_type = 'lrt')

Run a Wald test to check for differential expression according to the time factor. Wait...didn't we just do that with the LRT?  They are different ways of skinning the same cat, and generally the LRT is considered a better test (see a nice explanation here). We need to run both, because only the Wald test will report the fold change*. My factor was "time", and the non-control value is D5, so the test is called "timeD5" as R is want to do with concatenating factors and values. Alternatively, you can list the valid Wald tests with the so$fits[["full"]]$design_matrix command.  On to the Wald test of the time factor...

so <- sleuth_wt(so, 'timeD5')
results_table_wt <- sleuth_results(so, 'timeD5')

We want the Wald test results for those genes that also passed the more stringent (when there's only two factor levels) LRT.  Let's get the names of the transcripts that are in both lists (with multiple testing corrected p-value, a.k.a. q-value, less than 0.05).

d5.lrt.sig_ids <- results_table_lrt$target_id[which(results_table_lrt$qval < 0.05)]
d5.wt.sig_ids <- results_table_wt$target_id[which(results_table_wt$qval < 0.05)]
shared_ids <- d5.wt.sig_ids[d5.wt.sig_ids %in% d5.lrt.sig_ids]

Grab just the rows of the Wald test that correspond to the shared transcripts, then write them to a file for posterity, downstream analysis in another program, etc..

shared_results <- results_table_wt[results_table_wt$target_id %in% shared_ids,]
write.csv(shared_results, file="kallisto_wald_test_lrt_passed_d5.csv")

To make sure we aren't throwing the baby out with the bathwater, let's write out the Wald-pass-but-LRT-failed genes to another file in case there's something interesting there.

wald_only_ids <- d5.wt.sig_ids[!(d5.wt.sig_ids %in% d5.lrt.sig_ids)]
wald_only_results <- results_table_wt[results_table_wt$target_id %in% wald_only_ids,]
write.csv(wald_only_results, file="kallisto_wald_test_only_passed_d5.csv")

Interestingly, unlike the samples used in the Sleuth preprint, both the Wald and LRT tests report fewer differentially expressed genes/transcripts than DESeq2's Wald test when used in conjunction with bwa and htseq-count on my local microdissection samples mentioned earlier. But, the differential expression ratios are higher.  This is the point at which I should explain the asterisk I put after "fold change" earlier on this page.

*The statistical model that Kallisto uses includes a term Beta.  In the output tables for the Wald test this is the "b" field and is the closest thing you can think of to the log2 fold-change you might be used to from other programs.  It's not really a fold change though, it's a bias estimator.  And it's the natural logarithm, not log2. It's roughly the extent to which the estimated count for a transcript is affected by the named factor rather than biological and technical variability (library prep, mapping, etc.).  So if there is very little variability and noise for a transcript (based on bootstrap analysis) it's pretty close to being a fold change measure.  If there is high technical variability for the transcript though, Beta tends to account for less of the overall estimated count even though Beta will have a higher magnitude.

If you've had the patience to read this far and understand Beta, why don't we look at some pretty graphs to see what our data look like (e.g. segregation on a Principle Component Analysis plot, volcano plots, etc.). Sleuth uses a Web interface called Shiny to allow interactive plotting, and you went through the installation instructions for both Kallisto and Sleuth you're set to go, right? The following will launch the Web interface, or at least give you a URL to paste into your browser.



  1. Hi Paul, nice summary and thanks for all other sleuth blogs. Just a quick question, is there a simple way to do pairwise comparison for samples with multi-level condition? For example, we have samples from D0, D5, and D10 groups instead of just D0 and D5 as in your example, and we want to compare D5 vs D0, D5 vs D10, and D10 vs D0. Thanks a lot, Sean

    1. hey,
      did you manage to do the multi-level condition comparison?

  2. This comment has been removed by the author.

    1. Hi Yifang,

      You just need a text file the two tab delimited columns as listed in the blog post: sample and path, then whatever factor columns you have for your experiment. Hope that helps!