Assuming you have already run a bunch of sample with Kallisto against a relevant transcript database, and have the outputs in folders called samplename.kallisto, run this simple script to generate the FPKM data at the transcript and gene levels (in this case, human):
$ kallisto_to_fpkm refseq2gene_human
Start R, and first load the experiment metadata:
> meta <- read.table("meta.tab", header=TRUE)
> meta$path <- as.character(meta$path)
> meta
sample path tgfb pfd
1 Library1 Library1.kallisto 0 0
2 Library2 Library2.kallisto 0 0
3 Library3 Library3.kallisto 0 0
4 Library4 Library4.kallisto 1 0
5 Library5 Library5.kallisto 1 0
6 Library6 Library6.kallisto 1 0
7 Library7 Library7.kallisto 0 1
8 Library8 Library8.kallisto 0 1
9 Library9 Library9.kallisto 0 1
10 Library10 Library10.kallisto 1 1
11 Library11 Library11.kallisto 1 1
12 Library12 Library12.kallisto 1 1
You'll see here that I have two treatments (TGFB and PFD), run independently and in combination, as well as wild type. I'm going to manually assign reasonable factor level names (e.g. "wt" for wild type) that will be used on the graph later.
> sample2category <- hashmap(meta$path, c(rep("wt",3), rep("tgfb", 3), rep("pfd", 3), rep("tgfb+pfd", 3)))
> sample2category
## (character) => (character)
## [Library4.kallisto] => [tgfb]
## [Library2.kallisto] => [wt]
## [Library11.kallisto] => [tgfb+pfd]
## [Library12.kallisto] => [tgfb+pfd]
## [Library5.kallisto] => [tgfb]
## [Library3.kallisto] => [wt]
## [...] => [...]
Looks good. Let's load up the gene level FPKM data we generated at the very start:
> gene_fpkm <- read.table("gene_fpkm.txt", header=TRUE, row.names=1)
Suppose we have a subset of genes that are of particular interest, let's load them. It's a simple text file with one gene name per line, in this case, 30 genes.
> cancer <- read.table("cancer.txt", colClasses=c("character"))
Let's just work with the subset of FPKM values from the genes of interest. For the sake of plotting a reasonable vertical axis range, I'm turning the FPKM values into log2(FPKM+1).
> gene_fpkm_cancer <- t(log1p(gene_fpkm[ecm_cancer$V1,], base=2))
What we need to do to generate the boxplots is turn the (gene, sample) matrix into a flatter long table where we have multiple gene -> value instances for each experiment factor combinations. The flattening is easy, using the function melt().
> library(reshape2)
> d <- melt(gene_fpkm_cancer)
> d[1,]
X1 X2 value
1 Library10.kallisto MMP11 4.196717
As you can see from just printing the first row of the melt()ed table, X1 is the library, X2 is the gene name, value if the log transformed FPKM value. Let's add the category labels we generated earlier.
> d$category <- sample2category[[d$X1]]
> d[1,]
X1 X2 value category
1 Library10.kallisto MMP11 4.273322 tgfb+pfd
Nice, now we are ready to plot, with some fancy options to make it pretty.
> ggplot(d)+ \
geom_boxplot(aes(category, y=value)) + \ # boxplot for each category
facet_grid(.~X2) + \ # boxplot pane for each gene
theme(axis.text.x=element_text(angle=90, hjust=1)) + \ # vertical category labels
scale_x_discrete(limits=c("wt","tgfb","pfd","tgfb+pfd")) # reorder category labels