October 23, 2018

Graphing Kallisto RNASeq results: Pretty box plotting genes by experiment factor levels


Assuming you have already run a bunch of sample with Kallisto against a relevant transcript database, and have the outputs in folders called samplename.kallisto, run this simple script to generate the FPKM data at the transcript and gene levels (in this case, human):

$ kallisto_to_fpkm refseq2gene_human

Start R, and first load the experiment metadata:

> meta <- read.table("meta.tab", header=TRUE)

> meta$path <- as.character(meta$path)
> meta
      sample               path tgfb pfd
1   Library1  Library1.kallisto    0   0
2   Library2  Library2.kallisto    0   0
3   Library3  Library3.kallisto    0   0
4   Library4  Library4.kallisto    1   0
5   Library5  Library5.kallisto    1   0
6   Library6  Library6.kallisto    1   0
7   Library7  Library7.kallisto    0   1
8   Library8  Library8.kallisto    0   1
9   Library9  Library9.kallisto    0   1
10 Library10 Library10.kallisto    1   1
11 Library11 Library11.kallisto    1   1
12 Library12 Library12.kallisto    1   1

You'll see here that I have two treatments (TGFB and PFD), run independently and in combination, as well as wild type.  I'm going to manually assign reasonable factor level names (e.g. "wt" for wild type) that will be used on the graph later.

> sample2category <- hashmap(meta$path, c(rep("wt",3), rep("tgfb", 3), rep("pfd", 3), rep("tgfb+pfd", 3)))

> sample2category
##          (character) => (character)
##  [Library4.kallisto] => [tgfb]     
##  [Library2.kallisto] => [wt]       
## [Library11.kallisto] => [tgfb+pfd] 
## [Library12.kallisto] => [tgfb+pfd] 
##  [Library5.kallisto] => [tgfb]     
##  [Library3.kallisto] => [wt]       
##                [...] => [...]      

Looks good.  Let's load up the gene level FPKM data we generated at the very start:

gene_fpkm <- read.table("gene_fpkm.txt", header=TRUE, row.names=1)


Suppose we have a subset of genes that are of particular interest, let's load them.  It's a simple text file with one gene name per line, in this case, 30 genes.

> cancer <- read.table("cancer.txt", colClasses=c("character"))

Let's just work with the subset of FPKM values from the genes of interest.  For the sake of plotting a reasonable vertical axis range, I'm turning the FPKM values into log2(FPKM+1).

gene_fpkm_cancer <- t(log1p(gene_fpkm[ecm_cancer$V1,], base=2))


What we need to do to generate the boxplots is turn the (gene, sample) matrix into a flatter long table where we have multiple gene -> value instances for each experiment factor combinations.  The flattening is easy, using the function melt().

> library(reshape2)
d <- melt(gene_fpkm_cancer)


> d[1,]

                  X1    X2    value
1 Library10.kallisto MMP11 4.196717

As you can see from just printing the first row of the melt()ed table, X1 is the library, X2 is the gene name, value if the log transformed FPKM value. Let's add the category labels we generated earlier.


> d$category <- sample2category[[d$X1]]
> d[1,]

                  X1    X2    value category
1 Library10.kallisto MMP11 4.273322 tgfb+pfd

Nice, now we are ready to plot, with some fancy options to make it pretty.

> ggplot(d)+ \
  geom_boxplot(aes(category, y=value)) + \                 # boxplot for each category
  facet_grid(.~X2) + \                                     # boxplot pane for each gene
  theme(axis.text.x=element_text(angle=90, hjust=1)) + \   # vertical category labels
  scale_x_discrete(limits=c("wt","tgfb","pfd","tgfb+pfd")) # reorder category labels




No comments:

Post a Comment