CHGI Bioinformatics: 2018

October 23, 2018

Graphing Kallisto RNASeq results: Pretty box plotting genes by experiment factor levels

Assuming you have already run a bunch of sample with Kallisto against a relevant transcript database, and have the outputs in folders called samplename.kallisto, run this simple script to generate the FPKM data at the transcript and gene levels (in this case, human):

$ kallisto_to_fpkm refseq2gene_human

Start R, and first load the experiment metadata:

> meta <- read.table("meta.tab", header=TRUE)

> meta$path <- as.character(meta$path)

> meta

sample path tgfb pfd

1 Library1 Library1.kallisto 0 0

2 Library2 Library2.kallisto 0 0

3 Library3 Library3.kallisto 0 0

4 Library4 Library4.kallisto 1 0

5 Library5 Library5.kallisto 1 0

6 Library6 Library6.kallisto 1 0

7 Library7 Library7.kallisto 0 1

8 Library8 Library8.kallisto 0 1

9 Library9 Library9.kallisto 0 1

10 Library10 Library10.kallisto 1 1

11 Library11 Library11.kallisto 1 1

12 Library12 Library12.kallisto 1 1

You'll see here that I have two treatments (TGFB and PFD), run independently and in combination, as well as wild type. I'm going to manually assign reasonable factor level names (e.g. "wt" for wild type) that will be used on the graph later.

> sample2category <- hashmap(meta$path, c(rep("wt",3), rep("tgfb", 3), rep("pfd", 3), rep("tgfb+pfd", 3)))

> sample2category

## (character) => (character)

## [Library4.kallisto] => [tgfb]

## [Library2.kallisto] => [wt]

## [Library11.kallisto] => [tgfb+pfd]

## [Library12.kallisto] => [tgfb+pfd]

## [Library5.kallisto] => [tgfb]

## [Library3.kallisto] => [wt]

## [...] => [...]

Looks good. Let's load up the gene level FPKM data we generated at the very start:

> gene_fpkm <- read.table("gene_fpkm.txt", header=TRUE, row.names=1)

Suppose we have a subset of genes that are of particular interest, let's load them. It's a simple text file with one gene name per line, in this case, 30 genes.

> cancer <- read.table("cancer.txt", colClasses=c("character"))

Let's just work with the subset of FPKM values from the genes of interest. For the sake of plotting a reasonable vertical axis range, I'm turning the FPKM values into log2(FPKM+1).

> gene_fpkm_cancer <- t(log1p(gene_fpkm[ecm_cancer$V1,], base=2))

What we need to do to generate the boxplots is turn the (gene, sample) matrix into a flatter long table where we have multiple gene -> value instances for each experiment factor combinations. The flattening is easy, using the function melt().

> library(reshape2)

> d <- melt(gene_fpkm_cancer)

> d[1,]

X1 X2 value

1 Library10.kallisto MMP11 4.196717

As you can see from just printing the first row of the melt()ed table, X1 is the library, X2 is the gene name, value if the log transformed FPKM value. Let's add the category labels we generated earlier.

> d$category <- sample2category[[d$X1]]

> d[1,]

X1 X2 value category

1 Library10.kallisto MMP11 4.273322 tgfb+pfd

Nice, now we are ready to plot, with some fancy options to make it pretty.

> ggplot(d)+ \

geom_boxplot(aes(category, y=value)) + \ # boxplot for each category

facet_grid(.~X2) + \ # boxplot pane for each gene

theme(axis.text.x=element_text(angle=90, hjust=1)) + \ # vertical category labels

scale_x_discrete(limits=c("wt","tgfb","pfd","tgfb+pfd")) # reorder category labels

February 8, 2018

Analyzing Bisulfite Treated Genome Data (differential methylation detection)

tl;dr Bismark+DSS+ChIPpeakAnno+BioMart is good way to find differentially methylated regions of genomes that have been bisulphite treated and sequenced, whether you have one or more experimental factors. There are some tricks to getting the annotations exported due to the non-one-to-one nature of diff methyl regions and nearby genes.
___________________

DSS is a good choice if you are trying to ensure very few false positives (i.e. specificity) while still getting decent recall/sensitivity (according to this nice comparison paper).

mkdir /export/common/dbs/bismark/danRer10

cp reference_genome.fa /export/common/dbs/bismark/danRer10

Ensure that the bowtie2 and samtools executables are in your PATH, e.g. install BioBuilds.

bismark_genome_preparation /export/common/dbs/bismark/danRer10

Then do the following for each of your samples to count the bisulphite converted sites:

bismark -p 4 -un --ambiguous -N 1 -L 28 --non_bs_mm /export/common/dbs/bismark/danRer10/ path/to/my_ctl_1.fastq.gz -B my_ctl_1

bismark_methylation_extractor --scaffolds --bedGraph --parallel 5 --comprehensive --single-end --merge_non_CpG my_ctl_1.bam

Install DSS if you don't already have it, by starting R then:

source("http://bioconductor.org/biocLite.R")

biocLite("DSS");

biocLite("org.Dr.eg.db");

Create an experiment metadata file for the sample factors:

library(DSS)
require(bsseq)

meta <- read.table("meta.tab", header=TRUE)

February 1, 2018

Are you losing important transcripts in your Kallisto/Sleuth RNASeq analysis?

tl;dr The default transcript filter function parameters in Sleuth are suitable for a single factor, two level contrast RNASeq experiment. If you are running a two-factor experiment (e.g. knock out vs. wild type, plus control vs. treatment), or an experiment with multiple factor levels (e.g. time series), you should probably use a filter function such as the one described below. You will retain more true positive differentially expressed genes, without generating too many new false positives.

________________________________

I've been a heavy user of Kallisto and Sleuth for RNASeq analysis for some time, and was used to seeing output similar to the following when loading up a dataset:

> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line)

reading in kallisto results

............

normalizing est_counts

26036 targets passed the filter

normalizing tpm

merging in metadata

normalizing bootstrap samples

summarizing bootstraps

I hadn't given much consideration to how the "filter" statistic was generated, until I had a 5 time point series experiment where we had a priori knowledge of the activation of a transcript only at the last two timepoints. This transcript did not show up in the Sleuth analysis with any p-value, let alone a significant one. A few days later in a two-factor experiment (growth condition and cell line), there were also some missing known transcripts.

The default filtering function in Sleuth (called basic_filter) requires at least 5 mapped reads to a transcript in at least 47% of the samples. This reduces spurious identification of differential expression in near-zero abundance transcripts, but retains genes that are moderately but consistently expressed in one of two factor levels (e.g. expressed-in-control-only transcripts, or expressed-in-treatment-only transcripts).

If I have two factors in my RNASeq experiment (3 replicates is typical, for 12 samples), this filter would eliminate transcripts only expressed in the interaction term, such as condition:cell_line in the above example. Here's the metadata:

> meta

sample path condition cell_line

1 NSC_Ctl_1 NSC_Ctl_1.kallisto NSC Ctl

2 NSC_Ctl_2 NSC_Ctl_2.kallisto NSC Ctl

3 NSC_Ctl_3 NSC_Ctl_3.kallisto NSC Ctl

4 NSC_KO_1 NSC_KO_1.kallisto NSC KO

5 NSC_KO_2 NSC_KO_2.kallisto NSC KO

6 NSC_KO_3 NSC_KO_3.kallisto NSC KO

7 Odiff_Ctl_1 Odiff_Ctl_1.kallisto OD Ctl

8 Odiff_Ctl_2 Odiff_Ctl_2.kallisto OD Ctl

9 Odiff_Ctl_3 Odiff_Ctl_3.kallisto OD Ctl

10 Odiff_KO_1 Odiff_KO_1.kallisto OD KO

11 Odiff_KO_2 Odiff_KO_2.kallisto OD KO

12 Odiff_KO_3 Odiff_KO_3.kallisto OD KO

The condition:cell_line term gleans data from only 3 (25%) of the samples (i.e. those that are OD:KO). Let's change the filter to only require >=5 reads in 25% of the samples...

> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line,

filter_fun=function(x){basic_filter(x, 5, 0.25)})

reading in kallisto results

............

normalizing est_counts

36320 targets passed the filter

normalizing tpm

merging in metadata

normalizing bootstrap samples

summarizing bootstraps

Whoa! We just increased the number of transcripts passing filter by 50%, which leads to a huge inflation of false positives in the differential expression, and just as importantly, detrimentally affects the q-values for the genes in our original, default-filtered analysis. A smarter filter might be to require 100% of samples with any present factor level to have at least 5 reads, i.e. keep any transcript where all replicate samples for a factor moderately express it.

[Puts on thinking cap, writes several failed attempts, then...]

> design_filter <- function(design, row, min_reads=5, min_prop = 0.47){

sum(apply(design, 2, function(x){

y <- as.factor(x);

return(max(tapply(row, y, function(f){sum(f >= min_reads)})/

tapply(row, y, length)) == 1

|| basic_filter(row, min_reads, min_prop)

)

})) > 0}

To pass in the design matrix that my new filter requires, I can just reuse the one my first call to sleuth_prep() generated, rather than making it myself. Probably not a bad idea to do it this way in any case, so we can then compare how many transcripts pass this new filter vs. the default filter.

> so_redux <- sleuth_prep(meta, ~cell_line*condition,

filter_fun=function(x){design_filter(so$design,x)})

reading in kallisto results

............

normalizing est_counts

26370 targets passed the filter

normalizing tpm

merging in metadata

normalizing bootstrap samples

summarizing bootstraps

Although for this dataset the new filter also requires ~25% of samples to have moderate expression, the added constraint that those 25% cover all replicates of some factor level means adding just 334 transcripts to the analysis instead of more than 10,000. This seems much more reasonable to me, and my known true positive transcript suddenly appeared. #winning

Note that the design_filter() should work for any set of nominal factors, but not quantitative factors. A column slice of the design matrix could be passed in accordingly in the so_redux code above.

Blog Archive

October 23, 2018

Graphing Kallisto RNASeq results: Pretty box plotting genes by experiment factor levels

February 8, 2018

Analyzing Bisulfite Treated Genome Data (differential methylation detection)

February 1, 2018

Are you losing important transcripts in your Kallisto/Sleuth RNASeq analysis?