February 8, 2018

Analyzing Bisulfite Treated Genome Data (differential methylation detection)


tl;dr Bismark+DSS+ChIPpeakAnno+BioMart is good way to find differentially methylated regions of  genomes that have been bisulphite treated and sequenced, whether you have one or more experimental factors. There are some tricks to getting the annotations exported due to the non-one-to-one nature of diff methyl regions and nearby genes.
___________________

DSS is a good choice if you are trying to ensure very few false positives (i.e. specificity) while still getting decent recall/sensitivity (according to this nice comparison paper).

mkdir /export/common/dbs/bismark/danRer10

cp reference_genome.fa /export/common/dbs/bismark/danRer10

Ensure that the bowtie2 and samtools executables are in your PATH, e.g. install BioBuilds

bismark_genome_preparation /export/common/dbs/bismark/danRer10

Then do the following for each of your samples to count the bisulphite converted sites:

bismark -p 4 -un --ambiguous -N 1 -L 28 --non_bs_mm /export/common/dbs/bismark/danRer10/ path/to/my_ctl_1.fastq.gz -B my_ctl_1

bismark_methylation_extractor --scaffolds --bedGraph --parallel 5 --comprehensive --single-end --merge_non_CpG my_ctl_1.bam

Install DSS if you don't already have it, by starting R then:


source("http://bioconductor.org/biocLite.R")
biocLite("DSS");

biocLite("org.Dr.eg.db");

Create an experiment metadata file for the sample factors:

library(DSS)
require(bsseq)

meta <- read.table("meta.tab", header=TRUE)
meta

      sample site sex
1  CUSH_27_M CUSH   M
2  CUSH_30_F CUSH   F
3  CUSH_35_F CUSH   F
4  CUSH_37_M CUSH   M
5  CUSH_38_F CUSH   F
6  CUSH_39_M CUSH   M
7  GLEN_60_M GLEN   M
8  GLEN_64_M GLEN   M
9  GLEN_65_F GLEN   F
10 GLEN_69_M GLEN   M
11 GLEN_72_F GLEN   F
12 GLEN_76_F GLEN   F

Transform the Bismark output into the format that DSS requires. With replicate whole genome bisulphite data, this can take a while and use a lot of memory (20Gb+).

data <- lapply(meta$sample, function(x){t <- read.table(paste0(x,".bismark.cov"), colClasses = c("character", "integer", "NULL", "NULL", "integer", "integer")); colnames(t) <- c("chr", "pos", "X", "N"); t[,"N"] <- t[,"N"]+t[,"X"]; t})

BSobj <- makeBSseqData(data, meta$sample)

I've got multiple factors in this experiment, so I'll use the more complex fit function in DSS to find loci.

DMLfit <- DMLfit.multiFactor(BSobj, design=meta, formula=~site*sex)

DMLtest.site <- DMLtest.multiFactor(DMLfit, coef=2)

Now merge those loci into regions with more powerful statistics:

DMRtest.site <- callDMR(DMLtest.site, p.threshold=0.05)

ix <- sort(DMLtest.site[,"pvals"], index.return=TRUE)$ix

Inspecting these sorted results I see that the effect of sex is very minimal, so I can go ahead and just consider the site factor in a dichtomous model that includes smoothing for better dispersion estimation in the underlying beta-binomial model of DSS.

dmlTest.site <- DMLtest(BSobj, group1=as.character(meta$sample[1:6]), group2=as.character(meta$sample[7:12]), smoothing=TRUE)

Grab our site-dependent peak loci and peak regions:
dmls <- callDML(dmlTest.site, p.threshold=0.001)
dmrs <- callDMR(dmlTest.site, p.threshold=0.01)

Visually have a look at the most significant differentially methylated region, for sanity's sake:

showOneDMR(dmrs[1,], BSobj)

Then write them to a file:

write.csv(dmrs, "dds_dmr_site_only_fdr_lt_0.01.csv")

Annotated peaks with information from nearby genes:

 library(ChIPpeakAnno)

Unfortunately for me, danRer10 isn't a default in this module, so we need to roll our own data source from BioMart:


library(biomaRt);

mart <- useMart(biomart="ensembl", dataset="drerio_gene_ensembl")

danRer10_anno <- getAnnotation(mart, featureType="TSS")

dmr_gRanges <- GRanges(seqnames=Rle(dmrs$chr), ranges=IRanges(start=dmrs$start, end=dmrs$end))

anno <- annotatePeakInBatch(dmr_gRanges, AnnotationData=danRer10_anno)

G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id", "entrezgene", "description"),values=anno$feature,mart= mart)


We also need to get rid of the leading zeros in the numbers for the peaks that ChIPpeakAnno annoyingly adds. Now you have three tables with different number of rows since some dmrs map to more than one feature, and not all features have annotations. Blergh. Hash tables to the rescue!

library(hash)
e2g <- hash(G_list$ensembl_gene_id, G_list$entrezgene)

e2d <- hash(G_list$ensembl_gene_id, G_list$description)

p2dmr <- hash(1:dim(dmrs)[1], lapply(1:dim(dmrs)[1], function(x){as.vector(c(dmrs$nCG[x], dmrs$meanMethy1[x], dmrs$meanMethy2[x], dmrs$diff.Methy[x], dmrs$areaStat[x]))}))


good_annos <- anno[which(!is.na(anno$feature)),]

h2v <- function(hsh, ks){sapply(ks, function(x){values(hsh[x])})}


Nice! We also need to get rid of the leading zeros in the numbers for the peaks that ChIPpeakAnno annoyingly adds. 

df <- data.frame(good_annos, h2v(e2g, good_annos$feature), h2v(e2d, good_annos$feature), t(h2v(p2dmr, as.character(as.numeric(good_annos$peak)))))

write.csv(df, "dss_site_only_dmr_fdr_lt_0.01_gene_annotation.csv")

Nice! Now we have a list of differentially methylated regions and the transcription start sites that they are closest too. Now you might load this list into Ingenuity Pathway Analyst, DAVID or MatInspector to look for higher level biological patterns linking your genes.





2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete