CHGI Bioinformatics: November 2017

Here's a quick run down of how to generate a log ratio plot for brdU IP pulldown vs. whole cell extract run with NGS DNA sequencing. This type of experiment is done to find how commonly used origins of replication are in yeast genomes for example, as the brdU is incorporated in soon-to-be daughter cell DNA during cell division. For the purposes of this walkthrough, you can use https://www.ebi.ac.uk/ena/data/view/SRR1284645 (wt1) and https://www.ebi.ac.uk/ena/data/view/SRR1284649 (wcewt1).

1) Map the reads to the genome for both whole cell extract and the experimental brdu IP capture (e.g. wild type strain used here):

/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wcewt1.fastq.gz | samtools sort -O bam -T wcewt1 > wcewt1.genome.bam

/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wt1.fastq.gz | samtools sort -O bam -T wt1 > wt1.genome.bam

Note that if you have paired end data, specify both on the command line for each bwa command.

2) Calculate the coverage across the genome for each sample (sacCer3.genome has the name and length of each chromosome tab separated, one chr per line):

bedtools genomecov -ibam wt1.genome.bam -g sacCer3.genome -d > wt1.genome.bdg

bedtools genomecov -ibam wcewt1.genome.bam -g sacCer3.genome -d > wcewt1.genome.bdg

Note that if you have paired end data, ad the flag -pc to the bedtools command.

3) Start R and calculate a 1000 base pair window to compare coverage of the WCE to the brdU pulldown, stepping 100bp for each window recalculation so we have 100bp "buckets" of same coverage rather than base pair resolution.

wt1 <- read.table("wt1.genome.bdg")

wcewt1 <- read.table("wcewt1.genome.bdg")

window <- function(x, i, w){mean(x[max(1,i-w/2):min(length(x),i+w/2)])}

buckets <- seq(1, length(wt1$V3), 100)

Avoid annoying scientific notation parsing issues in IGV later with this magic:

options(scipen=999)

Before we actually apply the windowing, figure out how much more overall sequence their is in one sample vs the other, as we need to normalize for that. Also, of course, add 1 to the totals so we aren't taking the log of zero (which causes R to have conniptions).

norm_factor <- sum(as.numeric(wt1$V3))/sum(as.numeric(wcewt1$V3))

log_ratio <- sapply(buckets, function(x){log(window(wt1$V3, x, 1000)/norm_factor+1)/log(window(wcewt1$V3, x, 1000)+1)})

write.table(data.frame(wt1$V1[buckets], wt1$V2[buckets], wt1$V2[buckets]+99, log_ratio), sep="\t", file="log_ratio_wt1_v_wcewt1.bed")

4) I'm lazy and reformat the output in Perl rather than putting all the right options into the R write.table() command. It also excludes chrM as that seems to raise the ceiling too much for the data range and makes everything else look small by default when graphed (e.g. 0-3 for normal but up to 16 for chrM).

perl -ane '$F[1]=~tr/"//d; print join("\t", @F[1..4]),"\n" unless $. == 1 or $F[1] eq "chrM"' log_ratio_wt1_v_wcewt1.bed > log_ratio_wt1_v_wcewt1.bdg

You're done! Now you can load your BedGraph file into IGV, etc. and look at peak heights...more on comparing samples using quantile normalization, etc. to come.

CHGI Bioinformatics

Blog Archive

November 28, 2017

Basic brdU analysis