1) Map the reads to the genome for both whole cell extract and the experimental brdu IP capture (e.g. wild type strain used here):
/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wcewt1.fastq.gz | samtools sort -O bam -T wcewt1 > wcewt1.genome.bam
/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wt1.fastq.gz | samtools sort -O bam -T wt1 > wt1.genome.bam
Note that if you have paired end data, specify both on the command line for each bwa command.
2) Calculate the coverage across the genome for each sample (sacCer3.genome has the name and length of each chromosome tab separated, one chr per line):
bedtools genomecov -ibam wt1.genome.bam -g sacCer3.genome -d > wt1.genome.bdg
bedtools genomecov -ibam wcewt1.genome.bam -g sacCer3.genome -d > wcewt1.genome.bdg
Note that if you have paired end data, ad the flag -pc to the bedtools command.
3) Start R and calculate a 1000 base pair window to compare coverage of the WCE to the brdU pulldown, stepping 100bp for each window recalculation so we have 100bp "buckets" of same coverage rather than base pair resolution.
wt1 <- read.table("wt1.genome.bdg")
wcewt1 <- read.table("wcewt1.genome.bdg")
window <- function(x, i, w){mean(x[max(1,i-w/2):min(length(x),i+w/2)])}
buckets <- seq(1, length(wt1$V3), 100)
Avoid annoying scientific notation parsing issues in IGV later with this magic:
options(scipen=999)
Before we actually apply the windowing, figure out how much more overall sequence their is in one sample vs the other, as we need to normalize for that. Also, of course, add 1 to the totals so we aren't taking the log of zero (which causes R to have conniptions).
norm_factor <- sum(as.numeric(wt1$V3))/sum(as.numeric(wcewt1$V3))
log_ratio <- sapply(buckets, function(x){log(window(wt1$V3, x, 1000)/norm_factor+1)/log(window(wcewt1$V3, x, 1000)+1)})
write.table(data.frame(wt1$V1[buckets], wt1$V2[buckets], wt1$V2[buckets]+99, log_ratio), sep="\t", file="log_ratio_wt1_v_wcewt1.bed")
4) I'm lazy and reformat the output in Perl rather than putting all the right options into the R write.table() command. It also excludes chrM as that seems to raise the ceiling too much for the data range and makes everything else look small by default when graphed (e.g. 0-3 for normal but up to 16 for chrM).
perl -ane '$F[1]=~tr/"//d; print join("\t", @F[1..4]),"\n" unless $. == 1 or $F[1] eq "chrM"' log_ratio_wt1_v_wcewt1.bed > log_ratio_wt1_v_wcewt1.bdg
You're done! Now you can load your BedGraph file into IGV, etc. and look at peak heights...more on comparing samples using quantile normalization, etc. to come.