December 13, 2019

Nanopore bacterial genome assemblies

tl;dr You should be able to get 99.95%+ DNA genome assembly accuracy from an Oxford Nanopore Technologies MinION device using the combination of Guppy modified base high accuracy base calling + Flye genome assembly w/ 2 rounds of polishing + 4 rounds of racon polishing + one round of medaka polishing. Steps to make a hybrid assembly with Illumina (SPAdes + pilon) are also included.

On a properly kitted Linux desktop computer with a good graphics card, this should take about an hour per genome, and does *not* require administrator (root or sudo) privileges. All the software in this tutorial can be installed in your local account using standard *nix commands and conda.

There may be similar scripts to do all this elsewhere, but I'll try to explain every step a bit here so you don't feel that you're cargo culting nanopore bioinformatics.

______

Handheld nanopore sequencing with the MinION and its ilk has put the power to generate into every lab, but the informatics can still be a bit intimidating for folks as there are so many options. This is one way to generate bacterial genome assemblies with nanopore data that has worked pretty well for us.

SETUP

I am assuming at minimum that you are running Linux on a computer with at least 4 CPU cores, and an NVIDIA GPU here. Pretty much any gaming laptop would suffice, or a ~$2000 desktop. You should know the basics of launching a terminal window and navigating the file system from the command line. If you don't know these essentials, please spend 45 minutes at Software Carpentry to learn it... trust me you won't regret it!

I am also assuming that you have some R9.4.1 flow cell data, which you've generated on the MinION or Flongle or PromethION and placed it on the Linux machine that you will use to run this tutorial.

If you don't already have a conda software manager in the Linux box, install one with all the defaults:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh 

Complete the installation, then log out/in of your terminal window so it's all set up.

Add conda "channels" (repositories) that might be useful for finding packages we're going to use.

conda config --add channels conda-forge 
conda config --add channels defaults 
conda config --add channels bioconda 

Create an environment that will house all the programs we're using in this tutorial, then activate it:

conda create -n nanopore_assembly 
conda activate nanopore_assembly 

All of the conda commands in this tutorial only need to be run the very first time you try this tutorial, except the conda activate and deactivate commands, which should be run each time.

NANOPORE ASSEMBLY

Now, let's install the Oxford Nanopore Technologies base caller Guppy (for the latest version, update the address by visiting the Community Page):

wget https://mirror.oxfordnanoportal.com/software/analysis/ont-guppy_3.4.3_linux64.tar.gz 
tar zxvf ont-guppy_3.4.3_linux64.tar.gz 

Run the Guppy base caller on your data, using the high accuracy model with base modification (methylation) calls:

ont-guppy/bin/guppy_basecaller --input_path /path/to/your/directory/of/fast5_pass --save_path /path/where/you/want/to/write/results --num_callers 4 --config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg --device cuda:all:100%

The output should look something like this if successful:

ONT Guppy basecalling software version 3.4.3+f4fc735
config file:        /opt/ont/guppy/data/dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg
model file:         /opt/ont/guppy/data/template_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.jsn
input path:         /localhome/gordonp/For Paul/19-11-19 pooled 6-10/pooled_6-10/pooled/20191119_1839_MN30852_FAL04937_2e9551ca/fast5/
save path:          guppy
chunk size:         1000
chunks per runner:  512
records per file:   4000
num basecallers:    8
gpu device:         cuda:all:100%
kernel path:        
runners per device: 8

Found 109 fast5 files to process.
Init time: 1951 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 1639062 ms, Samples called: 23534183594, samples/s: 1.43583e+07
Finishing up any open output files.

Basecalling completed successfully.

There are actually a lot of tweaks you can make to the previous command to accelerate the base calling, depending on the NVIDIA GPU you have on board your computer. This is left as an exercise to the reader. If the command fails, try removing the "--num_callers 4" option, and make sure nothing else is using the GPU on the system except /usr/bin processes when you run the command nvidia-smi.

Your results are in the guppy subdirectory, so let's go there:

cd guppy

*If* you barcoded a number of samples into one run (since a MinION run has capacity to do multiple 2-5Mbp bacterial genomes), you will need to demultiplex FASTQ files generated by Guppy, using the ONT software qcat:

conda install qcat
conda install future
cat *.fastq | qcat --trim -b demultiplexed

Hopefully you see an output that corresponds to the barcodes that you actually used! There will be a subdirectory for each sample in guppy/demultiplexed/barcode##, plus maybe a few false positive barcode IDs that you can safely ignore.

Adapters detected in 336008 of 432735 reads
  NBD104/NBD114 336008: |      ############### |  77.65 %
           none  85909: |                  ### |  19.85 %
Barcodes detected in 336008 of 432735 adapters
      barcode01     16: |                      |   0.00 %
      barcode02      1: |                      |   0.00 %
      barcode04      5: |                      |   0.00 %
      barcode05      9: |                      |   0.00 %
      barcode06 105209: |                 #### |  24.31 %
      barcode07  63486: |                   ## |  14.67 %
      barcode08  67515: |                  ### |  15.60 %
      barcode09  57684: |                   ## |  13.33 %
      barcode10  42037: |                    # |   9.71 %
      barcode12      1: |                      |   0.00 %
      barcode13      5: |                      |   0.00 %
      barcode14      9: |                      |   0.00 %
      barcode15      7: |                      |   0.00 %
      barcode16      1: |                      |   0.00 %
      barcode17      1: |                      |   0.00 %
      barcode18      6: |                      |   0.00 %
      barcode19      5: |                      |   0.00 %
      barcode21      4: |                      |   0.00 %
      barcode22      6: |                      |   0.00 %
      barcode24      1: |                      |   0.00 %
           none  85909: |                  ### |  19.85 %
10818 reads were skipped due to the min. length filter.

Demultiplexing finished in 308.34s

Now we are almost ready to assemble the reads from a sample into a draft genome. If you're multiplexed, you'll need to go into the subdirectory with the results of the demuplexing:

cd demultiplexed

For everything that follows in this tutorial, naturally you will run the commands for each barcode##.fastq file. *Alternatively*, if you only had one sample, concatenate all the Guppy output into one FASTQ file called 'all.fastq':

cat *_0.fastq > all.fastq

We are going to use the Flye assembler to generate a draft genome for a given barcode (or all.fastq if you only had one sample on the run). In this case, I'm assembling the sample labelled with barcode06, and I suspect it might have plasmids so we'll check for those. The genome should be about 3Mbp, hence the '-g 3m' option provided, but don't worry this need only be a rough estimate.

conda install flye 
flye --nano-raw barcode06.fastq --threads 8 --iterations 2 --plasmids -g 3m --out-dir barcode06

In my limited experience, asking for more than 2 iterations of polishing from Flye can be counterproductive, but your mileage may vary. The output is in the directory barcode06, so let's go there:

cd barcode06

The Flye output is in assembly.fasta, but it's kind of rough. Let's now apply a nice general genome polisher called racon to read overlaps of >500bp with a maximum of 15% mismatches. We'll do this over four iterations to fix up the consensus sequence (with the final result in racon4.fasta):

conda install racon
conda install minimap2
ln -s assembly.fasta racon0.fasta
for n in 1 2 3 4; 
do 
minimap2 racon`expr $n - 1`.fasta ../barcode06.fastq > minimap.paf; 
racon ../barcode06.fastq minimap.paf racon`expr $n - 1`.fasta -e 0.15 -t 8 -m 8 -x -6 -g -8 -w 500 > racon$n.fasta; 
done

Why four iterations? That's what the tool in the next step was trained using, so we're feeding it what it's seen before. There is also a fancier version of racon that uses the GPU, but if you're using this tutorial because you're a novice, maybe you're best to explore racon-gpu once you have more experience.

At this point, we'd like to run the ONT-developed medaka polisher which helps resolve a lot of homopolymer consensus errors typical of nanopore signal (e.g. deciding between 5 A's in a row vs. 7). *But*, its dependencies are incompatible with other tools we've used so far, so we'll need to switch to a new conda environment (this is one of the key strengths of conda, to put a fence between incompatible softwares).

conda deactivate
conda create -n nanopore_polishing
conda activate nanopore_polishing
conda install medaka 

Now we're ready to run the medaka polisher:

medaka_consensus -i ../barcode06.fastq -d racon4.fasta -o medaka -t 8 

Your final nanopore-only genome is in medaka/consensus.fasta. Most of the remaining errors ~0.05% errors will be homopolymer stretches.

HYBRID ASSEMBLY (OPTIONAL)

If you also have Illumina paired-end short read data for the same biological sample, these can be used to correct the assembly by first de novo assembling the short reads with SPAdes (using the nanopore long read assembly as a trusted guide to help get through the repetitive regions that are troublesome in short read assembly):

conda install spades
spades.py -o spades --trusted-contigs medaka/consensus.fasta -1 /path/to/illumina/sample_R1_001.fastq.gz  -2 /path/to/illumina/sample_R2_001.fastq.gz

Now let's polish the short read assembly with Pilon, which first requires mapping the Illumina reads to our new SPAdes assembly:

conda install pilon
minimap2 -a spades/contigs.fasta /path/to/illumina/sample_R1_001.fastq.gz /path/to/illumina/sample_R2_001.fastq.gz | samtools sort -T tmp -O bam - > illumina.bam
samtools index illumina.bam
pilon --genome spades/contigs.fasta --frags illumina.bam

Then we'll take note of where the SPAdes contigs are positioned within the nanopore consensus genome.

minimap2 -a medaka/consensus.fasta pilon.fasta > pilon_vs_trusted.sam

Now we need to overlay the high quality (read: much fewer homopolymer problems) short-read assembly on the trusted contiguous long read assembly (read: resolves long repeats properly). I've written a script for that, so we'll download it and change its file mode make it executable.

wget -O trusted_pilon_overlay https://owncloud.westgrid.ca/index.php/s/S4BV5FpEQao7Kbu/download
chmod u+x trusted_pilon_overlay

Now let's run the overlay script:

./trusted_pilon_overlay medaka/consensus.fasta pilon_vs_trusted.sam pilon_overlay_on_trusted.fasta

Your hybrid polished genome is in 'pilon_overlay_on_trusted.fasta'. Enjoy!  Note that if you want to polish any plasmids with the short reads as well, there are additional steps to be taken, but I will do a separate post about that since there are a number of caveats.




July 2, 2019

Hybrid PacBio + Illumina assembly: turning Unicycler on its head


tl;dr Unicycler is an amazing program that does a great job of assembling microbial genomes from short and long reads. In hybrid assembly mode, short reads are assembled into contigs, then these are spanned using long reads such as those from PacBio or Oxford Nanopore Technologies. This is an oversimplified version, but is the gist of it.

Occasionally though, I find that highly repetitive genome structures are better resolved (even vs. Unicycler's 'bold' mode) by doing the opposite: assemble the long reads, then use them as a trusted guide for short read assembly, and then use the long reads to fill the gaps in the guide short assembly.
___________

Note: All of the tools described below are available in BioConda, and can be installed using commands like "conda install lordec". Links to the original tools pages are provided for reference only.

Step 1. Employ LoRDEC to correct the PacBio reads using the short reads:

$ lordec-correct -2 all_illumina.fastq -k 19 -s 3 -i PacBio.filtered_subreads.fastq -o pacbio-corrected.fastq

Step 2. Run Canu on the long reads to get a draft assembly (this is about a 3Mbp genome in my example):


$ canu -p my_isolate_name -d canu genomeSize=3.0m -pacbio-raw pacbio-corrected.fastq

Step 3a. Pick the reasonable contigs from the canu output directory (which we imaginatively called "canu"):


$ cd canu
$ grep '>' my_isolate_name.contigs.fasta 
>tig00000001 len=2709880 reads=3198 covStat=2122.56 gappedBases=no class=contig suggestRepeat=no suggestCircular=no

If more than one contig, and one is a lot smaller than the other, maybe you have plasmids.

Step 3b. Run SPAdes on the short reads with these as trusted contigs:

$ spades.py -o spades.2002_S45 --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto

And if you suspect you have a plasmid, run the same thing again but with a new output directory and the --plasmid flag:

$ spades.py -o spades.2002_S45.plasmid --trusted-contigs GD2002.contigs.fasta -1 ../2002_S45_L001_R1_001.fastq -2 ../2002_S45_L001_R2_001.fastq --cov-cutoff auto --plasmid

Step 4a. Run minimap2 piped into samtools, for the short reads against the Spades output (needed for Pilon in Step 5):

$ cd spades
$ minimap2 -ax sr contigs.fasta ../../1042_S33_L001_R1_001.fastq ../../1042_S33_L001_R2_001.fastq | samtools sort -T tmp -O bam > short_reads_to_asm.bam

Step 4b. Sanity check that most reads actually mapped, and index the BAM file:

$ samtools index short_reads_to_asm.bam
$ samtools flagstat short_reads_to_asm.bam
843188 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
5026 + 0 supplementary
0 + 0 duplicates
839119 + 0 mapped (99.52% : N/A)
838162 + 0 paired in sequencing
419081 + 0 read1
419081 + 0 read2
825398 + 0 properly paired (98.48% : N/A)
830940 + 0 with itself and mate mapped
3153 + 0 singletons (0.38% : N/A)
624 + 0 with mate mapped to a different chr
543 + 0 with mate mapped to a different chr (mapQ>=5)

Step 5. Run Pilon to polish the Spades assembly.

$ pilon --genome contigs.fasta --frags short_reads_to_asm.bam


Step 6. Map the Pilon assembly to the canu long reads assembly:

$ minimap2 -a ../*.contigs.fasta pilon.fasta > pilon_vs_trusted.sam

Step 7. Use this little script (rename to trusted_pilon_overlay and make executable on Linux) to overlay the Pilon polished Illumina assembly contigs (higher per base quality) onto the long reads canu assembly (lower per base quality) for the final genome:

$ trusted_pilon_overlay ../*.contigs.fasta pilon_vs_trusted.sam pilon_overlay_on_trusted.fasta

Step 8a: Check that the plasmids are real with BLAST matches to exiting plasmids:

blastn -db nt -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.fasta.blastn_nt.txt -remote

Step 8b: Check that you didn't assemble the Illumina phiX spike-in control sequence into your genome! That would be embarrassing (and all too common in GenBank).


$ makeblastdb -dbtype nucl -in phiX.fna
$ blastn -db phiX.fna -query pilon_overlay_on_trusted.fasta -out pilon_overlay_on_trusted.phiX.txt

The result should be "No hits found".  Woohoo, you're done!

Step 9 (optional).  If you still have multiple contigs, start exploring the assembly graph (the .gfa file in your outputs) visually in Bandage to see what might be the cause. Loading a Unicycler assembly and overlaying the pilon_overlay_on_trusted.fasta file as BLAST matches can be informative too, like below where the bubbles represent multiple nearly identical partially integrated phages in a genome. In the case below, my method builds two contigs, displayed as green and blue BLAST hits on Bandage's 'bold' 22 contigs.






October 23, 2018

Graphing Kallisto RNASeq results: Pretty box plotting genes by experiment factor levels


Assuming you have already run a bunch of sample with Kallisto against a relevant transcript database, and have the outputs in folders called samplename.kallisto, run this simple script to generate the FPKM data at the transcript and gene levels (in this case, human):

$ kallisto_to_fpkm refseq2gene_human

Start R, and first load the experiment metadata:

> meta <- read.table("meta.tab", header=TRUE)

> meta$path <- as.character(meta$path)
> meta
      sample               path tgfb pfd
1   Library1  Library1.kallisto    0   0
2   Library2  Library2.kallisto    0   0
3   Library3  Library3.kallisto    0   0
4   Library4  Library4.kallisto    1   0
5   Library5  Library5.kallisto    1   0
6   Library6  Library6.kallisto    1   0
7   Library7  Library7.kallisto    0   1
8   Library8  Library8.kallisto    0   1
9   Library9  Library9.kallisto    0   1
10 Library10 Library10.kallisto    1   1
11 Library11 Library11.kallisto    1   1
12 Library12 Library12.kallisto    1   1

You'll see here that I have two treatments (TGFB and PFD), run independently and in combination, as well as wild type.  I'm going to manually assign reasonable factor level names (e.g. "wt" for wild type) that will be used on the graph later.

> sample2category <- hashmap(meta$path, c(rep("wt",3), rep("tgfb", 3), rep("pfd", 3), rep("tgfb+pfd", 3)))

> sample2category
##          (character) => (character)
##  [Library4.kallisto] => [tgfb]     
##  [Library2.kallisto] => [wt]       
## [Library11.kallisto] => [tgfb+pfd] 
## [Library12.kallisto] => [tgfb+pfd] 
##  [Library5.kallisto] => [tgfb]     
##  [Library3.kallisto] => [wt]       
##                [...] => [...]      

Looks good.  Let's load up the gene level FPKM data we generated at the very start:

gene_fpkm <- read.table("gene_fpkm.txt", header=TRUE, row.names=1)


Suppose we have a subset of genes that are of particular interest, let's load them.  It's a simple text file with one gene name per line, in this case, 30 genes.

> cancer <- read.table("cancer.txt", colClasses=c("character"))

Let's just work with the subset of FPKM values from the genes of interest.  For the sake of plotting a reasonable vertical axis range, I'm turning the FPKM values into log2(FPKM+1).

gene_fpkm_cancer <- t(log1p(gene_fpkm[ecm_cancer$V1,], base=2))


What we need to do to generate the boxplots is turn the (gene, sample) matrix into a flatter long table where we have multiple gene -> value instances for each experiment factor combinations.  The flattening is easy, using the function melt().

> library(reshape2)
d <- melt(gene_fpkm_cancer)


> d[1,]

                  X1    X2    value
1 Library10.kallisto MMP11 4.196717

As you can see from just printing the first row of the melt()ed table, X1 is the library, X2 is the gene name, value if the log transformed FPKM value. Let's add the category labels we generated earlier.


> d$category <- sample2category[[d$X1]]
> d[1,]

                  X1    X2    value category
1 Library10.kallisto MMP11 4.273322 tgfb+pfd

Nice, now we are ready to plot, with some fancy options to make it pretty.

> ggplot(d)+ \
  geom_boxplot(aes(category, y=value)) + \                 # boxplot for each category
  facet_grid(.~X2) + \                                     # boxplot pane for each gene
  theme(axis.text.x=element_text(angle=90, hjust=1)) + \   # vertical category labels
  scale_x_discrete(limits=c("wt","tgfb","pfd","tgfb+pfd")) # reorder category labels




February 8, 2018

Analyzing Bisulfite Treated Genome Data (differential methylation detection)


tl;dr Bismark+DSS+ChIPpeakAnno+BioMart is good way to find differentially methylated regions of  genomes that have been bisulphite treated and sequenced, whether you have one or more experimental factors. There are some tricks to getting the annotations exported due to the non-one-to-one nature of diff methyl regions and nearby genes.
___________________

DSS is a good choice if you are trying to ensure very few false positives (i.e. specificity) while still getting decent recall/sensitivity (according to this nice comparison paper).

mkdir /export/common/dbs/bismark/danRer10

cp reference_genome.fa /export/common/dbs/bismark/danRer10

Ensure that the bowtie2 and samtools executables are in your PATH, e.g. install BioBuilds

bismark_genome_preparation /export/common/dbs/bismark/danRer10

Then do the following for each of your samples to count the bisulphite converted sites:

bismark -p 4 -un --ambiguous -N 1 -L 28 --non_bs_mm /export/common/dbs/bismark/danRer10/ path/to/my_ctl_1.fastq.gz -B my_ctl_1

bismark_methylation_extractor --scaffolds --bedGraph --parallel 5 --comprehensive --single-end --merge_non_CpG my_ctl_1.bam

Install DSS if you don't already have it, by starting R then:


source("http://bioconductor.org/biocLite.R")
biocLite("DSS");

biocLite("org.Dr.eg.db");

Create an experiment metadata file for the sample factors:

library(DSS)
require(bsseq)

meta <- read.table("meta.tab", header=TRUE)
meta

      sample site sex
1  CUSH_27_M CUSH   M
2  CUSH_30_F CUSH   F
3  CUSH_35_F CUSH   F
4  CUSH_37_M CUSH   M
5  CUSH_38_F CUSH   F
6  CUSH_39_M CUSH   M
7  GLEN_60_M GLEN   M
8  GLEN_64_M GLEN   M
9  GLEN_65_F GLEN   F
10 GLEN_69_M GLEN   M
11 GLEN_72_F GLEN   F
12 GLEN_76_F GLEN   F

Transform the Bismark output into the format that DSS requires. With replicate whole genome bisulphite data, this can take a while and use a lot of memory (20Gb+).

data <- lapply(meta$sample, function(x){t <- read.table(paste0(x,".bismark.cov"), colClasses = c("character", "integer", "NULL", "NULL", "integer", "integer")); colnames(t) <- c("chr", "pos", "X", "N"); t[,"N"] <- t[,"N"]+t[,"X"]; t})

BSobj <- makeBSseqData(data, meta$sample)

I've got multiple factors in this experiment, so I'll use the more complex fit function in DSS to find loci.

DMLfit <- DMLfit.multiFactor(BSobj, design=meta, formula=~site*sex)

DMLtest.site <- DMLtest.multiFactor(DMLfit, coef=2)

Now merge those loci into regions with more powerful statistics:

DMRtest.site <- callDMR(DMLtest.site, p.threshold=0.05)

ix <- sort(DMLtest.site[,"pvals"], index.return=TRUE)$ix

Inspecting these sorted results I see that the effect of sex is very minimal, so I can go ahead and just consider the site factor in a dichtomous model that includes smoothing for better dispersion estimation in the underlying beta-binomial model of DSS.

dmlTest.site <- DMLtest(BSobj, group1=as.character(meta$sample[1:6]), group2=as.character(meta$sample[7:12]), smoothing=TRUE)

Grab our site-dependent peak loci and peak regions:
dmls <- callDML(dmlTest.site, p.threshold=0.001)
dmrs <- callDMR(dmlTest.site, p.threshold=0.01)

Visually have a look at the most significant differentially methylated region, for sanity's sake:

showOneDMR(dmrs[1,], BSobj)

Then write them to a file:

write.csv(dmrs, "dds_dmr_site_only_fdr_lt_0.01.csv")

Annotated peaks with information from nearby genes:

 library(ChIPpeakAnno)

Unfortunately for me, danRer10 isn't a default in this module, so we need to roll our own data source from BioMart:


library(biomaRt);

mart <- useMart(biomart="ensembl", dataset="drerio_gene_ensembl")

danRer10_anno <- getAnnotation(mart, featureType="TSS")

dmr_gRanges <- GRanges(seqnames=Rle(dmrs$chr), ranges=IRanges(start=dmrs$start, end=dmrs$end))

anno <- annotatePeakInBatch(dmr_gRanges, AnnotationData=danRer10_anno)

G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id", "entrezgene", "description"),values=anno$feature,mart= mart)


We also need to get rid of the leading zeros in the numbers for the peaks that ChIPpeakAnno annoyingly adds. Now you have three tables with different number of rows since some dmrs map to more than one feature, and not all features have annotations. Blergh. Hash tables to the rescue!

library(hash)
e2g <- hash(G_list$ensembl_gene_id, G_list$entrezgene)

e2d <- hash(G_list$ensembl_gene_id, G_list$description)

p2dmr <- hash(1:dim(dmrs)[1], lapply(1:dim(dmrs)[1], function(x){as.vector(c(dmrs$nCG[x], dmrs$meanMethy1[x], dmrs$meanMethy2[x], dmrs$diff.Methy[x], dmrs$areaStat[x]))}))


good_annos <- anno[which(!is.na(anno$feature)),]

h2v <- function(hsh, ks){sapply(ks, function(x){values(hsh[x])})}


Nice! We also need to get rid of the leading zeros in the numbers for the peaks that ChIPpeakAnno annoyingly adds. 

df <- data.frame(good_annos, h2v(e2g, good_annos$feature), h2v(e2d, good_annos$feature), t(h2v(p2dmr, as.character(as.numeric(good_annos$peak)))))

write.csv(df, "dss_site_only_dmr_fdr_lt_0.01_gene_annotation.csv")

Nice! Now we have a list of differentially methylated regions and the transcription start sites that they are closest too. Now you might load this list into Ingenuity Pathway Analyst, DAVID or MatInspector to look for higher level biological patterns linking your genes.





February 1, 2018

Are you losing important transcripts in your Kallisto/Sleuth RNASeq analysis?


tl;dr The default transcript filter function parameters in Sleuth are suitable for a single factor, two level contrast RNASeq experiment. If you are running a two-factor experiment (e.g. knock out vs. wild type, plus control vs. treatment), or an experiment with multiple factor levels (e.g. time series), you should probably use a filter function such as the one described below. You will retain more true positive differentially expressed genes, without generating too many new false positives.

________________________________

I've been a heavy user of Kallisto and Sleuth for RNASeq analysis for some time, and was used to seeing output similar to the following when loading up a dataset:

> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line)
reading in kallisto results
............
normalizing est_counts
26036 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

I hadn't given much consideration to how the "filter" statistic was generated, until I had a 5 time point series experiment where we had a priori knowledge of the activation of a transcript only at the last two timepoints. This transcript did not show up in the Sleuth analysis with any p-value, let alone a significant one. A few days later in a two-factor experiment (growth condition and cell line), there were also some missing known transcripts.  

The default filtering function in Sleuth (called basic_filter) requires at least 5 mapped reads to a transcript in at least 47% of the samples. This reduces spurious identification of differential expression in near-zero abundance transcripts, but retains genes that are moderately but consistently expressed in one of two factor levels (e.g. expressed-in-control-only transcripts, or expressed-in-treatment-only transcripts).

If I have two factors in my RNASeq experiment (3 replicates is typical, for 12 samples), this filter would eliminate transcripts only expressed in the interaction term, such as condition:cell_line in the above example.  Here's the metadata:

> meta

        sample                 path condition cell_line
1    NSC_Ctl_1   NSC_Ctl_1.kallisto       NSC       Ctl
2    NSC_Ctl_2   NSC_Ctl_2.kallisto       NSC       Ctl
3    NSC_Ctl_3   NSC_Ctl_3.kallisto       NSC       Ctl
4     NSC_KO_1    NSC_KO_1.kallisto       NSC        KO
5     NSC_KO_2    NSC_KO_2.kallisto       NSC        KO
6     NSC_KO_3    NSC_KO_3.kallisto       NSC        KO
7  Odiff_Ctl_1 Odiff_Ctl_1.kallisto        OD       Ctl
8  Odiff_Ctl_2 Odiff_Ctl_2.kallisto        OD       Ctl
9  Odiff_Ctl_3 Odiff_Ctl_3.kallisto        OD       Ctl
10  Odiff_KO_1  Odiff_KO_1.kallisto        OD        KO
11  Odiff_KO_2  Odiff_KO_2.kallisto        OD        KO
12  Odiff_KO_3  Odiff_KO_3.kallisto        OD        KO

The condition:cell_line term gleans data from only 3 (25%) of the samples (i.e. those that are OD:KO). Let's change the filter to only require >=5 reads in 25% of the samples...

> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line,
                          filter_fun=function(x){basic_filter(x, 5, 0.25)})
reading in kallisto results
............
normalizing est_counts
36320 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

Whoa! We just increased the number of transcripts passing filter by 50%, which leads to a huge inflation of false positives in the differential expression, and just as importantly, detrimentally affects the q-values for the genes in our original, default-filtered analysis.  A smarter filter might be to require 100% of samples with any present factor level to have at least 5 reads, i.e. keep any transcript where all replicate samples for a factor moderately express it.

[Puts on thinking cap, writes several failed attempts, then...]

> design_filter <- function(design, row, min_reads=5, min_prop = 0.47){
    sum(apply(design, 2, function(x){
        y <- as.factor(x);
        return(max(tapply(row, y, function(f){sum(f >= min_reads)})/
                   tapply(row, y, length)) == 1 
                   || basic_filter(row, min_reads, min_prop)
              )
    })) > 0}

To pass in the design matrix that my new filter requires, I can just reuse the one my first call to sleuth_prep() generated, rather than making it myself.  Probably not a bad idea to do it this way in any case, so we can then compare how many transcripts pass this new filter vs. the default filter.

> so_redux <- sleuth_prep(meta, ~cell_line*condition, 
                          filter_fun=function(x){design_filter(so$design,x)})
reading in kallisto results
............
normalizing est_counts
26370 targets passed the filter
normalizing tpm
merging in metadata
normalizing bootstrap samples
summarizing bootstraps

Although for this dataset the new filter also requires ~25% of samples to have moderate expression, the added constraint that those 25% cover all replicates of some factor level means adding just 334 transcripts to the analysis instead of more than 10,000.  This seems much more reasonable to me, and my known true positive transcript suddenly appeared. #winning

Note that the design_filter() should work for any set of nominal factors, but not quantitative factors. A column slice of the design matrix could be passed in accordingly in the so_redux code above.