February 9, 2017

Debugging RNASeq: sample swaps and batch effects

tl;dr If a standard differential expression analysis yields nothing for an experiment, look at the distance matrix between samples in an RNASeq experiment and see if anything looks out of place. Sometimes you can validate sample swaps when different cell lines are involved, and even strong batch effects can be modelled effectively to reveal the underlying factor effects you're expecting.

______________
There's a particular sinking feeling you get when you do a differential gene expression analysis of a carefully planned RNASeq experiment and you get no significant changes. I got that recently, when analyzing a two-factor Latin Square experiment done in triplicate.  The experiment involved two different cell lines (factor 1, with levels E545K and EV), and a treatment (factor 2, with level plus and minus).  We had good reason to believe that there should be an interaction effect (factor 3) between the one of the cell lines and the treatment.  I ran a Kallisto + Sleuth analysis with a linear model of ~cell_line+treatment+cell_line:treatment, and an experiment metadata file like so:

sample                path                           cell_line treatment 
E545K_minus_CP_Dec_1  E545K_minus_CP_Dec_1.kallisto  K         ctl      
E545K_minus_CP_Dec_2  E545K_minus_CP_Dec_2.kallisto  K         ctl       
E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K         ctl       
E545K_plus_CP_Dec_1   E545K_plus_CP_Dec_1.kallisto   K         exp      
E545K_plus_CP_Dec_2   E545K_plus_CP_Dec_2.kallisto   K         exp       
E545K_plus_CP_Nov_30  E545K_plus_CP_Nov_30.kallisto  K         exp      
EV_minus_CP_Dec_1     EV_minus_CP_Dec_1.kallisto     EV        ctl      
EV_minus_CP_Dec_2     EV_minus_CP_Dec_2.kallisto     EV        ctl       
EV_minus_CP_Nov_30    EV_minus_CP_Nov_30.kallisto    EV        ctl      
EV_plus_CP_Dec_1      EV_plus_CP_Dec_1.kallisto      EV        exp      
EV_plus_CP_Dec_2      EV_plus_CP_Dec_2.kallisto      EV        exp      
EV_plus_CP_Nov_30     EV_plus_CP_Nov_30.kallisto     EV        exp      

Note that I called the E545K samples "K", because I'm lazy and want to use the default behaviour of R which is to make the alphabetically first factor value the base value, so reported ratios will be K:EV. As I said before, I got nothing significant...so I decided to run a bwa + DESeq2 analysis instead with the same model. That analysis generates a distance matrix like so:



[Note: you can glean similar information from the Sleuth PCA plots in the sleuth_live interface, but I often find them too cluttered] Probably the most important thing to look at on this chart is the cluster tree on the left hand side. First, most of the EV cell line samples cluster together at the bottom, with the one exception of EV_minus_CP_Dec2 way up with the first sample, E545K_plus_CP_Dec2. A single E545K sample, E545K_minus_CP_Dec2, clusters with the EV samples.  Maybe these two are swapped?

For supporting evidence, we can look in the mapped reads for the G>A variant that's specific to cell line E545K.  There's something you can't do with a gene expression microarray! (Talk about kicking a platform when it's down).
for b in E*.bam; do
  echo $b;
  samtools mpileup -r chr3:178936091-178936091 $b 2> /dev/null;
done

E545K_minus_CP_Dec_1.bam
chr3 178936091 N 64 ggggggggggggggggggggggaggggggggggggggggggggggggggggggggggggg^Wg^Wg^Wg^Wg AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE/EEEEEEE/EEEEEEEEEEEEEEEEEEEE

E545K_minus_CP_Dec_2.bam
chr3 178936091 N 20 gggggggggggggggggggg EEEAEEEEEEEEEE/EE<EE

E545K_minus_CP_Nov_30.bam
chr3 178936091 N 77 ggggggggggggggggggggggggggggggggggggggggggggggggggggggaggggggggggggggggggggg^Wg AAAEEEEEEAEEA/EEEEEEE/EEEAEAEEEEEEEEEEEEAAEEEE/EEEEAAEEEEEEEEEEEEEE/E<<EAEEEE

E545K_plus_CP_Dec_1.bam
chr3 178936091 N 33 gggggggggaggggggggggggagggggggggg AAAEEEEEEE<EEEE<E/EEEEEEEEAEE/EEE

E545K_plus_CP_Dec_2.bam
chr3 178936091 N 56 ggggggggAgggggggggggggggggggggggggggggggggggggggaggggggg AEEEEEEEEEAEAEE/EEEEEEEEEEEEEAEAEEEEE/AE/EE<E/EEAEEEEE<E

E545K_plus_CP_Nov_30.bam
chr3 178936091 N 36 ggagggggggaggggggggggggggggggggggggg E6EEEE/EEEE/AEEEEEAEEEE/EEAAEEEEEE//

EV_minus_CP_Dec_1.bam
chr3 178936091 N 30 g$ggggggggggggggggggggggggggggg AAAAEEEEEEEEEE/EEEEEAEEE/EEEA<

EV_minus_CP_Dec_2.bam
chr3 178936091 N 47 gggggggggagggggggaggggggggggaggggggggggggggggg^Wg AEAEEEEEEEEEEAEEEEEEEEEE/EEAEEEEEEEEAE6EEEE/AEE

EV_minus_CP_Nov_30.bam
chr3 178936091 N 52 g$ggggggggggggggggggggggggggggggggggggggggggggggggggg AAEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE6EEEEAEEEE<EA/

EV_plus_CP_Dec_1.bam
chr3 178936091 N 31 gggggggggggggggggggggggggggggg^Wg AEEEEEE/EEEEEEEAEAEEEEEEE<A/EEE

EV_plus_CP_Dec_2.bam
chr3 178936091 N 30 gggggggggggggggggggggggggggggg AEEEEAEEEEEEEEAEEEEEEEEEE6<AAE

EV_plus_CP_Nov_30.bam
chr3 178936091 N 34 ggggggggGggggggggggggggggggggggggg AAAEEEE/EEEEEEEEA/EE/EEEEEEEEEEEA<


I've added in the DNA base highlights manually to show which samples have the variant. We'd expect it to be the first 6 normally (all the E545K... samples). The two that we suspect are swapped (also highlighted) defy the expectation: the EV sample has G>A, the E545K sample doesn't. Had the swap been between sample in the same cell line we wouldn't be able to use this evidence, but we're lucky I guess.

Rerunning the analysis with the samples swapped yields almost nothing.

The second thing to notice about the distance matrix is that the E545K samples (given the sample swap) pair off in the tree closely based on the date the sample was prepared, but the EV samples do not (treatment vs non is the primary clustering factor for those).  This suggests a strong batch effect that's specific to E545K cells, and is worse at later dates.  Let's swap the samples in the metadata file, and model the batch effect in a new Sleuth analysis in order to recover the real differential expression signal for the factors of interest: cell line, treatment and cell line/treatment interaction.  This means that our new metadata file (e_meta_swapped.tab) should look like this:

sample                path                           cell_line treatment date
E545K_minus_CP_Dec_1  E545K_minus_CP_Dec_1.kallisto  K         ctl       D1
E545K_minus_CP_Dec_2  EV_minus_CP_Dec_2.kallisto     K         ctl       D2
E545K_minus_CP_Nov_30 E545K_minus_CP_Nov_30.kallisto K         ctl       D0
E545K_plus_CP_Dec_1   E545K_plus_CP_Dec_1.kallisto   K         exp       D1
E545K_plus_CP_Dec_2   E545K_plus_CP_Dec_2.kallisto   K         exp       D2
E545K_plus_CP_Nov_30  E545K_plus_CP_Nov_30.kallisto  K         exp       D0
EV_minus_CP_Dec_1     EV_minus_CP_Dec_1.kallisto     EV        ctl       D1
EV_minus_CP_Dec_2     E545K_minus_CP_Dec_2.kallisto  EV        ctl       D2
EV_minus_CP_Nov_30    EV_minus_CP_Nov_30.kallisto    EV        ctl       D0
EV_plus_CP_Dec_1      EV_plus_CP_Dec_1.kallisto      EV        exp       D1
EV_plus_CP_Dec_2      EV_plus_CP_Dec_2.kallisto      EV        exp       D2
EV_plus_CP_Nov_30     EV_plus_CP_Nov_30.kallisto     EV        exp       D0

As I mentioned earlier, the batch effect seems worse (cluster tree separation is higher) at later dates, so we set Nov 30 as the base level for the date factor by giving it the alphabetically first factor level name, D0. Don't use just 0, 1, 2 unless you want to treat the date as a dosage effect (i.e. the hypotheical effect at Dec 2 should be exactly twice as much as at Dec 1).

I tried modelling just a straight date batch effect by adding date to the model.  This yielded 1000+ significant changes, but they were almost mirrors of the E545K effect (e.g. when E545K cells had a change of X for a transcript, the date factor usually had a change of -X).  This indicates that the E545K changes are artifactual, so we need to model the interaction of the date and cell line (date:cell_line). On with the a new (correct) Sleuth analysis!

library(sleuth)
meta <- read.table("e_meta_swapped.tab", header=TRUE)
meta$path <- as.character(meta$path)

so <- sleuth_prep(meta, ~cell_line*treatment+date:cell_line+date)
so <- sleuth_fit(so)

Note that cell_line*treatment is R syntactic sugar for cell_line+treatment+cell_line:treatment. This gives us a model where gene expression is affected by 3 factors and two factor interactions.*

Now, one by one, we need to exclude each of the factors of the experiment from the linear model to test what transcripts are affected by the factor's exclusion.

so <- sleuth_fit(so, ~treatment+date, "no_cell_line")
so <- sleuth_fit(so, ~cell_line*treatment, "no_date")
so <- sleuth_fit(so, ~cell_line+date:cell_line+date, "no_treatment")
so <- sleuth_fit(so, ~cell_line+treatment+date:cell_line+date, "no_interaction")
so <- sleuth_fit(so, ~cell_line*treatment+date, "no_date_interaction")

The significance test of choice here is the Likelihood Ratio Test (LRT). Run it.

so <- sleuth_lrt(so, 'no_cell_line', 'full')
so <- sleuth_lrt(so, 'no_treatment', 'full')
so <- sleuth_lrt(so, 'no_interaction', 'full')
so <- sleuth_lrt(so, 'no_date', 'full')
so <- sleuth_lrt(so, 'no_date_interaction', 'full')

Grab the results, and filter to those with a qval (multiple-testing corrected p-value) of <.05.

results_table_cell_line_lrt <- sleuth_results(so, 'no_cell_line:full', test_type = 'lrt')
results_table_treatment_lrt <- sleuth_results(so, 'no_treatment:full', test_type = 'lrt')
results_table_interaction_lrt <- sleuth_results(so, 'no_interaction:full', test_type = 'lrt')
results_table_date_lrt <- sleuth_results(so, 'no_date:full', test_type = 'lrt')
results_table_date_interaction_lrt <- sleuth_results(so, 'no_date_interaction:full', test_type = 'lrt')

cell_line.lrt.sig_ids <- results_table_cell_line_lrt$target_id[which(results_table_cell_line_lrt$qval < 0.05)]
treatment.lrt.sig_ids <- results_table_treatment_lrt$target_id[which(results_table_treatment_lrt$qval < 0.05)]
interaction.lrt.sig_ids <- results_table_interaction_lrt$target_id[which(results_table_interaction_lrt$qval < 0.05)]
date.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]
date_interaction.lrt.sig_ids <- results_table_date_lrt$target_id[which(results_table_date_lrt$qval < 0.05)]


Now, let's make a (non-redundant) list of all of the target IDs that pass the LRT significance filter. I'm excluding the LRT test results for date and date:cell_line because they're huge, and I don't really care to report the specifics of those batch effects unless they are in genes that are also affected by the original factors of interest (cell line, treatment, and their interaction).

lrt_ids <- unique(c(cell_line.lrt.sig_ids, treatment.lrt.sig_ids, interaction.lrt.sig_ids))

The LRT test doesn't generate fold-change estimates for transcripts since it only considers the factors' presence, not their levels. We need to run a Wald test for each factor level (7 total since the date factor has 2 levels to contrast with base level D0).

so <- sleuth_wt(so, 'dateD1')
so <- sleuth_wt(so, 'dateD2')
so <- sleuth_wt(so, 'cell_lineK')
so <- sleuth_wt(so, 'cell_lineK:treatmentexp')
so <- sleuth_wt(so, 'treatmentexp')
so <- sleuth_wt(so, 'cell_lineK:dateD1')
so <- sleuth_wt(so, 'cell_lineK:dateD2')

Grab the results.

results_table_date2_wt <- sleuth_results(so, 'dateD2')
results_table_date1_wt <- sleuth_results(so, 'dateD1')
results_table_date1_int_wt <- sleuth_results(so, 'cell_lineK:dateD1')
results_table_date2_int_wt <- sleuth_results(so, 'cell_lineK:dateD2')
results_table_cell_line_wt <- sleuth_results(so, 'cell_lineK')
results_table_interaction_wt <- sleuth_results(so, 'cell_lineK:treatmentexp')
results_table_treatment_wt <- sleuth_results(so, 'treatmentexp')

Let's report out only the Wald test results of each factor contrast that correspond to the LRT-passing targets (transcripts) identified earlier in the list "ids".

shared_results_cell_line <- results_table_cell_line_wt[results_table_cell_line_wt$target_id %in% lrt_ids,]
shared_results_treatment <- results_table_treatment_wt[results_table_treatment_wt$target_id %in% lrt_ids,]
shared_results_interaction <- results_table_interaction_wt[results_table_interaction_wt$target_id %in% lrt_ids,]
shared_results_date2_int <- results_table_date2_int_wt[results_table_date2_int_wt$target_id %in% lrt_ids,]
shared_results_date1_int <- results_table_date1_int_wt[results_table_date1_int_wt$target_id %in% lrt_ids,]
shared_results_date1 <- results_table_date1_wt[results_table_date1_wt$target_id %in% lrt_ids,]
shared_results_date2 <- results_table_date2_wt[results_table_date2_wt$target_id %in% lrt_ids,]

Since we want to report out all the factors together (i.e. influence of each factor for a transcript are reported on the same row), we need to sort each factor's Wald test results alphabetically by ID for consistency.

shared_results_cell_line <- shared_results_cell_line[sort.list(shared_results_cell_line[,1]),]
shared_results_treatment <- shared_results_treatment[sort.list(shared_results_treatment[,1]),]
shared_results_interaction <- shared_results_interaction[sort.list(shared_results_interaction[,1]),]
shared_results_date2_int <- shared_results_date2_int[sort.list(shared_results_date2_int[,1]),]
shared_results_date1_int <- shared_results_date1_int[sort.list(shared_results_date1_int[,1]),]
shared_results_date1 <- shared_results_date1[sort.list(shared_results_date1[,1]),]
shared_results_date2 <- shared_results_date2[sort.list(shared_results_date2[,1]),]

Write out all the LRT-significant data to a file. I could get fancy and splice out specific columns to keep, but that's a lot of syntax and it's easier to do that in Excel or with UNIX commands like cut.

write.csv(c(shared_results_cell_line,shared_results_treatment,shared_results_interaction,shared_results_date2_int,shared_results_date1_int,shared_results_date1,shared_results_date2), "e_kallisto_swapped_lrt_q0.05_max.csv")

This looks a lot better! Now we have ~330 genes that show an interaction effect for the cell line and treatment whereas before the swap and cell line specific batch effect modelling we had nothing.  It took some work for sure, but a lot less than redoing the experiment and resequencing everything. This was a quick run to get some directions for grant applications based on pathway analysis etc., so overall patterns are what matter. They'll want to run follow-up experiments and qPCR on salient transcripts in the existing samples to verify the results.

_____________

*If you tried to include the treatment:date interaction and other factor combinations in the model with these just 12 samples, you're likely to get a nasty message like:
Error in solve.default(t(X) %*% X) : 

  Lapack routine dgesv: system is exactly singular: U[7,7] = 0

Which in essence means that you only have one sample for each combination of factors. That'd leave zero degrees of freedom for statistical tests. That's no good, to put it lightly. There may be an interaction effect of the treatment and date, but as per the distance matrix shown earlier, any such effect is negligible compared to the cell line batch effect, so we'll make do without modelling it.

February 8, 2017

Transcriptome analysis without an annotated genome: pretty fast and slightly dirty

tl;dr If you're doing RNASeq for a species without a well-annotated, contiguous genome, and want useful differential expression results "quickly", run Trinity in genome guided mode, then Kallisto+Sleuth. Annotated functionally using Diamond and a few Perl one liners (below). It'll take less than a week, mostly hands off in Trinity run time.

_________________

An interesting RNA sequencing project came across my desk recently, where a group was looking at a species of fish (with a published genome) they'd corralled and raised at multiple sites upstream and downstream of a riparian urban center. The idea is to look at fish gene signatures for fauna water quality stress.

These days there are a lot of genomes out there that have been generated by groups that basically got funding to run a few HiSeq lanes, and they push out assemblies with thousands (if not more) contigs to get a paper out of it. I don't blame them, as getting a nice genome of contiguous chromosomes is a lot of work, though becoming easier with long reads from PacBio and Oxford Nanopore instruments. The academic rewards these days are not that high for such an endeavour, especially if your bailiwick is molecular biology, ecology, physiology, etc..  Even if you get a nice genome, gene structure and functional annotation in eukaryotes (except maybe fungal genomes, which I was involved in automating) takes a lot of manpower unless you can convince Ensembl your assembly is ripe for them to take on. The fish genome in question is in this dead space: no gene models, lots of contigs.

That being said, as soon as a "genome" sequence is available, researchers often assume you can easily do RNASeq experiments now with good functional analysis and the whole nine yards. I've done this before with Tophat + Cufflinks + Cuffdiff in de novo gene modelling mode for other organisms, which is okay but has issues with genes split across contigs, missassemblies, isoform detection, chimeric genes, etc. Since those tools were in vogue there have been a few advances in RNASeq processing, so I thought it's worth seeing what alternatives can do.

First off, don't map back to NCBI's UniGene. I tried this since it's easy and got nothing, which is due to the several factors such as poor coverage of genes expressed in the tissue of interest (liver), and the fact that UniGene gives you the best representative sequence for each gene, not a consensus transcript. The representative is the longest good quality read in dbEST for a gene, and because these mostly come from (poly A-tail amplified) ESTs, they tend to have a 3' mRNA bias which leads to a whole host of issues for differential expression analysis especially in fish, which often have 8 copies of genes (they went through an extra whole genome duplication, so ohnologs that have 4 copies in most vertebrates have 8 in teleosts, i.e. most fish).

Given the volume of tissue specific transcript information you get these days for cheap, why not do a de novo transcript assembly, then run a quick differential expression workflow like I blogged in Using Kallisto & Sleuth for RNASeq analysis?  In my case, this yielded about 50 genes in logical functional categories that were candidates for further study in water quality stress response. Here are the details of how to proceed.

I downloaded the reference genome from the NCBI for the fish in question.

wget  ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/JN/CD/JNCD01/JNCD01.1.fsa_nt.gz 

Did the same for FastA parts 2 and 3. Concatenated the files to get a final FastA genome.

cat JNCD01.*.fsa_nt.gz | gzip -cd - > genome.fna

BWA (version 0.7+) indexed the genome. If you need samtools, bwa, Trinity, etc., you may find it easiest to just install BioBuilds. Then...

bwa index genome.fna

Ran BWA against the genome for all the samples, in this case X = {site0, site1, site2, site3, site4}, and Y = {1,2,3,4} since there were four fish sampled from each site.

bwa mem -t 10 genome.fna <(gzip -cd siteX_sampleY.R1.fastq.gz) <(gzip -cd siteX_sampleY.R2.fastq.gz) | samtools sort -@ 4 -T siteX_sampleY -O bam > siteX_sampleY.bam

Merged all the BAMs into one big one (using 16 threads):

samtools merge -r -@ 16 -O BAM all_samples.bam *.bam

Ran Trinity on reference-genome-guided mode with these data to generate consensus transcripts.  NOTE: unlike most genome-guided methods, Trinity only uses the genome to pre-cluster the reads for the de novo assembly, so it doesn't inherit all the gaps and flaws of the "genome". 

I have a lot of threads available on my machine, so I let it use 80.  YMMV.


Trinity --genome_guided_bam all_samples.bam   --genome_guided_max_intron 10000 --max_memory 450G --CPU 80 

If you didn't have a reference genome, you would have just skipped the bwa bit, the samtools bit and previous line, going straight to this instead:

Trinity --seqType fq --max_memory 800G --single (<gzip -cd myreads.fastq.gz) --CPU 80 --trimmomatic

At this point, book a last minute, long weekend getaway. Trinity finished after 5 days, generating 510738 transcript contigs in trinity_output_dir. Using a normal Trinity run instead on this volume of data would normally take 2 weeks or more. #winning

I ran these against the NCBI non-redundant protein database with Diamond (blastx mode), 

Download the non-redundant protein dataset from the NCBI.

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz


Uncompress it, then index it with Diamond.

diamond index -i nr nr

Run Diamond on the Trinity results (FastA).  It's hard to overstate how awesomely fast Diamond is at finding good protein homology for DNA sequences.

diamond blastx -d nr -q Trinity-GG.fasta -p 80 -a fathead.nr

Then run MEGAN6's taxonomic classification algorithm (after being sure to install thre "tools" during the setup script, and downloading the latest NCBI acc2tax file from the same site):


/export/common/programs/megan6/tools/blast2lca -i fathead.nr.daa -f DAA -m BlastX -o fathead.nr.diamond.tax -a2t /export/common/dbs/megan6/prot_acc2tax-Oct2017X1.abin

This yielded 142539 contigs with protein matches (the remaining 370K are mostly rRNA, short nonsense fragments/assembly chimera, low quality, etc.).  Of these 136764 match teleost proteins (the rest are Burkholderia, other contaminants, or unclassified). In my case, the fish was evolutionarily fairly close to zebrafish, so most of the top scores had e-values<10e-35. But, you might want to pimp up this one liner with a e-value or % ID filter if you've got an organism with no reasonable model organism in the evolutionary 'hood.

perl -F\; -ane 'print "$F[0]\n" if /Actinopteri/' fathead.nr.diamond.tax > fathead.nr.diamond.tax.fish_matches.txt

perl -ne 'BEGIN{%keep=split /(\s)/s, `cat fathead.nr.diamond.tax.fish_matches.txt`; $/=">"}print if /^>?(\S+)/ and $keep{$1}' Trinity-GG.fasta > Trinity-GG.fish_keep.fna

Index the kept contigs with kallisto:

kallisto index -i fishname Trinity-GG.fish_keep.fna

Run kallisto against the fishy (that's a good thing) contigs for each sample.

kallisto quant -i fishname.kallisto_index -b 100 -l 180 -s 20 -o siteX_sampleY.kallisto <(gzip -cd siteX_sample1.R1.fastq.gz) <(gzip -cd siteX_sampleY.R2.fastq.gz)

Kallisto finished running the reads against the Trinity assembled transcripts.  


I then mapped the Trinity transcript assembly IDs to the best Diamond matches in the NCBI (RefSeq NP_ preferred, then XP_, then whatever):



perl -ane 'if(/(NP_.*?)\s/ and not $p{$F[0]}++){$f{$F[0]} = "$1\t$F[10]"}elsif(/(XP_.*?)\s/ and not $p{$F[0]} and not $x{$F[0]}++){$f{$F[0]} = "$1\t$F[10]"}elsif(not $p{$F[0]} and not $x{$F[0]}){$f{$F[0]} = "$F[1]\t$F[10]"}END{for(keys %f){print "$_\t$f{$_}\n"}}' fathead.nr.diamond.tab > trinity2refid

Because it's quite likely that we only have partial coverage of genes, I want to consolidate our assembled transcripts into gene symbol level as much as possible before differential expression analysis. To this end, I'll map the refids to gene names in IPA (you could do this with DAVID too for example if you don't have an IPA license) and save to ref2symbol_ipa.txt. I then mapped the IDs using the Entrez Gene database.

wget ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
perl -F\\t mv -ane 'print "$F[5]\t$F[15]"' gene2refseq > refid2symbol_entrez.txt

I also mapped the RefSeq IDs to UniProtKB here, then extracted the UniProtKB IDs to gene names (saved together as refid2symbol_uniprot.txt).


perl -F\\t -ane 'print "$F[0]\t$1\n" if $F[6] =~ /^(\S+)/' uniprot-yourlist.tab > refid2symbol_uniprot.txt

In the end I had a file of Trinity IDs and their corresponding gene (if available from IPA, Entrez Gene or UniProt) or RefSeq transcript by running this:


perl -ane 'BEGIN{%r2i=split /\s/s,`cat refid2symbol_ipa.txt`; %r2e=split /\s/s,`cat refid2symbol_ncbi.txt`; %r2u=split /\s/s,`cat refid2symbol_uniprot.txt`} print "$F[0]\t", ($r2i{$F[1]} || $r2e{$F[1]} || $r2u{$F[1]} || $F[1]), "\n"' trinity2refid > trinity2symbol.txt

I ran the following Sleuth (R) analysis which generates a table of the expression values for all sites where there is a significant log-ratio test qval (multiple testing corrected p-value) for the factor.  Most of the code from here down is a variation on my earlier post Using Kallisto & Sleuth for RNASeq analysis.

library(sleuth)
meta <- read.table("metadata.tab", header=TRUE)
trinity2symbol <- read.table("trinity2symbol.txt")
meta$path <- as.character(meta$path)
so <- sleuth_prep(meta, ~site, target_mapping=trinity2symbol, use_extra_bootstraps=TRUE)
so <- sleuth_fit(so)


so <- sleuth_fit(so, ~1, "reduced")
so <- sleuth_lrt(so, 'reduced', 'full')
results_table_lrt <- sleuth_results(so, 'reduced:full', test_type = 'lrt’)
results_table_lrt <- results_table_lrt[sort.list(results_table_lrt[,1]),]
lrt.sig_ids <- results_table_lrt$target_id[which(results_table_lrt$qval < 0.05)]

Now we want to perform the Wald test for each site vs. site0 (our factor base level) since this will give us the fold-changes and site-specific significance of the differential expression.

so <- sleuth_wt(so, 'site1')
so <- sleuth_wt(so, 'site2')
so <- sleuth_wt(so, 'site3')
so <- sleuth_wt(so, 'site4')
results_table_wt_1 <- sleuth_results(so, 'site1')
results_table_wt_2 <- sleuth_results(so, 'site2')
results_table_wt_3 <- sleuth_results(so, 'site3')
results_table_wt_4 <- sleuth_results(so, 'site4')

We'll want to report the beta (~fold change) and wald test q-value for each sample for each ID that passed the original log-ratio-test.

shared_results_1 <- results_table_wt_1[results_table_wt_1$target_id %in% lrt.sig_ids,]
shared_results_2 <- results_table_wt_2[results_table_wt_2$target_id %in% lrt.sig_ids,]
shared_results_3 <- results_table_wt_3[results_table_wt_3$target_id %in% lrt.sig_ids,]
shared_results_4 <- results_table_wt_5[results_table_wt_4$target_id %in% lrt.sig_ids,]

It's important at this point to make sure we have all the rows in each sample in the same index position for collating the output table, and the easiest way to do this is to sort all the results by target_id (transcript name) alphabetically, like I snuck in earlier for the LRT test results in case you weren't paying attention.

shared_results_1 <- shared_results_1[sort.list(shared_results_1[,1]),]
shared_results_2 <- shared_results_2[sort.list(shared_results_2[,1]),]
shared_results_3 <- shared_results_3[sort.list(shared_results_3[,1]),]
shared_results_4 <- shared_results_4[sort.list(shared_results_4[,1]),]

Collate the results and write to a file. The first column is the significant (q-value < .05) log-ratio test score information for the factor for each gene across all the sites, while the subsequent columns are the Wald scores and log fold-changes (kinda) for each specific site.  Note that because we have multiple values for the factor, a gene can pass the LRT test, but have insufficient replicates at any given site to have a good (qval <.05) Wald score. This is the opposite of the binary factors people usually are testing with RNASeq (e.g., treatment vs. control). Sequencing more replicates is left as an exercise to the reader.

combined_result <- data.frame(results_table_lrt[results_table_lrt$target_id %in% lrt.sig_ids,], site1_qval=shared_results_1[,"qval"], site1_b=shared_results_1
[,"b"], site2_qval=shared_results_2[,"qval"], site2_b=shared_results_2[,"b"], site3_qval=shared_results_3[,"qval"], site3_b=shared_results_3[,"b"], 
site4_qval=shared_results_4[,"qval"], site4_b=shared_results_4[,"b"])
write.table(combined_result, "all_sites_lrt_passed_qval.05.txt", sep="\t")

Now that we have the diff exp table, let's annotate the transcripts that had protein hits (note the literal ctrl-A in the command):

perl -ne 'BEGIN{%t=reverse(split /\s/s, `cut -f 2,1 trinity2refid`)$/=">"}print "$1\t$2\n" if /^(\S+)\s+(.*?)^A/ and exists $t{$1}' nr > trinity_descs.txt



sort tids | perl -ane 'BEGIN{%d=split /[\t\n]/s, `cut -f 1,2 id_desc.txt`}print $d{$F[0]},"\n"' | perl -ane 'BEGIN{%d=split /[\t\n]/s, `cut -f 2,3 trinity_descs.txt`}print $d{$F[0]},"\n"' > descs

Lets also append the p-value of the protein hit used for the annotation:

perl -F\\t -lane 'BEGIN{%e=split /\s/s, `cut -f 1,3 trinity2refid`}print $_,"\t",$e{$F[0]},"\n"' all_sites_lrt_passed_qval_desc.05.txt > lrt_any_factor_qval_lt_0.05_desc_diamond_evalues.txt



That's it!  Now you can go down the rabid whole of functional analysis, a blog post I'll get to writing eventually...






October 13, 2016

The Mothur of all eukaryotes

While microbial small subunit RNA analysis (16S) has been commonplace for a number of years now to profile microbial communities for super cheap per sample, and there's an Illumina MiSeq lurking in all corners of universities these days, around here we're seeing an uptick in the use of it for eukaryotic profiling (18S). Fun stuff like dietary analysis of wild animal feces, or sequencing river water to get fish diversity information.

I'm not so silly as to get involved in the Qiime vs. Mothur methodology debate, but let's suppose you've settled on Mothur. Mothur's kind of nice insofar as it's a single binary (Qiime is typically installed as a virtual machine), and the recommended Silva reference files used can be downloaded in a bundle from the Mothur Web site.  There are two issues with using the Mothur-provided reference files for eukaryotic analysis: 1) it truncates the taxonomy for eukaryotes the same way as prokaryotes (6 levels) even the though the tree is deeper, so you end up with only order, class or family level designation quite often, and 2) the prebuilt DB is Silva 123 from July 2015.  Silva is now at version 128 as of this writing, with 577,832 quality entries vs. the old with 526,361, with 18,213 Eukarya represented vs. 16,209 in version 123.  Here's how to address both the eukaryotic and latest version issues:

Download and unpack (as newer version come out adjust the URL accordingly):


wget -O SSURef_NR99_latest_opt.arb.gz https://www.arb-silva.de/fileadmin/silva_databases/current/ARB_files/SSURef_128_SILVA_23_09_16_opt.arb.gz

gzip -d SSURef_NR99_latest_opt.arb.gz

wget -O tax_slv_ssu_latest.txt https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_128.txt

perl -pe 's/ \(Animalia\)//;s/ /_/g' tax_slv_ssu_latest.txt > tax_slv_ssu_latest.fixed.txt

Export the FastA files:

Launch arb:

arb SSURef_NR99_latest_opt.arb

Export as per http://blog.mothur.org/2015/12/03/SILVA-v123-reference-files/ with a final file name of silva.full_latest.fasta

Save the file as silva.full_latest.fasta, then quit arb.

Note that I compiled ARB from source since I'm on CentOS 7 and they only have precompiled binaries for earlier operating systems.  If you're in the same boat, you'll need to root/sudo the following, which are not all documented in the installation instructions.

yum install libxml2 transfig libXp libtiff gnuplot-common xorg-x11-xbitmaps Xaw3d xorg-x11-fonts-misc xfig xfig-common motif gnuplot libtiff-devel libxml2-devel libxml2-python lynx glib2-devel imake libXmu-devel libXp-devel motif-devel

Format the sequence:

Here we deviate a bit from the Mothur README for simplicity, and to get the right taxonomic labels for eukaryotes.

mothur "#screen.seqs(fasta=silva.full_latest.fasta, start=1044, end=43116, maxambig=5, processors=8); pcr.seqs(start=1044, end=43116, keepdots=T); degap.seqs(); unique.seqs();"

grep ">" silva.latest.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.latest.good.pcr.ng.unique.accnos

mothur "#get.seqs(fasta=silva.latest.good.pcr.fasta, accnos=silva.latest.good.pcr.ng.unique.accnos)"

mv silva.full_latest.good.pcr.pick.fasta silva.full_latest.align

Run my eukaryotic labelling adjustment script (modified from https://raw.githubusercontent.com/rec3141/diversity-scripts/master/convert_silva_taxonomy.r, since that one didn't work for me, and wasn't parameterized)


Rscript convert_silva_taxonomy.r tax_slv_ssu_latest.txt silva.full_latest.align silva.full_latest.tax

Now you're good to go for any type of SSU analysis (16S or 18S), and follow something like the ever popular MiSeq SOP.

If the above instructions failed for you, download my SILVA 128 tax file here, and the fasta and align.

**Update: there seems to be a problem with 4 Ralstonia sequence taxonomic classifications in the current SILVA release.  You'll need to manually fix those in the output taxonomy file to get it to work properly. They have only two levels of classification.

October 6, 2016

Using Kallisto & Sleuth for RNASeq analysis

Kallisto is a fast pseudo-mapping tool for RNAseq data that has some advantages over more established mapping and quantification methods like bwa and htseq-count. In very broad strokes, it doesn't do a traditional spliced alignment to a reference genome, but rather assigns read to a reference set of transcripts based on kmer counts. This works when you have well-annotated genome like human or mouse. Where the kmer transcript assignment is ambiguous, it uses bootstrap analysis to estimate the probability of correct assignment to a transcript. It can bootstrap because kmer counting is super fast, and takes little memory, making it ideal for running analysis on a laptop for instance.

I've been holding back because the stats software to interpret differential gene expression from the bootstrapped values has been slowly evolving. In June a preprint about this analysis software, called Sleuth, was put up on bioRXiv. So it was probably time I checked it out more thoroughly.

In particular I was drawn by microdissected tissue RNASeq experiments where DESeq2 was only showing very modest differential expression of genes, but a lot of it (1000's of genes). In theory, Sleuth should work better at resolving splicing variation in genes which might be going on, or even maybe compensating for tissue impurity insofar as it causes systematic underreporting of differential expression ratios.

Here's a brief rundown on running a combined Kallisto + Sleuth analysis.

Create a Kallisto searchable database of transcripts for your species, if you don't already have one. Personally I like RefSeq, but you could use Ensembl transcripts, etc.  Here's how to get the RefSeq fastas, while stripping the GIs and leaving the NM_ or NR_ ID as the one Kallisto will report.


wget ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.*.rna.fna.gz

gzip -cd mouse.*.gz | perl -i -pe 's/^.*?\|ref\|(.*?)\|/>$1/' > mm10.refMrna.fa

Index the reference file with Kallisto. It'll take a few minutes.


kallisto index -i /path/to/dbdir/mm10.refMrna mm10.refMrna.fa

Run kallisto for each of your samples, here I have single end RNASeq.  The 180/20 is a reasonable insert size/std dev estimate for TruSeq RNA v2 library preps, probably the most common type of sequencing these days. 190bp might be good to as per our own lab's TapeStation stats over many runs, but in practice the effect of changing this value by ten is negligible. The 100 specifies the number of bootstraps to run.

kallisto quant -i /path/to/dbdir/mm10.refMrna -o sample_id.kallisto -b 100 --single -l 180 -s 20 <(gzip -cd /path/to/unique_sample_id_prefix*.fastq.gz)

Now let's run some differential expression analysis. First, create a tab delimited metadata file for the experiment so we have the factors for each sample ID (header "sample"), and the file path (header "path") to the kallisto output for each sample.  The file should look something like this:

sample      path               time
mysample1   mysample1.kallisto D0
mysample2   mysample2.kallisto D0
mysample3   mysample3.kallisto D5
mysample4   mysample4.kallisto D5

I've put D0, D5 here so we have a factor with two values, and by default the alphabetically first factor value is considered the control in R (i.e. ratios will be reported as exp:ctl, so D5:D0).  You could have put 0, and 5 in the column instead, but then you'd have to coerce the numbers it into factor levels, so why bother?  Putting just numbers only make sense if you have a range of numeric values and are expecting a dosage response in the regression model.

Note that the tutorial on the Sleuth Web site uses a somewhat convoluted method to get the right metadata table together. Here, I've simplified it, assuming you are running R from the directory where all the kallisto quant output directories reside.

library(sleuth)
meta <- read.table("meta.tab", header=TRUE)

meta$path <- as.character(meta$path)

Make a regression model for the factor(s) in the metadata file.  Here we only have one, called "time" in the metadata file prepared earlier.

so <- sleuth_prep(meta, ~ time)

Model the gene expression as responding to the time factor.

so <- sleuth_fit(so)

Create another model where the gene expression is not dependent on any factor.

so <- sleuth_fit(so, ~1, "reduced")

Run a likelihood ratio test between the two models to see what transcripts appear to really be affected by the time factor value.


so <- sleuth_lrt(so, 'reduced', 'full')
results_table_lrt <- sleuth_results(so, 'reduced:full', test_type = 'lrt')

Run a Wald test to check for differential expression according to the time factor. Wait...didn't we just do that with the LRT?  They are different ways of skinning the same cat, and generally the LRT is considered a better test (see a nice explanation here). We need to run both, because only the Wald test will report the fold change*. My factor was "time", and the non-control value is D5, so the test is called "timeD5" as R is want to do with concatenating factors and values. Alternatively, you can list the valid Wald tests with the so$fits[["full"]]$design_matrix command.  On to the Wald test of the time factor...

so <- sleuth_wt(so, 'timeD5')
results_table_wt <- sleuth_results(so, 'timeD5')

We want the Wald test results for those genes that also passed the more stringent (when there's only two factor levels) LRT.  Let's get the names of the transcripts that are in both lists (with multiple testing corrected p-value, a.k.a. q-value, less than 0.05).

d5.lrt.sig_ids <- results_table_lrt$target_id[which(results_table_lrt$qval < 0.05)]
d5.wt.sig_ids <- results_table_wt$target_id[which(results_table_wt$qval < 0.05)]
shared_ids <- d5.wt.sig_ids[d5.wt.sig_ids %in% d5.lrt.sig_ids]

Grab just the rows of the Wald test that correspond to the shared transcripts, then write them to a file for posterity, downstream analysis in another program, etc..

shared_results <- results_table_wt[results_table_wt$target_id %in% shared_ids,]
write.csv(shared_results, file="kallisto_wald_test_lrt_passed_d5.csv")

To make sure we aren't throwing the baby out with the bathwater, let's write out the Wald-pass-but-LRT-failed genes to another file in case there's something interesting there.

wald_only_ids <- d5.wt.sig_ids[!(d5.wt.sig_ids %in% d5.lrt.sig_ids)]
wald_only_results <- results_table_wt[results_table_wt$target_id %in% wald_only_ids,]
write.csv(wald_only_results, file="kallisto_wald_test_only_passed_d5.csv")

Interestingly, unlike the samples used in the Sleuth preprint, both the Wald and LRT tests report fewer differentially expressed genes/transcripts than DESeq2's Wald test when used in conjunction with bwa and htseq-count on my local microdissection samples mentioned earlier. But, the differential expression ratios are higher.  This is the point at which I should explain the asterisk I put after "fold change" earlier on this page.

*The statistical model that Kallisto uses includes a term Beta.  In the output tables for the Wald test this is the "b" field and is the closest thing you can think of to the log2 fold-change you might be used to from other programs.  It's not really a fold change though, it's a bias estimator.  And it's the natural logarithm, not log2. It's roughly the extent to which the estimated count for a transcript is affected by the named factor rather than biological and technical variability (library prep, mapping, etc.).  So if there is very little variability and noise for a transcript (based on bootstrap analysis) it's pretty close to being a fold change measure.  If there is high technical variability for the transcript though, Beta tends to account for less of the overall estimated count even though Beta will have a higher magnitude.

If you've had the patience to read this far and understand Beta, why don't we look at some pretty graphs to see what our data look like (e.g. segregation on a Principle Component Analysis plot, volcano plots, etc.). Sleuth uses a Web interface called Shiny to allow interactive plotting, and you went through the installation instructions for both Kallisto and Sleuth you're set to go, right? The following will launch the Web interface, or at least give you a URL to paste into your browser.

sleuth_live(so)


March 1, 2016

Cleaning up GATK's garbage on many-CPU servers

Every once in a while when I run GATK genotyping commands, I observe Java processes eating up crazy amounts of CPU (using the "top" command) for a long while, but not accomplishing anything in terms of analysis. Poking around I figured out this was due to the Java Garbage Collector (GC), which for the really keen can be explored in depth here, with Tenures and Edens and all kinds of fun low level computer science stuff. The long and the short of it is that on larger systems with many CPUs, the default GC mode is both too aggressive and sporadic to work effectively for GATK processes like the HaplotypeCaller.  Playing around, I found the best override settings for the GC system for our system can be invoked as follows:


java -Xmx24G -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar /export/achri_data/programs/GenomeAnalysisTK.jar -I in.bam -T HaplotypeCaller -o out.vcf  -L my.interval -nct 2...


In the case of a 32-CPU server, one pathological HaplotypeCaller process used 10% of its normal CPU time and 25% of its wall time with the new settings, but of course YMMV. If you are using Queue for your job control, there is an old thread that discusses this issue in part (threading, but not concurrency of GC). Also worth noting that the four GC threads are all never going full bore at the same time, so in effect you use one extra CPU on average with these settings.