November 28, 2017

Basic brdU analysis

Here's a quick run down of how to generate a log ratio plot for brdU IP pulldown vs. whole cell extract run with NGS DNA sequencing. This type of experiment is done to find how commonly used origins of replication are in yeast genomes for example, as the brdU is incorporated in soon-to-be daughter cell DNA during cell division.  For the purposes of this walkthrough, you can use https://www.ebi.ac.uk/ena/data/view/SRR1284645 (wt1) and https://www.ebi.ac.uk/ena/data/view/SRR1284649 (wcewt1).

1) Map the reads to the genome for both whole cell extract and the experimental brdu IP capture (e.g. wild type strain used here):


/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wcewt1.fastq.gz | samtools sort -O bam -T wcewt1 > wcewt1.genome.bam 

/export/common/programs/bwa-0.7.15 mem -t 8 /export/common/dbs/bwa/sacCer3.fa wt1.fastq.gz | samtools sort -O bam -T wt1 > wt1.genome.bam 

Note that if you have paired end data, specify both on the command line for each bwa command.

2) Calculate the coverage across the genome for each sample (sacCer3.genome has the name and length of each chromosome tab separated, one chr per line):

bedtools genomecov -ibam wt1.genome.bam -g sacCer3.genome -d > wt1.genome.bdg
bedtools genomecov -ibam wcewt1.genome.bam -g sacCer3.genome -d > wcewt1.genome.bdg

Note that if you have paired end data, ad the flag -pc to the bedtools command.

3) Start R and calculate a 1000 base pair window to compare coverage of the WCE to the brdU pulldown, stepping 100bp for each window recalculation so we have 100bp "buckets" of same coverage rather than base pair resolution.

wt1 <- read.table("wt1.genome.bdg")
wcewt1 <- read.table("wcewt1.genome.bdg")
window <- function(x, i, w){mean(x[max(1,i-w/2):min(length(x),i+w/2)])}
buckets <- seq(1, length(wt1$V3), 100)

Avoid annoying scientific notation parsing issues in IGV later with this magic:

options(scipen=999)

Before we actually apply the windowing, figure out how much more overall sequence their is in one sample vs the other, as we need to normalize for that. Also, of course, add 1 to the totals so we aren't taking the log of zero (which causes R to have conniptions).

norm_factor <- sum(as.numeric(wt1$V3))/sum(as.numeric(wcewt1$V3))


log_ratio <- sapply(buckets, function(x){log(window(wt1$V3, x, 1000)/norm_factor+1)/log(window(wcewt1$V3, x, 1000)+1)})


write.table(data.frame(wt1$V1[buckets], wt1$V2[buckets], wt1$V2[buckets]+99, log_ratio), sep="\t", file="log_ratio_wt1_v_wcewt1.bed")

4) I'm lazy and reformat the output in Perl rather than putting all the right options into the R write.table() command. It also excludes chrM as that seems to raise the ceiling too much for the data range and makes everything else look small by default when graphed (e.g. 0-3 for normal but up to 16 for chrM).

perl -ane '$F[1]=~tr/"//d; print join("\t", @F[1..4]),"\n" unless $. == 1 or $F[1] eq "chrM"' log_ratio_wt1_v_wcewt1.bed > log_ratio_wt1_v_wcewt1.bdg

You're done!  Now you can load your BedGraph file into IGV, etc. and look at peak heights...more on comparing samples using quantile normalization, etc. to come.

May 29, 2017

The ungoogleable: Authenticating a Galaxy Portal with Active Directory

As a rule, Google is very useful for finding examples of how to configure software, especially when you're doing Unix-y stuff and using well worn programs like the Apache Web Server.  A recent configuration task I needed to perform was the exception that proved the rule: there is no working, non-trivial example on the Web of how to configure Apache to authenticate a user against an LDAP server that requires authentication to search (you don't want to hard code in a user name and password in the httpd.conf file, of course). Windows Active Directories often fall into this category, as they do not allow an anonymous search to start (Apache's default is an anon search, then bind with a user name, then search for attributes).

Specifically, the Galaxy Web server is a popular Python based bioinformatics package.  It uses Paster to serve up the Web pages, but expects you to put a user authentication proxy in front of it if you want to use CAS, LDAP, etc. to manage the portal's user accounts. Galaxy's Paster process then just blindly accepts the REMOTE_USER variable sent along by the proxy with each HTTP request. Ideally, the REMOTE_USER is a fully qualified e-mail address so that Galaxy can send e-mail notifications for features that have that capability e.g., notify a user when their analysis is complete. Kind of like this:



The keys to getting this to work are Apache's AuthLDAPInitialBindPattern and AuthLDAPRemoteUserAttribute directives. Feast your eyes on the following Apache config, that:
  1. takes a user name like "work_group_name\user_login_name"
  2. authenticates against an Active Directory with a long, complex binding DN like "CN=user_login_name,OU=work_group_name,OU=Users,OU=Accounts,DC=..."
  3. queries the user information in the directory for an e-mail address containing attribute (here userPrincipalName)
  4. forwards the e-mail address on to the paster Web server in the REMOTE_USER variable
Enjoy!

# If the LDAP server cert is not chained with any of your root certs on the box and you can't import it
LDAPVerifyServerCert off
<VirtualHost my_proxy_site:443>
        ServerName my_proxy_site
        # Require SSL since the users will be entering their org credentials so we shouldn't accept them as plain-text
        SSLEngine on
        SSLCertificateFile /etc/ssl/certs/my_proxy_site.crt
        SSLCertificateKeyFile /etc/ssl/certs/my_proxy_site.key

        RewriteEngine On
        # Serve up static content directly rather than proxying, for efficiency
        RewriteRule ^/static/style/(.*) /opt/gls/galaxy/static/june_2007_style/blue/$1 [L]
        RewriteRule ^/static/scripts/(.*) /opt/galaxy/static/scripts/packed/$1 [L]
        RewriteRule ^/static/(.*) /opt/galaxy/static/$1 [L]
        RewriteRule ^/favicon.ico /opt/galaxy/static/favicon.ico [L]
        RewriteRule ^/robots.txt /opt/galaxy/static/robots.txt [L]
        # Send everything else through the proxy
        RewriteRule ^(.*) "http://the_web_server_expecting_remote_user_set:8080/$1" [P]

        <location "/">
        Order deny,allow
        Deny from all
        # We're picking specific subnets that are allowed to connect and authenticate
        Allow from 192.168.
        Allow from 10.0.0.

        # Authenticate against the org's Active Directory (one of three in this case)
        AuthName "Please login using your Organization_Name_Here credentials"
        AuthType Basic
        AuthBasicProvider ldap
        AuthLDAPURL "ldaps://ldap_server_name1 ldap_server_name2 ldap_server_name3/DC=my_suborg,DC=my_org,DC=my_tld?sAMAccountName,userPrincipalName?sub"
        AuthLDAPInitialBindAsUser on
        AuthLDAPInitialBindPattern (.+)\\(.+) CN=$2,OU=$1,OU=Users,OU=Accounts,DC=my_suborg,DC=my_org,DC=my_tld
        AuthLDAPRemoteUserAttribute userPrincipalName
        Require valid-user

        # Pass the LDAP account name we just authenticated on to the final destination (the Web server process at the_web_server_expecting_remote_user_set:8080) with the REMOTE_USER header set
        RewriteEngine On
        RewriteCond %{LA-U:REMOTE_USER} (.+)
        RewriteRule .* - [E=RU:%1]
        RequestHeader set REMOTE_USER %{RU}e

        </location>
</VirtualHost>

April 18, 2017

Data for free: viral metagenomics in human RNASeq data

tl;dr Even with a polydT bead RNASeq extraction, you can find interesting metagenomic data, as many single stranded RNA viruses have polyadenylated genomes/transcripts. In an experiment involving exposure of patients to rhinovirus, measuring the viral load in the samples (Diamond + MEGAN5) yielded a quantitative factor effect that 1) increased the number of genes significantly differentially expressed, and 2) allowed me to exclude a sample due to evidence of patient infection before the experimental inoculation.

Metagenomic read assignment to taxonomy is not super accurate. I discuss a few ways to evaluate whether a signal is either a true or false positive in a human RNAseq experiment context.
___________


I had an RNASeq project that was quite resistant to yielding significant results, based on a patient medical factor and experimental exposure to a cold virus (a clinically approved Rhinovirus A strain). I suspected that the binary (yes/no) exposure factor wasn't useful in my Sleuth gene expression model because the virus wasn't spreading the same way, or maybe at all, in some patients and sampled tissues.

Because single stranded RNA rhinoviruses have polyadenylated genomes/transcripts, our lab's regular polydT bead extraction protocol should theoretically pull these out with the normally targeted human polyA-tailed transcripts.  I adapted some of my non-reference Kallisto+Sleuth analysis steps to look for the virus in my data (Lane1 subset) like so, assuming you have your sample names in samplenames.txt (important: in the same order as the meta.tab file used later):

for f in `cat samplenames.txt`; do kallisto quant -i /export/common/dbs/kallisto/hg19.refMrna -o $f.kallisto -b 100 --single -l 180 -s 20 <(gzip -cd myrunid/Data/Intensities/BaseCalls/$f/*L001*.fastq.gz) & done

Note that I launched these in parallel in the background. If you don't have as many cores as you have samples, replace '&' with ';' before "done" to run in series, or use GNU parallel, etc.. Wait for these jobs to finish, then as a quick sanity check to see if the samples all map decently (typically 80%+ map to the RefSeq human transcriptome)...

for f in `cat samplenames.txt`; do perl -ane '$t+=$F[3];END{$ARGV=~s/abundance.tsv/run_info.json/;print $ARGV,"\t", $t/$1,"\n" if `cat $ARGV`=~/n_processed": (\d+)/s;}' $f.kallisto/abundance.tsv; done
N1pre.kallisto/run_info.json 0.8336
N1time1.kallisto/run_info.json 0.7829
...

Maybe exclude any really bad ones (e.g. 40% mapping a.k.a. < 0.4) from further analysis for now and check what the input quantity and quality (RIN score) was. Then run the Diamond program in BLASTX mode:

for f in `cat samplenames.txt`; do /export/common/programs/diamond-0.7.9 blastx -d /export/common/dbs/diamond/nr -a $f.diamond_nr -q myrunid/Data/Intensities/BaseCalls/ProjectName/$f/*_L001*.fastq.gz -p 1 & done


Same waiting caveat as above, then...

for f in `cat samplenames.txt`; do diamond view -a $f.nr.daa -o $f.nr.diamond.tab -f tab ; done
for f in `cat samplenames.txt`; do /export/common/programs/megan5/tools/blast2lca -v -i $f.nr.diamond.tab -o $f.nr.diamond.tax -g2t /export/common/dbs/megan5/gi_taxid-March2015X.bin; done

In the 60 samples, there were 157 million sequences aligned at the protein level. I aligned these against all 75 million known protein sequences in less than a day on a single machine.  Let that sink in and realize how awesome Diamond is. Note that you could use Kraken to do this analysis instead at the DNA level, please let me know how you fair.

Let's first do a sanity check and see how often species are identified at 100% (no matches outside the species lowest common ancestor for all hits) by MEGAN5, which should have the lowest false positive rate for such analyses:

perl -ne '$c{$1}++ if /\[Species\] (.*?); 100/;END{for(keys %c){print "$c{$_}\t$_\n"}}' *.tax | sort -n -k 1

1 Abacion magnum
1 Abida secale
1 Acalymma trivittatum
1 Acanthamoeba culbertsoni
1 Acanthascus dawsoni
1 Acanthis flammea
1 Acanthocheilonema odendhali
1 Acanthochromis polyacanthus
1 Acanthocyclops vernalis
1 Acanthosaura crucigera
...
40808 Camelus ferus
45634 Rhinovirus A
47346 Saimiri boliviensis
49148 Rattus norvegicus
58556 Tupaia chinensis
62884 Cricetulus griseus
70044 Callithrix jacchus
74475 Papio anubis
88769 Mus musculus
98365 Ostrea edulis
101606 Chlorocebus sabaeus
119838 Pan paniscus
132084 Rhinopithecus roxellana
143348 Chlamydia psittaci
176297 Pongo abelii
278318 Pan troglodytes
333650 Macaca fascicularis
378764 Gorilla gorilla
472266 Macaca mulatta
889709 synthetic construct
15692650 Homo sapiens


There are 19978079 sequences with 100% at the species level, so we have a 78.5% true positive rate if we assume nearly all the actual sample is of human origin. A little Linnean naming knowledge quickly shows that the most frequent hits besides human are to constructs or primates like macaques, gorillas, and orangutans (17.7%). These are often due to high protein-level sequence similarity and random sequencing errors colliding. The same goes for other mammals, but to a lesser extent.

The Rhinovirus A count of 45634 stands out though. We should expect this to be picked up as an artifact much less frequently than a New World monkey based on evolutionary distance...even if we assume convergent evolution with the human host at the AA level the fact is that rhinovirus C has only 3 hits. Rhinovirus A has 4,327 Genbank protein entries and rhinovirus C has 2580, so database bias can't account for the 4 order of magnitude discrepancy. Rhinovirus C is actually completely in line with the mean of 4.2 observations for species assignments under a Poisson distribution for this dataset, i.e. it's part of the (low count) random false positive observations.  That's 4.2/60 samples = 0.07 false observations/sample for any given species to be observed at least once with 100% confidence. I conclude that the rhinovirus A detection isn't likely to be artifactual when seen in control samples. 

We can now add that to our default sample metadata file and model the viral load effect (log probably makes sense) as part of the expression regression model.

echo viral_load > virus_count.txt

for f in *.tax; do echo -n "$f "; grep -i "Rhinovirus A; 100;" $f | wc -l >> virus_count.txt; done

paste meta.tab virus_count.txt > meta_with_viral_load.txt

Be sure to add 1 to the load so you aren't taking the log of zero, which will cause R to choke.

> library(sleuth)
> meta_with_viral_load <- read.table("meta_with_viral_load.txt");
> meta_with_viral_load
    sample             path patient_group_factor time viral_load
1    N1pre   N1pre.kallisto                   no    0          0
2  N1time1 N1time1.kallisto 
                  no    1         29
3  N1time2 N1time2.kallisto                   no    2       1187
4    N2pre   N2pre.kallisto                   no    0          6
5  N2time1 N2time1.kallisto                   no    1         29
6  N2time2 N1time2.kallisto
                   no    2         17
...
16   S1pre   S1pre.kallisto                  yes    0          0
17 S1time1 S1time1.kallisto                  yes    1       2006
18 S1time2 S1time2.kallisto                  yes    2       7292
...

> so <- sleuth_prep(meta_with_viral_load, ~patient_group_factor*time*I(log(viral_load+1)))

...

Note that this sequence was generated on an Illumina NextSeq 500, which does not have the "spread-of-signal" problem to the same degree that has recently been characterized for patterned flow cell Illumina devices. If you are using a HiSeq 4000 or similar, you'll be needing to do extra cleanup steps to remove free primers during library prep before putting samples onto the machine (I assume that Illumina is working fervently to fix the protocol right now). A significant amount (~5-10%) of post-infection sample data might have pre-infection samples barcodes, making the kind of analysis done here moot.

Spread-of-signal exists for the NextSeq 500 and MiSeq, probably more in the 1% range so judge your detection threshold accordingly. In the experiment described here, one of the patients appears to have caught a slightly different rhinovirus A strain before the experimental inoculation and was already in immunity overdrive (i.e. N2pre has a viral load, and inoculation didn't cause a big jump in load for time 1 and 2). I didn't bore you with all the strain level information, but trust me, it's there. If the N2pre load was artifactual by some spread-of-signal we'd expect the Poisson distance matrix to still look decent, but in fact this sample stands out, clustering with the post-infection samples. I therefore excluded it from the analysis, not by cherry picking but by metagenomic/metatranscriptomic (ssRNA viral genome) evidence that make the sample unsuitable for the study.

The sample exclusion and the quantitative modelling increased the number of significantly differentially expressed genes from a handful to hundreds, using data freely available but unmapped by Kallisto in a normal human gene expression analysis.

March 31, 2017

Verifying sample pairings in clinical RNASeq: using Kallisto as a Bloom filter to calculate a GenomeID

tl;dr Before going into a full analysis, you can quickly check if paired samples in an RNASeq experiment involving human subjects are properly paired by checking that their GenomeIDs roughly match.  The quickest way to generate a GenomeID when starting with FASTQ files is to run Kallisto with a specific, small set of transcripts as the database, then pipe the pseudoaligned reads to BWA. You can process a billion reads in less than ten minutes this way with a multicore server. I used this to identify a sample with inconsistent GenomeID that you would never otherwise suspect was problematic based on its overall expression profile, but excluding it nearly doubled (10->17) the number of differentially expressed genes found for the factor of interest.

This Kallisto-as-BWA-filter procedure can be used for any human FASTQ files to help uniquely identify the sample origin. Although originally designed for exomes and genomes, it works well enough to be useful for RNASeq.
__________________

For a few years, I've been using a scheme I made up called GenomeID.  This is a handy 12 character ID that can be calculated from the zygosity of highly polymorphic SNPs in coding regions of 58 genes throughout the human genome. Because of the details of the schema, even close relatives have completely unrelated GenomeIDs, and the odds of ID collisions are tiny. This lets me quickly check in a fraction of a second the intrinsic ID of a genome or exome BAM or VCF file for tracking purposes, regardless of the labels different source may give me (e.g. the genome in a bottle is NA12878, CEPH1463.02, or 623 but always has GenomeID ABL0N0yhmh9B). Feel free to use the code, port it to Python, etc.

I never intended it to be used for identifying transcript datasets, but luckily for me, 14 of the 58 genes involved in generating the ID are ubiquitously expressed (see the PDF in GitHub for details and references).  So for any given cell type/condition we'll get some consistent dropout of markers, leading to more GenomeID letters being related than in a genome or exome.  There is still enough information entropy in those 14+ genes though to distinguish between samples at the scale of an RNASeq experiment. I can use this to check to a first approximation that multiple samples in the experiment that are supposed to have come from the same human actually do.

Before I can run my GenomeID script, I need a VCF or BAM file as input.  That's fine, I'll launch a parallel BWA MEM process for each of the 24 samples (~50 million reads each), and I have a 32-core server so it should be quick right?  Unfortunately, after getting used to running Kallisto for RNASeq pseudomapping, waiting for BWA to finish on these 1.2 billion reads is like:


Who has hours to spare? They say impatience is one of the chief virtues of a programmer. I thought to myself, why don't I combine the best of both worlds and use Kallisto to quickly identify the subset of reads pseudomapping to the genes involved in the GenomeID, then only map those reads to the genome using BWA MEM? So, I quickly generated a subset of the RefSeq transcripts (150 of the 52606 I typically use) corresponding to the 58 GenomeID genes, indexed it as hg19_genomeid_bloom, then ran a test of the combined strategy:

bash-4.2$ time kallisto quant -i hg19_genomeid_bloom --pseudobam <(gzip -cd sampleA1/sampleA1_L001_R1_001.fastq.gz) --single -l 180 -s 20 -o testbloom | perl -ane 'print "\@$F[0]\n$F[9]\n+\n$F[10]\n"' | bwa mem hg19.fa - | samtools sort -O bam > testbloom.bam

[jumble of bwa and kallisto progress messages...]
[quant] processed 4,131,919 reads, 24,897 reads pseudoaligned
[more jumble...]
real 1m00.809s
user 0m59.875s
sys 0m5.965s

The intervening perl one-liner converts the SAM output from Kallisto into FASTQ for BWA. If you'll notice the quant message, we only passed 0.6% of the reads on to BWA for processing, saving us oodles of CPU time.  This follows in the vein of a Bloom filter, which is used to quickly calculate a hash code that is guaranteed to include all true positive entries, but may also include some false positives. A Bloom filter saves a lot of effort when the retrieval operation is expensive compared to the Bloom hashing operation, and we approximate* this here with Kallisto pseudomapping being much less expensive per sequence than real full alignment in BWA. The few false positives Kallisto generates will be mapped to elsewhere in the genome by BWA, mitigating their potential effect on the GenomeID result.

Given an Illumina sequencing experiment output, run bcl2fastq, then:

bash-4.2$ cd Data/Intensities/BaseCalls/myExperiment
bash-4.2$ for d in *; do kallisto quant -i hg19_genomeid_bloom --pseudobam <(gzip -cd $d/*.gz) --single -l 180 -s 20 -o $d.bloom | perl -ane 'print "\@$F[0]\n$F[9]\n+\n$F[10]\n"' | bwa mem hg19.fa - | samtools sort -O bam > $d.genomeid.bam & done
bash-4.2$ for b in *.genomeid.bam; do samtools index $b & done

That took less than 10 minutes on a 32 core server, and it never broke the 50% utilization mark.  It's Kallisto heavy in the first few minutes, then shifts towards BWA at the end as it can't quite keep up with our 0.6% passthrough rate.  Now we can generate the GenomeIDs with a script from the repository:

bash-4.2$ generate_genome_id *.genomeid.bam > genomeids

Append it to the metadata file we'll use for Kallisto, with factor columns removed for readability here:
bash-4.2$ paste meta.tab genomeids

sample path GenomeID
N1d_003_screen N1d_003_screen.kallisto /CBCJYBAKABB
N1e_003_V8 N1e_003_V8.kallisto /CBCJABAKABB
N2d_010_screen N2d_010_screen.kallisto /CQGBZIBCNBB
N2e_010_V8 N2e_010_V8.kallisto /CQGBYIBCdBB
N3d_018_screen N3d_018_screen.kallisto /CAaBoJAKIBB
N3e_018_V8 N3e_018_V8.kallisto /CASBoJAKJBB
N4d_026_screen N4d_026_screen.kallisto /CQGdwJRKIBB
N4e_026_V_V8 N4e_026_V_V8.kallisto /CQOZwJRKYBB
N5d_035_screen N5d_035_screen.kallisto /CACJRIBCIBB
N5e_035_V8 N5e_035_V8.kallisto /CASZRIBCIBB
N6d_043_screen N6d_043_screen.kallisto /CAeB4JRKYBB
N6e_043_V8 N6e_043_V8.kallisto /CAeB5JRKIBB
S1d_020_screen S1d_020_screen.kallisto /CQGJgIAKNBB
S1e_020_V8 S1e_020_V8.kallisto /CQGJoIAKNBB
S2d_049_screen S2d_049_screen.kallisto /CRSJYIACIBB
S2e_049_V8 S2e_049_V8.kallisto /CRSJYIACIBB
S3d_050_screen S3d_050_screen.kallisto /AQGJQJSCJBB
S3e_050_V8 S3e_050_V8.kallisto /AAOJQJACIBB
S4d_058_screen S4d_058_screen.kallisto /CQaBYJAKIBB
S4e_058_V8 S4e_058_V8.kallisto /CQSBZJAKKBB
S5d_061_screen S5d_061_screen.kallisto /CZaNgJAKEBB
S5e_061_V8 S5e_061_V8.kallisto /CRCJhJAKcBB
S6d_065_screen S6d_065_screen.kallisto /CRKRQJBLABB
S6e_065_V8 S6e_065_V8.kallisto /CRKVQJBKIBB

N# and S# are subject identifiers and d/e are pre and post experiment, so since it's alphabetical we expect matching like (1,2), (3,4), etc.  For each pair we're only off by a letter or two, except for subject S5 (italicized above). I use this information in two ways: first, after I generate the Poisson distance matrix (dark=close) for the samples, I'll look to see if S5e and S5d are close to each other.

Yes, they are right next to each other in the cluster tree (see left margin), so the difference in IDs is unlikely to be explained by the difference in observable genes to generate the IDs in each sample.^ I might exclude this sample later if the differential expression stats are all wonky.  As it turns out, excluding this sample nearly doubles (10 -> 17) the number of significantly differentially expressed genes for the factor of interest.  Looking at the distance matrix alone you would never suspect that this sample was problematic.

Second, N5d is quite the outlier in the matrix...is it mislabelled, maybe from another patient?  No, despite having a significantly outlying expression profile, the GenomeID of the sample is nearly identical to N5e.  If something went wrong with this sample, it's nearly impossible for us to distinguish it from plain old biological variability amongst humans in the treatment response. I'm unlikely to exclude this from the analysis until I've tried a bunch of other remedies like non-parameteric tests.

Is this a definitive way to check for mistaken sample labelling? No. But it's surely better than blindly trusting that the inputs to the analysis are correctly labelled, and worth a 10 minute effort that could save you major headaches down the road.  Don't forget to do a sex check too.
__________
*I say approximately because Kallisto will fail to map some reads BWA would, so we can't guaranteed no false negatives. And some end up in poorly resolved equivalency classes without bootstrapping, but it's negligible for the purposes of the GenomeID calculation.

^By the way, don't use the pvalues script in GitHub on RNASeq derived identifiers. It assumes diploidy, which we can't assume for RNASeq derived genotypes due to imprinting, etc.

March 21, 2017

Quick sample sex swap sanity check for RNASeq

I've noted before how you might detect RNASeq sample swaps in engineered cell line samples.  But what if we don't have a priori genotype knowledge?

To quickly catch some possible sample swap problems in an RNASeq experiment involving clinical samples with sex metadata, you can look at the expression level of SRY, a gene found on the Y chromosome and hence should only appear in males of any age^. Females have shown zero signal in all experiments we've run here. If using Kallisto, concatenate the abundance (transcripts per million) files first:

$ perl -e 'print "target_id\t",join("\t",map {/(.*)\//;$1} @ARGV),"\n";' *.kallisto/abundance.tsv > all_abundance.tsv

$ paste *.kallisto/abundance.tsv | perl -ane 'print $F[0];for (1..$#F){print "\t$F[$_]" if /[49]$/}print "\n"' | tail -n +2 >> all_abundance.tsv

Then the following adds a column to the metadata tabular file with the NM_003140 identifier for SRY (I used RefSeq as a Kallisto reference):

$ grep NM_003140 all_abundance.tsv | perl -ane 'print join("\n",@F),"\n"' | paste meta.tab -
sample path sex age NM_003140
A A.kallisto M 50 0.603562
B B.kallisto M 75 0.540668
C C.kallisto M 27 0.519294
D D.kallisto F 35 0
E E.kallisto M 46 0
F F.kallisto M 74 0.970973
G G.kallisto M 41 0.57206
H H.kallisto F 30 0
I I.kallisto M 19 0.246618
J J.kallisto F 39 0.381072
K K.kallisto F 61 0
L L.kallisto M 37 0.304948
M M.kallisto M 65 0
N N.kallisto F 78 0
O O.kallisto F 57 0
P P.kallisto F 53 0
Q Q.kallisto F 52 0
R R.kallisto F 73 0

I've highlighted three samples were the SRY information doesn't jibe with the sample metadata: two males without SRY expression, and one female with SRY expression. Time to track backwards to where the errors might have occurred! Note that in my experience, no females have any detectable SRY expression, but not all males are detectable (i.e. occasionally male samples have 0).

Update: Even more reliable is the presence of the XIST long non-coding (but polyadenylated so usually captured in mammalian RNASeq protocols) in female samples but not male.  It's RefSeq ID is NM_001564.This wouldn't be reliable in Turner Syndrome subjects, but hopefully your study does not include these, or if they do you use the combination of XIST absence and SRY presence.
_________
^Sure there are rare exceptions to nominal male=Y, I'm ignoring them here.

March 13, 2017

Combining data from multiple RNASeq experiments: release the Kruskal! (...Wallis test)

tl;dr When you combine RNASeq experiments, the assumption about normality across samples with the same nominal factors goes out the window. The standard tests give 16 transcripts even when the experiment batch effect is modelled fully. For transcript-level (e.g. Kallisto output) abundance data with at least ~25 samples, we can test two factors in a non-parametric (expression rank) way by using a Kruskal-Wallis test.

A post-hoc U-Test for each transcript passing Kruskal-Wallis will tell us what factors (or their interaction) are significantly changing the expression level. This is despite a global U-Test of transcripts turning up nothing. In the example below, we get 776 candidates in the combined analysis, with 71 in the factor interaction term. Many of those 71 are enriched in specific pathways, and do not overlap with the 16 found via parametric tests.
_________________________

Recently, I analyzed a RNASeq experiment redo.  The experiment was redone because the original yielded no useful significant transcript changes across the experimental factor contrast of interest (exercise vs. non-exercise).  The redone experiment did yield some significant contrasts for the exercise factor, but I thought to myself: can't we combine the two experiments and get even better statistical power? The short answer is no if you're using the standard tests  like Wald and Likelihood Ratio, because they assume replicates that are normally distributed.  We shouldn't assume that the transcript expression levels for replicates across the two experiments are normally distributed.

Here's a three step process to model the two-factor experiment non-paramterically with the Kruskal Wallis test. 1) Check for outliers, 2) Run Kruskal-Wallis, 3) Run post-hoc pairwise U-tests.

Step 1: Check for outliers

Note that since we are going to be using specific commands to gather the raw abundance data from Kallisto later, it's important that the samples be in alphabetical order in the metadata file.

meta <- read.table("meta_kallisto_combined_sorted.tab", header=TRUE)
meta$path <- as.character(meta$path)

We'll concatenation the two factors' values to yield four distinct factor level combination, giving us a base level (ctl_five), the individual factor effects (ctl_ten for time, exp_five for condition) and the interaction effect (exp_ten).

meta$groups <- as.factor(paste(meta$condition, meta$time, sep="_"))
meta
            sample                     path condition time   groups
1    D10_control_2   D10_control_2.kallisto       ctl  ten  ctl_ten
2    D10_control_3   D10_control_3.kallisto       ctl  ten  ctl_ten
3    D10_control_4   D10_control_4.kallisto       ctl  ten  ctl_ten
4    D10_control_5   D10_control_5.kallisto       ctl  ten  ctl_ten
5     D5_control_1    D5_control_1.kallisto       ctl five ctl_five
6     D5_control_2    D5_control_2.kallisto       ctl five ctl_five
7     D5_control_5    D5_control_5.kallisto       ctl five ctl_five
8     D5_control_6    D5_control_6.kallisto       ctl five ctl_five
9   Day10_Control1  Day10_Control1.kallisto       ctl  ten  ctl_ten
10  Day10_Control2  Day10_Control2.kallisto       ctl  ten  ctl_ten
11  Day10_Control3  Day10_Control3.kallisto       ctl  ten  ctl_ten
12 Day10_Exercise1 Day10_Exercise1.kallisto       exp  ten  exp_ten
13 Day10_Exercise2 Day10_Exercise2.kallisto       exp  ten  exp_ten
14 Day10_Exercise3 Day10_Exercise3.kallisto       exp  ten  exp_ten
15   Day5_Control1   Day5_Control1.kallisto       ctl five ctl_five
16   Day5_Control2   Day5_Control2.kallisto       ctl five ctl_five
17   Day5_Control3   Day5_Control3.kallisto       ctl five ctl_five
18  Day5_Exercise1  Day5_Exercise1.kallisto       exp five exp_five
19  Day5_Exercise2  Day5_Exercise2.kallisto       exp five exp_five
20  Day5_Exercise3  Day5_Exercise3.kallisto       exp five exp_five
21     Exercise_12     Exercise_12.kallisto       exp five exp_five
22     Exercise_13     Exercise_13.kallisto       exp  ten  exp_ten
23     Exercise_17     Exercise_17.kallisto       exp  ten  exp_ten
24     Exercise_18     Exercise_18.kallisto       exp  ten  exp_ten
25     Exercise_19     Exercise_19.kallisto       exp  ten  exp_ten
26      Exercise_1      Exercise_1.kallisto       exp five exp_five
27      Exercise_2      Exercise_2.kallisto       exp five exp_five
28      Exercise_7      Exercise_7.kallisto       exp five exp_five

Now let's read in the transcript abundance data gathered by running Kallisto.  The commands to put this file together are described in a previous blog post.

> abundance_full <- read.table("all_abundance.tsv", header=TRUE)

Now let's make a distance matrix for the samples based on the abundance data, in order to look for outliers that may interfere with our analysis.  This is probably why the first experiment (samples named Day*) yielded no results.
> library("pheatmap")
> library("RColorBrewer")
> library("PoiClaClu")
> poisd <- PoissonDistance(t(abundance_full))
> samplePoisDistMatrix <- as.matrix( poisd$dd )
> rownames(samplePoisDistMatrix) <- colnames(abundance_full)
> colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
> pheatmap(samplePoisDistMatrix,  clustering_distance_rows=poisd$dd,clustering_distance_cols=poisd$dd, col=colors)


What immediately leaps out to me is that the main grouping factor is the experiment, i.e. all the DayXXX samples are grouped together. The second biggest grouping factor within each experiment is the day (5 vs. 10). Also, good contrasts yield a "checkerboard" effect in the distance matrix.  This is much more like a mosaic.

Does this mean that it'll be tough to pull out the exercise factor effect from the combined dataset? This is where a non-parametric test helps out, because it does not assume a normal distribution of the effect amongst "replicate" samples, just that the direction of the effect is the same (I'm glossing over details here, but it's generally true). Let's look at the residual sum of squares (RSS) of the model fit to get a relative measure of how well we are modelling the data with the known factors.

so <- sleuth_prep(meta, ~condition*time)
so <- sleuth_fit(so)
sum(so$fits$full$summary[,2])
[1] 1159405

Getting back to the distance matrix and outlier analysis, do you see the light (distant) streak running through the matrix for samples 9 and 12 (X-axis label)? They are all off on their own (cluster tree on the left margin), and don't relate nicely via the general factor trends. Let's try excluding them from the model and see how it affects our RSS.

meta_no_outliers <- meta[c(1:8,10:11,13:28),]
so_no_outliers <- sleuth_prep(meta_no_outliers, ~condition*time)
so_no_outliers <- sleuth_fit(so_no_outliers)
sum(so_no_outliers$fits$full$summary[,2])
[1] 1052599

So we reduces the RSS by over 9% by excluding 7% of the samples (2 of 28). Generally, excluding a good replicate will reduce the RSS by only up to half its sample weight (i.e. we'd expect an RSS reduction of <5% for these two samples, but we got 9% so they are really outliers).  Maybe these samples were prepared separately, etc. we'd have to go back to the lab book to look for an explanation.

Sample 27 from the second experiment run also looks a bit fishy, all off on its one in the cluster tree.  Let's try dropping it too.

meta_no_outliers_redux <- meta[c(1:8,10:11,13:26,28),]
so_redux <- sleuth_prep(meta_no_outliers_redux, ~condition*time)
so_redux <- sleuth_fit(so_redux)
sum(so_redux$fits$full$summary[,2])
[1] 1048939

The exclusion of Sample 27 (Exercise_2) dropped the RSS by a measly incremental 0.35% even though we dropped 3.8% of the sample (1 of 26). We should keep it because it isn't an outlier contributing to a bad model fit.

While it's fresh in our minds, let's assign an experiment factor to the metadata, which as you can see in the bottom right of the matrix is samples 10-20 once we eliminate the two big outliers. This will help us reduce the RSS later.

> meta_no_outliers$which_experiment <- as.factor(c(rep.int(0,8),rep.int(1,10),rep(0,8)))
> meta_no_outliers
            sample                     path condition time   groups which_experiment
1    D10_control_2   D10_control_2.kallisto      ctl  ten  ctl_ten                0
2    D10_control_3   D10_control_3.kallisto      ctl  ten  ctl_ten                0
3    D10_control_4   D10_control_4.kallisto      ctl  ten  ctl_ten                0
4    D10_control_5   D10_control_5.kallisto      ctl  ten  ctl_ten                0
5     D5_control_1    D5_control_1.kallisto      ctl five ctl_five                0
6     D5_control_2    D5_control_2.kallisto      ctl five ctl_five                0
7     D5_control_5    D5_control_5.kallisto      ctl five ctl_five                0
8     D5_control_6    D5_control_6.kallisto      ctl five ctl_five                0
10  Day10_Control2  Day10_Control2.kallisto      ctl  ten  ctl_ten                1
11  Day10_Control3  Day10_Control3.kallisto      ctl  ten  ctl_ten                1
13 Day10_Exercise2 Day10_Exercise2.kallisto      exp  ten  exp_ten                1
14 Day10_Exercise3 Day10_Exercise3.kallisto      exp  ten  exp_ten                1
15   Day5_Control1   Day5_Control1.kallisto      ctl five ctl_five                1
16   Day5_Control2   Day5_Control2.kallisto      ctl five ctl_five                1
17   Day5_Control3   Day5_Control3.kallisto      ctl five ctl_five                1
18  Day5_Exercise1  Day5_Exercise1.kallisto      exp five exp_five                1
19  Day5_Exercise2  Day5_Exercise2.kallisto      exp five exp_five                1
20  Day5_Exercise3  Day5_Exercise3.kallisto      exp five exp_five                1
21     Exercise_12     Exercise_12.kallisto      exp five exp_five                0
22     Exercise_13     Exercise_13.kallisto      exp  ten  exp_ten                0
23     Exercise_17     Exercise_17.kallisto      exp  ten  exp_ten                0
24     Exercise_18     Exercise_18.kallisto      exp  ten  exp_ten                0
25     Exercise_19     Exercise_19.kallisto      exp  ten  exp_ten                0
26      Exercise_1      Exercise_1.kallisto      exp five exp_five                0
27      Exercise_2      Exercise_2.kallisto      exp five exp_five                0
28      Exercise_7      Exercise_7.kallisto      exp five exp_five                0


Step 2: Run the non-parametric tests

Let's first try the non-parametric U-test (a.k.a. Wilcox or Mann-Whitney test) to check for difference between the exercise and non-exercise samples.

> abundance <- abundance_full[rowSums(abundance_full)>length(meta_no_outliers$sample),c(1:8,10:11,13:28)]

> wil <- function(x){df <- data.frame(tpm=t(abundance[x,])[,1], fctr=m); w <- wilcox.test(tpm ~ fctr, data=df); w$p.value}  #[ed. m will be populated with the factor values for each sample in the next function]

> calc_wilcox_qvals <- function(combo){m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; return(p.adjust(pvals, method="fdr"))}

> condition_wilcox_qvals <- calc_wilcox_qvals(meta_no_outliers$condition)
There were 50 or more warnings (use warnings() to see the first 50) #[ed. This is due to rank ties in the U-test, you can generally ignore these]

> table(condition_wilcox_qvals < .05)

FALSE 
23953

D'oh! We don't have any transcript differential expression picked up by the U-test that survive the False Discovery Rate (Benjaimin-Hochberg in this case) multiple testing correction.  We know from previous experience that we might pick up a two-level contrast once we have at least 15 Kallisto samples.  We can also show that we have the power to do so in this experiment by looking at one of the primary grouping factors in the distance matrix: the time levels of Day5 vs Day10.

> time_wilcox_qvals <- calc_wilcox_qvals(meta_no_outliers$time)
There were 50 or more warnings (use warnings() to see the first 50)

> table(time_wilcox_qvals < .05)

FALSE  TRUE 
21176  2777 

If exercise-vs.-non isn't separating by rank nicely across the two experiments, this might be because at Day 5, the effect of exercise isn't very strong so those exercise samples mix ranks with the non-exercise samples.  The way to improve our power to detect the Day10 exercise effect would be to look at the interaction (synergistic) effect of the day and exercise factors, as per the "groups" metadata factor we created earlier that has four levels. The U-test only works with two factor levels.  The generalization of it for more than two factor levels in the Kruskal-Wallis test, which we'll employ here.

> kru <- function(x){d <- data.frame(tpm=t(abundance[x,])[,1], groups=m); k <- kruskal.test(tpm ~ groups, data=d); k$p.value}


> kru_qvals <- function(combo){m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,kru(i))}; return(p.adjust(pvals, method="fdr"))}
> k <- kru_qvals(meta_no_outliers$groups)
> table(k < .05)

FALSE  TRUE 
23193   760 

[Aside: An astute reader will note that we have less than 8 samples in each of the four factor levels, yet we are picking out significant contrasts when the simpler U-test is underpowered at <8 samples per factor level.  Because we have four factor levels, there are more unique permutations possible of the ranks than when there's only two, hence the better performance here.  Thinking of it another way, for any fixed string length, the number of unique ways to write 1's and 0's is << the number of ways to write 0, 1, 2, and 3. More unique ranking choices means more information entropy, which means greater statistical power for any given factor level.]

Let's note the Kruskal-Wallis significant transcripts for later.  

> kru_sig_abundance <- abundance[which(k < .05),]

> sig_ks <- k[which(k < .05)]

Let's do a sanity check and see if a random labelling of the samples gives any significant changes (it shouldn't).

k_random <- kru_qvals(sample(meta_no_outliers$groups))
> table(k_random < .05)

FALSE 

23953 

Phew! You could run the random sampling a bunch of times, that's left as an exercise to the reader.


Stage 3: Post-hoc significance testing with many factor levels

Now, we know that 760 transcripts have factor-levels with significantly different rankings according to our groups label, but which of the groups factor's four levels are the ones that are significant?  We need to contrast each pair of factor levels. This page from the R Companion provides a nice summary of ways to do the post-hoc analysis of the differing groups. Let's use the U-Test since it behaves nicely without an even number of items in each category (remember that we removed two samples, and if you don't believe me...


> table(meta_no_outliers$groups)

ctl_five  ctl_ten exp_five  exp_ten 

       7        6        7        6 
).

No only do we want to use the U-test (a.k.a. Mann-Whitney or Wilcox) for each pair of factor levels and do multiple testing correction (FDR again), but we need to load a couple of libraries that'll give us useful functions to print out the group comparisons later.

> library(multcompView)
> library(rcompanion)


> categorize <- function(x){pt <- pairwise.wilcox.test(as.numeric(x), meta_no_outliers$groups, p.adjust.method="fdr");

pt$p.value[which(is.nan(pt$p.value))] <- 1;  multcompLetters(fullPTable(pt$p.value), compare="<", threshold=0.05, Letters=letters, reverse=FALSE)$Letters}
> kru_sig_diff_groups <- apply(kru_sig_abundance, 1, categorize)


This will give us for each of the 760 transcripts a compact representation of the significant grouping information for the 4 factor levels, e.g. 

> kru_sig_diff_groups[,1]
ctl_five ctl_ten exp_five exp_ten
"ab"     "c"     "a"      "bc"

This means that ctl_five is statistically indistinguishable from exp_five (they shared 'a') and exp_ten(they share 'b'), ctl_ten is indistinguishable from exp_ten (they share 'c'), exp_five and exp_ten are distinct (they share no letters), etc..  Let's write a function to keep only the U-test results that have exp_ten (our interaction term, index 4) distinct from either non-exercise control time points (indices 1& 2).


> exercise_only <- function(x){!grepl(x[1],x[4]) && !grepl(x[4],x[1]) && !grepl(x[2],x[4]) && !grepl(x[4],x[2])}
> eo <- apply(kru_sig_diff_groups, 2, exercise_only)
> table(eo)
eo
FALSE  TRUE 

  689    71 
> kru_exercise_sigids <- colnames(kru_sig_diff_groups[,eo])

Eureka!  We have 71 significant Day10 exercise differentially expressed transcripts across both experiments.  This was done based solely on expression rank of transcripts across samples, so we didn't need the experiments to be truly replicates, just moving in the same direction generally.  We could write a function to calculate the log2 fold changes, but it'd get complicated as we have multiple factor levels, and won't include estimates of technical variability, etc that Sleuth provides.  Even if the parametric test results from Sleuth aren't going to be significant due to the non-normality of the samples, the fold change information is useful so let's run it. 

> so <- sleuth_prep(meta_no_outliers, ~condition*time+which_experiment+condition:which_experiment+time:which_experiment)
> sum(so$fits$full$summary[,2])
[1] 864793.9

Note that this is a much better RSS than originally (~1160K). We did two things: exclude the outliers (~9% drop in RSS), and we explicitly modelled the experiments and their effect on the other factors (~18% drop in RSS). This will help the Sleuth results for fold-change* be more realistic. Run the Wald and Likelihood Ratio (LRT) tests.

> so <- sleuth_fit(so, ~time+which_experiment+time:which_experiment, "no_condition")
...
> so <- sleuth_fit(so, ~condition+which_experiment+condition:which_experiment, "no_time")
...
> so <- sleuth_fit(so, ~condition+time+which_experiment+condition:which_experiment+time:which_experiment, "no_interaction")
...
> so <- sleuth_lrt(so, 'no_condition', 'full')
> so <- sleuth_lrt(so, 'no_time', 'full')
> so <- sleuth_lrt(so, 'no_interaction', 'full')
> lrt_condition <- sleuth_results(so, 'no_condition:full', test_type = 'lrt')
> lrt_time <- sleuth_results(so, 'no_time:full', test_type = 'lrt')
> lrt_interaction <- sleuth_results(so, 'no_interaction:full', test_type = 'lrt')
> so <- sleuth_wt(so, 'conditionexp')
> so <- sleuth_wt(so, 'timeten')
> so <- sleuth_wt(so, 'conditionexp:timeten')
> so <- sleuth_wt(so, 'conditionexp:which_experiment1')
> so <- sleuth_wt(so, 'timeten:which_experiment1')
> wt_condition <- sleuth_results(so, 'conditionexp')
> wt_time <- sleuth_results(so, 'timeten')
> wt_interaction <- sleuth_results(so, 'conditionexp:timeten')
> wt_condition1 <- sleuth_results(so, 'conditionexp:which_experiment1')
> wt_time1 <- sleuth_results(so, 'timeten:which_experiment1')

Did the parametric tests show up anything significant?

> table(wt_condition$qval < .05)

FALSE  TRUE 
33456    16 
> table(lrt_condition$qval < .05)

FALSE  TRUE 
33466     6 
> table(lrt_interaction$qval < .05)

A few, but not as many as our non-parametric Kruskal-Wallis (71).  The LRT (6) results are a subset of the Wald results (10).  Let's put the non-paramteric and parametric results together for reporting.

> all_sigids <- unique(c(wt_condition_sigids,kru_exercise_sigids))
> length(all_sigids)
[1] 87

87 = 71 + 16 so there is no overlap in the two tests' results.  Let's define a function that provides the post-hoc pairwise U-test results when available, or NAs when not available, for reporting purposes.

> kru_data <- function(id){if(id %in% colnames(kru_sig_diff_groups)){c(kru_sig_diff_groups[,which(colnames(kru_sig_diff_groups) == id)],sig_ks[which(colnames(kru_sig_diff_groups)==id)])}else{rep(NA,5)}}

Gather the Wald test results (which includes the fold-change*) and the pairwise U-test results for each of the 87 significantly differentially expressed transcripts, then transpose the rows and columns, then write to a file.

> dl <- vapply(all_sigids, function(id){unlist(c(wt_condition[wt_condition$target_id == id,],wt_condition1[wt_condition1$target_id == id,],wt_time[wt_time$target_id == id,],kru_data(id)))}, FUN.VALUE=rep("1", 38))

> write.csv(t(dl), "kruskal_wallis_q_lt_0.05_u_test_fdr_0.05.csv")

I'll load this into Excel and remove extraneous Wald test result columns that just confuse most people reading the analysis, and convert the beta values to log2 fold-changes* with an Excel formula for users'  didactic purposes.**



If you look at the exp_five column, you'll see that it is often indistinguishable from (shares a letter with) one or both of the controls. This explains why the two-factor level U-test failed, and why using the Kruskal-Wallis test was key to seeing the interaction effect.  As per usual, take any individual transcript with a grain of salt, but look for pathway enrichment in IPA, DAVID, etc.


________
* Not really fold-change, but the natural logarithm of the bias estimator.  Read the Sleuth manual for details. It's worth it.

** A clever reader will note that you can change the log base with a constant value. Once again, I used this formula for didactic purposes, making the base conversion more explicit.