CHGI Bioinformatics: March 2017

March 31, 2017

Verifying sample pairings in clinical RNASeq: using Kallisto as a Bloom filter to calculate a GenomeID

tl;dr Before going into a full analysis, you can quickly check if paired samples in an RNASeq experiment involving human subjects are properly paired by checking that their GenomeIDs roughly match. The quickest way to generate a GenomeID when starting with FASTQ files is to run Kallisto with a specific, small set of transcripts as the database, then pipe the pseudoaligned reads to BWA. You can process a billion reads in less than ten minutes this way with a multicore server. I used this to identify a sample with inconsistent GenomeID that you would never otherwise suspect was problematic based on its overall expression profile, but excluding it nearly doubled (10->17) the number of differentially expressed genes found for the factor of interest.

This Kallisto-as-BWA-filter procedure can be used for any human FASTQ files to help uniquely identify the sample origin. Although originally designed for exomes and genomes, it works well enough to be useful for RNASeq.

__________________

For a few years, I've been using a scheme I made up called GenomeID. This is a handy 12 character ID that can be calculated from the zygosity of highly polymorphic SNPs in coding regions of 58 genes throughout the human genome. Because of the details of the schema, even close relatives have completely unrelated GenomeIDs, and the odds of ID collisions are tiny. This lets me quickly check in a fraction of a second the intrinsic ID of a genome or exome BAM or VCF file for tracking purposes, regardless of the labels different source may give me (e.g. the genome in a bottle is NA12878, CEPH1463.02, or 623 but always has GenomeID ABL0N0yhmh9B). Feel free to use the code, port it to Python, etc.

I never intended it to be used for identifying transcript datasets, but luckily for me, 14 of the 58 genes involved in generating the ID are ubiquitously expressed (see the PDF in GitHub for details and references). So for any given cell type/condition we'll get some consistent dropout of markers, leading to more GenomeID letters being related than in a genome or exome. There is still enough information entropy in those 14+ genes though to distinguish between samples at the scale of an RNASeq experiment. I can use this to check to a first approximation that multiple samples in the experiment that are supposed to have come from the same human actually do.

Before I can run my GenomeID script, I need a VCF or BAM file as input. That's fine, I'll launch a parallel BWA MEM process for each of the 24 samples (~50 million reads each), and I have a 32-core server so it should be quick right? Unfortunately, after getting used to running Kallisto for RNASeq pseudomapping, waiting for BWA to finish on these 1.2 billion reads is like:

Who has hours to spare? They say impatience is one of the chief virtues of a programmer. I thought to myself, why don't I combine the best of both worlds and use Kallisto to quickly identify the subset of reads pseudomapping to the genes involved in the GenomeID, then only map those reads to the genome using BWA MEM? So, I quickly generated a subset of the RefSeq transcripts (150 of the 52606 I typically use) corresponding to the 58 GenomeID genes, indexed it as hg19_genomeid_bloom, then ran a test of the combined strategy:

bash-4.2$ time kallisto quant -i hg19_genomeid_bloom --pseudobam <(gzip -cd sampleA1/sampleA1_L001_R1_001.fastq.gz) --single -l 180 -s 20 -o testbloom | perl -ane 'print "\@$F[0]\n$F[9]\n+\n$F[10]\n"' | bwa mem hg19.fa - | samtools sort -O bam > testbloom.bam

[jumble of bwa and kallisto progress messages...]

[quant] processed 4,131,919 reads, 24,897 reads pseudoaligned
[more jumble...]
real 1m00.809s

user 0m59.875s

sys 0m5.965s

The intervening perl one-liner converts the SAM output from Kallisto into FASTQ for BWA. If you'll notice the quant message, we only passed 0.6% of the reads on to BWA for processing, saving us oodles of CPU time. This follows in the vein of a Bloom filter, which is used to quickly calculate a hash code that is guaranteed to include all true positive entries, but may also include some false positives. A Bloom filter saves a lot of effort when the retrieval operation is expensive compared to the Bloom hashing operation, and we approximate* this here with Kallisto pseudomapping being much less expensive per sequence than real full alignment in BWA. The few false positives Kallisto generates will be mapped to elsewhere in the genome by BWA, mitigating their potential effect on the GenomeID result.

Given an Illumina sequencing experiment output, run bcl2fastq, then:

bash-4.2$ cd Data/Intensities/BaseCalls/myExperiment

bash-4.2$ for d in *; do kallisto quant -i hg19_genomeid_bloom --pseudobam <(gzip -cd $d/*.gz) --single -l 180 -s 20 -o $d.bloom | perl -ane 'print "\@$F[0]\n$F[9]\n+\n$F[10]\n"' | bwa mem hg19.fa - | samtools sort -O bam > $d.genomeid.bam & done

bash-4.2$ for b in *.genomeid.bam; do samtools index $b & done

That took less than 10 minutes on a 32 core server, and it never broke the 50% utilization mark. It's Kallisto heavy in the first few minutes, then shifts towards BWA at the end as it can't quite keep up with our 0.6% passthrough rate. Now we can generate the GenomeIDs with a script from the repository:

bash-4.2$ generate_genome_id *.genomeid.bam > genomeids

Append it to the metadata file we'll use for Kallisto, with factor columns removed for readability here:
bash-4.2$ paste meta.tab genomeids

sample path GenomeID
N1d_003_screen N1d_003_screen.kallisto /CBCJYBAKABB
N1e_003_V8 N1e_003_V8.kallisto /CBCJABAKABB
N2d_010_screen N2d_010_screen.kallisto /CQGBZIBCNBB
N2e_010_V8 N2e_010_V8.kallisto /CQGBYIBCdBB
N3d_018_screen N3d_018_screen.kallisto /CAaBoJAKIBB
N3e_018_V8 N3e_018_V8.kallisto /CASBoJAKJBB
N4d_026_screen N4d_026_screen.kallisto /CQGdwJRKIBB
N4e_026_V_V8 N4e_026_V_V8.kallisto /CQOZwJRKYBB
N5d_035_screen N5d_035_screen.kallisto /CACJRIBCIBB
N5e_035_V8 N5e_035_V8.kallisto /CASZRIBCIBB
N6d_043_screen N6d_043_screen.kallisto /CAeB4JRKYBB
N6e_043_V8 N6e_043_V8.kallisto /CAeB5JRKIBB
S1d_020_screen S1d_020_screen.kallisto /CQGJgIAKNBB
S1e_020_V8 S1e_020_V8.kallisto /CQGJoIAKNBB
S2d_049_screen S2d_049_screen.kallisto /CRSJYIACIBB
S2e_049_V8 S2e_049_V8.kallisto /CRSJYIACIBB
S3d_050_screen S3d_050_screen.kallisto /AQGJQJSCJBB
S3e_050_V8 S3e_050_V8.kallisto /AAOJQJACIBB
S4d_058_screen S4d_058_screen.kallisto /CQaBYJAKIBB
S4e_058_V8 S4e_058_V8.kallisto /CQSBZJAKKBB
S5d_061_screen S5d_061_screen.kallisto /CZaNgJAKEBB
S5e_061_V8 S5e_061_V8.kallisto /CRCJhJAKcBB
S6d_065_screen S6d_065_screen.kallisto /CRKRQJBLABB
S6e_065_V8 S6e_065_V8.kallisto /CRKVQJBKIBB

N# and S# are subject identifiers and d/e are pre and post experiment, so since it's alphabetical we expect matching like (1,2), (3,4), etc. For each pair we're only off by a letter or two, except for subject S5 (italicized above). I use this information in two ways: first, after I generate the Poisson distance matrix (dark=close) for the samples, I'll look to see if S5e and S5d are close to each other.

Yes, they are right next to each other in the cluster tree (see left margin), so the difference in IDs is unlikely to be explained by the difference in observable genes to generate the IDs in each sample.^ I might exclude this sample later if the differential expression stats are all wonky. As it turns out, excluding this sample nearly doubles (10 -> 17) the number of significantly differentially expressed genes for the factor of interest. Looking at the distance matrix alone you would never suspect that this sample was problematic.

Second, N5d is quite the outlier in the matrix...is it mislabelled, maybe from another patient? No, despite having a significantly outlying expression profile, the GenomeID of the sample is nearly identical to N5e. If something went wrong with this sample, it's nearly impossible for us to distinguish it from plain old biological variability amongst humans in the treatment response. I'm unlikely to exclude this from the analysis until I've tried a bunch of other remedies like non-parameteric tests.

Is this a definitive way to check for mistaken sample labelling? No. But it's surely better than blindly trusting that the inputs to the analysis are correctly labelled, and worth a 10 minute effort that could save you major headaches down the road. Don't forget to do a sex check too.
__________
*I say approximately because Kallisto will fail to map some reads BWA would, so we can't guaranteed no false negatives. And some end up in poorly resolved equivalency classes without bootstrapping, but it's negligible for the purposes of the GenomeID calculation.

^By the way, don't use the pvalues script in GitHub on RNASeq derived identifiers. It assumes diploidy, which we can't assume for RNASeq derived genotypes due to imprinting, etc.

March 21, 2017

Quick sample sex swap sanity check for RNASeq

I've noted before how you might detect RNASeq sample swaps in engineered cell line samples. But what if we don't have a priori genotype knowledge?

To quickly catch some possible sample swap problems in an RNASeq experiment involving clinical samples with sex metadata, you can look at the expression level of SRY, a gene found on the Y chromosome and hence should only appear in males of any age^. Females have shown zero signal in all experiments we've run here. If using Kallisto, concatenate the abundance (transcripts per million) files first:

$ perl -e 'print "target_id\t",join("\t",map {/(.*)\//;$1} @ARGV),"\n";' *.kallisto/abundance.tsv > all_abundance.tsv

$ paste *.kallisto/abundance.tsv | perl -ane 'print $F[0];for (1..$#F){print "\t$F[$_]" if /[49]$/}print "\n"' | tail -n +2 >> all_abundance.tsv

Then the following adds a column to the metadata tabular file with the NM_003140 identifier for SRY (I used RefSeq as a Kallisto reference):

$ grep NM_003140 all_abundance.tsv | perl -ane 'print join("\n",@F),"\n"' | paste meta.tab -

sample path sex age NM_003140

A A.kallisto M 50 0.603562

B B.kallisto M 75 0.540668

C C.kallisto M 27 0.519294

D D.kallisto F 35 0

E E.kallisto M 46 0

F F.kallisto M 74 0.970973

G G.kallisto M 41 0.57206

H H.kallisto F 30 0

I I.kallisto M 19 0.246618

J J.kallisto F 39 0.381072

K K.kallisto F 61 0

L L.kallisto M 37 0.304948

M M.kallisto M 65 0

N N.kallisto F 78 0

O O.kallisto F 57 0

P P.kallisto F 53 0

Q Q.kallisto F 52 0
R R.kallisto F 73 0

I've highlighted three samples were the SRY information doesn't jibe with the sample metadata: two males without SRY expression, and one female with SRY expression. Time to track backwards to where the errors might have occurred! Note that in my experience, no females have any detectable SRY expression, but not all males are detectable (i.e. occasionally male samples have 0).

Update: Even more reliable is the presence of the XIST long non-coding (but polyadenylated so usually captured in mammalian RNASeq protocols) in female samples but not male. It's RefSeq ID is NM_001564.This wouldn't be reliable in Turner Syndrome subjects, but hopefully your study does not include these, or if they do you use the combination of XIST absence and SRY presence.

_________

^Sure there are rare exceptions to nominal male=Y, I'm ignoring them here.

March 13, 2017

Combining data from multiple RNASeq experiments: release the Kruskal! (...Wallis test)

tl;dr When you combine RNASeq experiments, the assumption about normality across samples with the same nominal factors goes out the window. The standard tests give 16 transcripts even when the experiment batch effect is modelled fully. For transcript-level (e.g. Kallisto output) abundance data with at least ~25 samples, we can test two factors in a non-parametric (expression rank) way by using a Kruskal-Wallis test.

A post-hoc U-Test for each transcript passing Kruskal-Wallis will tell us what factors (or their interaction) are significantly changing the expression level. This is despite a global U-Test of transcripts turning up nothing. In the example below, we get 776 candidates in the combined analysis, with 71 in the factor interaction term. Many of those 71 are enriched in specific pathways, and do not overlap with the 16 found via parametric tests.

_________________________

Recently, I analyzed a RNASeq experiment redo. The experiment was redone because the original yielded no useful significant transcript changes across the experimental factor contrast of interest (exercise vs. non-exercise). The redone experiment did yield some significant contrasts for the exercise factor, but I thought to myself: can't we combine the two experiments and get even better statistical power? The short answer is no if you're using the standard tests like Wald and Likelihood Ratio, because they assume replicates that are normally distributed. We shouldn't assume that the transcript expression levels for replicates across the two experiments are normally distributed.

Here's a three step process to model the two-factor experiment non-paramterically with the Kruskal Wallis test. 1) Check for outliers, 2) Run Kruskal-Wallis, 3) Run post-hoc pairwise U-tests.

Step 1: Check for outliers

Note that since we are going to be using specific commands to gather the raw abundance data from Kallisto later, it's important that the samples be in alphabetical order in the metadata file.

meta <- read.table("meta_kallisto_combined_sorted.tab", header=TRUE)

meta$path <- as.character(meta$path)

We'll concatenation the two factors' values to yield four distinct factor level combination, giving us a base level (ctl_five), the individual factor effects (ctl_ten for time, exp_five for condition) and the interaction effect (exp_ten).

meta$groups <- as.factor(paste(meta$condition, meta$time, sep="_"))

Step 2: Run the non-parametric tests

Let's first try the non-parametric U-test (a.k.a. Wilcox or Mann-Whitney test) to check for difference between the exercise and non-exercise samples.

> abundance <- abundance_full[rowSums(abundance_full)>length(meta_no_outliers$sample),c(1:8,10:11,13:28)]

> wil <- function(x){df <- data.frame(tpm=t(abundance[x,])[,1], fctr=m); w <- wilcox.test(tpm ~ fctr, data=df); w$p.value} #[ed. m will be populated with the factor values for each sample in the next function]

> calc_wilcox_qvals <- function(combo){m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; return(p.adjust(pvals, method="fdr"))}

> condition_wilcox_qvals <- calc_wilcox_qvals(meta_no_outliers$condition)

There were 50 or more warnings (use warnings() to see the first 50) #[ed. This is due to rank ties in the U-test, you can generally ignore these]

> table(condition_wilcox_qvals < .05)

FALSE

23953

D'oh! We don't have any transcript differential expression picked up by the U-test that survive the False Discovery Rate (Benjaimin-Hochberg in this case) multiple testing correction. We know from previous experience that we might pick up a two-level contrast once we have at least 15 Kallisto samples. We can also show that we have the power to do so in this experiment by looking at one of the primary grouping factors in the distance matrix: the time levels of Day5 vs Day10.

> time_wilcox_qvals <- calc_wilcox_qvals(meta_no_outliers$time)

There were 50 or more warnings (use warnings() to see the first 50)

> table(time_wilcox_qvals < .05)

FALSE TRUE

21176 2777

If exercise-vs.-non isn't separating by rank nicely across the two experiments, this might be because at Day 5, the effect of exercise isn't very strong so those exercise samples mix ranks with the non-exercise samples. The way to improve our power to detect the Day10 exercise effect would be to look at the interaction (synergistic) effect of the day and exercise factors, as per the "groups" metadata factor we created earlier that has four levels. The U-test only works with two factor levels. The generalization of it for more than two factor levels in the Kruskal-Wallis test, which we'll employ here.

> kru <- function(x){d <- data.frame(tpm=t(abundance[x,])[,1], groups=m); k <- kruskal.test(tpm ~ groups, data=d); k$p.value}

> kru_qvals <- function(combo){m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,kru(i))}; return(p.adjust(pvals, method="fdr"))}

> k <- kru_qvals(meta_no_outliers$groups)

> table(k < .05)

FALSE TRUE

23193 760

[Aside: An astute reader will note that we have less than 8 samples in each of the four factor levels, yet we are picking out significant contrasts when the simpler U-test is underpowered at <8 samples per factor level. Because we have four factor levels, there are more unique permutations possible of the ranks than when there's only two, hence the better performance here. Thinking of it another way, for any fixed string length, the number of unique ways to write 1's and 0's is << the number of ways to write 0, 1, 2, and 3. More unique ranking choices means more information entropy, which means greater statistical power for any given factor level.]

Let's note the Kruskal-Wallis significant transcripts for later.

> kru_sig_abundance <- abundance[which(k < .05),]

> sig_ks <- k[which(k < .05)]

Let's do a sanity check and see if a random labelling of the samples gives any significant changes (it shouldn't).

> k_random <- kru_qvals(sample(meta_no_outliers$groups))

> table(k_random < .05)

FALSE

23953

Phew! You could run the random sampling a bunch of times, that's left as an exercise to the reader.

Stage 3: Post-hoc significance testing with many factor levels

Now, we know that 760 transcripts have factor-levels with significantly different rankings according to our groups label, but which of the groups factor's four levels are the ones that are significant? We need to contrast each pair of factor levels. This page from the R Companion provides a nice summary of ways to do the post-hoc analysis of the differing groups. Let's use the U-Test since it behaves nicely without an even number of items in each category (remember that we removed two samples, and if you don't believe me...

> table(meta_no_outliers$groups)

ctl_five ctl_ten exp_five exp_ten

7 6 7 6

No only do we want to use the U-test (a.k.a. Mann-Whitney or Wilcox) for each pair of factor levels and do multiple testing correction (FDR again), but we need to load a couple of libraries that'll give us useful functions to print out the group comparisons later.

> library(multcompView)
> library(rcompanion)

> categorize <- function(x){pt <- pairwise.wilcox.test(as.numeric(x), meta_no_outliers$groups, p.adjust.method="fdr");
pt$p.value[which(is.nan(pt$p.value))] <- 1; multcompLetters(fullPTable(pt$p.value), compare="<", threshold=0.05, Letters=letters, reverse=FALSE)$Letters}

> kru_sig_diff_groups <- apply(kru_sig_abundance, 1, categorize)

This will give us for each of the 760 transcripts a compact representation of the significant grouping information for the 4 factor levels, e.g.

> kru_sig_diff_groups[,1]

ctl_five ctl_ten exp_five exp_ten
"ab" "c" "a" "bc"
This means that ctl_five is statistically indistinguishable from exp_five (they shared 'a') and exp_ten(they share 'b'), ctl_ten is indistinguishable from exp_ten (they share 'c'), exp_five and exp_ten are distinct (they share no letters), etc.. Let's write a function to keep only the U-test results that have exp_ten (our interaction term, index 4) distinct from either non-exercise control time points (indices 1& 2).

> exercise_only <- function(x){!grepl(x[1],x[4]) && !grepl(x[4],x[1]) && !grepl(x[2],x[4]) && !grepl(x[4],x[2])}

> eo <- apply(kru_sig_diff_groups, 2, exercise_only)

> table(eo)

FALSE TRUE

689 71
> kru_exercise_sigids <- colnames(kru_sig_diff_groups[,eo])

Eureka! We have 71 significant Day10 exercise differentially expressed transcripts across both experiments. This was done based solely on expression rank of transcripts across samples, so we didn't need the experiments to be truly replicates, just moving in the same direction generally. We could write a function to calculate the log2 fold changes, but it'd get complicated as we have multiple factor levels, and won't include estimates of technical variability, etc that Sleuth provides. Even if the parametric test results from Sleuth aren't going to be significant due to the non-normality of the samples, the fold change information is useful so let's run it.

> so <- sleuth_prep(meta_no_outliers, ~condition*time+which_experiment+condition:which_experiment+time:which_experiment)

> sum(so$fits$full$summary[,2])

[1] 864793.9

Note that this is a much better RSS than originally (~1160K). We did two things: exclude the outliers (~9% drop in RSS), and we explicitly modelled the experiments and their effect on the other factors (~18% drop in RSS). This will help the Sleuth results for fold-change* be more realistic. Run the Wald and Likelihood Ratio (LRT) tests.

> so <- sleuth_fit(so, ~time+which_experiment+time:which_experiment, "no_condition")

...

> so <- sleuth_fit(so, ~condition+which_experiment+condition:which_experiment, "no_time")

...

> so <- sleuth_fit(so, ~condition+time+which_experiment+condition:which_experiment+time:which_experiment, "no_interaction")

...

> so <- sleuth_lrt(so, 'no_condition', 'full')

> so <- sleuth_lrt(so, 'no_time', 'full')

> so <- sleuth_lrt(so, 'no_interaction', 'full')

> lrt_condition <- sleuth_results(so, 'no_condition:full', test_type = 'lrt')

> lrt_time <- sleuth_results(so, 'no_time:full', test_type = 'lrt')

> lrt_interaction <- sleuth_results(so, 'no_interaction:full', test_type = 'lrt')

> so <- sleuth_wt(so, 'conditionexp')

> so <- sleuth_wt(so, 'timeten')

> so <- sleuth_wt(so, 'conditionexp:timeten')

> so <- sleuth_wt(so, 'conditionexp:which_experiment1')

> so <- sleuth_wt(so, 'timeten:which_experiment1')

> wt_condition <- sleuth_results(so, 'conditionexp')

> wt_time <- sleuth_results(so, 'timeten')

> wt_interaction <- sleuth_results(so, 'conditionexp:timeten')

> wt_condition1 <- sleuth_results(so, 'conditionexp:which_experiment1')

> wt_time1 <- sleuth_results(so, 'timeten:which_experiment1')

Did the parametric tests show up anything significant?

> table(wt_condition$qval < .05)

FALSE TRUE

33456 16

> table(lrt_condition$qval < .05)

FALSE TRUE

33466 6

> table(lrt_interaction$qval < .05)

A few, but not as many as our non-parametric Kruskal-Wallis (71). The LRT (6) results are a subset of the Wald results (10). Let's put the non-paramteric and parametric results together for reporting.

> all_sigids <- unique(c(wt_condition_sigids,kru_exercise_sigids))

> length(all_sigids)

[1] 87

87 = 71 + 16 so there is no overlap in the two tests' results. Let's define a function that provides the post-hoc pairwise U-test results when available, or NAs when not available, for reporting purposes.

> kru_data <- function(id){if(id %in% colnames(kru_sig_diff_groups)){c(kru_sig_diff_groups[,which(colnames(kru_sig_diff_groups) == id)],sig_ks[which(colnames(kru_sig_diff_groups)==id)])}else{rep(NA,5)}}

Gather the Wald test results (which includes the fold-change*) and the pairwise U-test results for each of the 87 significantly differentially expressed transcripts, then transpose the rows and columns, then write to a file.

> dl <- vapply(all_sigids, function(id){unlist(c(wt_condition[wt_condition$target_id == id,],wt_condition1[wt_condition1$target_id == id,],wt_time[wt_time$target_id == id,],kru_data(id)))}, FUN.VALUE=rep("1", 38))

> write.csv(t(dl), "kruskal_wallis_q_lt_0.05_u_test_fdr_0.05.csv")

I'll load this into Excel and remove extraneous Wald test result columns that just confuse most people reading the analysis, and convert the beta values to log2 fold-changes* with an Excel formula for users' didactic purposes.**

If you look at the exp_five column, you'll see that it is often indistinguishable from (shares a letter with) one or both of the controls. This explains why the two-factor level U-test failed, and why using the Kruskal-Wallis test was key to seeing the interaction effect. As per usual, take any individual transcript with a grain of salt, but look for pathway enrichment in IPA, DAVID, etc.

________
* Not really fold-change, but the natural logarithm of the bias estimator. Read the Sleuth manual for details. It's worth it.

** A clever reader will note that you can change the log base with a constant value. Once again, I used this formula for didactic purposes, making the base conversion more explicit.

March 1, 2017

This is not normal: dealing with outliers, correlated factors, and skewed factor effects in RNASeq (a Kallisto/Sleuth/DESeq2/U-Test/skewness mash-up)

tl;dr there are RNA experiments where despite reasonable sample selection, no significant changes will be found. Especially in patient-cenetered studies, sometimes you just can't get true "replicates".

In three stages we can correct for outliers, spurious correlations, and a quantitative factor like patient age that has a skewed (as opposed to Gaussian) dosage effect on transcripts. This process gets us from 2 significantly regulated transcripts to 285 (q-value < 0.05). This involves using Kallisto, Sleuth, DESeq2, and some home brewed code around R's built-in Wilcox-Mann-Whitney U Test and skewness measure.
__________________________

Let's start with an experiment designed with a reasonable balance of males and females in each of three age categories, under 40, over 60, and in between. I noticed on the sample submission form that the input RNA concentrations were unusually divergent, so I've included those values in case they were informative (might affect PCR duplicate levels).

library(sleuth)

meta <- read.table("meta.tab", header=TRUE)

meta$path <- as.character(meta$path)

meta$age_group <- as.factor(meta$age_group)

Stage 1: Identifying outlier samples

First, we need to collate the normalized TPM transcript abundance data as per my previous blog post. You could probably do it with the so info you've already got from the above analysis, but it's trickier I find. Let's load all the transcripts with expression values using DESeq2.

abundance_full <- read.table("all_abundance.tsv", header=TRUE, row.names=1)

abundance_full_integer <- apply(abundance_full, c(1,2), function(x){as.integer(x)})

library("DESeq2")

dds <- DESeqDataSetFromMatrix(countData=abundance_full_integer, colData=data.frame(meta), design=~sex*age_group+age+sex:age+conc)

dds_redux <- dds[rowSums(counts(dds)) > 1,]

Normalize the count data so that the variance is similar across the range of expression values (necessary for a good Poisson distance measure). This is a nice, easy to use feature of DESeq2. Then visualize the distance matrix.

rld <- rlog(dds_redux, blind=FALSE)

library("pheatmap")

library("RColorBrewer")

library("PoiClaClu")

poisd <- PoissonDistance(t(abundance_full))

samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste(meta$sex, meta$age, sep=":")

colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

pheatmap(samplePoisDistMatrix, clustering_distance_rows=poisd$dd,

clustering_distance_cols=poisd$dd, col=colors)

Let's also check that in closely related samples (such as 8 & 12 in the lower right corner) show fairly even variance across the range of expression values, otherwise our nice matrix isn't very meaningful.

plot(assay(rld)[,c(8,12)], xlim=c(0,16), ylim=c(0,16), pch=16, cex=0.3)

Looks good, i.e. it's fairly evenly thick above zero, and there's no feathering in the lower left. Using just a log 2 transform on the Kallisto TPM data yields a different tree, and the scatter plot above is a lot fatter at the bottom, so it was worth using DESeq2's normalization.

Back to the outliers. In the matrix, you'll see that there are three samples, with indices 4, 5, and 14 along the bottom that are strong outliers. Not only are they strong outliers, but they are closely related to each other despite not sharing either a sex or age factor level. This gives us a strong suspicion that there's something wrong here like cross-contamination, prep batch effect, etc. but it's not obvious what. Let's exclude these samples and see how much the model of known factors improves.

> meta_no_outliers <- meta[c(1:3,6:13,15:18),]

> so_no_outliers <- sleuth_prep(meta_no_outliers, ~sex*age_group+age+sex:age+conc)

...

> so_no_outliers <- sleuth_fit(so_no_outliers)

...

> sum(so_no_outliers$fits$full$summary[,2])

[1] 193932.1

The 194K residual sum of squares (RSS) is a lot better than the 347K we had in the original analysis with all 18 samples! If we work through the original analysis with just the 15, we get...

> length(lrt_age.sig_ids)

[1] 8

> length(lrt_sex.sig_ids)

[1] 8

> length(lrt_int.sig_ids)

[1] 4

> length(lrt_conc.sig_ids)

[1] 71

D'oh! Looks like the concentration of input RNA is explaining most of the variance. Collectively this is 84 genes, which we will write out to a file, even if the concentration effect is suspect.

Stage 2: Spurious factor correlations

Getting back to our potential sources of failure in the analysis, let's check for spurious correlation between the concentration and the age factor (the main factor of interest). Plot the data points and a trend line.

> plot(data.frame(meta$age,meta$conc))

> abline(lm(meta$age ~ meta$conc))

Ouch, the concentration as a dosage factor is probably absorbing age dosage effect. The same plot for the outlier-less metadata set eliminates the lower right dot, leaving us with two really low concentrations that are centered in the age distribution (50 & 52). That's lucky for us, because it means that we can change the concentration numeric factor into a binary one to capture just those two points without disturbing the balance of the age factor (e.g. if we'd set the low threshold to 100 it'd have all sample that are older and could still confound the age factor modelling). It's the best we can do with the card we've been dealt.

> plot(data.frame(meta$age,meta$conc))

> meta$conc_low <- meta$conc <= 55

> meta$conc_low

[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[13] FALSE TRUE FALSE FALSE TRUE FALSE

In case you're wondering how just excluding the outlier samples and the concentration from the model works out, you get just one significant change, only in age_group1 (between 40yo and 60yo), and it's a transcript at that showed up in the previous analyses repeatedly. You'll see shortly how the binary concentration modelling works out a lot better.

Stage 3: Rationally defining a factor level threshold for a skewed factor effect (numeric factor -> two-level factor)

Let's keep only the samples with a TPM of at least 1 on average.

> abundance <- abundance_full[rowSums(abundance_full)>length(meta_clean$age),]

Let's define all the combinations of young/old possible by setting the young/old threshold to each observed value in turn.

> oldness_treshold_combos <- t(sapply(meta_clean$age, function(x){meta_clean$age > x}))

Let's define a function to run the Wilcox-Mann-Whitney U Test, as I have in a previous post. As an aside, I also tried using the Kruskal-Wallis H test which extends the U Test notion to three or more factor levels (using the 0/1/2 age_group factor from earlier), but it's underpowered with this number of samples. As you'll see below, we'll have enough trouble with Wilcox already. We'll stick the distribution of old/young across samples in a global variable called "m" later.

> wil <- function(x){d <- data.frame(tpm=t(abundance[x,])[,1], old=m); w <- wilcox.test(tpm ~ old, data=d); w$p.value}

Now define the function that assigns a young/old combination to "m", then runs the Wilcox test and finally applies the FDR (Benjamini-Hochberg) p-value multiple testing correction to give us "q-values". The initial check for the sum combo below is a special condition for the oldest sample, which gives no age contrast if used as a threshold and buggers up the whole process.

> perm_qvals <- function(combo){if(!sum(combo)){return(rep_len(c(NA), length(abundance[,1])))}; m <<- as.factor(combo); pvals <<- c(); for(i in 1:length(abundance[,1])){pvals <<- append(pvals,wil(i))}; return(p.adjust(pvals, method="fdr"))}

Run the qval test on each combination we calculated earlier.

> combo_results <- apply(oldness_treshold_combos, 1, perm_qvals)

There were 50 or more warnings (use warnings() to see the first 50) [you can ignore these, it's complaining about ties in the rankings for the Wilcox test for certain transcripts, in which case it can't calculate an exact p-value)]

> apply(combo_results, 2, function(x){sum(as.integer(x<.05))})

[1] 285 NA 0 0 0 0 0 0 0 0 0 0 0 0 0

> res <- combo_results[,1]

> names(res) <- rownames(abundance)

It may seem strange that only the first threshold (which is old = age > 50 if you look back at the order of samples in meta.tab) has significant changes, but you have to keep in mind, as we determined in a previous post, that you need at least eight samples in each factor level to make the U Test p-value survive FDR multiple testing correction in Kallisto-based RNASeq where you'll have 18,000+ tests (i.e. expressed mRNA isoforms in a tissue). Only an 8/7 split comes close to satisfying that when we have 15 samples, like we do here.

Let's try the analysis again, but see what happens if we don't exclude the outliers, so we have 18 samples, which will give us multiple 8+ split options (8/10, 9/9,10/8):

> abundance <- abundance_full[rowSums(abundance_full)>length(meta$age),]

> rownames(abundance) <- rownames(abundance_full[rowSums(abundance_full)>length(meta$age),])

> oldness_treshold_combos <- t(sapply(meta$age, function(x){meta$age > x}))

> combo_results_full <- apply(oldness_treshold_combos, 1, perm_qvals)

apply(combo_results_full, 2, function(x){sum(as.integer(x<.05))})

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 0 0

It would seem in this case that the benefit of added samples for the U Test is outweighed by the outlying nature of the expression values in those 3 extra samples, as we get no significant changes. The three samples are not just overall outliers, but also specifically in their response to the experimental factor of interest (age). That is to say, they disrupt the rank of genes grouped as young/old, not just the replicate normality assumption of the Wald and LRT tests on the TPM values. Had we gotten good results in the combo_results_full vector, we might have included the outliers and proceeded with caution in comparing the betas from Sleuth with either 15 or 18 samples.

Let's note the significant IDs from the 15 sample test for later when we want to print out results to a file. I'm doing this in two steps, for clarity.

> sig_indices <- res < .05

> ids <- sort(rownames(abundance)[sig_indices])

Let's run with the >50 cutoff in the 15 samples and redo the Sleuth analysis with both the binary age factor and the quantitative age factor. The beta values are more useful than the raw fold-changes I could calculate directly from the abundance matrix variable because 1) the effect of other factors are subtracted from the data, and 2) especially for low abundance transcripts beta more accurately reflects the effect of the age factor since it properly accounts for measurement error though the bootstrap analysis Sleuth does.

> meta_clean$old <- meta_clean$age > 50

> meta_clean

sample path sex age_group age conc conc_low old

1 57_50Y_P2 57_50Y_P2.kallisto M 1 50 55 TRUE FALSE

2 63_75Y_P2 63_75Y_P2.kallisto M 2 75 99 FALSE TRUE

3 SATP100 SATP100.kallisto M 0 27 203 FALSE FALSE

6 SATP119 SATP119.kallisto M 2 74 511 FALSE TRUE

7 SATP120 SATP120.kallisto M 1 41 466 FALSE FALSE

8 SATP125 SATP125.kallisto F 0 30 697 FALSE FALSE

9 SATP137 SATP137.kallisto M 0 19 183 FALSE FALSE

10 SATP140 SATP140.kallisto F 0 39 264 FALSE FALSE

11 SATP154 SATP154.kallisto F 2 61 256 FALSE TRUE

12 SATP68 SATP68.kallisto M 0 37 303 FALSE FALSE

13 SATP75 SATP75.kallisto M 2 65 112 FALSE TRUE

15 SATP91 SATP91.kallisto F 1 57 736 FALSE TRUE

16 SATP92 SATP92.kallisto F 1 53 136 FALSE TRUE

17 SATP95 SATP95.kallisto F 1 52 53 TRUE TRUE

18 SATP98 SATP98.kallisto F 2 73 83 FALSE TRUE

Luckily for us, the age threshold of >50 puts one low concentration sample in the young category and one in the old, so both factors can be included in the final model.

> so_final <- sleuth_prep(meta_clean, ~sex*old+age:old+conc_low)

> so_final <- sleuth_fit(so_final)

> sum(so_final$fits$full$summary[,2])

[1] 266102.6

By the way, did our spurious correlation removal from Stage 2 help? Let's see what the RSS is without it in the model.

> so_final <- sleuth_fit(so_final, ~sex*old+age:old, "no_conc")

> sum(so_final$fits$no_conc$summary[,2])

[1] 300209.5

Nice, it reduces the noise by over 10%. It's not as good an RSS as we had with the quantitative concentration factor, but it's much less biased. Let's go on with the inclusive model and calculate the betas/fold-changes with a Wald test.

> so_final <- sleuth_wt(so_final, "sexM")

> so_final <- sleuth_wt(so_final, "oldTRUE")

> so_final <- sleuth_wt(so_final, "sexM:oldTRUE")

> so_final <- sleuth_wt(so_final, "oldTRUE:age")

> so_final <- sleuth_wt(so_final, "oldFALSE:age")

> so_final <- sleuth_wt(so_final, "conc_lowTRUE")

> wt_sex <- sleuth_results(so_final, 'sexM')

> wt_old <- sleuth_results(so_final, 'oldTRUE')

> wt_int <- sleuth_results(so_final, 'sexM:oldTRUE')

> wt_old_age <- sleuth_results(so_final, 'oldTRUE:age')

> wt_young_age <- sleuth_results(so_final, 'oldFALSE:age')

> wt_conc_low <- sleuth_results(so_final, 'conc_lowTRUE')

> table(wt_sex[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_old[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_int[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_conc[,"qval"] <.05)

FALSE TRUE

28133 11

> table(wt_old_age[,"qval"] <.05)

FALSE TRUE

28142 2

> table(wt_young_age[,"qval"] <.05)

FALSE TRUE

28143 1

So we have very few significant q values, but that's to be expected, because the Wald test assumes you have a normal distribution and an accurate assessment of variance. We have neither, but we want the beta values that the test generates. These will still be our most accurate assessment of factor driven fold-change in transcript abundance.

wt_shared_sex <- wt_sex[wt_sex$target_id %in% ids,]

wt_shared_old <- wt_sex[wt_old$target_id %in% ids,]

wt_shared_int <- wt_int[wt_int$target_id %in% ids,]

wt_shared_old_age <- wt_old_age[wt_old_age$target_id %in% ids,]

wt_shared_young_age <- wt_old_age[wt_young_age$target_id %in% ids,]

wt_shared_young_age <- wt_old_age[wt_conc$target_id %in% ids,]

wt_shared_conc_low <- wt_conc_low[wt_conc_low$target_id %in% ids,]

wt_shared_sex <- wt_shared_sex[sort.list(wt_shared_sex[,1]),]

wt_shared_old <- wt_shared_old[sort.list(wt_shared_old[,1]),]

wt_shared_int <- wt_shared_int[sort.list(wt_shared_int[,1]),]

wt_shared_young_age <- wt_shared_young_age[sort.list(wt_shared_young_age[,1]),]

wt_shared_old_age <- wt_shared_old_age[sort.list(wt_shared_old_age[,1]),]

wt_shared_conc_low <- wt_shared_conc_low[sort.list(wt_shared_conc_low[,1]),]

sig_abundance <- sig_abundance[sort.list(rownames(sig_abundance)),]

If the sample values going into the fold-changes for the age factor level (old/young) are not normally distributed, is there something useful we can say about their distribution? One summary statistic we can grasp fairly easily is skewness, which when negative indicates that the mean is less than the median, and conversely that when positive the mean is greater than the median. From the Wikipedia entry:

In our data for example, a negative skew would mean that the effect of the fold-change is more intense in the older people within the old group. A positive skew would mean that the FC values for the old-designated samples pile up closer to the old cutoff (>50yo) so increased age isn't a big a factor. Let's see if the skew of the abundance of the 285 significant transcripts are quantitatively different from the general transcript abundance.

> library(moments)

> sig_abundance <- abundance[sig_indices,]

> sk <- apply(sig_abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})

> summary(sk)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.72200 -0.29110 0.09093 0.07834 0.44870 1.50800

> sk0 <- apply(abundance, 1, function(x){skewness(as.vector(x[meta_clean$old]))})

> summary(sk0)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

-2.0620 -0.1577 0.2699 0.3049 0.7294 2.2680 62

Looks like the 285 transcripts have a long tail more to the left (more negative) than the whole set of transcripts. Let's check more thoroughly. First, are both distributions (of skewness) normal? Run the Shapiro-Wilk test to find out.

> shapiro.test(sk)

...

W = 0.9934, p-value = 0.2449

> shapiro.test(sample(sk0, length(ids)))

...

W = 0.99399, p-value = 0.3266

Yes. Do they have the same variance? Run an F-test, or you could try Levene's test or Bartlett's test, etc.. they all fail for this dataset.

> var.test(sk, sk0)

F test to compare two variances

data: sk and sk0

F = 0.699, num df = 284, denom df = 19418, p-value = 7.016e-05

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.5961302 0.8307902

sample estimates:

ratio of variances

0.6990001

Which makes running a t-test moot since it assumes equality of variance in each sample group. What we can say easily is that the skewness is further left and tighter (about 1/3 as much) in the significantly changes transcripts than in a random sample of 285 transcripts. That means many transcripts are affected in the really old a lot more than the moderately old, and luckily we don't have to give a specific threshold (like 60 in the original analysis) to catch that phenomenon, nor did the effect have to be normally distributed in our samples.

> sum(sk)

[1] 22.32632

> sum(sample(sk0,length(sk)), na.rm=TRUE)

[1] 78.9279

> mean(sk)

[1] 0.07833795

> mean(sk0, na.rm=TRUE)

[1] 0.3049191

For all the 285 U-test passing transcripts let's write out all the factor contrasts we've calculated so far, and the old samples skew stat adjusted for sk0 (to help account for the effect of the specific distribution of sample ages we have in the analysis).

sk_adjusted <- sk - (mean(sk0, na.rm=TRUE)-mean(sk))

> d <- data.frame(ids,avg_tpm=rowMeans(sig_abundance[ids,]),u_test_qval=res[ids],skew=sk_adjusted,wt_shared_sex[,"b"],wt_shared_sex[,"qval"],wt_shared_old[,"b"],wt_shared_old[,"qval"],wt_shared_int[,"qval"],wt_shared_int[,"b"],wt_shared_young_age[,"qval"],wt_shared_young_age[,"b"],wt_shared_old_age[,"qval"],wt_shared_old_age[,"b"])

> write.csv(d, "u_test_no_outliers_q_lt_0.05.csv")

Pondering the merits and validity of including a kurtosis measure is left as an exercise for the reader.

So, that's one way to analyze RNASeq when the expected expression values aren't necessarily normally distributed and don't have a reasonable estimate of variance. Now you have two gene lists to play with: those that appear normally distributed amongst the age predefined groups (84, mostly attributable to input RNA concentration conflated with age), and those that show a less/greater then 50yo dichtomy (285). Some transcripts on the former list are also in the latter list. While the beta (~fold change) values in the latter aren't super meaningful, taken together with the skewness value for the transcript you can focus on transcript changes that are skewed to the really old, or more middle aged, etc.

P.S.: significant changes were corroborated with consistency scores in pathway analysis (IPA) from both the 84 and 285 item lists, albeit different pathways, therefore it's probably worth following up on both.

Blog Archive

March 31, 2017

Verifying sample pairings in clinical RNASeq: using Kallisto as a Bloom filter to calculate a GenomeID

March 21, 2017

Quick sample sex swap sanity check for RNASeq

March 13, 2017

Combining data from multiple RNASeq experiments: release the Kruskal! (...Wallis test)

Step 1: Check for outliers

Step 2: Run the non-parametric tests

Stage 3: Post-hoc significance testing with many factor levels

March 1, 2017

This is not normal: dealing with outliers, correlated factors, and skewed factor effects in RNASeq (a Kallisto/Sleuth/DESeq2/U-Test/skewness mash-up)

Stage 1: Identifying outlier samples

Stage 2: Spurious factor correlations

Stage 3: Rationally defining a factor level threshold for a skewed factor effect (numeric factor -> two-level factor)