tl;dr Before going into a full analysis, you can quickly check if paired samples in an RNASeq experiment involving human subjects are properly paired by checking that their GenomeIDs roughly match. The quickest way to generate a GenomeID when starting with FASTQ files is to run Kallisto with a specific, small set of transcripts as the database, then pipe the pseudoaligned reads to BWA. You can process a billion reads in less than ten minutes this way with a multicore server. I used this to identify a sample with inconsistent GenomeID that you would never otherwise suspect was problematic based on its overall expression profile, but excluding it nearly doubled (10->17) the number of differentially expressed genes found for the factor of interest.
This Kallisto-as-BWA-filter procedure can be used for any human FASTQ files to help uniquely identify the sample origin. Although originally designed for exomes and genomes, it works well enough to be useful for RNASeq.
__________________
For a few years, I've been using a scheme I made up called GenomeID. This is a handy 12 character ID that can be calculated from the zygosity of highly polymorphic SNPs in coding regions of 58 genes throughout the human genome. Because of the details of the schema, even close relatives have completely unrelated GenomeIDs, and the odds of ID collisions are tiny. This lets me quickly check in a fraction of a second the intrinsic ID of a genome or exome BAM or VCF file for tracking purposes, regardless of the labels different source may give me (e.g. the genome in a bottle is NA12878, CEPH1463.02, or 623 but always has GenomeID ABL0N0yhmh9B). Feel free to use the code, port it to Python, etc.
I never intended it to be used for identifying transcript datasets, but luckily for me, 14 of the 58 genes involved in generating the ID are ubiquitously expressed (see the PDF in GitHub for details and references). So for any given cell type/condition we'll get some consistent dropout of markers, leading to more GenomeID letters being related than in a genome or exome. There is still enough information entropy in those 14+ genes though to distinguish between samples at the scale of an RNASeq experiment. I can use this to check to a first approximation that multiple samples in the experiment that are supposed to have come from the same human actually do.
Before I can run my GenomeID script, I need a VCF or BAM file as input. That's fine, I'll launch a parallel BWA MEM process for each of the 24 samples (~50 million reads each), and I have a 32-core server so it should be quick right? Unfortunately, after getting used to running Kallisto for RNASeq pseudomapping, waiting for BWA to finish on these 1.2 billion reads is like:
Who has hours to spare? They say impatience is one of the chief virtues of a programmer. I thought to myself, why don't I combine the best of both worlds and use Kallisto to quickly identify the subset of reads pseudomapping to the genes involved in the GenomeID, then only map those reads to the genome using BWA MEM? So, I quickly generated a subset of the RefSeq transcripts (150 of the 52606 I typically use) corresponding to the 58 GenomeID genes, indexed it as hg19_genomeid_bloom, then ran a test of the combined strategy:
bash-4.2$ time kallisto quant -i hg19_genomeid_bloom --pseudobam <(gzip -cd sampleA1/sampleA1_L001_R1_001.fastq.gz) --single -l 180 -s 20 -o testbloom | perl -ane 'print "\@$F[0]\n$F[9]\n+\n$F[10]\n"' | bwa mem hg19.fa - | samtools sort -O bam > testbloom.bam
[jumble of bwa and kallisto progress messages...]
[quant] processed 4,131,919 reads, 24,897 reads pseudoaligned
[more jumble...]
real 1m00.809s
[more jumble...]
real 1m00.809s
user 0m59.875s
sys 0m5.965s
The intervening perl one-liner converts the SAM output from Kallisto into FASTQ for BWA. If you'll notice the quant message, we only passed 0.6% of the reads on to BWA for processing, saving us oodles of CPU time. This follows in the vein of a Bloom filter, which is used to quickly calculate a hash code that is guaranteed to include all true positive entries, but may also include some false positives. A Bloom filter saves a lot of effort when the retrieval operation is expensive compared to the Bloom hashing operation, and we approximate* this here with Kallisto pseudomapping being much less expensive per sequence than real full alignment in BWA. The few false positives Kallisto generates will be mapped to elsewhere in the genome by BWA, mitigating their potential effect on the GenomeID result.
Given an Illumina sequencing experiment output, run bcl2fastq, then:
Given an Illumina sequencing experiment output, run bcl2fastq, then:
bash-4.2$ cd Data/Intensities/BaseCalls/myExperiment
bash-4.2$ for d in *; do kallisto quant -i hg19_genomeid_bloom --pseudobam <(gzip -cd $d/*.gz) --single -l 180 -s 20 -o $d.bloom | perl -ane 'print "\@$F[0]\n$F[9]\n+\n$F[10]\n"' | bwa mem hg19.fa - | samtools sort -O bam > $d.genomeid.bam & done
bash-4.2$ for b in *.genomeid.bam; do samtools index $b & done
That took less than 10 minutes on a 32 core server, and it never broke the 50% utilization mark. It's Kallisto heavy in the first few minutes, then shifts towards BWA at the end as it can't quite keep up with our 0.6% passthrough rate. Now we can generate the GenomeIDs with a script from the repository:
bash-4.2$ generate_genome_id *.genomeid.bam > genomeids
Append it to the metadata file we'll use for Kallisto, with factor columns removed for readability here:
bash-4.2$ paste meta.tab genomeids
bash-4.2$ generate_genome_id *.genomeid.bam > genomeids
Append it to the metadata file we'll use for Kallisto, with factor columns removed for readability here:
bash-4.2$ paste meta.tab genomeids
sample path GenomeID
N1d_003_screen N1d_003_screen.kallisto /CBCJYBAKABB
N1e_003_V8 N1e_003_V8.kallisto /CBCJABAKABB
N2d_010_screen N2d_010_screen.kallisto /CQGBZIBCNBB
N2e_010_V8 N2e_010_V8.kallisto /CQGBYIBCdBB
N3d_018_screen N3d_018_screen.kallisto /CAaBoJAKIBB
N3e_018_V8 N3e_018_V8.kallisto /CASBoJAKJBB
N4d_026_screen N4d_026_screen.kallisto /CQGdwJRKIBB
N4e_026_V_V8 N4e_026_V_V8.kallisto /CQOZwJRKYBB
N5d_035_screen N5d_035_screen.kallisto /CACJRIBCIBB
N5e_035_V8 N5e_035_V8.kallisto /CASZRIBCIBB
N6d_043_screen N6d_043_screen.kallisto /CAeB4JRKYBB
N6e_043_V8 N6e_043_V8.kallisto /CAeB5JRKIBB
S1d_020_screen S1d_020_screen.kallisto /CQGJgIAKNBB
S1e_020_V8 S1e_020_V8.kallisto /CQGJoIAKNBB
S2d_049_screen S2d_049_screen.kallisto /CRSJYIACIBB
S2e_049_V8 S2e_049_V8.kallisto /CRSJYIACIBB
S3d_050_screen S3d_050_screen.kallisto /AQGJQJSCJBB
S3e_050_V8 S3e_050_V8.kallisto /AAOJQJACIBB
S4d_058_screen S4d_058_screen.kallisto /CQaBYJAKIBB
S4e_058_V8 S4e_058_V8.kallisto /CQSBZJAKKBB
S5d_061_screen S5d_061_screen.kallisto /CZaNgJAKEBB
S5e_061_V8 S5e_061_V8.kallisto /CRCJhJAKcBB
S6d_065_screen S6d_065_screen.kallisto /CRKRQJBLABB
S6e_065_V8 S6e_065_V8.kallisto /CRKVQJBKIBB
N# and S# are subject identifiers and d/e are pre and post experiment, so since it's alphabetical we expect matching like (1,2), (3,4), etc. For each pair we're only off by a letter or two, except for subject S5 (italicized above). I use this information in two ways: first, after I generate the Poisson distance matrix (dark=close) for the samples, I'll look to see if S5e and S5d are close to each other. N1d_003_screen N1d_003_screen.kallisto /CBCJYBAKABB
N1e_003_V8 N1e_003_V8.kallisto /CBCJABAKABB
N2d_010_screen N2d_010_screen.kallisto /CQGBZIBCNBB
N2e_010_V8 N2e_010_V8.kallisto /CQGBYIBCdBB
N3d_018_screen N3d_018_screen.kallisto /CAaBoJAKIBB
N3e_018_V8 N3e_018_V8.kallisto /CASBoJAKJBB
N4d_026_screen N4d_026_screen.kallisto /CQGdwJRKIBB
N4e_026_V_V8 N4e_026_V_V8.kallisto /CQOZwJRKYBB
N5d_035_screen N5d_035_screen.kallisto /CACJRIBCIBB
N5e_035_V8 N5e_035_V8.kallisto /CASZRIBCIBB
N6d_043_screen N6d_043_screen.kallisto /CAeB4JRKYBB
N6e_043_V8 N6e_043_V8.kallisto /CAeB5JRKIBB
S1d_020_screen S1d_020_screen.kallisto /CQGJgIAKNBB
S1e_020_V8 S1e_020_V8.kallisto /CQGJoIAKNBB
S2d_049_screen S2d_049_screen.kallisto /CRSJYIACIBB
S2e_049_V8 S2e_049_V8.kallisto /CRSJYIACIBB
S3d_050_screen S3d_050_screen.kallisto /AQGJQJSCJBB
S3e_050_V8 S3e_050_V8.kallisto /AAOJQJACIBB
S4d_058_screen S4d_058_screen.kallisto /CQaBYJAKIBB
S4e_058_V8 S4e_058_V8.kallisto /CQSBZJAKKBB
S5d_061_screen S5d_061_screen.kallisto /CZaNgJAKEBB
S5e_061_V8 S5e_061_V8.kallisto /CRCJhJAKcBB
S6d_065_screen S6d_065_screen.kallisto /CRKRQJBLABB
S6e_065_V8 S6e_065_V8.kallisto /CRKVQJBKIBB
Yes, they are right next to each other in the cluster tree (see left margin), so the difference in IDs is unlikely to be explained by the difference in observable genes to generate the IDs in each sample.^ I might exclude this sample later if the differential expression stats are all wonky. As it turns out, excluding this sample nearly doubles (10 -> 17) the number of significantly differentially expressed genes for the factor of interest. Looking at the distance matrix alone you would never suspect that this sample was problematic.
Second, N5d is quite the outlier in the matrix...is it mislabelled, maybe from another patient? No, despite having a significantly outlying expression profile, the GenomeID of the sample is nearly identical to N5e. If something went wrong with this sample, it's nearly impossible for us to distinguish it from plain old biological variability amongst humans in the treatment response. I'm unlikely to exclude this from the analysis until I've tried a bunch of other remedies like non-parameteric tests.
Is this a definitive way to check for mistaken sample labelling? No. But it's surely better than blindly trusting that the inputs to the analysis are correctly labelled, and worth a 10 minute effort that could save you major headaches down the road. Don't forget to do a sex check too.
__________
*I say approximately because Kallisto will fail to map some reads BWA would, so we can't guaranteed no false negatives. And some end up in poorly resolved equivalency classes without bootstrapping, but it's negligible for the purposes of the GenomeID calculation.
^By the way, don't use the pvalues script in GitHub on RNASeq derived identifiers. It assumes diploidy, which we can't assume for RNASeq derived genotypes due to imprinting, etc.