March 21, 2017

Quick sample sex swap sanity check for RNASeq

I've noted before how you might detect RNASeq sample swaps in engineered cell line samples.  But what if we don't have a priori genotype knowledge?

To quickly catch some possible sample swap problems in an RNASeq experiment involving clinical samples with sex metadata, you can look at the expression level of SRY, a gene found on the Y chromosome and hence should only appear in males of any age^. Females have shown zero signal in all experiments we've run here. If using Kallisto, concatenate the abundance (transcripts per million) files first:

$ perl -e 'print "target_id\t",join("\t",map {/(.*)\//;$1} @ARGV),"\n";' *.kallisto/abundance.tsv > all_abundance.tsv

$ paste *.kallisto/abundance.tsv | perl -ane 'print $F[0];for (1..$#F){print "\t$F[$_]" if /[49]$/}print "\n"' | tail -n +2 >> all_abundance.tsv

Then the following adds a column to the metadata tabular file with the NM_003140 identifier for SRY (I used RefSeq as a Kallisto reference):

$ grep NM_003140 all_abundance.tsv | perl -ane 'print join("\n",@F),"\n"' | paste meta.tab -
sample path sex age NM_003140
A A.kallisto M 50 0.603562
B B.kallisto M 75 0.540668
C C.kallisto M 27 0.519294
D D.kallisto F 35 0
E E.kallisto M 46 0
F F.kallisto M 74 0.970973
G G.kallisto M 41 0.57206
H H.kallisto F 30 0
I I.kallisto M 19 0.246618
J J.kallisto F 39 0.381072
K K.kallisto F 61 0
L L.kallisto M 37 0.304948
M M.kallisto M 65 0
N N.kallisto F 78 0
O O.kallisto F 57 0
P P.kallisto F 53 0
Q Q.kallisto F 52 0
R R.kallisto F 73 0

I've highlighted three samples were the SRY information doesn't jibe with the sample metadata: two males without SRY expression, and one female with SRY expression. Time to track backwards to where the errors might have occurred! Note that in my experience, no females have any detectable SRY expression, but not all males are detectable (i.e. occasionally male samples have 0).

Update: Even more reliable is the presence of the XIST long non-coding (but polyadenylated so usually captured in mammalian RNASeq protocols) in female samples but not male.  It's RefSeq ID is NM_001564.This wouldn't be reliable in Turner Syndrome subjects, but hopefully your study does not include these, or if they do you use the combination of XIST absence and SRY presence.
_________
^Sure there are rare exceptions to nominal male=Y, I'm ignoring them here.

No comments:

Post a Comment