January 29, 2016

Combining gzip files

Many files in bioinformatics are gzip compressed to save space. Sure you can send decompression streams to programs that need raw FASTQ or whatever using gzip -cd file.gz | , but what if you need to merge a bunch of FASTQ files (or any other files without headers)? In a stroke of genius the gzip designers made each compression block independent, so you can concatenate without decompression. For example to merge the lane files of a paired-end Illumina NextSeq 500 run:

gzip *_R1*.fastq.gz > outdir/R1.fastq.gz
gzip *_R2*.fastq.gz > outdir/R2.fastq.gz

It's as simple as that!

January 27, 2016

Locating Named Short Tandem Repeats in the Human Genome



Short Tandem Repeats (STRs) are a staple of forensic science (think of how many times they mention "CODIS markers" on the TV series CSI). Strangely enough, it's tough to Google a list of the genomic location of the thousands of STRs in the human reference sequence beyond the handful used in forensic contexts.  You see, STRs are used in other applications, like cytogenetic microarrays and you might want to look at the location of markers that pop up on an array result.  Clearly this info is buried deep in the design and annotation files of Affymetrix arrays, but what about something we can just look up in a tab delimited file?  I came across this issue and here's how you can generate such a table, in BED file format actually, using the dbSTS (Sequence Tagged Sites) section of Genbank and the mapping to hg19 (a.k.a. GRCh37):

$ wget ftp://ftp.ncbi.nlm.nih.gov/repository/dbSTS/UniSTS_human.sts

$ perl -ane 'print "$F[0]\t$F[4]\n"' UniSTS_human.sts > stsid2name

$ wget ftp://ftp.ncbi.nlm.nih.gov/repository/dbSTS/UniSTS_ePCR.Reports/Homo_sapiens/hs_geno.epcr

$ perl -ne 'BEGIN{%i2n=split /\s/s, `cat stsid2name`}if(/CM0006([678]\d)\.1\|\t(\d+)..(\d+)\t(\d+)/ and $i2n{$4}){my $chr = $1-62; print join("\t", "chr".($chr == 24 ? "Y" : $chr == 23 ? "X" : $chr), $2, $3, $i2n{$4}),"\n"}' hs_geno.epcr > dbsts_hg19.bed

Note that some STSs don't map uniquely, that's just the way the cookie crumbles. Now you can index the bed file with tabix to get STSs in a range, or grep it to retrieve the location of almost any STR, such as D13S800:

$ grep D13S800 dbsts_hg19.bed
chr13 73874692 73874987 D13S800