Many files in bioinformatics are gzip compressed to save space. Sure you can send decompression streams to programs that need raw FASTQ or whatever using gzip -cd file.gz | , but what if you need to merge a bunch of FASTQ files (or any other files without headers)? In a stroke of genius the gzip designers made each compression block independent, so you can concatenate without decompression. For example to merge the lane files of a paired-end Illumina NextSeq 500 run:
gzip *_R1*.fastq.gz > outdir/R1.fastq.gz
gzip *_R2*.fastq.gz > outdir/R2.fastq.gz
It's as simple as that!
This blog provides updates on happenings at the Bioinformatics Support Services of the University of Calgary's Cumming School of Medicine, Centre for Health Genomics and Informatics. This includes a mix of technical information of use to other bioinformaticians, and practical information about services for both bench researchers and clinicians. Opinions expressed here are solely my (Paul Gordon) own, and should not be construed as AHS or UofC policy.
Blog Archive
January 29, 2016
January 27, 2016
Locating Named Short Tandem Repeats in the Human Genome
Short Tandem Repeats (STRs) are a staple of forensic science (think of how many times they mention "CODIS markers" on the TV series CSI). Strangely enough, it's tough to Google a list of the genomic location of the thousands of STRs in the human reference sequence beyond the handful used in forensic contexts. You see, STRs are used in other applications, like cytogenetic microarrays and you might want to look at the location of markers that pop up on an array result. Clearly this info is buried deep in the design and annotation files of Affymetrix arrays, but what about something we can just look up in a tab delimited file? I came across this issue and here's how you can generate such a table, in BED file format actually, using the dbSTS (Sequence Tagged Sites) section of Genbank and the mapping to hg19 (a.k.a. GRCh37):
$ wget ftp://ftp.ncbi.nlm.nih.gov/repository/dbSTS/UniSTS_human.sts $ perl -ane 'print "$F[0]\t$F[4]\n"' UniSTS_human.sts > stsid2name $ wget ftp://ftp.ncbi.nlm.nih.gov/repository/dbSTS/UniSTS_ePCR.Reports/Homo_sapiens/hs_geno.epcr $ perl -ne 'BEGIN{%i2n=split /\s/s, `cat stsid2name`}if(/CM0006([678]\d)\.1\|\t(\d+)..(\d+)\t(\d+)/ and $i2n{$4}){my $chr = $1-62; print join("\t", "chr".($chr == 24 ? "Y" : $chr == 23 ? "X" : $chr), $2, $3, $i2n{$4}),"\n"}' hs_geno.epcr > dbsts_hg19.bed
Note that some STSs don't map uniquely, that's just the way the cookie crumbles. Now you can index the bed file with tabix to get STSs in a range, or grep it to retrieve the location of almost any STR, such as D13S800:
$ grep D13S800 dbsts_hg19.bed chr13 73874692 73874987 D13S800
Subscribe to:
Posts (Atom)