October 13, 2016

The Mothur of all eukaryotes

While microbial small subunit RNA analysis (16S) has been commonplace for a number of years now to profile microbial communities for super cheap per sample, and there's an Illumina MiSeq lurking in all corners of universities these days, around here we're seeing an uptick in the use of it for eukaryotic profiling (18S). Fun stuff like dietary analysis of wild animal feces, or sequencing river water to get fish diversity information.

I'm not so silly as to get involved in the Qiime vs. Mothur methodology debate, but let's suppose you've settled on Mothur. Mothur's kind of nice insofar as it's a single binary (Qiime is typically installed as a virtual machine), and the recommended Silva reference files used can be downloaded in a bundle from the Mothur Web site.  There are two issues with using the Mothur-provided reference files for eukaryotic analysis: 1) it truncates the taxonomy for eukaryotes the same way as prokaryotes (6 levels) even the though the tree is deeper, so you end up with only order, class or family level designation quite often, and 2) the prebuilt DB is Silva 123 from July 2015.  Silva is now at version 128 as of this writing, with 577,832 quality entries vs. the old with 526,361, with 18,213 Eukarya represented vs. 16,209 in version 123.  Here's how to address both the eukaryotic and latest version issues:

Download and unpack (as newer version come out adjust the URL accordingly):


wget -O SSURef_NR99_latest_opt.arb.gz https://www.arb-silva.de/fileadmin/silva_databases/current/ARB_files/SSURef_128_SILVA_23_09_16_opt.arb.gz

gzip -d SSURef_NR99_latest_opt.arb.gz

wget -O tax_slv_ssu_latest.txt https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/tax_slv_ssu_128.txt

perl -pe 's/ \(Animalia\)//;s/ /_/g' tax_slv_ssu_latest.txt > tax_slv_ssu_latest.fixed.txt

Export the FastA files:

Launch arb:

arb SSURef_NR99_latest_opt.arb

Export as per http://blog.mothur.org/2015/12/03/SILVA-v123-reference-files/ with a final file name of silva.full_latest.fasta

Save the file as silva.full_latest.fasta, then quit arb.

Note that I compiled ARB from source since I'm on CentOS 7 and they only have precompiled binaries for earlier operating systems.  If you're in the same boat, you'll need to root/sudo the following, which are not all documented in the installation instructions.

yum install libxml2 transfig libXp libtiff gnuplot-common xorg-x11-xbitmaps Xaw3d xorg-x11-fonts-misc xfig xfig-common motif gnuplot libtiff-devel libxml2-devel libxml2-python lynx glib2-devel imake libXmu-devel libXp-devel motif-devel

Format the sequence:

Here we deviate a bit from the Mothur README for simplicity, and to get the right taxonomic labels for eukaryotes.

mothur "#screen.seqs(fasta=silva.full_latest.fasta, start=1044, end=43116, maxambig=5, processors=8); pcr.seqs(start=1044, end=43116, keepdots=T); degap.seqs(); unique.seqs();"

grep ">" silva.latest.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.latest.good.pcr.ng.unique.accnos

mothur "#get.seqs(fasta=silva.latest.good.pcr.fasta, accnos=silva.latest.good.pcr.ng.unique.accnos)"

mv silva.full_latest.good.pcr.pick.fasta silva.full_latest.align

Run my eukaryotic labelling adjustment script (modified from https://raw.githubusercontent.com/rec3141/diversity-scripts/master/convert_silva_taxonomy.r, since that one didn't work for me, and wasn't parameterized)


Rscript convert_silva_taxonomy.r tax_slv_ssu_latest.txt silva.full_latest.align silva.full_latest.tax

Now you're good to go for any type of SSU analysis (16S or 18S), and follow something like the ever popular MiSeq SOP.

If the above instructions failed for you, download my SILVA 128 tax file here, and the fasta and align.

**Update: there seems to be a problem with 4 Ralstonia sequence taxonomic classifications in the current SILVA release.  You'll need to manually fix those in the output taxonomy file to get it to work properly. They have only two levels of classification.

October 6, 2016

Using Kallisto & Sleuth for RNASeq analysis

Kallisto is a fast pseudo-mapping tool for RNAseq data that has some advantages over more established mapping and quantification methods like bwa and htseq-count. In very broad strokes, it doesn't do a traditional spliced alignment to a reference genome, but rather assigns read to a reference set of transcripts based on kmer counts. This works when you have well-annotated genome like human or mouse. Where the kmer transcript assignment is ambiguous, it uses bootstrap analysis to estimate the probability of correct assignment to a transcript. It can bootstrap because kmer counting is super fast, and takes little memory, making it ideal for running analysis on a laptop for instance.

I've been holding back because the stats software to interpret differential gene expression from the bootstrapped values has been slowly evolving. In June a preprint about this analysis software, called Sleuth, was put up on bioRXiv. So it was probably time I checked it out more thoroughly.

In particular I was drawn by microdissected tissue RNASeq experiments where DESeq2 was only showing very modest differential expression of genes, but a lot of it (1000's of genes). In theory, Sleuth should work better at resolving splicing variation in genes which might be going on, or even maybe compensating for tissue impurity insofar as it causes systematic underreporting of differential expression ratios.

Here's a brief rundown on running a combined Kallisto + Sleuth analysis.

Create a Kallisto searchable database of transcripts for your species, if you don't already have one. Personally I like RefSeq, but you could use Ensembl transcripts, etc.  Here's how to get the RefSeq fastas, while stripping the GIs and leaving the NM_ or NR_ ID as the one Kallisto will report.


wget ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.*.rna.fna.gz

gzip -cd mouse.*.gz | perl -i -pe 's/^.*?\|ref\|(.*?)\|/>$1/' > mm10.refMrna.fa

Index the reference file with Kallisto. It'll take a few minutes.


kallisto index -i /path/to/dbdir/mm10.refMrna mm10.refMrna.fa

Run kallisto for each of your samples, here I have single end RNASeq.  The 180/20 is a reasonable insert size/std dev estimate for TruSeq RNA v2 library preps, probably the most common type of sequencing these days. 190bp might be good to as per our own lab's TapeStation stats over many runs, but in practice the effect of changing this value by ten is negligible. The 100 specifies the number of bootstraps to run.

kallisto quant -i /path/to/dbdir/mm10.refMrna -o sample_id.kallisto -b 100 --single -l 180 -s 20 <(gzip -cd /path/to/unique_sample_id_prefix*.fastq.gz)

Now let's run some differential expression analysis. First, create a tab delimited metadata file for the experiment so we have the factors for each sample ID (header "sample"), and the file path (header "path") to the kallisto output for each sample.  The file should look something like this:

sample      path               time
mysample1   mysample1.kallisto D0
mysample2   mysample2.kallisto D0
mysample3   mysample3.kallisto D5
mysample4   mysample4.kallisto D5

I've put D0, D5 here so we have a factor with two values, and by default the alphabetically first factor value is considered the control in R (i.e. ratios will be reported as exp:ctl, so D5:D0).  You could have put 0, and 5 in the column instead, but then you'd have to coerce the numbers it into factor levels, so why bother?  Putting just numbers only make sense if you have a range of numeric values and are expecting a dosage response in the regression model.

Note that the tutorial on the Sleuth Web site uses a somewhat convoluted method to get the right metadata table together. Here, I've simplified it, assuming you are running R from the directory where all the kallisto quant output directories reside.

library(sleuth)
meta <- read.table("meta.tab", header=TRUE)

meta$path <- as.character(meta$path)

Make a regression model for the factor(s) in the metadata file.  Here we only have one, called "time" in the metadata file prepared earlier.

so <- sleuth_prep(meta, ~ time)

Model the gene expression as responding to the time factor.

so <- sleuth_fit(so)

Create another model where the gene expression is not dependent on any factor.

so <- sleuth_fit(so, ~1, "reduced")

Run a likelihood ratio test between the two models to see what transcripts appear to really be affected by the time factor value.


so <- sleuth_lrt(so, 'reduced', 'full')
results_table_lrt <- sleuth_results(so, 'reduced:full', test_type = 'lrt')

Run a Wald test to check for differential expression according to the time factor. Wait...didn't we just do that with the LRT?  They are different ways of skinning the same cat, and generally the LRT is considered a better test (see a nice explanation here). We need to run both, because only the Wald test will report the fold change*. My factor was "time", and the non-control value is D5, so the test is called "timeD5" as R is want to do with concatenating factors and values. Alternatively, you can list the valid Wald tests with the so$fits[["full"]]$design_matrix command.  On to the Wald test of the time factor...

so <- sleuth_wt(so, 'timeD5')
results_table_wt <- sleuth_results(so, 'timeD5')

We want the Wald test results for those genes that also passed the more stringent (when there's only two factor levels) LRT.  Let's get the names of the transcripts that are in both lists (with multiple testing corrected p-value, a.k.a. q-value, less than 0.05).

d5.lrt.sig_ids <- results_table_lrt$target_id[which(results_table_lrt$qval < 0.05)]
d5.wt.sig_ids <- results_table_wt$target_id[which(results_table_wt$qval < 0.05)]
shared_ids <- d5.wt.sig_ids[d5.wt.sig_ids %in% d5.lrt.sig_ids]

Grab just the rows of the Wald test that correspond to the shared transcripts, then write them to a file for posterity, downstream analysis in another program, etc..

shared_results <- results_table_wt[results_table_wt$target_id %in% shared_ids,]
write.csv(shared_results, file="kallisto_wald_test_lrt_passed_d5.csv")

To make sure we aren't throwing the baby out with the bathwater, let's write out the Wald-pass-but-LRT-failed genes to another file in case there's something interesting there.

wald_only_ids <- d5.wt.sig_ids[!(d5.wt.sig_ids %in% d5.lrt.sig_ids)]
wald_only_results <- results_table_wt[results_table_wt$target_id %in% wald_only_ids,]
write.csv(wald_only_results, file="kallisto_wald_test_only_passed_d5.csv")

Interestingly, unlike the samples used in the Sleuth preprint, both the Wald and LRT tests report fewer differentially expressed genes/transcripts than DESeq2's Wald test when used in conjunction with bwa and htseq-count on my local microdissection samples mentioned earlier. But, the differential expression ratios are higher.  This is the point at which I should explain the asterisk I put after "fold change" earlier on this page.

*The statistical model that Kallisto uses includes a term Beta.  In the output tables for the Wald test this is the "b" field and is the closest thing you can think of to the log2 fold-change you might be used to from other programs.  It's not really a fold change though, it's a bias estimator.  And it's the natural logarithm, not log2. It's roughly the extent to which the estimated count for a transcript is affected by the named factor rather than biological and technical variability (library prep, mapping, etc.).  So if there is very little variability and noise for a transcript (based on bootstrap analysis) it's pretty close to being a fold change measure.  If there is high technical variability for the transcript though, Beta tends to account for less of the overall estimated count even though Beta will have a higher magnitude.

If you've had the patience to read this far and understand Beta, why don't we look at some pretty graphs to see what our data look like (e.g. segregation on a Principle Component Analysis plot, volcano plots, etc.). Sleuth uses a Web interface called Shiny to allow interactive plotting, and you went through the installation instructions for both Kallisto and Sleuth you're set to go, right? The following will launch the Web interface, or at least give you a URL to paste into your browser.

sleuth_live(so)


March 1, 2016

Cleaning up GATK's garbage on many-CPU servers

Every once in a while when I run GATK genotyping commands, I observe Java processes eating up crazy amounts of CPU (using the "top" command) for a long while, but not accomplishing anything in terms of analysis. Poking around I figured out this was due to the Java Garbage Collector (GC), which for the really keen can be explored in depth here, with Tenures and Edens and all kinds of fun low level computer science stuff. The long and the short of it is that on larger systems with many CPUs, the default GC mode is both too aggressive and sporadic to work effectively for GATK processes like the HaplotypeCaller.  Playing around, I found the best override settings for the GC system for our system can be invoked as follows:


java -Xmx24G -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar /export/achri_data/programs/GenomeAnalysisTK.jar -I in.bam -T HaplotypeCaller -o out.vcf  -L my.interval -nct 2...


In the case of a 32-CPU server, one pathological HaplotypeCaller process used 10% of its normal CPU time and 25% of its wall time with the new settings, but of course YMMV. If you are using Queue for your job control, there is an old thread that discusses this issue in part (threading, but not concurrency of GC). Also worth noting that the four GC threads are all never going full bore at the same time, so in effect you use one extra CPU on average with these settings.

February 2, 2016

Are you using DESeq2 correctly for Illumina RNASeq?

tl;dr If you have single end Illumina RNASeq data the reads are antisense. You cannot use DESeq2 to analyze the alignment files (BAMs) with summarizeOverlaps() as described in the vignette. You need to use htseq-count with the "stranded=reverse" option and use DESeqDataSetFromHTSeqCount(), or you need to use a preprocessing function for the reads in R that is new in the GenomicRanges package starting in version 1.23.26 (April 12th, 2016). Full analysis walkthrough starts halfway this page.

Like many people I've been moving away from Cufflinks based RNASeq gene expression analysis to DESeq2, largely due to its more robust statistics on tough samples.  There are great tutorials (e.g. http://www.bioconductor.org/help/workflows/rnaseqGene/) that show you how to get started making pretty graphs from your data with DESeq2 and related R/Bioconductor packages. This post is to warn you that it can all go horribly wrong but look okay on the surface!

If you're new to it, unless you know to do some very specific googling or your wet lab is right on top of these things and tells you, you might not realize that the single end Illumina RNASeq protocol generates antisense reads, not sense read.  Here's an image buried deep in the Illumina login-only Web site.



I suspect that the GenomicAlignments and/or DESeq2 package authors haven't worked with such datasets much, because there is no way in summarizeOverlaps() method to count only antisense reads, and the DESeq2 tutorial linked above doesn't mention this possible snag. Setting ignore.strand=TRUE in summarizeOverlaps(), will do in a pinch, but you lose all the advantages of a stranded sequencing protocol...most prominently the gene resolution of mRNAs with overlapping exons.

The most annoying bit of this whole thing is that it's not glaringly obvious running through the DESeq2 tutorial that you've got the wrong data, since there is enough seep through to the sense strand (~2% typically but I've seen up to 15%) that the graphs and stats look okay prima facie.  To make things worse, the seep through affects certain genes much more than others so it's not like just subsampling the correct strand's data. Here's a PCA plot of samples using the sense strand, where the samples still cluster reasonably well by factor groups:



The variance is a bit fat at the low end, but not alarmingly so, and the range is normal:

And you do get a list of statistically significant changes for the various factor contrasts using DESeq2. But it's the wrong data, we should be using the antisense counts! If you want to determine from the aligned data if you have sense or antisense, here's a quick command I whipped up to generate a BED file with percentage strand concordance for ungapped/clipped 75bp reads in myfile.bam for each gene in a UCSC refFlat file:

cut -f 1-6 mm10_refFlat.bed | uniq | perl -ane '$t=$r=0;for $hit (split /\n/, `samtools view myfile.bam $F[0]:$F[1]-$F[2]`){my @a=split /\t/,$hit; next unless $hit =~ /7\dM/; $t++; $r++ if $a[1] & 16}print join("\t", @F[0..3],int(100*($F[5] eq "-" ? $r/$t:($t-$r)/$t)),$F[5],$t),"\n" if $t' > myfile.strand.bed

You might want to stick that in a shell script file for reuse. Changing the /7\dM/ regex for longer reads is left as an exercise for the reader ;-).

Looking at the results, it was obvious that the % strand concordance (5th column) is near zero for the vast majority of genes, at any read depth (7th column).  Without any a priori knowledge of a sample's wet lab prep that's how you know you've got to deviate from the DESeq2 tutorial and use htseq-count:

htseq-count -s reverse -f bam in.bam mm10_refGene_w_genenames.gtf > out.gene_counts.tab

This has the added advantage of giving you gene names in the row labels for DESeq2 output rather than Entrez GeneID as per the DESeq2 tutorial. If you want to generate a UCSC gene model set, here's a script that adds the gene_ids from a refFlat BED to a refGene GTF table retrieved using the UCSC Table Browser. Make sure you run cpanm Set::IntervalTree if you don't have this Perl module installed already. It handles most idiosyncrasies of the UCSC gene models, like weird miRNAs that span several genes. It can also make a BED by the way.

Unfortunately, HTSeq is pretty slow -- 2 or 3x slower than Bioconductor's summarizeOverlaps() -- and then you need to build an extra correspondence table to use with DESeqDataSetFromHTSeqCount(), and you can't take advantage of the nice automated read counting parallelization in R.  Now as of April 2016 you can use a built-in function when you go to load your read counts as an extra argument, preprocess.reads=invertStrand like so to correctly use strand-specific single end Illumina RNASeq data:

se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)


Here's a slightly more foolproof version of the DESeq2 tutorial commands if you're starting with a bunch of BAM files from Illumina RNASeq data of mice (Mus musculus), and a sample design file that includes the BAM file names to mitigate file sorting or subsetting issues (sample_meta.tab):

library("GenomicFeatures")
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library("GenomicAlignments")
library("BiocParallel")
library("Rsamtools")
library("DESeq2")
library("PoiClaClu")
library("pheatmap")
library("RColorBrewer")
library(org.Mm.eg.db)
meta <- read.csv("sample_meta.tab")
bamfiles <- BamFileList(as.character(meta$BAM_filename))
ebg <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene")
multicoreParam <- MulticoreParam(workers = 8) // If you have an 8-core CPU
register(multicoreParam)                      // ditto
invertStrand <- function(x){GenomicRanges:::invertRleStrand(x)} // if you don't have the latest GenomicRanges version
se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=FALSE, preprocess.reads=invertStrand)
At this point let's say it's a knockout mouse (e.g. Adh1) being subjected to some treatment and you want to just do a quick sanity check for the number of reads mapping to the knockout gene in each sample (normalized just for the total mapped read depth of each sample).  You can do the following to get the right index for that gene (the knownGene database is indexed on Entrez Gene ID):

counts <- assay(se, "counts")

gene_symbols <- select(org.Mm.eg.db, rownames(counts), "SYMBOL", "ENTREZID")
knockout_gene_id <- gene_symbols[which(gene_symbols$SYMBOL == "Adh1"),]$ENTREZID
counts[knockout_gene_id,]/colSums(counts)
Assuming relative abundance for the knockout in each sample makes sense, we can go on and do all the stuff in the DESeq2 tutorial. As an aside you could also generate an FPKM file for use in other methods that need it:

transcripts_per_million <- counts/colSums(counts)*1000000
gene_lengths_in_k <- sum(width(reduce(ebg)))/1000
fpkm <- transcripts_per_million/gene_lengths_in_k
fpkm_symbol_df <- data.frame(SYMBOL=gene_symbols, fpkm)
write.csv(fpkm_symbol_df, file="gene_fpkm.csv")

Now on with the DESeq2 tutorial
colData(se) <- DataFrame(meta)
dds <- DESeqDataSet(se, design = ~ Treatment + Knockout + Treatment:Knockout)
dds_redux <- dds[rowSums(counts(dds)) > 1,]
rld <- rlog(dds_redux, blind=FALSE)
plot(assay(rld)[,1:2], pch=16, cex=0.3)
Ah, that looks nicer! That fanning at the left edge has gone away, and we have a lot more highly expressed genes (upper right). Quasi-homoskedastic for distance analysis, etc...

poisd <- PoissonDistance(t(counts(dds_redux)))
samplePoisDistMatrix <- as.matrix( poisd$dd )
rownames(samplePoisDistMatrix) <- paste(meta$Treatment, meta$Knockout, meta$Sample_ID, sep=":")
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(samplePoisDistMatrix, clustering_distance_rows=poisd$dd, clustering_distance_cols=poisd$dd, col=colors)

I've labelled the Y axis with Treatment:Knockout:Sample_ID from sample_meta.tab. The first two samples (14_2W6 and 5_2i3, i.e. samples #5 and #17 per the X axis) look like outliers, I might have to revisit those if the differential expression analysis turns out badly. Otherwise the tree clustering on the left reflects the TRUE/FALSE experiment/control breakdown of the samples nicely, and most of the 0 Knockout (i.e. wildtype) are grouped in the top half. This graph looked quite different before the antisense.strand=TRUE patch! Let's check out the PCA plot to compare to the original:


plotPCA(rld, intgroup = c("Treatment", "Knockout"))


Whoa! Those two are really outliers (not nearly as obvious in the wrong-strand graph). Note that by default you can't see the individual sample labels in the plot. Since it's just a standard ggplot2 object we can add the labels once we have the PCA data. The PCA data can be grabbed by calling the underlying plot function directly with a special argument, returnData.
rld_plotdata <- DESeq2:::plotPCA.DESeqTransform(rld, intgroup=c("Treatment", "Knockout"), returnData=TRUE) library(ggplot2) plotPCA(rld, intgroup=c("Treatment", "Knockout")) + annotate("text", x=rld_plotdata$PC1, y=rld_plotdata$PC2, label=rownames(rld_plotdata))
Yup, it's those two samples. Probably want to go back and exclude those two samples from the analysis, by splicing out the relevant bits (indices 5 & 17) of the se object.

se_redux <- se[,c(seq(1,4),seq(6,16),seq(18,20))]
dds <- DESeqDataSet(se_redux, design = ~ Treatment + Knockout + Treatment:Knockout)
dds_redux <- dds[rowSums(counts(dds)) > 1,]
rld <- rlog(dds_redux, blind=FALSE)

I reran the charts and they cluster much better. Now on to the differential expression:

deseq <- DESeq(dds_redux)

Here's the less obvious part: I have several groups I can compare in my model: Treatment vs. Control, or Knockout zygosity, or the interaction of the two.  Let's look at how we address these in Bioconductor:


resultsNames(deseq)
[1] "Intercept"     "TreatmentTRUE"     "Knockout"     "TreatmentTRUE.Knockout"

So now we can proceed to retrieve the differential expression (FDR=0.05) for each "contrast":


result_treatment <- results(deseq, contrast=list("TreatmentTRUE"), alpha=.05)
result_knockout <- results(deseq, contrast=list("Knockout"), alpha=.05)
result_interaction <- results(deseq, contrast=list("TreatmentTRUE.Knockout"), alpha=.05)

Now let's check the number of entries with multiple testing adjusted p-value <.05 and see if it's manageable. If not, proceed to 0.01 or change your model, etc.:

result_knockout.sig <- which(result_knockout$padj < 0.05)
length(result_knockout.sig)
[1] 320

I can deal with 320. Finally, print the relevant columns to a file along with the corresponding gene names (repeat this for each of the three results):

result_knockout.sig_symbol_df <- data.frame(result_knockout[result_knockout.sig,], SYMBOL=gene_symbols$SYMBOL[result_knockout.sig])
write.csv(result_knockout.sig_symbol_df[c("SYMBOL", "baseMean", "log2FoldChange", "padj")], file="knockout_dge.csv")




January 29, 2016

Combining gzip files

Many files in bioinformatics are gzip compressed to save space. Sure you can send decompression streams to programs that need raw FASTQ or whatever using gzip -cd file.gz | , but what if you need to merge a bunch of FASTQ files (or any other files without headers)? In a stroke of genius the gzip designers made each compression block independent, so you can concatenate without decompression. For example to merge the lane files of a paired-end Illumina NextSeq 500 run:

gzip *_R1*.fastq.gz > outdir/R1.fastq.gz
gzip *_R2*.fastq.gz > outdir/R2.fastq.gz

It's as simple as that!

January 27, 2016

Locating Named Short Tandem Repeats in the Human Genome



Short Tandem Repeats (STRs) are a staple of forensic science (think of how many times they mention "CODIS markers" on the TV series CSI). Strangely enough, it's tough to Google a list of the genomic location of the thousands of STRs in the human reference sequence beyond the handful used in forensic contexts.  You see, STRs are used in other applications, like cytogenetic microarrays and you might want to look at the location of markers that pop up on an array result.  Clearly this info is buried deep in the design and annotation files of Affymetrix arrays, but what about something we can just look up in a tab delimited file?  I came across this issue and here's how you can generate such a table, in BED file format actually, using the dbSTS (Sequence Tagged Sites) section of Genbank and the mapping to hg19 (a.k.a. GRCh37):

$ wget ftp://ftp.ncbi.nlm.nih.gov/repository/dbSTS/UniSTS_human.sts

$ perl -ane 'print "$F[0]\t$F[4]\n"' UniSTS_human.sts > stsid2name

$ wget ftp://ftp.ncbi.nlm.nih.gov/repository/dbSTS/UniSTS_ePCR.Reports/Homo_sapiens/hs_geno.epcr

$ perl -ne 'BEGIN{%i2n=split /\s/s, `cat stsid2name`}if(/CM0006([678]\d)\.1\|\t(\d+)..(\d+)\t(\d+)/ and $i2n{$4}){my $chr = $1-62; print join("\t", "chr".($chr == 24 ? "Y" : $chr == 23 ? "X" : $chr), $2, $3, $i2n{$4}),"\n"}' hs_geno.epcr > dbsts_hg19.bed

Note that some STSs don't map uniquely, that's just the way the cookie crumbles. Now you can index the bed file with tabix to get STSs in a range, or grep it to retrieve the location of almost any STR, such as D13S800:

$ grep D13S800 dbsts_hg19.bed
chr13 73874692 73874987 D13S800