March 1, 2016

Cleaning up GATK's garbage on many-CPU servers

Every once in a while when I run GATK genotyping commands, I observe Java processes eating up crazy amounts of CPU (using the "top" command) for a long while, but not accomplishing anything in terms of analysis. Poking around I figured out this was due to the Java Garbage Collector (GC), which for the really keen can be explored in depth here, with Tenures and Edens and all kinds of fun low level computer science stuff. The long and the short of it is that on larger systems with many CPUs, the default GC mode is both too aggressive and sporadic to work effectively for GATK processes like the HaplotypeCaller.  Playing around, I found the best override settings for the GC system for our system can be invoked as follows:


java -Xmx24G -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar /export/achri_data/programs/GenomeAnalysisTK.jar -I in.bam -T HaplotypeCaller -o out.vcf  -L my.interval -nct 2...


In the case of a 32-CPU server, one pathological HaplotypeCaller process used 10% of its normal CPU time and 25% of its wall time with the new settings, but of course YMMV. If you are using Queue for your job control, there is an old thread that discusses this issue in part (threading, but not concurrency of GC). Also worth noting that the four GC threads are all never going full bore at the same time, so in effect you use one extra CPU on average with these settings.