January 29, 2016

Combining gzip files

Many files in bioinformatics are gzip compressed to save space. Sure you can send decompression streams to programs that need raw FASTQ or whatever using gzip -cd file.gz | , but what if you need to merge a bunch of FASTQ files (or any other files without headers)? In a stroke of genius the gzip designers made each compression block independent, so you can concatenate without decompression. For example to merge the lane files of a paired-end Illumina NextSeq 500 run:

gzip *_R1*.fastq.gz > outdir/R1.fastq.gz
gzip *_R2*.fastq.gz > outdir/R2.fastq.gz

It's as simple as that!

No comments:

Post a Comment