Many files in bioinformatics are gzip compressed to save space. Sure you can send decompression streams to programs that need raw FASTQ or whatever using gzip -cd file.gz | , but what if you need to merge a bunch of FASTQ files (or any other files without headers)? In a stroke of genius the gzip designers made each compression block independent, so you can concatenate without decompression. For example to merge the lane files of a paired-end Illumina NextSeq 500 run:
gzip *_R1*.fastq.gz > outdir/R1.fastq.gz
gzip *_R2*.fastq.gz > outdir/R2.fastq.gz
It's as simple as that!
This blog provides updates on happenings at the Bioinformatics Support Services of the University of Calgary's Cumming School of Medicine, Centre for Health Genomics and Informatics. This includes a mix of technical information of use to other bioinformaticians, and practical information about services for both bench researchers and clinicians. Opinions expressed here are solely my (Paul Gordon) own, and should not be construed as AHS or UofC policy.
No comments:
Post a Comment