Not All Compression Formats Are Created Equal: Use .xz, instead of .zip or .gz for your compression needs.
on November 12, 2017
If you’re still compressing your files in .zip format, please stop. Or, if you are a Linux person like me, you might still be using the .tar.gz format. It’s been long the default in most mainstream distros (Fedora, Ubuntu, etc).
You should just use the .tar.xz format. It will usually be incredibly more efficient than a normal zip, and will also always beat .tar.gz (which already beats the zip format)
What prompted this blog post was seeing the stark difference between zip and tar.xz, yet again, earlier today as I worked on one of my research projects.
There I was, collecting CSV files derived from versions of the chromium HSTS preload list (it’s a cool research topic I’ll write about in the future). This is what the folder looked like:
So it has CSV files – a lot more than that screenshot suggests: 365 CSV files total. They aren’t all 1MB, so the total size of the folder isn’t that bad: 83.6MB.
I needed to compress it so I can send it through email, and this is how it went, using four different formats (.zip, .7z, .tar.gz, and .tar.xz):
In 4th (and last) place is .zip @ 19.3MB.
In 3rd place is tar.gz @ 18.8MB
In a virtual tie: .7z and .tar.xz, with xz outperforming 7z by just 1kB.
Effectively, the zip and gz files are each over 40x bigger than the xz archive.
That example is the worst case for zip and gz. That’s because the CSV files have large similarities to each other (each one is an updated version of the one before it), which is something xz and 7z can exploit.
Here’s a more reasonable and representative example – compressing a folder with files that don’t really have that much to do with each other. Here, I’ll be compressing a 6.4MB folder containing a PHP project (cobalt [link], of course). Here’s the result between .zip and .xz:
I already have a .tar.xz and .zip file, so I just created the 7z and gz versions now (hence, timestamp difference). As you can see, 7z and xz are still virtually tied @ 1.6MB (that’s a rounded-off size reading), gz is still second best @ 2.5MB, and zip is still last at 2.8MB.
It’s not quite the same bloodbath as before (40x worse) but xz and 7z are still clearly much better.
Why am I recommending xz over 7z then? That’s pretty much because I’m mostly a Linux user. I can create xz files easily, out of the box, and in Windows both the 7zip and WinRAR utilities work just fine in unpacking them. In MacOS, I do have to resort to the terminal (tar xvfJ myfile.tar.xz), but that happens rarely so I don’t mind so much. And there’s probably an available application that I can install for free from the App Store; I just don’t bother looking because having to do that kind of work in MacOS happens very rarely for me.
If I were a primarily a Windows user, I’d probably recommend 7z for the same pragmatic reason – I don’t really know of a tool that works with both packing and unpacking xz format in Windows, but a 7z one exists.
So if you’re a Linux user, you should be making xz your compression format of choice. If Windows, 7z. For MacOS users, both are probably supported just fine after installing a third-party utility, so go with whatever you seem to have.