How Deflater works
On this page we look briefly at how the compression algorithm which underlies Java's
Deflater works. As mentioned in the introduction, Deflater is actually
a wrapper around the zlib library, which implements the deflate
algorithm.
The deflate algorithm is essentially a combination of two mechanisms:
- Firstly, dictionary-based compression is applied; this tries
to replace recurring patterns in the data with a reference to the first (or a previous)
occurrence of the pattern;
- Then, Huffman encoding is applied: this encodes a byte as a
variable number of bits, with more frequently-occurring bytes being given a
correspondingly shorter code.
Dictionary compression
The compressor works within a "window" of data. It looks at the upcoming data
to be compressed, and looks to see if it already encountered a matching sequence
in the data recently processed. For example, consider the following sequence,
where the character in bold is the character currently being processed, and
those to the left have already been processed:
A B A B C D A B A B C D A...
The compressor sees that it has a sequence "AB" to compress, and that that
sequence occurred 2 characters ago. So it can encode a reference looking something
like (2, 2) to mean "same sequence as two characters ago, length 2 characters"
(of course, that in turn was also probably encoded as a repetition of the earlier
AB sequence).
However, if it is configured to search deeper, it might also find that
"ABC"– or even "ABCD"– occurred 6 characters ago.
As you can see, how exactly the compressor matches upcoming input against
previous sequences is something that is up to the Deflater (in fact
the zlib) implementation, and it turns out that it can be
configured to some extent. Specifically, we can configure a
Deflater's compression level to indicate how far the deflater looks for
matching sequences, resulting in a tradeoff between CPU and compression ratio.
If a character occurs which the compressor decides doesn't match any previous sequence
(possibly because it doesn't look far enough), then it is simply encoded as itself.
As you can see from this description, the deflate algorithm will generally favour
streams with repeating sequences of bytes. (Note that we've used letters for the sake
of illustration, but all byte values are treated equally by the compressor: it doesn't
specifically favour those representing printable characters.)
Huffman encoding
After applying dictionary compression, Huffman encoding is applied to the
output. On the next page, we look at Huffman encoding in
the DEFLATE algorithm.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.