I’ve been doing a lot of work in the PDF space lately. While implementing Binary Cross Reference Streams I was surprised to see that they could be encoded with PNG Predictors. This was surprising to me because binary cross reference streams aren’t images, they are byte tables:

Data

While the values vary, they are often within the same ranges. The first byte can only contain the numbers 0, 1 and 2 and are often the same, the middle two bytes are often the same and the last byte is generally increasing by one each time. Knowing this about the data you can develop a lossless algorithm which normalizes the data and so makes PKZIP/GZIP/FLATE compression work much better.

For the cross reference stream in PDF documents the most common predictor algorithm is the mind-blowingly simple UP filter:

Up(x) = Raw(x) – Prior(x)

This means each byte is simply its own value minus the value of the byte above it. A complete list of different PNG Predictors is available in the spec. Let’s take a look at what this very simple algorithm does to a small sample table:

02 0002 00

02 0002 01

02 0002 02

02 0002 03

 

The last column continues to increase incrementally upwards as it is an index.

02 0002 00

00 0000 01

00 0000 01

00 0000 01

This example may look contrived but it’s actually right out of the PDF 1.7 Spec. By subtracting values like this you can decrease the vocabulary the ZIP encoder has to know and so significantly reduce the encoded stream size.

A particularly great example comes right from the libpng docs:

If you make a 24-bit 4096 x 4096 RGB image which contains one pixel of each color it is 48 MB as raw data, 36 MB with normal GZIP compression and an insane 59,852 bytes with PNG Predictor Filtering.

 

The real moral of the story here is that images are just a subclass of multidimensional byte tables and the same kinds of techniques can be used on both to achieve much better rates of compression. That is, if you have a priori information about the data and the compression algorithm.