v1.1.2 / chapter 1 of 3 / 01 mar 07 / greg goebel / public domain
* This chapter provides an introduction to data compression concepts, and outlines techniques for "lossless" data compression.
* The essential figure of merit for data compression is the "compression ratio", or ratio of the size of a compressed file to the original uncompressed file. For example, suppose a data file takes up 100 kilobytes (KB). Using data compression software, that file could be reduced in size to, say, 50 KB, making it easier to store on disk and faster to transmit over an Internet connection. In this specific case, the data compression software reduces the size of the data file by a factor of two, or results in a "compression ratio" of 2:1.
There are "lossless" and "lossy" forms of data compression. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. Text files are stored using lossless techniques, since losing a single character can in the worst case make the text dangerously misleading. Archival storage of master sources for images, video data, and audio data generally needs to be lossless as well. However, there are strict limits to the amount of compression that can be obtained with lossless compression. Lossless compression ratios are generally in the range of 2:1 to 8:1.
Lossy compression, in contrast, works on the assumption that the data doesn't have to be stored perfectly. Much information can be simply thrown away from images, video data, and audio data, and when uncompressed such data will still be of acceptable quality. Compression ratios can be an order of magnitude greater than those available from lossless methods.
The question of which is "better", lossless or lossy techniques, is pointless. Each has its own uses, with lossless techniques better in some cases and lossy techniques better in others. In fact, as this document will show, lossless and lossy techniques are often used together to obtain the highest compression ratios.
Even given a specific type of file, the contents of the file, particularly the orderliness and redundancy of the data, can strongly influence the compression ratio. In some cases, using a particular data compression technique on a data file where there isn't a good match between the two can actually result in a bigger file.
* A few little comments on terminology before we proceed:
Many of the examples in this document refer to compressing "characters", simply because a text file is very familiar to most readers. However, in general, compression algorithms are not restricted to compressing text files. Data bytes are data bytes, regardless of whether they define text characters, or graphics data, or measurement data being returned from a space probe.
This chapter discusses fundamental lossless data compression techniques in detail. The discussions will focus on compressing text files, but to repeat, they will work for other types of files as well.
* One of the simplest forms of data compression is known as "run length encoding (RLE)", which is sometimes known as "run length limiting (RLL)".
Suppose you have a text file in which the same characters are often repeated,
one after another. This redundancy provides an opportunity for compressing
the file. Compression software can scan through the file, find these
redundant strings of characters, and then store them using an escape
character (ASCII 27), followed by the character and a binary count of the
number of times it is repeated. For example, the 50 character sequence:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX that's all, folks!
-- can be converted to:
<ESC>X<31> that's all, folks!
This eliminates 28 characters, compressing the text by more than a factor of
two. Of course, the compression software must be smart enough not to
compress strings of two or three repeated characters, since for three
characters run length encoding would have no advantage, and for two it would
actually increase the size of the output file.
As described, this scheme has two potential problems. First, an escape character may actually occur in the file. The answer is to use two escape characters in that case, which can, once again, actually make the output file bigger if the uncompressed input file includes lots of escape characters.
The second problem is that a single byte cannot specify run lengths greater than 256. This difficulty can be dealt with by using multiple escape sequences to compress one very long string.
Run length encoding is really not very useful for compressing text files, since a typical text file doesn't have a lot of long, repetitive character strings. It is very useful, however, for compressing bytes of a monochrome image file, which normally consists of solid black picture bits, or "pixels", in a sea of white pixels, or the reverse. It can also be used effectively with color graphics files that consist of large simple blocks of a single color.
Run-length encoding is often used as a preprocessor for other compression algorithms. As the next chapter explains, for example, it is used as one of the many pieces of the JPEG image compression scheme.
* A more sophisticated and efficient lossless compression technique is known as "Huffman coding", in which the characters in a data file are converted to a binary code, where the most common characters in the file have the shortest binary codes, and the least common have the longest.
To see how Huffman coding works, assume that a text file is to be compressed,
and that the characters in the file have the following frequencies:
A: 29
B: 64
C: 32
D: 12
E: 9
F: 66
G: 23
In practice, we need the frequencies for all the characters used in the
text, including all letters, digits, and punctuation, but to keep the example
simple we'll just stick to the characters from A to G.
The first step in building a Huffman code is to order the characters from
highest to lowest frequency of occurrence as follows:
66 64 32 29 23 12 9
F B C A G D E
First, the two least-frequent characters are selected, logically grouped
together, and their frequencies added. In this example, the D and E
characters have a combined frequency of 21:
:
.......
: 21 :
: :
66 64 32 29 23 12 9
F B C A G D E
This begins the construction of a "binary tree" structure. We now again
select the two elements the lowest frequencies, regarding the D-E combination
as a single element. In this case, the two elements selected are G and the
D-E combination. We group them together and add their frequencies. This new
combination has a frequency of 44:
:
..........
: 44 :
: :
: .......
: : 21 :
: : :
66 64 32 29 23 12 9
F B C A G D E
We continue in the same way to select the two elements with the lowest
frequency, group them together, and add their frequencies, until we run out
of elements. In the third iteration, the lowest frequencies are C and A:
:
..........
: 44 :
: : :
....... : .......
: 61 : : : 21 :
: : : : :
66 64 32 29 23 12 9
F B C A G D E
The next iterations give:
:
..............
: 105 :
: :
: ..........
: : 44 :
: : :
....... : .......
: : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E
:
..............
: 105 :
: :
: ..........
: : 44 :
: : : :
....... ....... : .......
: 130 : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E
:
....................
: 235 :
: :
: ..............
: : 105 :
: : :
: : ..........
: : : 44 :
: : : :
....... ....... : .......
: 130 : : 61 : : : 21 :
: : : : : : :
66 64 32 29 23 12 9
F B C A G D E
The result is known as a "Huffman tree". To obtain the Huffman code itself,
each branch of the tree is labeled with a 1 or 0. It doesn't matter how the
1s and 0s are assigned, though a consistent scheme obviously is easier to
deal with:
:
....................
:0 :1
: :
: ...............
: :0 :1
: : :
: : ...........
: : :0 :1
: : : :
....... ....... : .......
:0 :1 :0 :1 : :0 :1
: : : : : : :
F B C A G D E
Tracing down the tree gives the "Huffman codes", with the shortest codes
assigned to the characters with the greatest frequency:
F: 00
B: 01
C: 100
A: 101
G: 110
D: 1110
E: 1111
A Huffman coder will go through the source text file, convert each character
into its appropriate binary Huffman code, and dump the resulting bits to the
output file. The Huffman codes won't get mixed up in decoding. The best way
to see that this is so is to envision the decoder cycling through the tree
structure, guided by the encoded bits it reads, moving from top to bottom and
then back to the top. As long as bits constitute legitimate Huffman codes
and a bit doesn't get scrambled or lost, the decoder will never get lost
either.
* There is an alternate algorithm for generating these codes, known as "Shannon-Fano coding". In fact, it preceded Huffman coding and one of the first data compression schemes to be devised, back in the 1950s. It was the work of the well-known Claude Shannon, working with R.M. Fano. David Huffman published a paper in 1952 that modified it slightly to create Huffman coding.
Shannon-Fano coding achieves the same results as Huffman coding, but works
from the top down, instead of the bottom up. Using the same set of example
characters as above, we arrange them in order of frequency:
F B C A G D E
66 64 32 29 23 12 9
Now we divide the set into two parts, as close to equal as possible. The sum
of the frequencies of F and B is 130, while the sum of the rest of the
characters is 105, and we can't break the set more closely than that:
130 : 105
....................
: :
: :
F B C A G D E
66 64 32 29 23 12 9
We continue breaking down each branch until they can't be broken down
further. Breaking down the branch with the F and B is easy, since there's
only two characters there. For the other branch, the frequencies of C and A
add up to 61, while the frequencies of the rest of the characters in that
branch add up to 44, which is again as close to equal as we can get:
130 : 105
....................
: :
: 61 : 44
....... ...............
: : : :
: : : :
F B C A G D E
66 64 32 29 23 12 9
We can complete the tree using the same procedure, and then assign 0s and 1s
to the branches as before:
130 : 105
.....................
: :
: 61 : 44
: ...............
: : :
: : 23 : 21
....... ....... ...........
: : : : : :
: : : : : :
F B C A G D E
66 64 32 29 23 12 9
130 : 105
.....................
:0 :1
: 61 : 44
: ...............
: :0 :1
: : 23 : 21
: : ...........
: : :0 :1
: : : :
....... ....... : .......
:0 :1 :0 :1 : :0 :1
: : : : : : :
F B C A G D E
66 64 32 29 23 12 9
The codes that result are exactly the same as would be obtained with Huffman
coding.
* Of course, Huffman codes can be used on any type of data, such as bytes in a graphics file. They are not very effective if the data in the file is strongly random, since the frequencies of different characters or bytes would be close to the same, but such "random" data files are hard to compress by any technique since they lack any redundancy to be exploited.
Fax machines, which transmit a graphics image, use a combination of RLE and Huffman compression to achieve compression ratios of about 10:1. Huffman coding techniques are also used in conjunction with other compression schemes to improve compression ratios.
* Since the average frequencies of letters in every language are distinctive and well known, a fact which is incidentally of great importance to codebreakers, it is possible to devise a Huffman code table for text files written in a specific language. A receiver can use the same "default" or "canonical" Huffman codes to decompress the message.
However, average frequencies are just that, averages, and individual messages will differ from that average to a greater or lesser degree. That means that better compressions can be obtained by analyzing a specific file and building a "custom" Huffman code for that file. This requires that the custom table of Huffman codes be output to the file along with the compressed data stream, to allow the decoder to uncompress the data.
Another variation on this idea is known as "adaptive" Huffman coding. In this approach, both the coder and decoder start with a predefined Huffman table, but both keep statistics on the frequency of different characters in the compressed data stream, and modify the Huffman codes on a continuous basis by adjusting the Huffman tree as needed. Since both the coder and decoder use the same modification algorithm, the decoder will adjust its Huffman table in the same way as the coder, and the two will remain in sync.
* Huffman coding looks pretty slick, and it is, but there's a way to improve on it, known as "arithmetic coding". The idea is subtle and best explained by example.
Suppose we have a message that only contains the characters A, B, and C, with
the following frequencies, expressed as fractions:
A: 0.5
B: 0.2
C: 0.3
To show how arithmetic compression works, we first set up a table, listing
characters with their probabilities along with the cumulative sum of those
probabilities. The cumulative sum defines "intervals", ranging from the
bottom value to less than, but not equal to, the top value. The order in
which characters are listed in the table does not seem to be important,
except to the extent that both the coder and decoder have to know what the
order is.
letter probability interval
________________________________
C: 0.3 0.0 : 0.3
B: 0.2 0.3 : 0.5
A: 0.5 0.5 : 1.0
________________________________
Now each character can be coded by the shortest binary fraction whose value
falls in the character's probability interval:
letter probability interval binary fraction
_______________________________________________________
C: 0.3 0.0 : 0.3 0
B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375
A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5
_______________________________________________________
This shows how single characters can be assigned minimum-length binary codes.
However, arithmetic coding doesn't stop there and simply translate the
individual characters in a message as these binary codes. It takes a subtler
approach, assigning binary fractions to complete messages.
To start, let's consider sending messages consisting of all possible
permutations of two of these three characters. We determine the probability
of the two-character strings by multiplying the probabilities of the two
characters, and then set up a series of intervals using those probabilities.
string probability interval binary fraction
_____________________________________________________________
CC: 0.09 0.00 : 0.09 0.0001 = 1/16 = 0.0625
CB: 0.06 0.09 : 0.15 0.001 = 1/8 = 0.125
CA: 0.15 0.15 : 0.30 0.01 = 1/4 = 0.25
BC: 0.06 0.30 : 0.36 0.0101 = 5/16 = 0.3125
BB: 0.04 0.36 : 0.40 0.011 = 3/8 = 0.375
BA: 0.10 0.40 : 0.50 0.0111 = 7/16 = 0.4375
AC: 0.15 0.50 : 0.65 0.1 = 1/2 = 0.5
AB: 0.10 0.65 : 0.75 0.1011 = 11/16 = 0.6875
AA: 0.25 0.75 : 1.00 0.11 = 3/4 = 0.75
_____________________________________________________________
The higher the probability of the string, in general the shorter the binary
fraction needed to represent it.
Let's build a similar table for three characters now:
string probability interval binary fraction
______________________________________________________________________
CCC 0.027 0.000 : 0.027 0.000001 = 1/64 = 0.015625
CCB 0.018 0.027 : 0.045 0.00001 = 1/32 = 0.03125
CCA 0.045 0.045 : 0.090 0.0001 = 1/16 = 0.0625
CBC 0.018 0.090 : 0.108 0.00011 = 3/32 = 0.09375
CBB 0.012 0.108 : 0.120 0.000111 = 7/64 = 0.109375
CBA 0.03 0.120 : 0.150 0.001 = 1/8 = 0.125
CAC 0.045 0.150 : 0.195 0.0011 = 3/16 = 0.1875
CAB 0.03 0.195 : 0.225 0.00111 = 7/32 = 0.21875
CAA 0.075 0.225 : 0.300 0.01 = 1/4 = 0.25
BCC 0.018 0.300 : 0.318 0.0101 = 5/16 = 0.3125
BCB 0.012 0.318 : 0.330 0.010101 = 21/64 = 0.328125
BCA 0.03 0.330 : 0.360 0.01011 = 11/32 = 0.34375
BBC 0.012 0.360 : 0.372 0.0101111 = 47/128 = 0.3671875
BBB 0.008 0.372 : 0.380 0.011 = 3/8 = 0.375
BBA 0.02 0.380 : 0.400 0.011001 = 25/64 = 0.390625
BAC 0.03 0.400 : 0.430 0.01101 = 13/32 = 0.40625
BAB 0.02 0.430 : 0.450 0.0111 = 7/16 = 0.4375
BAA 0.05 0.450 : 0.500 0.01111 = 15/32 = 0.46875
ACC 0.045 0.500 : 0.545 0.1 = 1/2 = 0.5
ACB 0.03 0.545 : 0.575 0.1001 = 9/16 = 0.5625
ACA 0.075 0.575 : 0.650 0.101 = 5/8 = 0.625
ABC 0.03 0.650 : 0.680 0.10101 = 21/32 = 0.65625
ABB 0.02 0.680 : 0.700 0.1011 = 11/16 = 0.6875
ABA 0.05 0.700 : 0.750 0.10111 = 23/32 = 0.71875
AAC 0.075 0.750 : 0.825 0.11 = 3/4 = 0.75
AAB 0.05 0.825 : 0.875 0.11011 = 27/32 = 0.84375
AAA 0.125 0.875 : 1.000 0.111 = 7/8 = 0.875
______________________________________________________________________
Obviously this same procedure can be followed for more characters, resulting
in a longer binary fractional value. What arithmetic coding does is find the
probability value of an entire message, and arrange it as part of a numerical
order that allows its unique identification.
* Let's stop here and send one of the binary strings defined in the table above to a decoder. We'll arbitrarily select the binary string with the decimal value of 0.21875 from the table above.
This value was obtained using the probability values and intervals defined
earlier:
string probability interval
________________________________
C: 0.3 0.0 : 0.3
B: 0.2 0.3 : 0.5
A: 0.5 0.5 : 1.0
________________________________
The value 0.21875 clearly falls into the interval for "C", so "C" must be the
first character. We can then "zoom in" on the characters that follow the "C"
by subtracting the bottom value of the interval for "C", which happens to be
0, and dividing the result by the width of the probability interval for "C",
which is 0.3:
(0.21875 - 0) / 0.3 = 0.72917
This is a simple shift and scaling operation.
The result falls into the probability interval for "A", and so the second
character must be "A". We can then zoom in on the next character by the same
approach as before, subtracting the bottom value of the interval for "A",
which is 0.5, and dividing the result by the width of the probability
interval for "A", which is also 0.5:
(0.72917 - 0.5) / 0.5 = 0.4583
This clearly falls into the probability interval for "B", and so the string
has been correctly uncompressed to "CAB", which is the correct answer.
Unfortunately, this leaves behind a remainder that can be decoded into an indefinitely long string of bogus characters. This is an artifact of using decimal floating-point math to perform the calculations in this example. In practice, arithmetic coding is based on binary fixed-point math, which avoids this problem.
One other problem is the fact that the binary fraction that is output by the arithmetic coder is of indefinite length, and the decoder has no idea of where the string ends if it isn't told. In practice, a length header can be sent to indicate how long the fraction is, or an end-of-transmission symbol of some sort can be used to tell the decoder where the end of the fraction is.
As with Huffman coding, arithmetic coding can also be performed using an adaptive algorithm, with the coder and decoder starting with a predetermined character probability interval table, tallying characters for their actual frequencies as they are encoded or decoded, and then adjusting the probability interval table accordingly.
* The cool thing about arithmetic coding is that by "smushing" a complete message into a single probability interval value, individual characters can be encoded with the equivalent of fractional values of bits. Huffman coding requires an integer number of bits for each character, and so this is one of the reasons that arithmetic coding is in general more efficient than Huffman coding.
Another reason for the efficiency of arithmetic coding is that more "probable" messages compress to shorter binary strings. By the way, at the risk of belaboring the obvious, the probability only depends on the numbers of different characters in the uncompressed file; their order is not important, since the result of a multiplication does not depend on the order of its factors.
The problem with arithmetic coding is that it is very computation-intensive and so it is slow. Huffman and arithmetic coding are sometimes referred to as forms of "statistical coding" or "entropy coding". The term "VLW (Variable Length Word)" also seems to pop up on occasion when discussing such coding techniques, though its exact definition seems a bit unclear.
* Good as they are, Huffman and arithmetic coding are not perfect for encoding text because they don't capture the higher-order relationships between words and phrases. There is a simple, clever, and effective approach to compressing text known as "LZ-77", which uses the redundant nature of text to provide compression. This technique was invented by two Israeli computer scientists, Abraham Lempel and Jacob Ziv, in 1977.
LZ-77 exploits the fact that words and phrases within a text file are likely to be repeated. When they do repeat, they can be encoded as a pointer to an earlier occurrence, with the pointer accompanied by the number of characters to be matched.
Pointers and uncompressed characters are distinguished by a leading flag bit, with a "0" indicating a pointer and a "1" indicating an uncompressed character. This means that uncompressed characters are extended from 8 to 9 bits, working against compression a little.
Key to the operation of LZ-77 is a sliding history buffer, also known as a "sliding window", which stores the text most recently transmitted. When the buffer fills up, its oldest contents are discarded. The size of the buffer is important. If it is too small, finding string matches will be less likely. If it is too large, the pointers will be larger, working against compression.
For an example, consider the phrase:
the_rain_in_Spain_falls_mainly_in_the_plain
-- where the underscores ("_") indicate spaces. This uncompressed message
is 43 bytes, or 344 bits, long.
At first, LZ-77 simply outputs uncompressed characters, since there are no
previous occurrences of any strings to refer back to. In our example, these
characters will not be compressed:
the_rain_
The next chunk of the message:
in_
-- has occurred earlier in the message, and can be represented as a pointer
back to that earlier text, along with a length field. This gives:
the_rain_<3,3>
-- where the pointer syntax means "look back three characters and take three
characters from that point." There are two different binary formats for the
pointer:
As noted, a flag bit with a value of 0 indicates a pointer. This is followed
by a second flag bit giving the size of the pointer, with a 0 indicating an
8-bit pointer, and a 1 indicating a 12-bit pointer. So, in binary, the
pointer <3,3> would look like this:
00 00000011 0011
The first two bits are the flag bits, indicating a pointer that is 8 bits
long. The next 8 bits are the pointer value, while the last four bits are
the length value.
After this comes:
Sp
-- which has to be output uncompressed:
the_rain_<3,3>Sp
However, the characters "ain_" have already been sent, so they are encoded
with a pointer:
the_rain_<3,3>Sp<9,4>
Notice here, once again at the risk of belaboring the obvious, that the
pointers refer to offsets in the uncompressed message. As the decoder
receives the compressed data, it uncompresses it, so it has access to the
parts of the uncompressed message that the pointers reference.
The characters "falls_m" are output uncompressed, but "ain" has been used
before in "rain" and "Spain", so once again it is encoded with a pointer:
the_rain_<3,3>Sp<9,4>falls_m<11,3>
Notice that this refers back to the "ain" in "Spain", and not the earlier
"rain". This ensures a smaller pointer.
The characters "ly" are output uncompressed, but "in_" and "the_" were output
earlier, and so they are sent as pointers:
the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>
Finally, the characters "pl" are output uncompressed, followed by another
pointer to "ain". Our original message:
the_rain_in_Spain_falls_mainly_in_the_plain
-- has now been compressed into this form:
the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>pl<15,3>
This gives 23 uncompressed characters at 9 bits apiece, plus six 14-bit
pointers, for a total of 291 bits as compared to the uncompressed text of 344
bits. This is not bad compression for such a short message, though this
message is repetitive by design, and of course compression gets better as the
buffer fills up, allowing more matches.
LZ-77 will typically compress text to a third or less of its original size. The hardest part to implement is an efficient search for matches in the buffer. Implementations use binary trees or hash tables to ensure a fast match.
There are a several variations on the LZ-77, the best known being "LZSS", which was published by Storer and Symanski in 1982. The differences between the two are unclear from the sources I have access to.
More drastic modifications of LZ-77 include a second level of compression on the output of the LZ-77 coder. "DEFLATE", for example, performs the second level of compression using Huffman coding and is used in the "PKZIP" archiver. "LZH" uses much the same approach and is used in the "LHARC" archiver, which was once in common use in the West but now most restricted to Japan, where it originated. The "ZIP" algorithm, which is now the standard archiver and is available in a number of implementations, uses Shannon-Fano coding for the second level of compression.
* LZ-77 is an example of what is known as "substitutional coding". There are other schemes in this class of coding algorithms. Lempel and Ziv came up with an improved scheme in 1978, appropriately named "LZ-78", and it was refined by a Mr. Terry Welch in 1984, making it "LZW".
As illustrated in the previous section, LZ-77 uses pointers to previous words or parts of words in a file to obtain compression. LZW takes that scheme one step further, actually constructing a "dictionary" of words or parts of words in a message, and then using pointers to the words in the dictionary.
Let's go back to the example message used in the previous section:
the_rain_in_Spain_falls_mainly_in_the_plain
The LZW algorithm stores strings in a "dictionary" with entries for 4,096
variable-length strings. The first 255 entries are used to contain the
values for individual bytes, so the actual first string index is 256. As the
string is compressed, the dictionary is built up to contain every possible
string combination that can be obtained from the message, starting with two
characters, then three characters, and so on.
For example, we scan through the message to build up dictionary entries as
follows:
256 -> th < th > e_rain_in_Spain_falls_mainly_in_the_plain
257 -> he t < he > _rain_in_Spain_falls_mainly_in_the_plain
258 -> e_ th < e_ > rain_in_Spain_falls_mainly_in_the_plain
259 -> _r the < _r > ain_in_Spain_falls_mainly_in_the_plain
260 -> ra the_ < ra > in_in_Spain_falls_mainly_in_the_plain
261 -> ai the_r < ai > n_in_Spain_falls_mainly_in_the_plain
262 -> in the_ra < in > _in_Spain_falls_mainly_in_the_plain
263 -> n_ the_rai < n_ > in_Spain_falls_mainly_in_the_plain
264 -> _i the_rain < _i > n_Spain_falls_mainly_in_the_plain
The next two-character string in the message is "in", but this has already
been included in the dictionary in entry 262. This means we now set up the
three-character string "in_" as the next dictionary entry, and then go
back to adding two-character strings:
265 -> in_ the_rain_ < in_ > Spain_falls_mainly_in_the_plain
266 -> _S the_rain_in < _S > pain_falls_mainly_in_the_plain
267 -> Sp the_rain_in_ < Sp > ain_falls_mainly_in_the_plain
268 -> pa the_rain_in_S < pa > in_falls_mainly_in_the_plain
The next two-character string is "ai", but that's already in the
dictionary at entry 261, so we now add an entry for the three-character
string "ain":
269 -> ain the_rain_in_Sp < ain > _falls_mainly_in_the_plain
Since "n_" is already stored in dictionary entry 263, we now add an entry
for "n_f":
270 -> n_f the_rain_in_Spai < n_f > alls_mainly_in_the_plain
271 -> fa the_rain_in_Spain_ < fa > lls_mainly_in_the_plain
272 -> al the_rain_in_Spain_f < al > ls_mainly_in_the_plain
273 -> ll the_rain_in_Spain_fa < ll > s_mainly_in_the_plain
274 -> ls the_rain_in_Spain_fal < ls > _mainly_in_the_plain
275 -> s_ the_rain_in_Spain_fall < s_ > mainly_in_the_plain
276 -> _m the_rain_in_Spain_falls < _m > ainly_in_the_plain
277 -> ma the_rain_in_Spain_falls_ < ma > inly_in_the_plain
Since "ain" is already stored in entry 269, we add an entry for the
four-character string "ainl":
278 -> ainl the_rain_in_Spain_falls_m < ainl > y_in_the_plain
279 -> ly the_rain_in_Spain_falls_main < ly > _in_the_plain
280 -> y_ the_rain_in_Spain_falls_mainl < y_ > in_the_plain
Since the string "_i" is already stored in entry 264, we add an entry for
the string "_in":
281 -> _in the_rain_in_Spain_falls_mainly < _in > _the_plain
Since "n_" is already stored in dictionary entry 263, we add an entry for
"n_t":
282 -> n_t the_rain_in_Spain_falls_mainly_i < n_t > he_plain
Since "th" is already stored in dictionary entry 256, we add an entry for
"the":
283 -> the the_rain_in_Spain_falls_mainly_in_ < the > _plain
Since "e_" is already stored in dictionary entry 258, we add an entry for
"e_p":
284 -> e_p the_rain_in_Spain_falls_mainly_in_th < e_p > lain
285 -> pl the_rain_in_Spain_falls_mainly_in_the_ < pl > ain
286 -> la the_rain_in_Spain_falls_mainly_in_the_p < la > in
The remaining characters form a string already contained in entry 269, so
there is no need to put it in the dictionary.
We now have a dictionary containing the following strings:
256 -> th
257 -> he
258 -> e_
259 -> _r
260 -> ra
261 -> ai
262 -> in
263 -> n_
264 -> _i
265 -> in_
266 -> _S
267 -> Sp
268 -> pa
269 -> ain
270 -> n_f
271 -> fa
272 -> al
273 -> ll
274 -> ls
275 -> s_
276 -> _m
277 -> ma
278 -> ainl
279 -> ly
280 -> y_
281 -> _in
282 -> n_t
283 -> the
284 -> e_p
285 -> pl
286 -> la
Please remember the dictionary is a means to an end, not an end in itself.
The LZW coder simply uses it as a tool to generate a compressed output. The
coder does not output the dictionary to the compressed output file; the
decoder doesn't need it. While the coder is building up the dictionary, it
sends characters to the compressed data output until it hits a string that's
in the dictionary. It outputs an index into the dictionary for that string,
and then continues output of characters until it hits another string in the
dictionary, causing it to output another index, and so on. That means that
the compressed output for our example message looks like this:
the_rain_<262>_Sp<261><263>falls_m<269>ly<264><263><256><258>pl<269>
The decoder constructs the dictionary as it reads and uncompresses the
compressed data, building up dictionary entries from the uncompressed
characters and dictionary entries it has already established.
* One puzzling thing about LZW is why the first 255 entries in the 4K buffer are initialized to single-character strings. There would be no point in setting pointers to single characters, since the pointers would be longer than the characters, and in practice that's not done anyway. I speculate that the single characters are put in the buffer just to simplify searching the buffer.
As this example of compressed output shows, as the message is compressed, the dictionary grows more complete, and the number of "hits" against it increases. Longer strings are also stored in the dictionary, and on the average the pointers substitute for longer strings. This means that up to a limit, the longer the message, the better the compression.
This limit is imposed in the original LZW implementation by the fact that once the 4K dictionary is complete, no more strings can be added. Defining a larger dictionary of course results in greater string capacity, but also longer pointers, reducing compression for messages that don't fill up the dictionary.
A variant of LZW known as "LZC" is used in the old UN*X "compress" data compression program. LZC uses variable length pointers up to a certain maximum size. It also monitors the compression of the output stream, and if the compression ratio goes down, it flushes the dictionary and rebuilds it, on the assumption that the new dictionary will be better "tuned" to the current text.
Another refinement of LZW keep track of string "hits" for each dictionary entry, and overwrites "least recently used" entries when the dictionary fills up. Refinements of LZW provide the core of GIF and TIFF image compression as well.
* LZW is also used in some modem communication schemes, such as the "V.42bis" protocol. V.42bis has an interesting flexibility. Since not all data files compress well given any particularly encoding algorithm, the V.42bis protocol monitors the data to see how well it compresses. It sends it "as is" if it compresses poorly, but will switch to LZW compression using an escape code if it compresses well. Both the transmitter and receiver continue to build up an LZW dictionary even while the transmitter is sending the file uncompressed and switch over transparently if the escape character is sent.
One of the interesting idiosyncracies of V.42bis is that it uses a "rolling"
escape character. The escape character starts out as a "null" (0) byte, but
is incremented by 51 every time it is sent, "wrapping around" when it exceeds
256:
0 51 102 153 204 255 --> 50 101 152 203 254 --> 49 100 ...
If a data byte that matches the escape character is sent, it is sent twice.
The escape character is incremented to ensure that the protocol doesn't get
bogged down if it runs into an extended string of bytes that match the escape
character.