The topic of this project was Byte Pair Encoding (BPE). The BPE method was taken from an article by Philip Gage, titled 'A New Algorithm for Data Compression', that appeared in 'The C Users Journal' - February 1994 edition.
An excerpt from the article describes BPE: "...a simple general-purpose data compression algorithm, called Byte Pair Encoding (BPE), provides almost as much compression as the popular Lempel, Ziv, and Welch (LZW) model. BPE's compression speed is somewhat slower than LZW's, but BPE's expansion is faster".
The algorithm compresses data by finding the most frequently occurring pairs of adjacent bytes in the data and replacing all instances of the pair with a byte that was not in the original data. The algorithm repeats this process until no further compression is possible, either because there are no more frequently occurring pairs or there are no more unused bytes to represent pairs. The algorithm writes out the table of pair substitutions before the packed data.
This algorithm is fundamentally multi-pass and requires that all the data be stored in memory. This requirement causes two potential problems: the algorithm cannot handle streams, and some files may be too large to fit in memory. Also, large binary files might contain no unused character to represent pair substitutions. Buffering small blocks of data and compressing each block separately solves these problems.
The algorithm reads and stores data until the buffer is full or only a minimum number of unused characters remain. The algorithm then compresses the buffer and outputs the compressed buffer along with its pair table. Using a different pair table for each data block also provides local adaptation to varying data and improves overall compression.
Once the algorithm finds the most frequently occurring pair, it must replace the pair throughout the data buffer with an unused character. The algorithm performs this replacement in place with a single buffer. As it replaces each pair, the algorithm updates the hash table's pair counts. This method of updating the hash table is faster than rebuilding the entire hash table after each pair substitution.
One significant advantage of the BPE algorithm is that compression never increases the data size. This guarantee makes BPE suitable for real-time applications where the type of data to be compressed may be unknown. If no compression can be performed, BPE passes the data through unchanged except for the addition of a few header bytes to each block of data.
Some algorithms, including LZW, can greatly inflate the size of certain data sets, such as randomized data or precompressed fields. LZW compression adapts linearly to frequently occurring patterns, building up strings one character at a time. The BPE algorithm adapts exponentially to patterns, since both bytes in a pair can represent previously defined pair codes. The previously defined pair codes can themselves contain nested codes and can expand into long strings.
This difference between LZW and BPE provides better compression for BPE in some cases. For example, under BPE a run of 1024 identical bytes in a row is reduced to a single byte after only ten pair substitution. This nesting of pair codes is the real power of the algorithm.
The following example illustrates this process:
The focus in using the Byte Pair Encoding process was to attempt to achieve the best compression possible by altering the program input parameters. The four input parameters were BLOCKSIZE, HASHSIZE, MAXCHARS and THRESHOLD. Of these four only the first three were varied.
HASHSIZE is the size of the hash table, and must be at least two. It should also be not much smaller than BLOCKSIZE. HASHSIZE values were varied between 2^12 (4096) and 2^14 (16384). The author does indicate that a smaller HASHSIZE is better for binary files, and conversely a larger one is better for text files.
BLOCKSIZE values were 0%, 5%, 10%, 15% and 20% greater than HASHSIZE. So as HASHSIZE values altered so did the BLOCKSIZE.
MAXCHARS is a parameter that is basically ignored within the article. This parameter deals with the character set per block and seems to have been arbitrarily set (could be due to incomplete understanding of Mr Gage's process). Some testing of the MAXCHAR parameter (not extensive), where a range of 30 to 250 was considered, led to a range of 130 to 180 with a step of 5 being chosen for the main run.
The THRESHOLD parameter is exclusively related to faster compression. As stated in Philip Gage's article "... by changing THRESHOLD from 3 to 10. This change caused the program to skip pairs with less than 10 occurrences. Since the program compresses most frequently occurring pairs first, skipping low-frequency pairs near the end of block processing has little effect on the amount of compression but can significantly improve speed. This change reduced the compression time from 55 to 30 seconds." (55 to 30 seconds within Philip Gage's own tests). As this project was not concerned with speed the THRESHOLD value was set to 3.
In the interests of running a credible project with reproducible results the well known Calgary Corpus was chosen as the compression target. This corpus of 21 files totaling 3,255,838 bytes is one standard of how well a compressor works. Jeff Gilchrist maintains a list of compression comparisons on the Calgary Corpus. The zipped corpus is also linked from his site. Mr Gage in comparing BPE to LZW (not reproduced here) gives no indication of what file he compresses.