School of Computer Science and Software Engineering Monash University
Bachelor of Science (Computer Science) with Honours
Clayton Campus
Honours Thesis - 2000
"Image Compression via Context Coding of Error Bitplanes"
Minh-Binh Tran
ID: 12275492
Supervisor: Dr R. T. Worley
Declaration
I Minh-Binh Tran declare that this thesis is my own work and has not been submitted in any form for another degree or diploma at any university or other institute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged in the text and a list of references is given in the bibliography
----------------------
Minh-Binh Tran
Acknowledgements
I would like to thank my supervisor for his help and guidance.
Abstract
The aim of this research is to investigate whether redundancy in error images can be further reduced by extending current methods, involving context coding of error bitplanes. Context coding techniques under investigation involve context growing, which reduces the effects of using a large number of initially inaccurate context probabilities. Fading memory, which allows the context probabilities to adjust according to changes in the data. Finally, increasing probability convergence, which reduces the entropy only on biased bitplanes. In addition, in an attempt to utilise the already encoded / decoded error bitplanes, the research develops 1s dropping and 3d contexts, which are then compared against the traditional 2d contexts and proven to be superior. The investigation concludes by presenting a final context coding scheme for a particular 3d template with 10 neighbours.
The algorithm developed is a combination of adaptive differential pulse code modulation (ADPCM) [1], context coding [2] and arithmetic coding [3]. The ADPCM algorithm is used as the predictor, to produce an error image. The error image is then converted to sign/magnitude/bitplanes (SMB). Context coding, with the addition of various improvements already mentioned, is then applied to the SMBs. Thus, producing probabilities that can then be passed to an Arithmetic coder to encode the data.
Table of Contents
List of Tables and Figures
1. Introduction 8
2. Background 9
2.1 Decorrelation Method 9
2.2 Context Coding 11
2.3 Encoding 13
3. Description of the Problem 14
4. Alternative Solutions: 14
5. System Design and Implementation 15
5.1 Adaptive Differential Pulse Code Modulation: 15
5.2 Error Image to Sign Magnitude Bitplanes 16
5.2.1 Sign Magnitude Representation 16
5.2.2 Sign Magnitude Bitplanes 17
5.3 Context Coding 17
5.3.1 Two Dimensional Contexts 18
5.3.2 Context growing 19
5.3.3 Fading memory 21
5.3.4 Increasing Convergence 21
5.3.5 Dropping Ones 22
5.3.6 Three Dimensional Context 22
5.4 Entropy Estimation 23
6. Results and Discussion: 24
6.1 Effect of Context Growing 25
6.2 Effect of Fading memory 26
6.3 Effect of Increasing Convergence 27
6.4 Effect of dropping 1s 29
6.5 Close Neighbours vs. Distant Neighbours 30
6.6 Analysing the Effectiveness of 3d Neighbours 31
6.7 Optimal Context Model For 10 Neighbours 32
7. Conclusion 33
8. Bibliography 35
List of Tables and Figures
Fig. 1. The 10 neighbour context used by ADPCM, to find s(a), s(b), s(c) and s(d), where a, b, c and d represent the lines in all 4 directions.
Table 1: An example of a variant of sign magnitude representation
Fig. 2. shows a 2d template with 10 neighbours, where the letter represent the members of the context and the ? is the binary event to be encoded.
Fig. 3. A hierarchy of templates with 0 neighbours up to n neighbours. The number of binary numbers in the squares represents the number of neighbours in the template. In addition, the actual sequence of binary numbers represents a particular context pattern
Fig. 4. A 3d Context with neighbours on the current bitplane being encoded / decoded and neighbours on previously encoded / decoded bitplanes, situated above areas not possible with a 2d contexts.
Fig. 5. A diagram of the template 5 and 10. Both diagrams show the order in which the contexts are grown in.
Table 3: Compares the entropy between a 2d context with 5 neighbours, with no context growing "2d_fixed" and a 2d context with context growing "2d_grow".
Table 4: Compares the entropy between a 2d context with 10 neighbours, with no context growing "2d_fixed" and a 2d context with context growing "2d_grow".
Table 5: Estimated bits required using when using 10 neighbours. In addition, bits saved, shows the difference between using "2d_fixed" and "2d_grow" contexts.
Table 6: Compares the entropy between two 2d contexts with 5 neighbours, one with no fading memory "2d_g" and the other with fading memory "2d_fade".
Table 7: Compares the entropy between two 2d contexts with 10 neighbours, one with no fading memory "2d_g" and the other with fading memory "2d_fade".
Table 8: Compares the entropy between two 2d contexts with 5 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does.
Table 9: Compares the entropy between two 2d contexts with 10 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does.
Table 10: Estimated bits required using when using 10 neighbours. In addition, bits saved, shows the difference between using "2d_gf" and "2d_conv" contexts.
Table 11: Compares the entropy between three 2d contexts with 5 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does and "2d_c2" is an improved version of "2d_conv".
Table 12: Compares the entropy between three 2d contexts with 10 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does and "2d_c2" is an improved version of "2d_conv".
Table 13: Compares the entropy between two 2d contexts with 5 neighbours. "2d_gfc2" does not implement dropping 1s while "2d_drop" does.
Table 14: Compares the entropy between two 2d contexts with 10 neighbours. "2d_gfc2" does not implement dropping 1s while "2d_drop" does.
Fig. 6. A diagram of the template 5 and 10 for the 2d context and the 3d context. All 4 diagrams show the order in which the contexts are grown in.
Table 15: Compares the entropy between a 2d context and a 3d context with 5 neighbours.
Table 16: Compares the entropy between a 2d context and a 3d context with 10 neighbours.
Fig. 7. Three examples from a 10 step process, where each neighbour on the current bitplane are moved to the upper bitplane, in a position as close to the current position as possible.
Fig. 8. The entropy of the "3d_gfc2d" as all its neighbours begin on the current bitplane and are then moved to the upper bitplane.
Fig. 9. A diagram of the optimal template for a context involving 10 neighbours.
Table 17: For template 10, a comparison of the entropy for the full_count, the basic 2d context with no additions and the final 3d context utilising all the context techniques mentioned.
1. Introduction
Communication is a key focus in our current society. The Internet allows people to communicate ideas and learn from one another. It is the flow of information from source to destination. Whether that is from one side of the world to the other or to someone just across the room. Information can come in many forms, for example, visual, sound or text. The faster this information can be sent the more of it the destination can receive in the same period. One way to increase the amount of data sent, in the same period, is to compress the information. Compression techniques decrease the amount of information needed to express the same idea.
Consider the information being in the form of a grey scale image. There are two types of compression techniques to decrease the image size, which is the amount of information required to store the image. One technique is lossy compression, where some information is discarded and an approximate image is stored instead. The discarded information is chosen as to not greatly affect the images' appearance. However, some of the original image is lost and cannot be perfectly recovered.
The other technique is lossless compression, where the entire image is kept in a smaller form and the exact image can be reproduced. This technique is preferred over lossy compression in situations where the photos are too expensive to throw away information, such as satellite photos or when the image may contain vital information that cannot be thrown away, e.g. medical images.
A lossless compression scheme can be implemented by using ADPCM as the predictor, to produce an error image. The error image can then be converted to SMBs. Context coding, can be applied to the SMBs. Thus, producing probabilities that can then be passed to an arithmetic coder to encode the data. The problem with this procedure is 'how to provide accurate probabilities for the data', because these probabilities are passed to the arithmetic coder that will determine the final image size. Hence, the aim of this research is to investigate whether probabilities that are more accurate can be maintained in context coding.
The context coding techniques investigated are context growing, fading memory and increasing probability convergence. In addition, this research develops 1s dropping and 3d contexts which utilise the already encoded / decoded error bitplanes in order to maintain more accurate probabilities for the data. On completion, a final context coding scheme is produced for a particular 3d template with 10 neighbours.
The thesis is organised into sections. Section 1 was introduction, section 2 contains the background information on image compression. Section 3 contains the description of the problem. Section 4 considers alternative solutions. Section 5 describes the system design and implementation. This section is broken up into 4 subsections, where 5.1 explains how ADPCM works. 5.2 converts an error image into SMBs. 5.3 introduces context coding with various subsections describing each context coding technique. 5.4 describes the entropy estimation procedure. Section 6 contains the results and discussion, which is broken up into 7 subsections each describing the effects of each context coding technique mentioned. Finally, section 7 contains the conclusion.
2. Background
2.1 Decorrelation Method
Correlation in an image means that adjacent pixels are providing some degree of information about the current pixel. An image with a high degree of correlation between pixels can be compressed, resulting in less data that needs to be stored. Hence, the redundancy in an image is related to the amount of correlation between the pixels.
Image compression techniques attempt to reduce redundancy by decreasing the amount of correlation in the image. This is done by using a decorrelation algorithm and then encoding the resultant image. Compression on an image, with some degree of correlation, produces a resultant image that has a histogram peaking at a certain value, eg zero. Variable length coding (VLC) algorithms, such as Huffman coding or arithmetic encoding, are then applied to the resultant image. VLC algorithms assign short code words to highly frequent pixel values and longer code words to less frequent pixel values. This effectively reduces the amount of information required to be stored. However, the effectiveness of a VLC algorithm, is dependent on whether the decorrelation method applied was able to produce a sharply peaked histogram for the resultant image.
Different decorrelation methods produce different histograms and vary in their effectiveness for image compression. The three main types of decorrelation methods are predictive, transform and multi-resolution decorrelation. Predictive decorrelation involves methods, such as DPCM and adaptive DPCM. DPCM uses the correlation coefficient r in the auto covariance of the image to determine the contribution of the three surrounding pixels, located top, left and top-left of the current pixel. Adaptive DPCM, on the other hand, takes into account that the parameter r in the auto covariance of the image is not necessarily constant. Contributing pixels, located left, top-left, top and top-right are given weights determined by detecting the magnitude of the differences in the four directions. Hence, adaptive DPCM is able to assign coefficients to contributing pixels, depending on the immediate surrounding pixels.
Secondly, transform decorrelation can be used to create the resultant image. This method includes KL-Transform [9], Discrete Cosine Transform (DCT) [10], Subband Coding (SBC) [11] and Walsh-Hadamard Transform [12]. Transform methods typically take an image and computes new values, by applying a transform that effectively concentrates important information portrayed by the new values. These values are quantised, resulting in quantisation errors. Since lossless compression requires the original image to be perfectly reproduced, DCT, SBC and other similar methods, increase the quantisation accuracy to reduce the errors. However, this causes the compression efficiency to fall dramatically when attempting to produce zero quantisation errors. Instead of eliminating quantisation errors, an error image with a small enough entropy is also stored along with the transformed values. This solution requires an optimum value between quantisation accuracy and error entropy size to be calculated.
Thirdly, multi-resolution decorrelation, is a process where the lowest resolution of the image is initially stored and then more data is added progressively, in levels, to eventually produce finer and finer detail in the image. This technique forms the basis of the following methods, Laplacian Pyramid [13] and Hierarchical Interpolation (HINT) [14]. Laplacian Pyramid eventually finishes 4/3 times larger then the original image. However, this increase may not be detrimental if it is compensated by a high level of decorrelation produced by the algorithm. Another method using multi-resolution decorrelation is HINT, where by the image is sampled while skipping a number of columns and rows between each sample. The 'low resolution' sample is transmitted using DPCM. The original values in-between the sampled values are determined through interpolation and addition to the received error.
A comparison of individual methods, categorised by predictive, transform or multi-resolution decorrelation, done by Roos et al in [1] shows that Laplacian Pyramid was found to be too burdened by the additional storage space needed. When considering transform methods, adapting the algorithm to suit lossless compression causes methods such as DCT and SBC, to be impractical. Both predictive decorrelation methods DPCM and Adaptive DPCM performed well, with Adaptive DPCM performing slightly better on noisy images. However, their results concluded that HINT was the best decorrelation method. It also had additional advantages of simplicity, the absence of parameters and its low sensitivity to noise and channel errors.
Despite the recommendation made by Roos et al, that HINT "is the most appropriate reversible intraframe method for image compression", the decorrelation method employed in this research is Adaptive DPCM. Adaptive DPCM was preferred over HINT because it produced comparable results, while being slightly more intuitive. Hence, allowing more attention to be focused on the main aspect of the research, which is context coding of error bitplanes. However, further investigations can be done to analyse the effect of different decorrelation methods in conjunction with context coding of error bitplanes.
2.2 Context Coding
A strong decorrelation algorithm creates a sharply peaked histogram, which centres around, for example, zero. This allows a VLC to assign shorter code words to more frequent pixel values and hence, compress the image. However, a more sharply peaked distribution can be achieved by applying context coding on the error image and then using a VLC to encode the data. Context coding allows the current pixel to be encoded with the probability of occurring, depending on its neighbouring pixels. As shown in [15], the storage requirements to keep counters on every possible error value in every possible combination of contexts becomes 'prohibitively expensive' for large contexts. However, the number of counters required can be reduced by separating the error image into a series of error bitplanes. Where all the most significant bits of the error pixels exist in the first bitplane and all the second most significant bits are in the second bitplane etc. This is known as the bitplane approach.
By separating the error image into several bitplanes, the number of counters required can be dramatically decreased. Each bitplane now only contains either a zero or a one, instead of a range of pixel values. In addition, the binary data passed to an adaptive Arithmetic coder, from each bitplane, adapts quickly. Thus, resulting in low start-up costs. However, a problem exists in the bit representation of error values that have roughly the same frequency but different bit patterns. For example, 31 and 32 are almost equally likely to occur in an error image, but have significant differences in their bit patterns. Hence, this causes a loss in compression using the bitplane approach. However, investigations made by [16] and [17] attempt to give the error values, such as 31 and 32, roughly the same probability by Gray coding them.
Gray coding assigns bit patterns that differ by only one bit to consecutive error values. An investigation made by [16] to find out whether the compression efficiency of Gray encoded bitplanes would be improved, resulted in an approximately 3 per cent increase in the average code length, compared to the ideal code length. The investigation showed that the benefits of using Gray coding, which allowed the use of a very large number of contexts, was offset by the loss in compression efficiency. However, the investigation also observed that the use of different types of Gray codes, resulted in different compression results. These results were left open, which raises the question, "Is there an optimal Gray code for use with bitplanes" [15].
Another type of encoding method, based on work done by Langdon in [18] and further examined by [15], was the sign/magnitude/bitplane (SMB) approach. This method converted the error values into a sign/magnitude representation, which required another bitplane to represent the sign of the error value. The current research being investigated is based on the SMB approach. However, in an attempt to improve compression, contexts are not only dependent on the current bitplane, but are also dependent on the other bitplanes.
Initially probabilities in large contexts do not occur often enough to provide an accurate representation of the data. In these cases, a smaller context is preferred over a larger one. However, after a certain time, it would then be more desirable to use a larger context, since larger contexts represent a higher order Markov model, which provides more accurate conditional probabilities for non independent data. Work done by Rissanen and Weinberger [31] [32] developed the "universal context modelling" algorithm to build a context tree and choose a context from that tree based on "the minimum sum of predicted code lengths for its next symbol". Alternatively, context selection could be based on the gradient estimates of image pixels [33]. However, in this investigation, a simplistic approach was preferred, where a larger context was used as soon as it had a frequency value greater then a pre - determined value. In addition, the context probabilities can be further improved by implementing an adaptive context.
An adaptive context is able to alter the probabilities in the context to give more weight to recent data. This process is known as scaling. Common techniques of scaling involve, periodically throwing away all the past information and resetting the probabilities [34], maintaining a fixed window size [35] of past data and as new data enters the window, the contribution of the old data leaving the window to taken out, exponential aging [36] where past events are given exponentially increasing weights and finally, the method employed in this investigation, periodic scaling / fading memory [37] where after a certain frequency value is reached the probabilities in the context are halved.
2.3 Encoding
After as much redundancy in an image has been removed, through decorrelation, the resultant image is then encoded. Encoding attempts to utilise the frequency of occurring pixel values. Types of encoding methods include Huffman coding, LZ coding and Arithmetic coding. Huffman coding is a method of encoding discrete symbols into code words of variable length. Given that the code words are not fixed in length, the decoder needs a method of identifying when a code word starts or stops. Hence, all the code words have a prefix property, where no codeword exists as the start of another longer codeword. Since Huffman coding is capable of generating variable length code words, symbols with a high probability of occurring can then be assigned shorter code words and symbols with a low probability of occurring can be assigned longer code words. However, the drawbacks to Huffman coding are that, for binary data, it must assign code words that are at least one bit long. Hence, even if the probability of a binary 0 is 0.99, it still must assign 1 bit to represent the binary 0 and 1 bit to represent the binary 1. This prevents Huffman coding from achieving any compression, even though the probabilities are heavily biased to the binary 0.
One alternative to Huffman coding is LZ coding, where the encoder generates fixed size code words that correspond to variable symbol lengths. This encoding technique can be seen as a fixed-sized dictionary of variable length words. The LZ coding scheme was then modified by Welch to become Lempel-Ziv-Welch (LZW) coding [19]. This modified version of LZ uses a variable-length dictionary with every new string appended to the dictionary. However, through experimentation by [20], it was found that LZW performed worse then Huffman coding, when used as a post encoder.
Another method of encoding, is called Arithmetic coding. This technique involves using input symbols as magnitudes and it began from an idea conceived by a man named Shannon [21]. Shannon purposed a method of representing a string of symbols as the sum of their scaled probabilities. It was then introduced by Rissanen [22], who created a 'practicable derivative' of enumerative codes written by Lynch [23], Davisson [24], Schalkwijk [25] and Cover [26]. Rissanen's code used a last-in first-out (LIFO) strategy. More importantly, however, with the establishment of this technique, the precision problem encountered with enumerative codes, such as Huffman coding, was solved. Soon after the first arithmetic code was developed, Pasco [27] used Elias' code [28] to create his own version of arithmetic coding, where he incorporated a first-in first-out (FIFO) approach. Through the use of FIFO codes, decoding could then be done instantaneously. Since, the decoding process could begin, whenever it receives the next portion of the code. The FIFO method spawned other different variants of arithmetic coding, written by Jones [29] and Martin [30]. Due to the great flexibility and efficiency of Arithmetic coding, it now forms the basis of a wide variety of codes that all treat input symbols as magnitudes.
3. Description of the Problem
Context coding can be used to produce higher order Markov models for the data. These models provide probabilities that can be encoded by an arithmetic encoder to produce a compressed image. However, the main problem is providing accurate probabilities. The techniques under investigation to provide and maintain these probabilities are context growing, which reduces the effects of using a large number of initially inaccurate context probabilities. Fading memory, which allows the context probabilities to adjust according to changes in the data. Finally, increasing probability convergence, which reduces the entropy only on biased bitplanes. In addition, in an attempt to utilise the already encoded / decoded error bitplanes, the research develops 1s dropping and 3d contexts, which are then compared against the traditional 2d contexts and proven to be superior. The investigation concludes by presenting a final context coding scheme for a particular 3d template with 10 neighbours.
4. Alternative Solutions:
Lossless compression algorithms, such as Huffman coding [4], Lempel-Ziv coding (LZ) [5], ADPCM, Context coding, Arithmetic coding etc. all involve reducing the redundancy in an image. However, greater compression efficiencies can be achieve through the use of combining several techniques. For example, Bramble used Differential Pulse Code Modulation (DPCM) combined with Huffman and LZ coding [6]. Blume and Fand applied S-Transform also followed by Huffman and LZ coding [7]. Huang et al. also used DPCM but with Huffman and run-length coding [8]. These methods all tried to achieve greater compression by firstly decorrelating the original image and then relying on an encoder, such as Huffman coding, to take advantage of the frequency of occurring pixels.
5. System Design and Implementation
The following sections step through the process of compressing a grey scale image, where the main focus is on the various techniques used to maintain accurate probabilities when context coding the bitplanes.
Beginning with a grey scale image, the error image is produced using a predictor function called ADPCM. The error image is then converted to a variant of sign magnitude representation. The modified error image is divided into bitplanes and context coding is then applied to produce probabilities of a binary event. Various context coding techniques such as context growing, fading memory, increasing probability convergence and dropping 1s, were applied to produce more accurate probabilities of a binary event occurring. Maintaining accurate probabilities of a binary event is vital because, it is used to calculate the entropy of the bitplanes. Hence, the probabilities are used to calculate the amount of bits required to represent each pixel.
5.1 Adaptive Differential Pulse Code Modulation:
ADPCM, developed by Roos [1], was used to predict the current pixel and produce an error pixel, given by:
(1)
E(i, j) is the error pixel, f(i, j) is the actual pixel value and F(i, j) is the predicted pixel value at location (i, j). The predicted value pixel F(i, j) can be defined by the sum of a portion of it's neighbouring pixel values.
![]()
(2)
where s(a), s(b), s(c) and s(d) determine the contribution of each neighbouring pixel value. The diagram below shows how each pixel contribution is calculated by using a context involving 10 neighbouring pixels.
The values of s(a), s(b), s(c) and s(d) are normalized to sum to zero, which minimises the amplification of noise. These normalized terms allows ADPCM to alter the contributions from neighbouring pixels, depending on the differences between pixels, within the context, from:
- east to west given by a,
- north to south given by b,
- southwest to northeast given by c,
- southeast to northwest given by d.
The contribution of each neighbouring pixel is given by:
(3)
where k = a, b, c, d and ra = 7, rb = 6, rc = 5 and rd = 4. The purpose of γ is to magnify large differences between pixel values and δ is added to avoid division by zero. Work done by Roos [1], experimented on a large number of images, with various values of γ and δ. Their findings indicate that the optimal value for γ is 3 and for δ is 8.
5.2 Error Image to Sign Magnitude Bitplanes
The process of converting an error image from 2 complement notation to SMBs will be described using an 8 bit image as an examples. However, the process does not change when converting images of other sizes, eg 12 bit, 16 bit etc.
5.2.1 Sign Magnitude Representation
Each error pixel, in the error image, was firstly converted into a variant of the sign magnitude representation. In this variant, the initial bit was used to represent the sign of the error pixel, with the remaining bits representing the magnitude of the error. However, for negative numbers the magnitude is decreased by 1 first, this prevents having two representations for zero, namely +0 and -0. The sign magnitude bit patterns for various eight bit error pixels are shown in table 1.
Table 1: An example of a variant of sign magnitude representation
|
Sign Bit |
Magnitude |
No. |
|||||||
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
+ 2 |
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
+ 1 |
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
+ 0 |
|
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
- 1 |
|
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
- 2 |
Table 1 shows the sign magnitude representation of 8 bit error pixels, where 9 bits are actually used, 1 for the sign and another 8 for the magnitude. The magnitude of the prediction error required 8 bits because it has a range between -255 to +255. Eg, the actual pixel value could be 0 but the predicted value could be 255, which results in an error of -255 and vice versa for +255. Thus, the actual magnitude of the prediction error would never be larger then 255, which means the magnitude needs at least 8 bits to be represented.
5.2.2 Sign Magnitude Bitplanes
An array of 9 binary images was created to store the error pixels. The 9th binary image contained only the sign bits of all the error pixels. The 8th binary image contained the most significant bit of the magnitude of the error pixel. Similarly, the 7th binary image contained the 2nd most significant bit etc.
The binary images are also known as bitplanes. Using table 1 as an example, each column in the table could be considered as a bitplane. The rest of this thesis terms the first column, the sign column, the 8th bitplane, the 2nd column the 7th bitplane and the last column, of the magnitude, would be the 0th bitplane. Hence, for an 8 bit image, 9 bitplanes were used, ranging from bitplane 8 to bitplane 0.
5.3 Context Coding
Normally each bit from each bitplane can be feed into an arithmetic encoder, as a series of 1's and 0's. The arithmetic encoder would then increment frequency counts for the number of 0's and 1's that have occurred and assign the appropriate code lengths, depending on the probability of a 0 or a 1 occurring. The probability of a binary event occurring is given by:
(4)
A binary event with a high probability of occurring would be assigned a shorter code word, compared to one with a lower probability of occurring. This indicates that shorter code lengths can be assigned if more accurate probabilities could be used. Context coding provides a way of assigning probabilities depending on the binary numbers surrounding the 1 or 0 to be encoded / decoded. The probability of a binary event, given the current context, is given by the conditional probability:
(5)
where s is a context represented by a unique pattern of binary numbers surrounding the current binary pixel. The accuracy of the probability used to encode the next binary pixel depends on the context used to provide the probability.
5.3.1 Two Dimensional Contexts
The term 2D context will be used to describe context layouts that involve binary pixels on one bitplane only. Fig 2 shows a 2D context template.
|
i |
f |
j |
|||||||||||||||
|
h |
c |
b |
d |
g |
|||||||||||||
|
e |
a |
? |
Fig. 2. shows a 2d template with 10 neighbours, where the letter represent the members of the context and the ? is the binary event to be encoded.
where ? is the binary event to be encoded, a 0 or 1 and a, b, c ... g are 10 binary pixels used to identify the context. A context involving 10 binary pixels creates 210 = 1024 possible context combinations. Having such a large number of contexts to maintain introduces one of two main problems with context coding. The first problem was partially solved by using bitplanes, which reduced the possible context combinations from an error value of 51210 to a binary value of 210. Hence, significantly reducing the storage space required to maintain the frequency distribution of 0s and 1s for each context, while still allowing a large context template to be used.
The second problem, with large context templates, is that the contexts require some time to begin to converge on the true probability of a binary event, given a particular context. Hence, initially all 210 contexts would provide inaccurate probabilities, causing unnecessarily longer code words to be assigned. This problem can be reduced by using a process called context growing.
5.3.2 Context Growing
In context growing, the probabilities from the smaller contexts are used until the probabilities from the larger contexts have been given enough time to become more accurate. This process is illustrated in fig. 3.
Fig. 3. A hierarchy of templates with 0 neighbours up to n neighbours. The number of binary numbers in the squares represent the number of neighbours in the template. In addition, the actual sequence of binary numbers represent a particular context pattern
As each binary number is being encoded, the probabilities for a context with 0 neighbours, 1 neighbour ... 10 neighbours are updated. The context with 0 neighbours will be called template 0 and template 10 is the context with 18 neighbours. The criteria used to choose between each template, which provides the probabilities to encode the current binary number, is based on the context with the largest number of neighbours and a frequency larger than a predetermined threshold. For example, by using context growing, if the threshold value was 100 then as each binary pixel is encoded, the frequency of the largest template, template 10, is tested to see if it has occurred more then 100 times. If so, then its probabilities are used to encode the binary event. If not, the next largest template, template 9, is tested.
This process continues until a context with a frequency value greater than the threshold is found, or if template 0 is reached and regardless of the threshold value its probabilities are used. Hence, for a given binary event, a context involving 10 neighbours, in template 10, may be used and for the next binary event, a smaller template, for example template 4, may be used, due to inefficient occurrences of a particular context with 10 - 5 neighbours surrounding the binary pixel.
The threshold value can be found using a 'brute force' method, where the threshold value is incremented from 2, because the probability of a 0 or a 1 begins at 1/2, hence the frequency starts at 2, to the number of binary pixels in the bitplane. Hence, the optimum threshold for each bitplane can be found by the encoder. However, this method would lead to a small over head, since 9 threshold values need to be sent for each bitplane, given a 8 bit image. In addition, the actual process of finding the optimum threshold value and then sending it to the decoder, proves to be time consuming and inefficient. The process is time consuming because, even if the encoder required 1 second to run, then it must be run for all threshold values in each bitplane, which requires 1(sec) x 266144(pixels) x 9(bitplanes) = 27 days. In addition, the process is inefficient because the actual gain between the optimum threshold value and the worst threshold value, with no context growing at all, results in roughly 6, 000 bits for a context with 10 neighbours , as shown in table 5.
Alternatively, an average threshold for each bitplane with an entropy less then, equal to or greater then a certain value, can be found for several training images and hard coded into the encoder and decoder. Thus, both the encoder and decoder know when to switch to a larger template. This method is implemented on the assumption that the sign bitplane, bitplane 8, is quite random and bitplane 7 to bitplane 0, moves from biased data to random data. The process works by finding the entropy of each bitplane, using template 0. If the entropy is less then 0.002, then it is a good indication that the bitplane is heavily biased towards 0 and all bitplanes above the current bitplane have entropy values less then 0.002 as well. For these particular bitplanes, a small threshold value of 10 is used. The next following bitplanes with entropy values greater then 0.002 and less then 0.65 are assigned threshold values of 50. The remaining bitplanes with entropy values greater then 0.65 are given threshold values of 300, except for the sign bitplane and the first bitplane to have an entropy greater then 0.65, these 2 bitplanes are given threshold values of 100. The threshold values used at two entropy markers are shown below, for a 8 bit image.
Table 2: Entropy indicates when to switch threshold values. Decoder needs to receive the positions of two 1s to be able to decode the data.
|
Bitplane |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
Threshold |
100 |
10 |
10 |
50 |
50 |
50 |
100 |
300 |
300 |
|
Entropy |
- |
- |
- |
0.002 |
- |
- |
0.65 |
- |
- |
|
Send bits |
- |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
0 |
Table 2 shows that the encoder can pre-scan the bitplanes and send 8 binary numbers to indicate when the decoder should switch to a larger hard coded threshold value. Note that the sign bitplane is always assumed by the decoder to be quite random and has a fixed threshold value of 100. In addition, the first binary 1 indicates to the decoder that it should start using the next fixed threshold value of 50. The second binary 1 indicates that the decoder should use 100 as a threshold value, only for that bitplane and then use 300 for the remaining bitplanes.
This method of context growing attempts to save bits by allowing larger templates to occur more often and, hopefully, they are given sufficient time to more accurately represent the data. In addition, even if the contexts are given enough time to allow the probabilities for the data to become more accurate, if the data changes then the probabilities no longer match the data and it would take a while for the probabilities to readjust itself. Hence, a context coding technique to allow the probabilities, for each context, to quickly adjust to changes in the data is a process called fading memory.
5.3.3 Fading memory
In this process when the total frequency of a particular context reaches a pre-determined value, then the frequency of 1s and 0s are halved and rounded up to the nearest integer. Any frequency values dropping to 0 are reset to 1. The pre-determined value used was 1000, which was found to work well with the 6 images. However, other implementations may use 8192 as in [38].
This process of scaling the probabilities, allows the more recent incoming binary events to carry more weight against the history of binary events. However, the process may become counter productive on data that does not change much. This happens because the probabilities of a long obvious trend in the past are being reduced, producing less accurate probabilities. However, more often then not, an image is typically not completely constant. Hence, fading memory allows the recent data to have more significance on the probability of a context, allowing the context to adapt to changes in the data. In addition, another method that can be used in conjunction with fading memory, is to increase the convergence rate.
5.3.4 Increasing Convergence
Normally, as each binary event is encountered, the frequencies for the number of 0s and the number of 1s are increased by one. However, by increasing the frequency by three, which was chosen because it produced the best improvement on the entropy of the 6 images, then every time a binary event is encoded /decoded, the results show that the true probability for each context can be reached at a faster rate. Hence, smaller and less accurate probabilities for each context are not used as often. This then leads to reducing the number of bits required.
5.3.5 Dropping Ones
To make further use of the bitplanes that have already been encoded / decode, 1s on the upper bitplanes are 'dropped down' to the current bitplane. This means that even if a neighbour is currently a '0', in the context, if any binary pixels above its current position is a '1' then it will be considered as a '1' as well. By dropping 1's down the bitplane, the context information recorded, changes from just whether there was a 1 or a 0, for example, in the left neighbour, to whether the left neighbour had an error value larger then another neighbour with a binary pixel value of 0. Hence, each context now contains probabilities that are depended on the magnitude, whether it was larger or smaller, and the location of large error values, compared to the 2d case where it just recorded reoccurring patterns of 1s and 0s.
5.3.6 Three Dimensional Context
The term 3d context is used to describe context templates that involve neighbours from more then 1 bitplane. More specifically, the 3d context involves neighbours in regions where it is impossible for a 2d context to involve. For example, in the diagram below, the bottom grid shows a template lying on the current bitplane, bitplane 5, which is being encoded / decoded, while the top grid shows a template lying on bitplane 6, that has recently been completely encoded / decoded.
Fig. 4. A 3d Context with neighbours on the current bitplane being encoded / decoded and neighbours on previously encoded / decoded bitplanes, situated above areas not possible with a 2d contexts.
Fig 4 is an example of a context template consisting of 21 neighbours. The context contains 10 neighbours positioned on the already encoded / decoded area on bitplane 5. However the remaining 11 neighbours, on bitplane 6, are positioned directly above the section yet to be encoded / decoded, on bitplane 5. This allows a 3d context to have neighbours attempting to represent all the actual neighbours existing around the current binary pixel. Thus, it will be shown that these new neighbours provide a comparable amount of information about the current pixel, on bitplane 5, compared to neighbours that are further away from the current pixel, on the same bitplane.
5.4 Entropy Estimation
The probabilities for each binary pixel, in a context, are used to estimate the number of bits per bit required, given by
(6)
The sum of the total number of bits required to represent every bit in the SMBs is divided by the total number of pixels. Hence, producing the entropy, which predicts the number of bits per pixel required to represent the image, if it was actually encoded by an arithmetic encoder. This method of calculating the entropy allows the number of bits required to be given adaptive probabilities that become more accurate as more binary events are encoded / decoded. Finally, the number of bits per pixel required, using this adaptive method, is compared to the average entropy of the original error image. The average entropy, H, of the original error image can be calculated using the following formula.
(7)
Where K equals the number of possible error values and pi is the probability of a particular grey level. The average entropy calculated from the original error image, assumes that the arithmetic encoder was initially given all the probabilities of the possible error values. Hence, the actual bits per pixel of the original image would be slightly larger, since the probability information of the error values must be initially sent. However, this comparison still gives a good measure of the performance of context coding of SMBs.
6. Results and Discussion:
Experiments were conducted on 6 test images to find the usefulness of already encoded / decoded bitplanes, when determining the probability of a binary event on a bitplane yet to be fully encoded / decoded. The test images consisted of two 8 bit and four 12 bit images. The 8 bit images were the Lenna and Mandrill images, both of size 512 x 512. These 8 bit images were chosen because the Lenna image is a popular benchmark in image processing and the mandrill image had a high entropy for its error image, meaning it was hard to predict. The 12 bit images were medical x-ray images, two were of a skull and the remaining two were of a spine. The skull images had dimensions of 1755 x 1463 and the spine images were about two times larger at 2487 x 2048.
The entropy of the SMBs, for the 6 test images, was calculated from the probabilities given different contexts. The results are divided up into stages, where the entropy improvements are analysed when using two templates, of size 5 and 10, in conjunction with context coding techniques such as, context growing, convergence increase, scaling and dropping 1s. Shown below are the two initial templates used.
5 Neighbours 10 Neighbours
|
i |
f |
j |
|||||||||||||||
|
c |
b |
d |
h |
c |
b |
d |
g |
||||||||||
|
a |
? |
e |
a |
? |
|||||||||||||
Fig. 5. A diagram of the template 5 and 10. Both diagrams show the order in which the contexts are grown in.
When an optimised solution is found for each context coding technique, the parameters are then used to find the difference between using a 2d context and a 3d context.
6.1 Effect of Context Growing
Tables 3 and 4 show the entropy, of the 6 test images, when the 2d context is not grown, labelled "2d_fixed" and when it is grown, labelled "2d_grow". "2d_grow" used threshold values that are shown in table 2.
Table 3: Compares the entropy between a 2d context with 5 neighbours, with no context growing "2d_fixed" and a 2d context with context growing "2d_grow".
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_fixed |
4.301 |
6.177 |
5.967 |
6.075 |
6.023 |
6.081 |
|
2d_grow |
4.306 |
6.176 |
5.967 |
6.075 |
6.023 |
6.082 |
Table 4: Compares the entropy between a 2d context with 10 neighbours, with no context growing "2d_fixed" and a 2d context with context growing "2d_grow".
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_fixed |
4.273 |
6.152 |
5.945 |
6.044 |
5.992 |
6.062 |
|
2d_grow |
4.267 |
6.121 |
5.942 |
6.042 |
5.991 |
6.061 |
As expected the template 10 produces better results then the template 5, because it uses a higher Markov model for the data. However, these results indicate that although there is virtually no difference between "2d_fixed" and "2d_grow", when using 5 neighbours, there is a noticeable difference when using 10 neighbours. The reason why the entropy results for table 3 were so close is because template 5 is small enough to be able to quickly converge on its final probabilities. Hence, template 5 avoids having to use a large number of contexts that have inaccurate probabilities, as in the case for template 10.
Shown in table 4, "2d_grow" is able to achieve better entropy values then "2d_fixed" because it initially uses smaller context templates and waits for the larger context templates to occur more often. Hence, this reduces the penalty of having a large number of contexts that initially have inaccurate probabilities representing the data. In addition, for the 6 images, shown in table 5 using template 10, analysis of the amount of bits required to represent each bitplane indicates that the context growing technique, reduces the total amount of bits required at an almost constant value for all the images. The bits saved column shows that this particular method of context growing saves on average about 6, 000 bits. This establishes that the entropy improvements, in table 4 between the 12 bit images, were over shadowed by the massive amount of pixel they contained. Hence, the penalty incurred for beginning with a large fixed template was dwarfed by the huge amount of data in the 12 bit images.
Table 5: Estimated bits required using when using 10 neighbours. Also, bits saved, shows the difference between using "2d_fixed" and "2d_grow" contexts.
|
2d_fixed |
2d_grow |
Bits saved |
|
|
Lenna |
1, 120, 217 |
1, 118, 492 |
1, 725 |
|
Mandrill |
1, 612, 590 |
1, 604, 541 |
8, 049 |
|
Skull 1 |
15, 264, 703 |
15, 257, 010 |
7, 693 |
|
Skull 2 |
15, 519, 002 |
15, 512, 678 |
6, 324 |
|
Spine 1 |
30, 519, 820 |
30, 513, 224 |
6, 596 |
|
Spine 2 |
30, 877, 059 |
30, 870, 045 |
7, 014 |
The context growing technique used is crude and provides small improvements. However, it serves the purpose of demonstrating that higher order Markov models, such as template 10, require time to adjust to the data and when they do, they provide more accurate probabilities and reduce the number of bits required. Another context coding technique, to provide more accurate probabilities, is to use fading memory. The following results show the effects of fading memory on the entropy.
6.2 Effect of Fading memory
Tables 6 and 7 show the entropy results when the 2d context is run with only the context growing technique, "2d_g", and then with the context growing technique in conjunction with the fading memory technique, "2d_fade".
Table 6: Compares the entropy between two 2d contexts with 5 neighbours, one with no fading memory "2d_g" and the other with fading memory "2d_fade".
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_g |
4.306 |
6.176 |
5.967 |
6.075 |
6.023 |
6.082 |
|
2d_fade |
4.296 |
6.132 |
5.940 |
6.052 |
6.004 |
6.064 |
Table 7: Compares the entropy between two 2d contexts with 10 neighbours, one with no fading memory "2d_g" and the other with fading memory "2d_fade".
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_g |
4.267 |
6.121 |
5.942 |
6.042 |
5.991 |
6.061 |
|
2d_fade |
4.263 |
6.104 |
5.924 |
6.024 |
5.977 |
6.048 |
Tables 6 and 7 indicate that the fading memory technique reduces the entropy required, compared to just using the 2d context with context growing. Improvements were achieved in both template 5 and template 10 indicating that the nature to the data changes as more information is encoded / decoded. However, only small improvements were gained when applied on the Lenna image, this must have been due to the consistent nature of the data.
The remaining 5 images had large improvements because of the changing nature of the data from, for example, the dark borders, in the skull images, to the smooth skull itself and eventually to the more rough neck / spine areas. Due to the vast amount of context information gained in the upper half of the image, the border and skull areas, when the image eventually changes to the neck/spine areas, without the use of the fading memory technique, the context spends a lot of time trying to force the probabilities to change. The changing context would have to occur frequently to make a significant change to its the probabilities. During this period of adjustment, inaccurate probabilities would be used, which increases the entropy. However, by using the fading memory technique, the changing context would not have to occur as much in order to replace the probabilities of the context with more accurate representations for the current data. The rate in which the context changes plays an important role in building accurate probabilities quickly. Hence, the next section analyses the effect of increasing the convergence of a context.
6.3 Effect of Increasing Convergence
Coupled with both the context growing and fading memory techniques, the effects of increasing convergence are investigated. In Table 8 and 9, the 2d context with context growing and fading memory "2d_gf" are compared against another 2d context "2d_conv" also with context growing, fading memory and the addition of increasing convergence by 3.
Table 8: Compares the entropy between two 2d contexts with 5 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does.
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_gf |
4.296 |
6.132 |
5.940 |
6.052 |
6.004 |
6.064 |
|
2d_conv |
4.292 |
6.134 |
5.936 |
6.048 |
6.001 |
6.061 |
Table 9: Compares the entropy between two 2d contexts with 10 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does.
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_gf |
4.263 |
6.104 |
5.924 |
6.024 |
5.977 |
6.048 |
|
2d_conv |
4.264 |
6.110 |
5.923 |
6.023 |
5.978 |
6.047 |
Table 8 shows that increasing convergence has a positive effect on the probabilities of the contexts for most of the images. However, shown in table 9, increasing the convergence rate fails to improve the entropy for the Lenna, Mandrill and Spine 1 images. The entropy increase in Lenna and Spine 1 were minor, compared to the significant increase in the Mandrill image. This was because increasing the convergence exploits heavily biased bitplanes. However, the Mandrill image has a large entropy for a 8 bit image, which means the upper bitplanes were not as biased, compared with the other images. For example, the 12 bit images have many upper planes heavily biased to zero. Hence, by increasing the convergence rate, more accurate probabilities of the biased nature of the bitplanes can be reached faster. However, moving down the bitplanes, as the binary events become more random, then the increased convergence rate become detrimental. Table 10 shows a closer analysis of the number of bits needed to represent each bitplane for Spine 2, using template 10.
Table 10: Estimated bits required using when using 10 neighbours. Also, bits saved, shows the difference between using "2d_gf" and "2d_conv" contexts.
|
Threshold |
2d_gf |
2d_conv |
Bits saved |
|
|
Biplane 13 (sign) |
100 |
5, 005, 449 |
5, 005, 747 |
-298 |
|
Bitplane 12 |
10 |
42 |
16 |
26 |
|
Bitplane 11 |
10 |
42 |
16 |
26 |
|
Bitplane 10 |
10 |
1, 424 |
1, 366 |
58 |
|
Bitplane 9 |
10 |
5, 117 |
5, 066 |
51 |
|
Bitplane 8 |
50 |
16, 429 |
16, 203 |
226 |
|
Bitplane 7 |
50 |
42, 455 |
41, 897 |
558 |
|
Bitplane 6 |
50 |
234, 267 |
221, 342 |
12, 925 |
|
Bitplane 5 |
50 |
1, 564, 069 |
1, 552, 208 |
11, 861 |
|
Bitplane 4 |
100 |
3, 892, 282 |
3, 895, 884 |
-3, 602 |
|
Bitplane 3 |
300 |
4, 853, 551 |
4, 858, 240 |
-4, 689 |
|
Bitplane 2 |
300 |
5, 028, 083 |
5, 033, 162 |
-5, 079 |
|
Bitplane 1 |
300 |
5, 073, 558 |
5, 078, 608 |
-5, 050 |
|
Bitplane 0 |
300 |
5, 087, 119 |
5, 092, 154 |
-5, 035 |
In table 10, the increased convergence rate had a large effect on bitplane 6 and 5, showing that the contexts used in "2d_gf" were not able to converge fast enough to represent the true probabilities for each context. In addition, it is interesting to see that as soon as the threshold value for context growing increased to 100, there was an immediate drop in the performance of increasing the convergence rate. This drop in performance was not caused by an increase in the threshold level, but by that fact that the threshold hold level was increased because the encoder pre-sent entropy information about the bitplanes, indicating that the entropy of bitplane 4 was greater then 0.65. Hence, when the bitplanes start to become more random, an increase in the convergence rate by 3 produces a negative effect.
With the knowledge that increasing the convergence rate produces a negative result on bitplanes that are not heavily biased, then tables 11 and 12 re-present the data from tables 8 and 9 with the addition of another version of increasing the conversion rate, labelled "2d_c2". "2d_c2" reduces the convergence rate back to 1 for the sign plane and when it reaches the first bitplane to have an entropy greater then 0.65. Note, the 8 bits of information initially sent by the encoder to let the decoder determine the threshold values are being reused to determine which bitplanes should have the convergence rate increased.
Table 11: Compares the entropy between three 2d contexts with 5 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does and "2d_c2" is an improved version of "2d_conv".
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_gf |
4.296 |
6.132 |
5.940 |
6.052 |
6.004 |
6.064 |
|
2d_conv |
4.292 |
6.134 |
5.936 |
6.048 |
6.001 |
6.061 |
|
2d_c2 |
4.294 |
6.131 |
5.934 |
6.045 |
5.999 |
6.057 |
Table 12: Compares the entropy between three 2d contexts with 10 neighbours. "2d_gf" does not implement convergence increase while "2d_conv" does and "2d_c2" is an improved version of "2d_conv".
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_gf |
4.263 |
6.104 |
5.924 |
6.024 |
5.977 |
6.048 |
|
2d_conv |
4.264 |
6.110 |
5.923 |
6.023 |
5.978 |
6.047 |
|
2d_c2 |
4.262 |
6.104 |
5.920 |
6.020 |
5.973 |
6.043 |
The modified version "2d_c2", which implements increasing convergence, shows a small improvement over "2d_gf" and "2d_conv", especially for the mandrill image, which contained many unbiased, almost random, bitplanes that caused "2d_conv" to drastically increase the entropy of the Mandrill image unnecessarily, in table 9. However, "2d_c2" does not attempt to increase the convergence rate on the almost random bitplanes, which allows it to produce results at least as good as "2d_gf" or better in the situation where it is able to exploit significantly biased bitplanes.
6.4 Effect of dropping 1s
Table 13 and 14 present the entropy achieved by "2d_gfc2" and "2d_drop" where both 2d contexts utilise all the context coding techniques up to this point, which are context growing, fading memory and a modified version of increasing convergence. However, "2d_drop" has the added advantage of the dropping 1s technique. Hence, it stores its context probabilities based on the information from the current bitplane and all other magnitude bitplanes that have already been encoded / decoded.
Table 13: Compares the entropy between two 2d contexts with 5 neighbours. "2d_gfc2" does not implement dropping 1s while "2d_drop" does.
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_gfc2 |
4.294 |
6.131 |
5.934 |
6.045 |
5.999 |
6.057 |
|
2d_drop |
4.255 |
6.087 |
5.917 |
6.021 |
5.970 |
6.037 |
Table 14: Compares the entropy between two 2d contexts with 10 neighbours. "2d_gfc2" does not implement dropping 1s while "2d_drop" does.
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_gfc2 |
4.262 |
6.104 |
5.920 |
6.020 |
5.973 |
6.043 |
|
2d_drop |
4.226 |
6.061 |
5.900 |
5.990 |
5.947 |
6.020 |
Table 13 and 14 both show large improvements to the entropy values when the 1s are dropped down the bitplane. This confirms that a context that also takes into account the magnitude of its neighbours, out performs a context that merely recognises re-occurring patterns of 0s and 1s that become increasingly random in the lower bitplanes. For example, a context which implements dropping 1s would provide probabilities for a binary event based on the knowledge that it is surrounded by large errors. In addition, by creating a 2d context with the same template as the context used in ADPCM, which is similar to template 10, and also using dropping 1s then this context may be able to record the number of times ADPCM used pixel values, for its prediction, that were wrongly predicted with either a large or a small error compared to the other neighbours.
6.5 Close Neighbours vs. Distant Neighbours
One problem with a 2d context is that it can only involve neighbours that have already been encoded / decoded on the same bitplane. These neighbours lie to the left and top of the current binary pixel position, as shown in fig 6. Hence, the neighbours can only half surround the current position. However, a 3d context may involved neighbours that completely surround the current position, where the neighbours to the right and below, of the current position, exist on the previously encoded / decoded bitplane. Both the 2d and the 3d contexts are shown in the diagram below:
2d Context 3d Context
5 Neighbours
|
c |
b |
d |
e |
b |
|||||||||||||
|
a |
? |
a |
?/C |
E |
|||||||||||||
|
D |
10 Neighbours
|
i |
f |
j |
|||||||||||||||
|
h |
c |
b |
d |
g |
f |
b |
g |
||||||||||
|
e |
a |
? |
j |
a |
?C |
E |
|||||||||||
|
I |
D |
H |
Fig. 6. A diagram of the template 5 and 10 for the 2d context and the 3d context. All 4 diagrams show the order in which the contexts are grown in.
In Fig 6, the letters represent the order in which the contexts are grown in. In addition, the upper case letters represent binary numbers on the previously encoded / decoded bitplane, (the upper bitplane). The lower case letters are binary pixels on the current bitplane which is being encoded / decoded.
The test was run on all 6 images producing tables 15 and 16. Again, both contexts "2d_ gfc2d" and "3d_ gfc2d" implement all the context coding techniques up to this point, which are context growing, fading memory, increasing convergence and dropping 1s.
Table 15: Compares the entropy between a 2d context and a 3d context with 5 neighbours.
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_ gfc2d |
4.255 |
6.087 |
5.917 |
6.021 |
5.970 |
6.037 |
|
3d_ gfc2d |
4.268 |
6.098 |
5.912 |
6.010 |
5.971 |
6.031 |
Table 16: Compares the entropy between a 2d context and a 3d context with 10 neighbours.
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
2d_ gfc2d |
4.226 |
6.061 |
5.900 |
5.990 |
5.947 |
6.020 |
|
3d_ gfc2d |
4.228 |
6.057 |
5.894 |
5.987 |
5.939 |
6.008 |
Table 16, indicates that even though "3d_ gfc2d" has neighbours existing on different bitplanes, these neighbours are more relevant, to the current binary pixel, then the local distant neighbours, namely h, i, f , j, that are on the 2d context. However, the entropy achieved by using the "3d_ gfc2d" in table 15, did not produce consistent results. The next section reveals that the inconsistent results were caused by a majority of neighbours on the upper bitplane. However, in conjunction with table 16, it does prove that there is a potential for using 3d contexts. Hence, the 3d context needs to be re-assessed to determine the effect, on the entropy, when the number of neighbours that exist on the upper bitplane are increased.
6.6 Analysing the Effectiveness of 3d Neighbours
Fig. 8 shows what happens to the entropy of the Mandrill image, using template 10, when the number of neighbours existing on the current bitplane, for the 2d context, are moved to the upper bitplane, one by one. Initially, at x = 0, there are no neighbours on the upper bitplane and the context is exactly the same as the 2d context, shown in Fig. 6. However, at x = 1, the furthest member, from the current binary pixel location, is moved to the upper bitplane in the closest available position to the current position. This process continues until all the neighbours positioned on the current bitplane are moved to the upper bitplane. The three examples, in fig. 7, show the template configurations at x = 0, x = 5 and x = 10.
|
x |
= |
0 |
x |
= |
5 |
x |
= |
10 |
|||||||||
|
i |
f |
j |
|||||||||||||||
|
h |
c |
b |
d |
g |
c |
b |
d |
||||||||||
|
e |
a |
? |
e |
a |
?/F |
G |
K |
?/A |
B |
F |
|||||||
|
J |
H |
I |
H |
E |
C |
D |
I |
||||||||||
|
G |
J |
Fig. 7. Three examples from a 10 step process, where each neighbour on the current bitplane are moved to the upper bitplane, in a position as close to the current position as possible.
Fig. 8. The entropy of the "3d_gfc2d" as all its neighbours begin on the current bitplane and are then moved to the upper bitplane.
Fig. 8 shows that there is a slight improvement when there are 7 neighbours in the current bitplane and 3 neighbours in the upper bitplane. From x = 3 the results steadily rise to the "2d_gfc2d" line. However, after x = 6 the entropy increases sharply. Thus, by having the majority of neighbours on the upper bitplane, the context no longer accurately represents the binary number on the current bitplane. It actually begins to reduce its knowledge of the current bitplane and eventually at x = 10, it has virtually no idea what happening on the current bitplane. However, the entropy at x = 10 is still within a reasonable range of the best performance because of the dropping 1s technique, allowing it to have some idea of the magnitude of errors occurring around the binary pixel location and when it receives the actual binary event itself.
6.7 Optimal Context Model For 10 Neighbours
Fig. 9, shows the optimal context template that incorporations, context growing, fading memory, a modified version of increasing convergence and a reduced 3d context layout, with only 3 neighbours on the upper bitplane.
|
f |
|||||||||||||||||
|
c |
b |
d |
g |
||||||||||||||
|
e |
a |
?/H |
I |
||||||||||||||
|
J |
Fig. 9. A diagram of the optimal template for a context involving 10 neighbours.
Table 17: For template 10, a comparison of the entropy for the full_count, the basic 2d context with no additions and the final 3d context utilising all the context techniques mentioned.
|
Lenna |
Mandrill |
Skull 1 |
Skull 2 |
Spine 1 |
Spine 2 |
|
|
full_count |
4.418 |
6.325 |
6.009 |
6.113 |
6.069 |
6.112 |
|
2d_basic |
4.273 |
6.152 |
5.945 |
6.044 |
5.992 |
6.062 |
|
3d_final |
4.225 |
6.056 |
5.893 |
5.982 |
5.938 |
6.008 |
In table 17, the "full_count" entropy should be slightly larger because the decoder requires the probabilities of the error pixels to be initially sent. However, the basic 2d context is still able to reduce the "full_count" entropy by 0.1 bits per pixel on average. The "3d_final" context coding model produces an average improvement of 0.16 bits per pixel over the "full_count" entropy and 0.06 bits per pixel over the "2d_basic" context implementation.
7. Conclusion
In conclusion, it has been shown that a basic 2d context can be drastically improved by implementing context growing, fading memory, increasing convergence, dropping 1s and a 3d template model. The context growing technique produced on average a savings of 6000 bits, for a template containing 10 neighbours. However, this did not have a significant affect on the entropy of the large 12 bit images, since the improvements were dwarfed by the sheer volume of data contained in these images. However, for the much smaller 8 bit images, a savings of a few thousand bits can reduce the entropy by 0.02 bits per pixel.
The effect of fading memory, on the other hand, was quite impressive. Not only was it able to reduce the entropy for all 6 images. However, it also provides valuable insight to the nature of the data. This technique can be used to detect changes in the data, which may be a good indication that the predictive scheme should change accordingly.
The increasing convergence technique did not perform as well as the fading memory technique and initially its performance was inconsistent. However, analysis of the bitplanes showed that it produced negative results on the lower bitplanes that were becoming increasingly random. Hence, the increased convergence technique was only used on the heavily biased bitplanes, allowing it to produce positive results.
The dropping 1s technique had the greatest effect on the entropy. This technique utilised higher bitplanes that allowed the contexts to store conditional probabilities based on a larger or smaller error value and the location of its surrounding neighbours. This out performs a context that just bases the probabilities on re-occurring patterns of 0s and 1s that become increasingly random in the lower bitplanes.
The final sections analysed the effect of having neighbours that virtually surrounded the binary pixel, where neighbours lying on the yet to be encoded / decoded areas are situated on the upper bitplane. It was found that as long as a majority of the neighbours were present on the current bitplane then the 3d context was able to slightly improve the entropy. After finding the optimal 3d context template, the product produced was a context coding scheme, optimised for a particular 3d template with 10 neighbours. The context coding scheme was able to produced, on average, an improved entropy result of 0.16 bits per pixel over the "full_count" entropy and 0.06 bits per pixel compared to what the basic 2d context could achieve.
Future developments may explore the effects of simultaneously context coding the error bitplanes as the error pixels are produced by the predictor. This could mean that the information gathered through context coding, could be used in someway to adaptively alter the predictor model to account for changes in the nature of the data. For example, when a predictor is attempting to predict the 1000th pixel, for an 8 bit image, it already has 9 unfinished SMBs. Context coding these unfinished bitplanes may provide valuable information to the predictor.
8. Bibliography
[1] P. Roos, M. A. Viergever, M. C. A. Van Dijke, and J. H. Peters. Reversible intra frame compression of medical images. IEEE Trans. Medical Imaging, 7:328-331, 1988.
[2] S. Todd, G. G. Langdon, and J. Rissanen. Parameter reduction and context selection for compression of greyscale images. IBM Research Journal, 29:188-193, 1985.
[3] G. G. Langdon. An introduction to arithmetic coding. IBM Research Journal, 28:135-149, 1984.
[4] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proc. IRE, 40:1098-1101, 1952.
[5] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23:337-343, 1977.
[6] J. H. Bramble. Comparison of information-preserving and information-losing data-compression algorithms for ct images. Radiology, 170:453-455, 1989.
[7] H. Blume and A. Fand. Reversible and irreversible image data compression using the s-transform and lempel-ziv coding. Proc. SPIE Med. Imaging III, 1091:2-18, 1989.
[8] H. K. Huang, S. B. C. Lo, B. K. Ho, and S. L. Lou. Radiological image compression using error-free and irreversible and two-dimensional discrete-cosine-transform coding techniques. J. Opt. Soc. Amer., 4:984-992, 1987.
[9] A. N. Netravali and J. O. Limb. Picture coding: A review. Proc. IEEE, 68:366-406, 1980.
[10] R. C. Gonzalez and P. A. Wintz. Digital image processing. Addison-Wesley, 1987.
[11] J. W. Woods and S. D. O’Neil. Subband coding of images. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-34:1278-1288, 1986.
[12] N. Ahmed and K. R. Rao. Orthogonal Transforms for Digital Signal Processing. Springer-Verlag, 1975.
[13] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE Trans. Commun., COM-32:532-540, 1983.
[14] T. Endoh and Y. Yamazaki. Progressive coding scheme for multilevel images. Proc. Picture Coding Symp., pages 21-22, 1986.
[15] R. T. Worley and P. E. Tischer. An alternative to gray coding for bit-plane compression. Australian Computer Science Communications, 16, 1994.
[16] A. J. Maeder, P. E. Tischer, and R. T. Worley. Adaptive context compression for lossless compression of medical grayscale images. Medical Imaging 1993: Image Capture Formatting and Display, pages 248-258, 1993.
[17] M. Rabbani and P. Jones. Digital Image Compression Techniques. Bellingham, 1991.
[18] G. Langdon. On the jpeg model for lossless image compression. Preprint of paper accepted for Data Compression Conference, 1992.
[19] T. A. Welch. A technique for high-performance data compression. Comput.,17(6):8-19, 1984.
[20] Gopinath R. Kuduvalli and Rangaraj M Rangayyan. Performance analysis of reversible image compression. IEEE Trans. Medical Imaging, 11:430-445, 1992.
[21] C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379-423, 1948.
[22] J. Rissanen. Generalized kraft inequality and arithmetic coding. IBM J. Res. Develop., 20:197-300, 1976.
[23] T. J. Lynch. Sequence time coding for data compression. IEEE, 54:1490-1491, 1966.
[24] L. D. Davisson. Comments on sequence time coding for data compression. Proc. IEEE, 54:2010, 1966.
[25] J. P. M. Schalkwijk. An algorithm for source coding. IEEE Trans. Inform. Theory, 3(IT-19):385-398, 1972.
[26] T. M. Cover. Enumerative source encoding. IEEE Trans. Inform. Theory, IT-19(1):73-77, 1973.
[27] R. Pasco. Source coding algorithms for fast data compression. PhD thesis, Ph.D dissertation, Dept. of Elec. Eng., Stanford University, 1976.
[28] N. Abramson. Information theory and coding. McGraw-Hill, 1969.
[29] C. B. Jones. An efficient coding system for long source sequences. Submitted to IEEE Trans. On Inform. Theory, 1979.
[30] G. N. N. Martin. Range encoding: An algorithm for removing redundancy from a digitised message. Video and Data Recording Conf., 1979.
[31] J. Rissanen. A universal data compression system, IEEE Trans. Inform, Theory, vol. 29,no. 5, pp. 656-664, 1983.
[32] M. J. Weinberger, J. Rissanen, and M. Feder. A universal finite memory source, IEEE Trans. Inform. Theory, vol 29, no. 5, pp. 643-652, 1995.
[33] T. V. Ramabadran and K. Chen. The use of contextual information in the reversible compression of medical images, IEEE Trans. Med. Imaging, Vol. MI-11, No 2, pp. 185-195, 1992.
[34] G. V. Cormack, R. N. Horspool. Data compression using dynamic Markov modelling. Computer Journal, 30:541-540, 1987.
[35] D. E. Knuth. Dynamic Huffman coding. Journal of Algorithms, 6:163-180, 1985.
[36] K. Mohiuddin, J. J. Rissanen, M. Wax. Adaptive model of nonstationary sources. IBM Technical Disclosure Bulletin, 28:4798-4800; 1986.
[37] I. H. Witten, R. M. Neal, J. G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30:520-540, 1987.
[38] P. G. Howard, J. S. Vitter, Analysis of arithmetic coding for data compression. Information Processing & Management, vol. 28, No 6, pp. 749-763, 1992.