I will prefix this post by saying that none of the ideas here are new and its all well covered ground by now. Like much of this blog, I am mostly writing to clarify ideas and to serve as a reference for the future-me.
Entropy and Prediction
Entropy as considered in information theory, can be thought of as the number of possible values the each symbol can take on. A stream of perfectly random bits has maximal entropy because each symbol is equally likely to be a 1 or 0. If however we impose rules on such a binary stream, e.g. after 3x1s there must be a 0, then after a run of 3x1s, the next symbol must be a 0, and so the entropy of such a stream is reduced. This change in entropy can be considered information.
Viewed through this lens, entropy and our ability to predict the next symbol, based on what has come before, are related concepts. If we can perfectly predict the next symbol every time, then the entropy of the a data stream is the entropy of the predictor, i.e. how many bits do we need to make the predictions.
Compression and Predictors
To compress a data stream is to reduce the number of bits required to represent towards the limit dictated by the entropy of the data stream. This can be done by looking for repeated symbols and using more efficient encoding for those symbols as is typically done with Huffman codes. A further refinement of this technique is to use a predictor to predict the next symbol and encode the difference between the actual symbol that occurs and the predicted symbol.
To see why this is an improvement, consider a perfect predictor - the only symbol we would need to encode is 0 which makes for very high compression ratios. In practice this rarely happens, but predictors do generally reduce the number
of possible symbols. Consider the sequence [10, 9, 8, 5, 3, 1]
. This has 6 unique symbols we need to compress. However if we use a simple predictor that says "the next symbol is the same as the current symbol", then the difference is [10, -1, -1, -3, -2, -2]
, where we assume the first prediction is 0. This sequence of difference from prediction only uses 4 symbols and is easier to compress.
The use of predictors allows us to exploit correlation between symbols to achieve better compressions than simply using more efficient coding. Predictors can be continuously applied with a different predictor each time, to potentially exploit high levels of correlation. The use of predictors can be said to "decorrelate" the data because the ability to predict the next symbol in difference-from-prediction data stream is less than before. e.g. if we were apply the same predictor as before to [10, -1, -1, -3, -2, -2]
we get [10, -11, 0, -2, 1, 0]
, and now we have 5 symbols instead of the 4 we started with, i.e. use of the predictor was detrimental. Where as the original data had elements that correlated to each other by difference of 1 or 2, the difference-from-prediction does not have this property. This also illustrates that the choice of predictor matters as a bad predictor can make things worse by trying to force correlations where non-exists.
Predictors can be specified ahead of time, or "trained" on the data and transmitted along with the compressed data so the receiver can decompress it.