#clear console
rm(list = ls())
#Black and White probs
p.White = .995
p.Black = .005
Question 1)
A purely brute force approach to encoding our 100pixel batches would ignore statistcal structure and simply require a bit for every pixel 2^100. We can greatly reduce the number of bits required if we only include sequences of with three or fewer black pixels and accept an error otherwise. We compute the sample space of sequences with 3 or fewer black pixels by the sum of the binom coefs: choose(100,0) + choose(100,1) + choose(100,2) + choose(100,3) to get the total amount of possible sequences with 3 or fewer black pixel occurences.
num.codewords = sapply(0:3, FUN=function(x) {choose(100,x)})
num.codewords = sum(num.codewords)
num.codewords
## [1] 166751
The number of possible combinations of sequences with 3 or fewer black pixels is 166,751 which is a great deal less than 2^100 possible combinations of all white and black pixels.
Question 2)
The number 166,751 corresponds with our sample space or ‘lexicon’ of codewords with three or fewer black pixels. If we read this source in 100-pixel batches, the Hartley measure of uncertainty is log2(cardinalityOfLexicon). This corresponds to the raw bit content needed to encode a single 100pixel batch without any compression beyond the intial sample space reduction we initially made (3 or fewer black pixels).
#Hartley measure
log2(num.codewords)
## [1] 17.34734
#Number of bits we need to encode all the possible sequences
ceiling(log2(num.codewords))
## [1] 18
Our theoretical minimum is equal to N * H(x):
#Entropy calculation
p.White = .995
p.Black = .005
H.x = (p.Black * log2(1/p.Black) + p.White * log2(1/p.White)) * 100
H.x
## [1] 4.541469
ceiling(H.x)
## [1] 5
Our theoretical minimum is a great deal less than the Hartley measure. This is because accoding to the Source Coding Theorem there is a smaller “typical” class of sequences worth caring about for a large N. So if we were to increase N we should expect the entropy calculation to approach this theoretical min.
Question3)
We can improve our performance by encoding only the sequences we care most about (typical set) or are most likely to occur. We identify those sequences by the formula below (fig 4.29 on p.80). Essentially we pick out all sequences in which the absolute value of the difference of (the specific sequence’s suprisal divided N) and the entropy of our source is less than our beta parameter (defines how close the probability has to be to 2^-NH). We expect the size of our typicality set to be 2^NH
\[T_{N\beta} \equiv \left\{ x \in \mathcal{A}_{X}^{N} : \left| \frac{1}{N}\log_2 \frac{1}{P(x)} - H \right| < \beta \right\}\]
typ.set.card = 2^(H.x)
ceiling(typ.set.card)
## [1] 24
So we’d expect appr 24 members in our typicality set. The probability for each of these members should be close to the theoretical minimum entropy (and all equal because of the Asymptotic equipartition principle):
2^-(H.x)
## [1] 0.04294193
#as we increase N this will approach our entropy for the entire set
log2(2^(H.x))
## [1] 4.541469
Practically we could approximate this typical set by sampling from our source and simply selecting the most common sequences (as in Figure 4.10 on p.78).
Question 4)
We can find the probability of finding a batch with 4 or more black pixels by using the binomial theorem:
#Helper
binomial = function(n, k) {
return(choose(n, k) * p.White^(n - k) * p.Black^k)
}
seq.overall = sapply(0:100, FUN=function(x){binomial(100,x)})
seq.overall = seq.overall[1:99] #Shave off an extra index (don't know why this is occuring)
plot(rev(seq.overall), pch=10, col="blue", main="Pixel sequence probability",
xlab="X occurances of White pixels in batch", ylab="P(x)")
seq.four.plus = sapply(4:100, FUN=function(x){binomial(100,x)})
seq.four.plus = seq.four.plus[1:96]
#Risk of an untabulated batch occuring (delta)
sum(seq.four.plus)
## [1] 0.001673268
So we have roughly a .17% chance of encountering an untabulated sequence if we reduce our set of possible sequences to those with 3 or fewer black pixels.