So, far we have studied three coding strategies.
These are Shannon coding, Fano coding and Huffman coding. We have proved that Huffman
code is optimal in the sense; that the expected or average code length for the Huffman code
is minimum. Now, let us discuss a bit more about the code word length of Huffman code
and Shannon code for some specific source alphabet. Huffman code and Shannon code, we will look
at this two codes and look at their word length. Now, using code word lengths of log 1 by P
I, which is called Shannon coding. This may be sometime much worse than the optimal code
for some particular symbol. Let us take an example that I have a source a. Binary source
consisting of two symbols S 1 and S 2 with the probabilities given as probability of
S 1 is equal to 0.9999 and probability of S 2 given as 0.0001.
Now, if we use the Shannon coding strategy, then we can calculate the number of bits which
we should allot to the source symbol S 1 as log of upper flow of log of 1 by P 1 and that
will come out to be 1 bit. For this the upper flow of log of 1 by P 2 will be a code to
14 bits. Optimal code word length is obviously 1 bit for both the symbols S 1 and S 2, hence
the code for the infrequent symbol is much longer in the Shannon code than in the optimal
code like Huffman code. Now, is it true that, code word length for an optimal code are always
less than upper flow of log 1 by P i? Let us answer this question with an example of
an illustration. Let us assume that I have a source consisting
of four symbols with the probabilities given as one-third probability of a S 2 equal to
one-third probability of S 3 equal to one-fourth and probability of S 4 equal to 1 by 12. Now,
for this source we can design an Huffman code and the code word length, which will result
from the design of the Huffman code will be 2 2 2 2 or 1 2 3 3 depending on where one
puts the merge. Probabilities in the process of Huffman coding
to get this, we follow the exactly same procedure, which we had discussed in the last class.
Now, both this Huffman codes achieve the same expected code word length in the second code;
that is this the third symbol has length 3, which is greater than log of 1 by P 3. Thus
the code word length for a Shannon code which would be decided by this could be less than
the code word length of the corresponding symbol of an optimal Huffman code. Huffman
code is optimal in that, it has a minimum expected length, but what does that say about
its performance on any particular sequence? For example, is it always better than any
other code for all sequencing, obviously not. Since, there are codes, which assign short
code words to infrequent source symbols. Now, such code will be better than the Huffman
code on those source symbol is t would be interesting to relate the code word lengths.
At least in probabilistic sense of say Huffman code or Shannon code to the code word length
of some other uniquely decodable code. Now, for notational convenience and this in turn
will result in the simplification of understanding of the concept to follow. We will follow the
following framework, we will
denote the output of a source as a random variable. So, the output of a source will be denoted
by a random variable x with the source alphabet given by this. We will denote the probability
mass function probability mass function p x; that is equal to probability of the random
variable x equal to x where x belongs to the alphabet. We will also denote this probability
mass function p x by we are denoting this by rather than
by this notation just for convenience. So,
if you want to denote the probability mass function for random variable x, we should
denote it by this term. But just for the sake of convenience, we will denote it by a simplification
simplified form as just p x. Now, dealing with Huffman code is difficult,
since there is no explicit explanation for the code word length instead. We will consider
the Shannon code because for the Shannon code word lengths i l x can be specified as equal
to log of 1 by p x such explicit explanation for the code word length of Huffman length
is not possible. So, it will be difficult for us to consider Huffman code, so we consider
the Shannon code and for the Shannon code we have a following important theorem. Let l x be the code word length associated
with the Shannon code
and let l dash x be the code word lengths associated with any other code. Then the theorem
says that probability of l x is greater than equal to plus a constant c is less than equal
to 1 by 2 c minus 1. what it means is that the probability that l dash x is say 5 or
more bits shorter. Then l x is less than 1 by 16, so the interpretation of this is given
as explained. Now, let us look into the proof of this. So, probability of random variable greater
than l dash x plus C is nothing but probability of log of 1 by P greater than equal to this,
this implies this, because the length is given by the full flow 1 by p x for the Shannon
code. Now, this quantity is less than equal to probability of log of 1 by p of x greater
than equal to length dash l dash x plus c minus 1. I can write this expression because
log of is less than equal to log of 1 by p x plus 1. Because of this I can write this
expression. Now, this interpretation in terms of probability
is nothing but probability of x less than or equal to 2 minus l dash x minus c plus
1. Now, probability of this random variable is nothing but a summation of the individual
probabilities, where I have to sum up for all x which satisfies p of x less than or
equal to 2 minus c plus 1. So, this is equivalent to summing up the probabilities of x where
x is all those x for which probability of x is less than equal to this quantity. Now,
I am summing this up for all p x less than this quantity. So, what it implies that this
is this quantity itself is less than or equal to summation 2 minus l dash x minus c plus
1 for this is the upper bound of p x. So, that is why it becomes less than or equal
to… Now, this is obviously less than where i sum up for all x. So, 2 minus l dash x minus
c minus 1, so this is equal to… Now, this quantity is the craft inequality, so summation
of 2 minus l dash x. If I assume that the code which I am considering is a uniquely
decodable, then this should satisfy Macmillan theory of inequality. This is less than equal
to 1, so what it implies? That this quantity is less than equal to 2 minus c minus 1, so
we have proved that no other code can do much better than the Shannon code most of the time.
So, it is a very important result. Now, pertaining to a channel four, there is another important
theorem, which we will have a look at it. So, suppose I have the code word length l
x for the Shannon code and l dash x as the code word length for any other uniquely decodable
code. Then what one would like to ensure is that l x is less than l dash x, more often
than l x is greater than l dash x. So, this is something which is desirable that I want
l x, which pertains to the Shannon code is less than l dash x, which pertains to any
other uniquely decodable code. This should happen more often than this, so that is probability
of this should be more than this. So, let us try to prove that, but before we
try to prove that. Let me define, what is the dyadic probability mass function p x?
Now, probability mass function p x is dyadic, if log of 1 by p x is an integer for all x.
x comes from is a discrete and it comes from the alphabet this x. So, if log of 1 by p
x is an integer for all x, then probability mass function which is given by p x is known
as dyadic. Now, with this definition let me study another
important theorem. The theorem says that for a dyadic probability mass function
p x. Let l x be equal to log 1 by p x be the
code word length of the binary
Shannon code for the source
and let l dash x be the lengths the code word length of any other uniquely decodable binary
code for the same source. Then the theorem says that probability of l x being less than
l dash x is greater than probability of l x being greater than l dash x.
This is with equality
if and only if l dash x is equal to l x for all x. Now, we can say thus the code word
length assignment of the form l x equal to log of 1 by p x; that is the Shannon’s strategy.
If uniquely competitively optimal, so we have proved that Huffman code is optimal in the
sense that the average expected code length is minimal. But now, what we are going to
prove is that, Shannon code is uniquely competitively optimal and by that we mean?
That probability of l x being less than l dash x the probability of code word length
for Shannon code being less than for any other uniquely decodable code, is always greater
than probability of code word length of Shannon code being greater than the code. Word length
of a any other uniquely decodable code a very important result though this result. We are
trying to prove it for the dyadic probability mass function. Then we will try to relax this
condition and see what happens so the proof follows as follows. Now, before we have a look at the proof, let
me define one function that is the signum function. Define the function
signum t as follows signum t is defined as,
it is equal to 1 is tis greater than 1 is equal to 0. If t is equal to 0 and it is equal
to minus 1, if t is less than 0. Then based on this definition, it is easy to see from
the following figure that
signum t will be always be less than or equal to 2 raise to t minus 1 for t equal to 0 plus
minus 1 plus minus 2. So, if we draw the plot for these two functions, they will appear
something like this t and i plot my function.
So, this is my signum t and this function is 2 raise to t minus 1. So,
from this is a plot for 2 raise to t minus 1 and this is a plot for the function signum
t form this plot. It is very clear that signum t is always less than or equal to 2 raise
to t minus 1, but very important to note that this is valid for t equal to 0 plus minus
1 plus minus 2. So, it is for all the integer values this is true and 2 t minus 1 when t
tends to negative is bounded by minus 1. So, this result is always valid. Now, based on
this result, we will try to prove the theorem for this is not valid for all t, but valid
for integer values of t. So, based on this we can now write probability
of l dash x being less than l x minus probability of l dash x being less than l x minus probability
of l dash x being greater than l x. This I can write this probability, I can write as
probability of x where my I consider only those x for which l dash x is less than l
x, the probability of the random variable. So, by the definition it is this and this
is nothing but the probability of the symbol for which l dash x will be greater than l
x. Now, this itself very it is not very difficult that this I can write as summation of p x
this is signum of l x minus l dash x for overall x. It is simple from here to here.
Now, this by definition is nothing but expectation of signum function l x minus l dash x. This
follows from the definition of expectation now because we have seen that signum of t
is always less than equal to 2 raise to t minus 1 for t equal to 0 plus minus 1 plus
minus 2. Since, our difference between this two random variables are always integers,
we can use this relationship to write as summation of x p x 2 of raise to l x minus l dash x
minus 1. So, let me just denote this relationship out here by a we will come back to this little
later. So, I denote this step out here by a. Now, based on this we have come from here
to here. Now, this I can simplify because my p x because
a dyadic probability mass function. So, p x is nothing but two raise to minus l x. So,
I can write this as summation of x 2 raise to minus l x minus 1 and this we can simplify
as x 2 raise to minus l dash x minus 2 raise to minus l of x. Now, this quantity is nothing
but the Macmillan’s graph inequality. So, if your l dash x corresponds to the code word
length of another uniquely decodable code, then it implies that this quantity has to
be less than equal to 1. So, I can say that this is equal to less than
or equal to 1 and this is obviously 1, because a dyadic probability mass function. So, summation
again is equal to 1. So, let me call this relationship as b. So, this quantity out here
is going to be less than equal to 1, if I can write this as less than equal to 1 minus
1 and if equal to 0, so what we have shown? Now, that this is less than equal to 0, where
this step a follows from this and b here. We have used the Macmillan graph inequality.
Now, we have an equality in the bound at a and b. So, if we have equality in the bound for signum,
which is occurred only for t equal to 0 or 1 in this case, then if p if you want to have
a equality here, then what it means? That l x is equal to l dash x or l x is equal to
l dash x plus 1. This is I can write because the signum function. If we look at the signum
function, the equality holds for t equal to 0 and t equal to 1, so in our case here if
I assume that the equality at this point, then t equal 0 and t equal to 1 corresponds
to l x is equal to l dash x and or l x is equal to l dash x plus 1. Another equality
is if you want the equality then this at point b, this should be equal to 1, which implies
that the graph inequality is satisfied and combining these two facts we will get l dash
x is equal to l x for all x. So, this we get from equality a, I mean when
we assume the equality a when we assume this condition that t is equal to 0 or 1 at this
point. Here we assume that this also satisfies the equality of Macmillan graph equality.
Then from both this condition, we get l dash x is equal to l x for all x. So, we have proved
the theorem, so we have proved that this probability is always greater than this for a Shannon
code. So, after having proved that the Shannon code is uniquely competitively optimal, so
after having proved that the Shannon code is uniquely competitively optimal for a dyadic
probability mass function. For non dyadic probability mass function there
is a corollary. I just take the corollary for the sake of completion without the proof.
But the proof is followed along the same line as the proof, which we saw for the dyadic
probability mass function for the corollary. For the non-dyadic probability mass function
would be for non-dyadic probability mass function, expectation of signum l x minus l dash x minus
1 is less than or equal to 0, where l x is upper flow of log of 1 by p x and l dash x
corresponds to the code word length for the source, but this is another uniquely decodable
code. So, l dash x is any other code word lengths the uniquely decodable code for the source.
So, a important result, what it says that on the average the Shannon code when provided
the code word length which does not differ from the code word lengths of any other uniquely
decodable code by 1. So, important relationship after having studied this, not let us go back
to the Huffman code. Now, we had looked into the procedure for the design of a Huffman
code where our code alphabet was a binary in nature. Now, let us look at the procedure
to extend the binary a Huffman coding to a non-binary Huffman coding strategy. So, in
the non-binary Huffman coding, what we mean is that the code alphabet is size is not 2,
but greater than 2. So, the binary Huffman coding procedure can be easily extended to
the non binary case where the code elements form an r array alphabet. So, let us study non binary Huffman code,
because that we had obtained the huffman algorithm based on the two observations; that first
that the symbols that occur more frequently or have a higher probability of occurrence.
We will get from 6 letters alphabet. We will have a shorter code word than symbols that
occur less frequently and the second observation was that the two symbols that occur least
frequently will have the same length. We added an additional requirement that the two symbols
with the lowest probability differ only in the last portion. Now, we can obtain a non
binary Huffman code in almost exactly the same way.
The obvious thing to do would be to modify the second observation, which would mean…
Now, that the r symbols, r denotes the size of the code alphabets. So, the r symbols which
occur least frequently will have the same length that is the modification of the second
observation. We will also modify the additional requirement to reduce the r symbols with the
lowest probability will differ only in the last position. However, we run into a small
problem with this approach consider design of a ternary Huffman code for a third with
6 letter alphabet. Now, is we use the rule described. Now, then
in this case, we would do is we would first combine the 3 letters with the lowest probability
into a composite letter. So, that we would be doing at the first stage, so after the
first reduction, we will move over to 4 letter alphabet. Now, once we move over to the 4
letter alphabet and if we follow same procedure of combining 3 letters into a composite letter,
then at the end of the second stage of reduction, I would be getting 2 letters. So, finally,
I have a 2 letters reduced alphabet and I have three symbols to be allotted, so this
is the one strategy which I could have followed. Another strategy, which I could have followed
was that to start initially, I could have combined 2 letters, so when I combine 2 letters
then at the first stage of my reduction, I would get 5 letters reduced alphabet. Now,
once I get 5 letter reduced alphabet, then I can combine again 3 letters into a composite
letter. When I do that at the second reduction, I would get 3 letters. I have three symbols
from the code alphabets to be allotted to the three reduced alphabet, so no problem.
Now, I could have followed another strategy, another strategy could have been that initially
I combine 3 letters. So, at the first stage of my reduction I get 4 letters.
At this stage instead of combining 3 letters, I would combine 2 letters. Finally, I would
get 3 letters. Now, there are three alternatives, which I have suggested. So, the question is,
which of these alternatives should be followed? Now, recall that the symbol with the lowest
probability will always have the longest code word. Furthermore, all the symbols that we
combine together into a composite symbol will have code words of the same length. What this
mean? That all the letters we combine together at the very first stage will have code word
that have the same length. This code words will be the longest of all the code words,
so this being the case if at some stage we have allowed to combine less than r symbols.
Then obviously the logical place to do this would be in the very first stage. So, in order
to achieve this, we can add some dummy source symbols with probabilities 0. Then combine
r symbols at a time at each stage of the reduction indirectly by this procedure. What we are
doing? That the very first stage the symbols, we are combining are less than r, because
the other symbols, which we added as a dummy symbol are having probability equal to 0.
Why we do this is, because we can just like we prove for the binary Huffman code; that
if we want the original tree to be optimal, then this necessary that the reduced tree.
We develop a code for the reduced tree also which is optimum optimal. Now, in order to
do that, we have to always combine r symbols at a time. So, if wish to form an r array
contact the optimal code that an any particular step in the Huffman coding algorithm. We shall
combine the source symbols r at a time in order to form one symbol in the reduced source
in the sequence. Now, we would like the last source in the sequence to have exactly r symbols.
This will allow us to construct the trivial optimal code for this source by allotting
each code symbol to the source symbol. The last source will have exactly r symbols, if
and only if the original source has r plus alpha r minus one symbols, where alpha is
an positive integer. So, only the original symbols satisfy this
relationship where r denotes the size of the code alphabet and alpha denotes the integer,
if my original size of my source alphabet is of the form this. Then I can combine r
symbols at a time at ease reduction of the source and this will provide me the optimal
code, when I move backwards. Therefore, if the original source does not have this many
number of symbols, we add dummy symbols to the source until this number is reached. Now,
it is not very difficult to show this, take a simple example. Suppose, I start with three
source symbols then three source symbols alpha is equal to 0 and it satisfies this condition.
Now, if I have my six source symbols, so if take let us take an example. I have six source
symbols to start with and I want to do the coding using code alphabet of size r equal
to 3. So, if you look at this expression we get 3 plus alpha 2, which is of this 6 is
not of this form. So, what it means that to make of this form I should add some dummy
variable. So, if add one dummy variable here that is say, make it 7, then 7 is of the form
3 plus 2 into 2. So, if add one dummy variable, which is has a probability of 0, then I get
this relationship satisfied. This also tells me that if I have six source symbols, we try
to understand the non binary Huffman coding procedure, which we have discussed today with
the help of an example and that we will do in the next class.

Categories: ArticlesBlog

### 1 Comment

#### Rishabh Jha · February 6, 2019 at 8:47 am

Plagiarized word-to-word and equation-by-equation from the book "Elements of Information Theory" [2nd Ed.] by Cover & Thomans. Not even a single line of extra explanation has been added. I came here for the proof for non-dyadic probability mass functions as I was not able to fill in the blanks since the proof has been omitted in the aforementioned book, he too skipped it in the video.