Ah, the joys of the
modern digital age. Now as far as ages
go, we’ve clearly entered into the Cambrian
explosion of digital content. You see, we have
plethoras of data out there on the internet–
books, music, videos, even your favorite TV shows
are all available at the touch of a button or a
finger to any device you want, anywhere in the world. Now think about
this for a minute. I mean, we’re talking about
yottabytes upon exabytes upon googles of information. All flowing through
the internet, everywhere in the world,
every second of the day. Now since the late ’70s,
information theorists and computer scientists have
been utilizing compression algorithms to effectively
reduce the size of this content, to keep the internet from
coming to a standstill. And this is why today we
start hearing more and more about the connection between
conversion rates and data sizes for applications and websites. So for example, a user may
browse away from a website if it hasn’t loaded
in three seconds or may not download an
application because it can incur extra charges
by their carrier. Now savvy developers need to
take control of this fact. And that means bucking
the trend of just using standard compression
algorithms out there. Instead, you need to
start digging in deep, understanding your data,
and utilizing compression in new and interesting ways
to make your users happy. But fear not, young developer. I’m here to help
with that process. My name is Colt McAnlis,
and this is Compressor Head. In 1836, an American inventor
named Samuel F. Morse concocted a way to
send electrical pulses through a wire. Now, at the end of
that wire was actually attached an electromagnet. Himself along with
other physicists, realized that if they
could separate out the pulses of
electricity on the wire to control the
electric magnet, they could ring a bell every
time a pulse came through. And thus was born the
electrical telegraph system. You see, the entire
intent of this operation was to find a way to
send natural language code over great distances,
using a pretty robust medium. Now of course, they
needed to figure a way to devise a situation
where they could take an English
character– a, b, or c– and actually turn it
into electrical pulses– dot dash dot dot dash. And thus was born Morse Code. The way they were able
to assign this code was actually based on
the inverse frequency of a symbol occurring
in the English language. So for example, e, the most
common occurring letter in the dictionary, actually
was assigned the smallest code. The reason for this was
you have to remember, they have some guy
sitting there banging on the electrical magnet,
sending pulses through a wire. And they wanted to minimize
the amount of work that needed to be done to
communicate a message. And of course, all of
the other symbols on here are based upon
their probability. Getting more complex, the less
probable they are to occur. Now in modern
computing, we actually use a different sort of form. And that’s actually
called binary code. Now binary code is not a
variable length symbol. Instead what we
say is that you’re going to use a set
number of bits, and then we’re able to
assign a symbol to a given number of bits using
that static amount. So for example, ASCII
assigns eight bits to a single character. So when we get a stream
of these bits on the wire, we know we can read in
eight bits at a time and determine what symbol
we’re actually using. Now however, if we decided
to move the other direction and implement a frequency
change, much like Morse did– sorry. You see that we can actually
get to something a little bit better. Now what happens here is
if we take into account the probability or frequency
occurrence of the character and use that to assign
a variable length bit pattern to it, we
can actually start getting some interesting things. So for example, c occurring
in a given symbol stream 80% of the time, gets assigned
the shortest character, a single zero. Now what this means is
instead of assigning eight bits per character
for our entire stream, we can actually assign a
variable length of characters and actually end up in this
example getting 81% savings. This is the definition
of variable length codes. And it’s what we’re going
to talk about more of today. Now it would seem that you
can’t really have a discussion about compression without
actually going back and talking about information theory itself. And when you’re talking
about information theory, you can’t talk about
it without introducing this guy, Cloud Shannon. Now Cloud, who besides being
awesome at parties– really good with keg
stands– pretty much invented what we call
modern information theory. Now what Cloud did
effectively was he found that there was a
correlation between information content and the probability
of that symbol occurring in the stream. Now he was able to
actually measure this correlation with
the logarithm function and actually get a result
in the number bits. So effectively,
he could find out what the information content
was in terms of binary data. Now this itself was
a pretty bold move that actually defined how
information theory works. But for our needs, what
we’re more interested with is his
definition of entropy. See, when you take all of
the probability content and information content of a
stream and sum it together, you end up with the
algorithm for entropy. Now besides being
a little gnarly here, what we really find out
is that entropy is actually a matching of an estimate for
the minimum number of bits that each symbol has
to have on average to represent the stream. Don’t worry, we’ll take
a look at an example of that real quick. So let’s take a look
at this in practice. So let’s say we
have a stream that only has two symbols in
it, whose probabilities are represented by P1 and P2. So you can see here that while
the probability is 0.5 and 0.5 between these two symbols,
the entropy value actually gives us 1. Basically saying that
we need 1 bit per symbol to represent all symbols
in this stream, right? So two symbols, we have 0 or 1. That’s pretty good. Now as we start skewing
the probability, effectively one
symbol starts becoming more dominant in the stream
than the other symbol. So we start seeing
the probability of P1 start rising
and P2 start falling. We can see that the entropy
value ends up at about 0.88. Roughly meaning that on average,
we need 0.88 bits per symbol to represent all the
symbols in the stream. And this is good, right? This is where we come back
to variable length codes. Some codes are bigger,
some codes are shorter. To further this
point, you can see that we’re really skewed
down here in 0.99 and 0.01. The entropy is as
close to 0 as it can get, meaning
that we’re going to see a lot of bias for
P0 and a large code for P1. Let’s look at this a bit more. So let’s say we have something
very simplistic here. We’ve got four symbols, right? And each one has an equal
probability of 0.25. Now in this environment,
the entropy is actually 2. So we have to have
minimum 2 bits per symbol to represent all the
symbols in this stream. Which actually boils down
to something that looks like this– 00, 01, 10, and 11. This is pretty much the
basis of how binary works. If you know the number
of bits per symbol, you can read it pretty easily. Now to look at entropy
a little deeper. If we actually start changing
the probability of the symbols in the stream, we get
some different results. So let’s say we
have a set-up here, where we’ve got four
symbols, and each one has a different probability,
0.49, 0.25, 0.25, and 0.01. When you actually run this
through the entropy equation, you end up with about
0.157 bits per symbol. Again, this is on
average, right. Now this means that we can
start assigning variable length codes to the stream to
start achieving compression. Because we really can
only get compression once we have differing probabilities. So in this case, A, being
the most dominant symbol, actually gets the smallest
code word of 1 bit per symbol. And over time, we’re going
to end up with compression. Now remember that
entropy actually represents the best estimate
of the minimum number of bits required to
represent the stream. And of course, that also
brings us into the next point. If you’re going to be
using variable length codes and get compression
out of them, you need to make sure that
you assign the shortest codes to the most frequent
symbols in your stream. If you don’t follow
this principle, you’re not going
to get compression, and you’re really not
going to have a good time. So now let’s take a
look at how you actually use these variable length
codes in your software to start encoding and
decoding your data stream. So let’s say we start with
our symbol stream A, B, C. And of course, we also
have a symbol table. Now a symbol table is going to
map your given symbols– A, B, and C– to your given variable
length code words, VLCs. So A goes to 0, B goes to 10,
and C of course goes to 11. Now encoding is actually
extremely straightforward. All we have to do is read
a symbol off the stream, figure out what code word
it goes to, and then output that code word to
the bit stream. So let’s see what
that looks like. We take an A off the stream. And of course we find that
A just goes to a single 0. So we can simply output a
single 0 to the bit stream. Now of course we take B
off the symbol stream. Look through our
symbol table and find that B corresponds to 1 and 0,
which is pretty easy for us. We again put up a 1 and
then put up a 0 as well. And of course C, the final
one here, goes to 1 1, which again is
pretty easy to do. Now the encoding process here
is actually not that difficult, as I said. We start with symbols, we moved
to code words, and it’s done. Now decoding is a little
bit trickier though. Because in decoding,
we effectively have to start with
our bit stream, and then read
through and determine what symbols we’re
working towards. Now here’s the tricky part. Because our code words are
actually variable length, we don’t know how
many bits to read. Let’s take a look
at what that says. So we start with our
input stream, 01, 01, 1. Now we read a single
bit off of the stream. We check this against
our symbol table. And we realize that the first
symbol is actually an A, right? A is a single 0, we got a
single 0 off the stream, so we can actually put
A in the output stream. Now we read the next
bit off the stream. And it’s a 1. Here is where things get tricky. We don’t know if this is going
to be a B or C, because you could see that both of their
code words start with a 1. This means we actually have
to read another bit off of the stream before
we can actually make a decision about
which code this is. So we read a 1 and a 0. And we can see that it actually
corresponds to the letter B. That means we read those off,
we output a B to the stream. Now we do this again
for the next one. We read one bit
again, because we have to always start with
the least common denominator. And we realize that
that could either be B or C. We read in a secondary
bit and use that pair. And we realize that we’ve
just read the letter C off of the stream. Now this is all
fine and dandy when things are working correctly. But the reality is
that it’s really hard to design and create the
proper variable length codes. In fact, a lot can
go wrong in how you’re choosing what code words
are assigned to what symbols. And this is where things can
actually get really wrong. So let’s say we end up here. We end up with the same input
bit stream that we saw before. But notice that the code
word for the symbol C has actually changed. It’s now 1, 0, 1
instead of 1, 1. So let’s start decoding
and see what happens. We read the 0 off
of the bit stream, and we notice that
that’s an A. Fantastic. An A goes out to the stream,
and all unicorns are happy. However when get to the
next one, we go– read a 1, and we realize that that could
be either B or C. Fantastic. Still things that
we’re doing normally. However, when we
read the next 0, we’re still not clear if this
is a B or C. We can’t actually make a decision. It looks like a 1,
0 should go to a B. But this might actually
be a C. Because if we read the third bit, we could
actually make that decision. So what we’re looking at
here is because of the way we’ve encoded our VLCs, we
don’t know if this 1 0 1 0 bit pattern is actually
two Bs or a C and an A. This
actually gives rise to something you need
to follow when designing your variable length codes. And that is known as
the prefix property. Or more specifically,
once a code has been assigned
to a given symbol, no other code can start
with that bit pattern. So if we have 1 0
with B, we should not be able to start the
letter C with 1 0. That’s going to make us
have a really bad time. Of course, at this
point you have to ask yourself where do
variable length codes come from? I mean they don’t
just drop from the sky into your computer for
your encoding purposes. For this, we actually
have to introduce someone named Peter Elias. Now besides being a renowned
tap dance fusion instructor, he also invented beta codes. Or what we call today
modern binary codes. Now as we talked about,
binary codes actually aren’t variable length
codes, because they don’t obey the prefix property. But Dr. Elias didn’t stop there. You see, he actually constructed
an entire fleet of compression codes that he used
for various patterns. Now, in the early days
of creating these codes, you could actually take an
integer value, such as 0 through 5, and actually
use that to denote what type of generation on
the code you’re going to have. So for instance, this is
known as the unary code. Probably the most simplistic
variable length code that obeys the prefix
property that we have. The way that unary codes are
constructed is given a specific index, you actually append
that many 1’s– minus 1– to the front of it before a 0. So for example
down here at 5, you can see we’ve appended
four 1’s and a 0. 4 gives us three 1’s and a 0. 3 gives us two 1’s and so on. Of course, because of the
way this is constructed, 0 actually doesn’t have a
value representation here. Now what’s important
about the way that Dr. Elias actually
constructed his codes, was that they all were built
around a specific probability distribution. So unary codes are
actually optimal when all of your symbols follow
probability set forth by P to the n is 2 the negative n. Now if your symbols
don’t actually follow this probability,
you’re going to get a little bit
of bloat in your codes later on in the stream. And you’re actually not
going to hit entropy. But unary codes aren’t
the only thing he created. Let’s see here. We’ve also got gamma
codes, which of course you see have a completely
different construction process. That’s mainly because they
have a different probability distribution down
here in the bottom. As well as delta codes,
or a variant of that. And themselves are actually
constructed quite differently, and have this monster of a
probability in the bottom. And of course we have
the fabled omega code, which is known to many far and
wide for having one of the more complex probability
distributions that you can’t actually list it. It’s kind of like saying
Voldemort out loud. Anyhow. Back in the day,
the way that you would choose what variable
length code to actually encode with is you would actually look
at the probability of your set and sort of determine how
many symbols you would have, against which code is
actually going to give you the optimal encoding
for that set. So for example, if you have
somewhere between 64 and 88 unique symbols that
needed code word, you could actually find that
if the probability matched properly, you could
use omega codes and get the best bit
pattern you needed. Of course, if you
were lower than that, you may need to choose gammas. Or somewhere in the middle,
you may need deltas. Of course this brings us
to an interesting question. If you have a set of
symbols that don’t actually follow any of these known
probability distribution functions, what do you do? Well of course the issue
with that– let’s see here– is that there are hundreds
of variable length codes out there. You see, we’ve got Schalkwijk
codes, Phased-In codes, Punctured codes, Stout
codes, Fibonacci codes, and really there’s a
lot to choose from. Or you could do something
a little bit different and actually construct
your VLCs based upon your probability directly. Let’s take a look at that. In 1951, while David Huffman
was still a student at MIT, he was taking an electrical
engineering class where his professor
gave him an opportunity. He could either
write a term paper solving a very
difficult problem, or he could study for his final. Of course David, being a little
bit optimized at the time, decided that he would
rather solve a problem than study for his exams. Now the problem that the
professor gave him was try to find the
most efficient way to represent letters,
numbers and other symbols using binary code. Of course, David decided to
chomp into this full bore. And after months and months
of chewing on this problem was about to give up. Effectively, he had
given up and was to the point of
throwing away all of his notes, when it hit him. He would later say that
it was like lightning striking his brain as soon as
the papers hit the trash can. So what he developed
at that time would later be recognized by
Donald Knuth as one of the most fundamental methods of computer
science and data transmission that we’d ever discovered. See, what happened
was David Huffman found a way to represent and
create variable length codes for any given probability in
a very simplistic and compact binary tree form. Now later he went on,
of course, to figure out how to levitate
things with his mind. But until then, he
just used this code, which we are going to
take a look at now. Now the brilliance of
what Huffman discovered was actually a connection
between assigning the symbol to the code word
using the table. Now what he was
able to do here was find that if he
used a binary tree, he could effectively use the
probabilities from the symbol table alongside the
branches of the binary tree to create an
optimal binary code. Well let’s build a Huffman tree. So typically what
you’re given is a set of symbols with
their probabilities. And the first step
in the process is to actually sort them such
that the highest probability symbol’s at the
front of the array, and the least probable symbol
is at the back of the array. From here, what we’re
actually going to do is pop the two least probable
symbols off the array. And actually make them
childs of a new node that’s pushed in there. We actually continue
this process all the way down the line. Comparing the two last
elements of the array together and making them
children of a new value, that’s then pushed
onto the array. Now at this point we have a
fully created binary tree. And to move from this tree to
the generation of Huffman codes is pretty simplistic. All we have to do is
assign a 0 and a 1 to the left and right
branches accordingly. To actually generate the
VLC for a given symbol, all we have to do is walk
from that node up to the root and record all the 0’s
and 1’s that we trace. So for example, A moving
from its note to the root gives us 0. B moving from the node to
the root gives us 0 and 1. And moving from C to
the root gives us 1 1. This effectively
is the generation of our variable length codes
using the Huffman generation algorithm. And now that is a Huffman tree. It’s pretty easy to see that
a little bit of compression can actually get you a long way. And in something as simple
as statistical encoding, you can use variable
length codes to actually shrink
your data set. Of course, the next thing
you really want to talk about is what other
compression algorithms are out there that you can
apply to your data stream before applying these variable
length codes to actually get you below entropy itself. Of course, that’s a
topic for different show. My name is Colt McAnlis. Thanks for watching. [MUSIC PLAYING]


Leave a Reply

Your email address will not be published. Required fields are marked *