Last updated on
Lab 4: Huffman Coding
In this lab, you will apply your knowledge to a practical scenario: file compression. By the end of the lab, you’ll be able to compress real-world files using a custom format (.huf
).
Senior track
As usual, here are instructions if you would like to complete a less-guided version of this lab. If you’d prefer the normal experience, skip ahead to the next section!
Your colleague has started a project to implement an in-house alternative to zip, gzip, and other compression formats, based on Huffman coding. As a strong advocate of the test-driven development (TDD) approach, they have written unit tests and added docstrings. This was a lot of work, and so now they’ve gone on vacation. Unfortunately, the code is due in 10 days, so you’re going to have to step in and write the actual implementation for them. With tests and documentation to guide you, what could go wrong?
Check out your colleague’s work following § Logistics. You can find the code and documentation they have already written in HuffmanBase.scala
and HuffmanChar.scala
. The core of the program is yet to be implemented, in HuffmanImpl.scala
.
If you want to follow this week’s senior track, skip the next sections and try implementing the lab using only the given codebase, documentation, and tests. You will most likely need to find information online on how Huffman coding works, since your colleague didn’t bother to document that!
Once you’re finished, don’t forget to check the Bonus section.
Introduction
Huffman coding is a lossless (i.e., keeping all information) compression algorithm that can be used to compress lists of symbols. It’s widely used in data compression tasks such as file archiving. For example, huffman coding is used in Gzip.
Usually, for a normal, uncompressed text (i.e., a string), each character is a symbol represented by the same number of bits (usually eight).
For the text “ABEACADABEA”, using ASCII code for each character, the encoded text has $11 \times 8 = 88$ bits.
Huffman coding is designed to optimize the length of the encoded data. In Huffman coding, each symbol can have a bit representation of different length, depending on the frequencies of symbols: the symbols that often appear in a text are represented by a shorter bit sequence than those being used more rarely.
For the text “ABEACADABEA”:
Symbol | Frequency | Huffman Code |
---|---|---|
A | 5 | 0 |
B | 2 | 100 |
C | 1 | 110 |
D | 1 | 111 |
E | 2 | 101 |
The huffman encoded text has $5 \times 1 + (2 + 1 + 1 + 2) \times 3 = 23$ bits, a significant reduction from the original $88$ bits!
In this lab, your task is to implement the Huffman coding algorithm.
This assignment aims to exercise the following concepts and techniques:
- Higher-order functions
- Pattern matching
- Lists
- Parametric polymorphism
- Specifications
- Testing
Logistics
-
Please review our usual course policies about lab assignments, plagiarism, grading, etc.
-
You can get the materials for this lab by running:
git clone https://gitlab.epfl.ch/lamp/cs-214/huffman-coding.git
-
Once you are done, submit the following files to Moodle:
src/main/scala/huffman/HuffmanImpl.scala src/main/scala/huffman/HuffmanChar.scala src/main/scala/HowManyHoursISpentOnThisLab.scala
📝 Important:
don’t forget to fill in how many hours you spent on this lab in
HowManyHoursISpentOnThisLab.scala
to pass the HoursTest
.
Theory of Huffman Coding
For a given text, each huffman code defines a specific bit sequence to represent a symbol. The core idea behind Huffman coding is rooted in two principles:
- Symbols that appear more frequently should have shorter codes;
- No code is a prefix of another (known as the prefix property).
Why is prefix property important?
Consider encoding text AB
where A
is encoded as $10$ and B
as $1010$. The encoded text would be $101010$. Now, when trying to decode, it’s ambiguous where one code starts and the other ends.
This ambiguity is eliminated with prefix property, which ensures that a sequence of codes is immediately and uniquely decodable without requiring any special delimiters or markers between individual codes.
The huffman codes are represented by a code tree, which is a binary tree with leaves representing the symbols.
Constructing Huffman Code Trees
Use text ABEACADABEA
with symbols A
to E
as the example. The code tree construction has 2 steps:
-
Create a leaf node for each symbol, associated with a weight which denotes the frequency of appearance of that symbol.
Show Code Trees
A(5) B(2) C(1) D(1) E(2)
-
While there’s more than one code tree, remove the two trees with the lowest root weight and merge them into a new tree by creating a new branching node. A branching node can be thought of as a set containing the symbols present in the leaves below it, with its weight being the total weight of those leaves.
Show Code Trees Step By Step
Merge
C(1)
andD(1)
:A(5) B(2) E(2) CD(2) / \ C(1) D(1)
Merge
B(2)
andE(2)
:A(5) BE(4) CD(2) / \ / \ B(2) E(2) C(1) D(1)
Merge
BE(4)
andCD(2)
:A(5) BECD(6) / \ BE(4) CD(2) / \ / \ B(2) E(2) C(1) D(1)
Merge
A(5)
andBECD(6)
:ABECD(11) / \ A(5) BECD(6) / \ BE(4) CD(2) / \ / \ B(2) E(2) C(1) D(1)
Encoding
For a given Huffman tree, one can obtain the encoded representation of a symbol by traversing from the root of the tree to the leaf containing the symbol. Along the way, when a left branch is chosen, a 0
is added to the representation, and when a right branch is chosen, 1
is added to the representation.
What’s the huffman code for symbol D
in the below code tree?
Solution
1011Decoding
Decoding also starts at the root of the tree. Given a sequence of bits to decode, we successively read the bits, and for each 0
, we choose the left branch, and for each 1
we choose the right branch.
When we reach a leaf, we decode the corresponding symbol and then start again at the root of the tree.
Given the Huffman code tree of the last check, what does the sequence of bits 10001010
correponds to?
Solution
BACHuffman coding ensures that the average length of the encoded symbols is minimized.
… with Different Types of Symbols
Huffman coding can handle different types of symbols. The previous examples shows that a symbol can be a character.
Image Compression
We can compress an image where each symbol is a color represented as a RGB tuple (Red, Green, Blue)
. Each symbol can be something like (255, 0, 0)
for pure red, (0, 255, 0)
for pure green, etc.
DNA Sequences
DNA sequences are composed of four primary nucleotides: Adenine (A
), Cytosine (C
), Guanine (G
), and Thymine (T
). Besides A
, C
, G
, T
, common codons like ATG
or TAA
can also be treated as individual symbols to make the compression more efficient.
Binary Data
For files that aren’t text-based, such as executables, Huffman coding can operate on bytes (sequences of 8 bits) or even larger chunks of bits.
In general, any data that can be broken down into discrete, countable units can be compressed using Huffman coding. This includes things mentioned in above examples or sensor readings, log entries, and more.
Implementation Guide
When implementing the algorithm in HuffmanImpl.scala
, open HuffmanBase.scala
on the side to read the documentation.
Since Huffman coding can handle different types of symbols, we use a generic abstract class CodeTree
with a type parameter T
to represent Huffman code trees, where T
represents the type of the symbols. CodeTree[T]
has two subtypes: Leaf[T]
and Fork[T]
:
enum CodeTree[T]:
case Leaf(symbol: T, weight: Int)
case Fork(left: CodeTree[T], right: CodeTree[T], symbols: List[T], weight: Int)
src/main/scala/huffman/HuffmanBase.scala
Structure
The relationships between the traits we’ll use:
HuffmanBase[T]
↑
HuffmanImpl[T] --(T = Char)--> HuffmanImpl[Char]
↑
HuffmanChar
Where
A
↑
B
means trait B
extends trait A
.
To begin, implement the following two (hint: very simple) functions on the code tree:
weight
which returns the weight of a given Huffman code tree (i.e., the weight of the root node).
def weight(tree: CodeTree[T]): Int =
???
src/main/scala/huffman/HuffmanImpl.scala
symbols
which returns the list of symbols defined in a given Huffman tree.
def symbols(tree: CodeTree[T]): List[T] =
???
src/main/scala/huffman/HuffmanImpl.scala
Using weight
and symbols
, we can define the function makeCodeTree
which takes the left and right subtrees, then builds a new tree by adding a parent node on top of these two trees:
def makeCodeTree(left: CodeTree[T], right: CodeTree[T]): CodeTree[T] =
Fork(left, right, symbols(left) ++ symbols(right), weight(left) + weight(right))
src/main/scala/huffman/HuffmanImpl.scala
makeCodeTree
automatically calculates the list of symbols and the weight when creating a node. Therefore, code trees can be constructed manually in the following way:
val sampleTree = makeCodeTree(
makeCodeTree(Leaf('x', 1), Leaf('e', 1)),
Leaf('t', 2)
)
Constructing Huffman Code Trees
The goal of this section is to define a function createCodeTree
, which builds a code tree for the given text symbols
(i.e., a list of symbols):
def createCodeTree(symbols: List[T]): CodeTree[T] =
???
src/main/scala/huffman/HuffmanImpl.scala
We break this assignment into smaller parts:
-
Define a function
symbolFreqs
which calculates the frequency of each symbol in the text.def symbolFreqs(symbols: List[T]): List[(T, Int)] = ???
src/main/scala/huffman/HuffmanImpl.scala
-
Write a function
makeOrderedLeafList
which generates a list containing all the leaves of the Huffman code tree to be constructed (the caseLeaf[T]
of the algebraic datatypeCodeTree[T]
). The list should be ordered by ascending weights where the weight of a leaf is the number of times (i.e., the frequency) it appears in the given text.def makeOrderedLeafList(freqs: List[(T, Int)]): List[Leaf[T]] = ???
src/main/scala/huffman/HuffmanImpl.scala
-
Write a simple function
isSingleton
which checks whether a list of code trees contains only one single tree.def isSingleton(trees: List[CodeTree[T]]): Boolean = ???
src/main/scala/huffman/HuffmanImpl.scala
-
Write a function
combine
which (1) removes the two trees with the lowest weight from the list constructed in the previous step, and (2) merges them by creating a new node of typeFork[T]
. Add this new tree to the list - which is now one element shorter - while preserving the order (by weight).def combine(trees: List[CodeTree[T]]): List[CodeTree[T]] = ???
src/main/scala/huffman/HuffmanImpl.scala
-
Write a function
until
which calls the two functions defined above until this list contains only a single tree. This tree is the final coding tree. The functionuntil
can be used in the following way:until(isSingleton, combine)(trees)
where
isSingleton
andcombine
refer to the two functions defined above.def until( isDone: List[CodeTree[T]] => Boolean, merge: List[CodeTree[T]] => List[CodeTree[T]] )(trees: List[CodeTree[T]]): List[CodeTree[T]] = ???
src/main/scala/huffman/HuffmanImpl.scala
-
Finally, use the functions defined above to implement the function
createCodeTree
.
Decoding
Your decoder implementation may pass the tests on your machine but trigger a
StackOverflowError
on the autograder, because the autograder may
have a smaller call stack size than your setup.
stack overflow: According to Wikipedia,
When a program attempts to use more space than is available on the call stack […], the stack is said to overflow, typically resulting in a program crash. […] The most-common cause of stack overflow is excessively deep or infinite recursion, in which a function calls itself so many times that the space needed to store the variables and information associated with each call is more than can fit on the stack.
It’s best not to wait until the last minute to submit your code, so you have enough time to address any issues that may arise!
In this lab, type Bit
is an alias of Int
.
Define a function decodeOne
which decodes one symbol from the bit sequence using the given code tree. The function returns an option. If succeed, it returns Some(sym, remainingBits)
, where sym
is the decoded symbol and remainingBits
is the rest of bit sequence that remains unused. Otherwise, return None
if it fails to decode.
def decodeOne(tree: CodeTree[T], bits: List[Bit]): Option[(T, List[Bit])] =
???
src/main/scala/huffman/HuffmanImpl.scala
Using decodeOne
, now we can define the function decode
which decodes a list of bits (which were already encoded using a Huffman code tree), given the corresponding code tree.
def decode(tree: CodeTree[T], bits: List[Bit]): List[T] =
???
src/main/scala/huffman/HuffmanImpl.scala
In file HuffmanChar.scala
, HuffmanChar
restricts the type of symbols to be Char
.
trait HuffmanChar extends HuffmanImpl[Char]
In HuffmanChar
, we can use everything implemented in HuffmanImpl[T]
as T
is instantiated to Char
.
Use the decode
function and the frenchCode
code tree to decode the bit sequence in secret
, and store the resulting character sequence in decodedSecret
:
def decodedSecret: List[Char] =
???
src/main/scala/huffman/HuffmanChar.scala
Encoding
This section deals with the Huffman encoding of a sequence of symbols into a sequence of bits.
…Using a Huffman Tree
Define the function encode
which encodes a list of characters using Huffman coding, given a code tree.
def encode(tree: CodeTree[T])(text: List[T]): List[Bit] =
???
src/main/scala/huffman/HuffmanImpl.scala
Your implementation must traverse the coding tree for each symbol, which is a task that should be done using a helper function.
Hint
flatMap
may be useful here! You can check this week’s flatMap
exercises for practice.
…Using a Coding Table
The previous function is simple, but very inefficient. The goal is now to define quickEncode
which encodes an equivalent representation, but more efficiently.
def quickEncode(tree: CodeTree[T])(text: List[T]): List[Bit] =
???
src/main/scala/huffman/HuffmanImpl.scala
Your implementation will build a coding table once which, for each possible symbol, gives the list of bits of its code. The simplest way - but not the most efficient - is to encode the table of symbols as a list of pairs.
type CodeTable = List[(T, List[Bit])]
The encoding must then be done by accessing the table, via a function codeBits
which returns the bit sequence representing the given symbol in the code table:
def codeBits(table: CodeTable)(symbol: T): List[Bit] =
???
src/main/scala/huffman/HuffmanImpl.scala
The creation of the table is defined by convert
which traverses the coding tree and constructs the character table, using the mergeCodeTables
defined below:
def convert(tree: CodeTree[T]): CodeTable =
???
src/main/scala/huffman/HuffmanImpl.scala
Hint
Think of a recursive solution: every subtree of the code tree tree
is itself a valid code tree that can be represented as a code table. Using the code tables of the subtrees, think of how to build the code table for the entire tree (that’s why we need mergeCodeTables
).
def mergeCodeTables(a: CodeTable, b: CodeTable): CodeTable =
???
src/main/scala/huffman/HuffmanImpl.scala
Now we can implement quickEncode
using convert
.
Try It Yourself!
Main.scala
provides a command-line interface for the HuffmanChar
program. You do not need to understand how it works yet.
In sbt, use run encode ./assets/test.txt
to encode the test.txt
file in assets
directory. The generated code tree and encoded bit sequences are stored in the newly-generated compressed file test.huf
in the assets
directory.
You’ll find the outputed “Compression Ratio” ($\frac{\textit{original text size}}{\textit{compressed size}}$) is very small. It’s because test.txt
is very small, and the code tree and encoded text is relatively large.
To decode test.huf
, use run decode ./assets/test.huf
. A file test-decoded.txt
will be generated, storing the decoded text. Compare test.txt
and test-decoded.txt
to see if they’re identical.
There are two much-larger files image-base64.txt
(76.83 kB) and alice_in_wonderland.txt
(148.6 kB) in assets
. Try run encode ./assets/image-base64.txt
or run encode ./assets/alice_in_wonderland.txt
to see the compression ratio!
Stack overflow when decoding
If you get a stack overflow error when trying to decode ./assets/image-base64.huf
or ./assets/alice_in_wonderland.huf
, check “Passing the last test” in the bonus section!
Generating an Executable
Similar to the Find lab, you can run pack
in sbt to generate an executable file huffman
in ./target/pack/bin
. In the terminal, you could run commands such as ./target/pack/bin/huffman encode ./assets/alice_in_wonderland.txt
.
image-base64.txt
image-base64.txt
stores the base64 encoding of an image… but which one? Find a base64 decoder that saves its output as a binary file to find out!
Bonus section
Passing the last test (Graded)
You might have stack overflow error when trying to pass the last test, Encode and Decode alice_in_wonderland.txt
. The error message looks like:
If you’ve encountered this issue, do you know why the decode
function leads to stack overflow?
Hint
The reason is likely that your decode
function is not tail recursive.
In a standard recursive function, each recursive call adds a new layer to the call stack. If the recursion is deep, this can quickly consume all the available stack space, leading to a stack overflow.
Tail recursion, on the other hand, is a form of recursion where the recursive call is the last operation in the function, allowing the Scala compiler to optimize it and use a constant amount of stack space. You may check this week’s tail recursion exercises for practice.
Writing Pre-/Post-conditions (Non-graded)
🔜 Our next SE lecture will be on specifications: mathematical descriptions of what a function does. But in this week’s lecture you have already seen a bit of this, in the form of monitors: executable pre- and postconditions added to a function.
Try to come up with good pre- and postconditions to add to your implementation of makeOrderedLeafList
, combine
, until
. If you have any doubts, ask us in help sessions!
We will have many specs exercises in the next exercise set.