Last updated on

Lab 4: Huffman Coding

In this lab, you will apply your knowledge to a practical scenario: file compression. By the end of the lab, you’ll be able to compress real-world files using a custom format (.huf).

Senior track

As usual, here are instructions if you would like to complete a less-guided version of this lab. If you’d prefer the normal experience, skip ahead to the next section!

Your colleague has started a project to implement an in-house alternative to zip, gzip, and other compression formats, based on Huffman coding. As a strong advocate of the test-driven development (TDD) approach, they have written unit tests and added docstrings. This was a lot of work, and so now they’ve gone on vacation. Unfortunately, the code is due in 10 days, so you’re going to have to step in and write the actual implementation for them. With tests and documentation to guide you, what could go wrong?

Check out your colleague’s work following § Logistics. You can find the code and documentation they have already written in HuffmanBase.scala and HuffmanChar.scala. The core of the program is yet to be implemented, in HuffmanImpl.scala.

If you want to follow this week’s senior track, skip the next sections and try implementing the lab using only the given codebase, documentation, and tests. You will most likely need to find information online on how Huffman coding works, since your colleague didn’t bother to document that!

Once you’re finished, don’t forget to check the Bonus section.

Introduction

Huffman coding is a lossless (i.e., keeping all information) compression algorithm that can be used to compress lists of symbols. It’s widely used in data compression tasks such as file archiving. For example, huffman coding is used in Gzip.

Usually, for a normal, uncompressed text (i.e., a string), each character is a symbol represented by the same number of bits (usually eight).

For the text “ABEACADABEA”, using ASCII code for each character, the encoded text has $11 \times 8 = 88$ bits.

Huffman coding is designed to optimize the length of the encoded data. In Huffman coding, each symbol can have a bit representation of different length, depending on the frequencies of symbols: the symbols that often appear in a text are represented by a shorter bit sequence than those being used more rarely.

For the text “ABEACADABEA”:

SymbolFrequencyHuffman Code
A50
B2100
C1110
D1111
E2101

The huffman encoded text has $5 \times 1 + (2 + 1 + 1 + 2) \times 3 = 23$ bits, a significant reduction from the original $88$ bits!

In this lab, your task is to implement the Huffman coding algorithm.

This assignment aims to exercise the following concepts and techniques:

  • Higher-order functions
  • Pattern matching
  • Lists
  • Parametric polymorphism
  • Specifications
  • Testing

Logistics

📝 Important: don’t forget to fill in how many hours you spent on this lab in HowManyHoursISpentOnThisLab.scala to pass the HoursTest.

Theory of Huffman Coding

For a given text, each huffman code defines a specific bit sequence to represent a symbol. The core idea behind Huffman coding is rooted in two principles:

Why is prefix property important?

Consider encoding text AB where A is encoded as $10$ and B as $1010$. The encoded text would be $101010$. Now, when trying to decode, it’s ambiguous where one code starts and the other ends.

This ambiguity is eliminated with prefix property, which ensures that a sequence of codes is immediately and uniquely decodable without requiring any special delimiters or markers between individual codes.

The huffman codes are represented by a code tree, which is a binary tree with leaves representing the symbols.

Constructing Huffman Code Trees

Use text ABEACADABEA with symbols A to E as the example. The code tree construction has 2 steps:

  1. Create a leaf node for each symbol, associated with a weight which denotes the frequency of appearance of that symbol.

    Show Code Trees
    A(5)   B(2)   C(1)   D(1)   E(2)
    
  2. While there’s more than one code tree, remove the two trees with the lowest root weight and merge them into a new tree by creating a new branching node. A branching node can be thought of as a set containing the symbols present in the leaves below it, with its weight being the total weight of those leaves.

    Show Code Trees Step By Step

    Merge C(1) and D(1):

    A(5)   B(2)   E(2)   CD(2)
                        /    \
                      C(1)  D(1)
    

    Merge B(2) and E(2):

    A(5)    BE(4)       CD(2)
           /    \      /    \
          B(2)  E(2)  C(1)  D(1)
    

    Merge BE(4) and CD(2):

    A(5)          BECD(6)
                 /      \
               BE(4)     CD(2)
              /   \     /   \
            B(2) E(2)  C(1) D(1)
    

    Merge A(5) and BECD(6):

        ABECD(11)
       /         \
      A(5)       BECD(6)
                /      \
               BE(4)    CD(2)
              /   \     /   \
            B(2) E(2)  C(1) D(1)
    
Every subtree is itself a valid code tree for a smaller set of symbols.

Encoding

For a given Huffman tree, one can obtain the encoded representation of a symbol by traversing from the root of the tree to the leaf containing the symbol. Along the way, when a left branch is chosen, a 0 is added to the representation, and when a right branch is chosen, 1 is added to the representation.

What’s the huffman code for symbol D in the below code tree?

Example Huffman tree

Solution 1011

Decoding

Decoding also starts at the root of the tree. Given a sequence of bits to decode, we successively read the bits, and for each 0, we choose the left branch, and for each 1 we choose the right branch.

When we reach a leaf, we decode the corresponding symbol and then start again at the root of the tree.

Given the Huffman code tree of the last check, what does the sequence of bits 10001010 correponds to?

Solution BAC

Huffman coding ensures that the average length of the encoded symbols is minimized.

… with Different Types of Symbols

Huffman coding can handle different types of symbols. The previous examples shows that a symbol can be a character.

Image Compression

We can compress an image where each symbol is a color represented as a RGB tuple (Red, Green, Blue). Each symbol can be something like (255, 0, 0) for pure red, (0, 255, 0) for pure green, etc.

DNA Sequences

DNA sequences are composed of four primary nucleotides: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Besides A, C, G, T, common codons like ATG or TAA can also be treated as individual symbols to make the compression more efficient.

Binary Data

For files that aren’t text-based, such as executables, Huffman coding can operate on bytes (sequences of 8 bits) or even larger chunks of bits.

In general, any data that can be broken down into discrete, countable units can be compressed using Huffman coding. This includes things mentioned in above examples or sensor readings, log entries, and more.

Implementation Guide

When implementing the algorithm in HuffmanImpl.scala, open HuffmanBase.scala on the side to read the documentation.

Since Huffman coding can handle different types of symbols, we use a generic abstract class CodeTree with a type parameter T to represent Huffman code trees, where T represents the type of the symbols. CodeTree[T] has two subtypes: Leaf[T] and Fork[T]:

enum CodeTree[T]:
  case Leaf(symbol: T, weight: Int)
  case Fork(left: CodeTree[T], right: CodeTree[T], symbols: List[T], weight: Int)

src/main/scala/huffman/HuffmanBase.scala

Structure

The relationships between the traits we’ll use:

  HuffmanBase[T]
      ↑
HuffmanImpl[T]  --(T = Char)--> HuffmanImpl[Char]
                                     ↑
                                HuffmanChar

Where

 A
 ↑
 B

means trait B extends trait A.

To begin, implement the following two (hint: very simple) functions on the code tree:

  1. weight which returns the weight of a given Huffman code tree (i.e., the weight of the root node).
def weight(tree: CodeTree[T]): Int =
  ???

src/main/scala/huffman/HuffmanImpl.scala

  1. symbols which returns the list of symbols defined in a given Huffman tree.
def symbols(tree: CodeTree[T]): List[T] =
  ???

src/main/scala/huffman/HuffmanImpl.scala

Using weight and symbols, we can define the function makeCodeTree which takes the left and right subtrees, then builds a new tree by adding a parent node on top of these two trees:

def makeCodeTree(left: CodeTree[T], right: CodeTree[T]): CodeTree[T] =
  Fork(left, right, symbols(left) ++ symbols(right), weight(left) + weight(right))

src/main/scala/huffman/HuffmanImpl.scala

makeCodeTree automatically calculates the list of symbols and the weight when creating a node. Therefore, code trees can be constructed manually in the following way:

val sampleTree = makeCodeTree(
  makeCodeTree(Leaf('x', 1), Leaf('e', 1)),
  Leaf('t', 2)
)

Constructing Huffman Code Trees

The goal of this section is to define a function createCodeTree, which builds a code tree for the given text symbols (i.e., a list of symbols):

def createCodeTree(symbols: List[T]): CodeTree[T] =
  ???

src/main/scala/huffman/HuffmanImpl.scala

We break this assignment into smaller parts:

  1. Define a function symbolFreqs which calculates the frequency of each symbol in the text.

    def symbolFreqs(symbols: List[T]): List[(T, Int)] =
      ???
    

    src/main/scala/huffman/HuffmanImpl.scala

  2. Write a function makeOrderedLeafList which generates a list containing all the leaves of the Huffman code tree to be constructed (the case Leaf[T] of the algebraic datatype CodeTree[T]). The list should be ordered by ascending weights where the weight of a leaf is the number of times (i.e., the frequency) it appears in the given text.

    def makeOrderedLeafList(freqs: List[(T, Int)]): List[Leaf[T]] =
      ???
    

    src/main/scala/huffman/HuffmanImpl.scala

  3. Write a simple function isSingleton which checks whether a list of code trees contains only one single tree.

    def isSingleton(trees: List[CodeTree[T]]): Boolean =
      ???
    

    src/main/scala/huffman/HuffmanImpl.scala

  4. Write a function combine which (1) removes the two trees with the lowest weight from the list constructed in the previous step, and (2) merges them by creating a new node of type Fork[T]. Add this new tree to the list - which is now one element shorter - while preserving the order (by weight).

    def combine(trees: List[CodeTree[T]]): List[CodeTree[T]] =
      ???
    

    src/main/scala/huffman/HuffmanImpl.scala

  5. Write a function until which calls the two functions defined above until this list contains only a single tree. This tree is the final coding tree. The function until can be used in the following way:

    until(isSingleton, combine)(trees)
    

    where isSingleton and combine refer to the two functions defined above.

    def until(
        isDone: List[CodeTree[T]] => Boolean,
        merge: List[CodeTree[T]] => List[CodeTree[T]]
    )(trees: List[CodeTree[T]]): List[CodeTree[T]] =
      ???
    

    src/main/scala/huffman/HuffmanImpl.scala

  6. Finally, use the functions defined above to implement the function createCodeTree.

Decoding

Your decoder implementation may pass the tests on your machine but trigger a StackOverflowError on the autograder, because the autograder may have a smaller call stack size than your setup.

stack overflow: According to Wikipedia,

When a program attempts to use more space than is available on the call stack […], the stack is said to overflow, typically resulting in a program crash. […] The most-common cause of stack overflow is excessively deep or infinite recursion, in which a function calls itself so many times that the space needed to store the variables and information associated with each call is more than can fit on the stack.

It’s best not to wait until the last minute to submit your code, so you have enough time to address any issues that may arise!

In this lab, type Bit is an alias of Int.

Define a function decodeOne which decodes one symbol from the bit sequence using the given code tree. The function returns an option. If succeed, it returns Some(sym, remainingBits), where sym is the decoded symbol and remainingBits is the rest of bit sequence that remains unused. Otherwise, return None if it fails to decode.

def decodeOne(tree: CodeTree[T], bits: List[Bit]): Option[(T, List[Bit])] =
  ???

src/main/scala/huffman/HuffmanImpl.scala

Using decodeOne, now we can define the function decode which decodes a list of bits (which were already encoded using a Huffman code tree), given the corresponding code tree.

def decode(tree: CodeTree[T], bits: List[Bit]): List[T] =
  ???

src/main/scala/huffman/HuffmanImpl.scala

In file HuffmanChar.scala, HuffmanChar restricts the type of symbols to be Char.

trait HuffmanChar extends HuffmanImpl[Char]

In HuffmanChar, we can use everything implemented in HuffmanImpl[T] as T is instantiated to Char.

Use the decode function and the frenchCode code tree to decode the bit sequence in secret, and store the resulting character sequence in decodedSecret:

def decodedSecret: List[Char] =
  ???

src/main/scala/huffman/HuffmanChar.scala

Encoding

This section deals with the Huffman encoding of a sequence of symbols into a sequence of bits.

…Using a Huffman Tree

Define the function encode which encodes a list of characters using Huffman coding, given a code tree.

def encode(tree: CodeTree[T])(text: List[T]): List[Bit] =
  ???

src/main/scala/huffman/HuffmanImpl.scala

Your implementation must traverse the coding tree for each symbol, which is a task that should be done using a helper function.

Hint

flatMap may be useful here! You can check this week’s flatMap exercises for practice.

…Using a Coding Table

The previous function is simple, but very inefficient. The goal is now to define quickEncode which encodes an equivalent representation, but more efficiently.

def quickEncode(tree: CodeTree[T])(text: List[T]): List[Bit] =
  ???

src/main/scala/huffman/HuffmanImpl.scala

Your implementation will build a coding table once which, for each possible symbol, gives the list of bits of its code. The simplest way - but not the most efficient - is to encode the table of symbols as a list of pairs.

type CodeTable = List[(T, List[Bit])]

The encoding must then be done by accessing the table, via a function codeBits which returns the bit sequence representing the given symbol in the code table:

def codeBits(table: CodeTable)(symbol: T): List[Bit] =
  ???

src/main/scala/huffman/HuffmanImpl.scala

The creation of the table is defined by convert which traverses the coding tree and constructs the character table, using the mergeCodeTables defined below:

def convert(tree: CodeTree[T]): CodeTable =
  ???

src/main/scala/huffman/HuffmanImpl.scala

Hint

Think of a recursive solution: every subtree of the code tree tree is itself a valid code tree that can be represented as a code table. Using the code tables of the subtrees, think of how to build the code table for the entire tree (that’s why we need mergeCodeTables).

def mergeCodeTables(a: CodeTable, b: CodeTable): CodeTable =
  ???

src/main/scala/huffman/HuffmanImpl.scala

Now we can implement quickEncode using convert.

Try It Yourself!

Main.scala provides a command-line interface for the HuffmanChar program. You do not need to understand how it works yet.

In sbt, use run encode ./assets/test.txt to encode the test.txt file in assets directory. The generated code tree and encoded bit sequences are stored in the newly-generated compressed file test.huf in the assets directory.

You’ll find the outputed “Compression Ratio” ($\frac{\textit{original text size}}{\textit{compressed size}}$) is very small. It’s because test.txt is very small, and the code tree and encoded text is relatively large.

To decode test.huf, use run decode ./assets/test.huf. A file test-decoded.txt will be generated, storing the decoded text. Compare test.txt and test-decoded.txt to see if they’re identical.

There are two much-larger files image-base64.txt(76.83 kB) and alice_in_wonderland.txt(148.6 kB) in assets. Try run encode ./assets/image-base64.txt or run encode ./assets/alice_in_wonderland.txt to see the compression ratio!

Stack overflow when decoding

If you get a stack overflow error when trying to decode ./assets/image-base64.huf or ./assets/alice_in_wonderland.huf, check “Passing the last test” in the bonus section!

Generating an Executable

Similar to the Find lab, you can run pack in sbt to generate an executable file huffman in ./target/pack/bin. In the terminal, you could run commands such as ./target/pack/bin/huffman encode ./assets/alice_in_wonderland.txt.

image-base64.txt

image-base64.txt stores the base64 encoding of an image… but which one? Find a base64 decoder that saves its output as a binary file to find out!

Bonus section

Passing the last test (Graded)

You might have stack overflow error when trying to pass the last test, Encode and Decode alice_in_wonderland.txt. The error message looks like:

StackOverflow

If you’ve encountered this issue, do you know why the decode function leads to stack overflow?

Hint

The reason is likely that your decode function is not tail recursive.

In a standard recursive function, each recursive call adds a new layer to the call stack. If the recursion is deep, this can quickly consume all the available stack space, leading to a stack overflow.

Tail recursion, on the other hand, is a form of recursion where the recursive call is the last operation in the function, allowing the Scala compiler to optimize it and use a constant amount of stack space. You may check this week’s tail recursion exercises for practice.

Writing Pre-/Post-conditions (Non-graded)

🔜 Our next SE lecture will be on specifications: mathematical descriptions of what a function does. But in this week’s lecture you have already seen a bit of this, in the form of monitors: executable pre- and postconditions added to a function.

Try to come up with good pre- and postconditions to add to your implementation of makeOrderedLeafList, combine , until. If you have any doubts, ask us in help sessions!

We will have many specs exercises in the next exercise set.