Math 220/317 Assignment 7: Huffman Encoding

Reading:

Programming:

In order to gain access to files, this Java program will run as a stand-alone application rather than in applet. An example program including an application shell for an applet, file input output, and threading is given.

Modify this application to encode/decode using the Huffman algorithm given in Algorithms in C++. An encoded file has format the literal string HuffmanEncoded alone on the first line, followed by a representation on the Huffman tree on subsequent lines, followed by a newline, followed by the encoded data in 8-bit binary. If the program sees that the named file has first line matching this format, the program assumes it should decode the file. Else the program encodes the file, which can be assumed to be ASCII.

Hint
Do this program in the following phases:

  1. Phase I: Input a file, frequency count the letters, output the frequency count to a file.
  2. Phase II: Read and write a Trie in human/machine readable format (see below).
  3. Phase III: Input a file, frequencey count the letters, create the Huffman Trie and output it to a file. Check carefully that all is well.
  4. Phase IV: Continue the program in Phase III to rescan the input file, encoding it into a second output file.
  5. Phase V: Write a companion program (begin by making a duplicate of the current program) which reads the Huffman Trie from its file, and reads the encoded data from its file, and decodes the file into a third file.
  6. Phase VI: Finish off the nice stuff: combine the Trie and encoded data files into a single file; combine the two programs into one, selecting the action based on the first line of the file read.

A tree can be put into human/machine readable format using the formula:

(left-sub-tree right-sub-tree)
and a one leaf, with ab as content is format ab.
Example:
((ef)((a0)(4f)))
looks like this:
                +
               / \
             ef   +
                 / \
                a0  4f
If we assume that leaf contents will be hexidecimal, then we can ignore any characters in the input besides:
)(0123456789abcdef
For example, newlines, tabs and blanks on output are used to keep the output lines short but are completely ignored when reading back the tree.

Rather than go into it too much here, another html page has the details about this tree to string description conversion.