Math 220/317 Assignment 7: Huffman Encoding
Reading:
- Exploring Java, Chapter 8 on Input/Output.
- Algorithms in C++, the chapter on Huffman Encoding
Programming:
In order to gain access to files, this Java program will run as a
stand-alone application
rather than in applet. An example
program including an application shell for an applet, file input
output, and threading is given.
Modify this application to encode/decode using the Huffman algorithm
given in Algorithms in C++.
An encoded file has format the literal string HuffmanEncoded
alone on the first line, followed by a representation on the
Huffman tree on subsequent lines, followed by a newline, followed
by the encoded data in 8-bit binary.
If the program sees that the named file has first line
matching this format, the program assumes it should decode
the file. Else the program encodes the file, which can be assumed
to be ASCII.
Hint
Do this program in the following phases:
- Phase I: Input a file, frequency count the letters,
output the frequency count to a file.
- Phase II: Read and write a Trie in human/machine readable
format (see below).
- Phase III: Input a file, frequencey count the letters,
create the Huffman Trie and output it to a file.
Check carefully that all is well.
- Phase IV: Continue the program in Phase III to rescan the
input file, encoding it into a second output file.
- Phase V: Write a companion program (begin by making a duplicate
of the current program) which reads the Huffman Trie from
its file, and reads the encoded data from its file, and
decodes the file into a third file.
- Phase VI: Finish off the nice stuff: combine the Trie and
encoded data files into a single file; combine the two
programs into one, selecting the action based on the
first line of the file read.
A tree can be put into human/machine readable format using
the formula:
(left-sub-tree right-sub-tree)
and a one leaf, with ab as content is format ab.
Example:
((ef)((a0)(4f)))
looks like this:
+
/ \
ef +
/ \
a0 4f
If we assume that leaf contents will be hexidecimal, then
we can ignore any characters in the input besides:
)(0123456789abcdef
For example, newlines, tabs and blanks on output
are used to keep the output
lines short but are completely ignored when reading back the tree.
Rather than go into it too much here, another
html page has the details about this tree to string description
conversion.