Saint Louis University |
Computer Science 2100
|
Dept. of Math & Computer Science |
Please see the general programming webpage for details about the programming environment for this course, guidelines for programming style, and details on electronic submission of assignments.
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
The files you may need for this assignment can be downloaded here.
This project is the dual of the decode project. In that earlier project, you wrote code to take a compressed file and turn it back into a traditional ASCII file. In this assignment, we want you to take an original ASCII file and produce a compressed version of it using the same conventions. Therefore your encode and decode programs ideally form a pair that can work together.
One of the bigger challenges of this assignment will be in developing an appropriate coding scheme that optimizes the compression for a particular input file. To this end, you will use the Huffman coding algorithm as discussed in Chapter 12.4 of the text as well as demonstrated by our huffman software. Carefully review both of those references for an overview of the process.
Your program must see the encoding process through from begining to end. You should prompt the user for the filename of the original input, build up the Huffman coding tree for that input, then prompt the user for an output file in which you will write the compressed form of the file (including the necessary header information for representing the tree itself). Here is a suggested outline of the steps.
Open the input file, read the entire thing eight-bits at a time, so that you can compute the frequency counts for each character. Keep in mind that the original ASCII characters can be viewed as integers from 0 to 255, and so you can use an array for storing these counts as you go. Go ahead and close the input file for the time being; we'll come back to it later.
Build the tree that represents the Huffman code, including all characters which had non-zero frequency as well as one presumed occurrence of a special EOM character. The general algorithm for this process is given on page 577 of the text.
Soon, we will begin to encode the original message by reading each character of the input file (again), and outputting the Huffman code for that character. This code can be computed for each character by walking up from the corresponding leaf of the tree (and then reversing the pattern to get the true code), but this is time consuming and also requires that we know where each character's associated leaf lies.
A better approach is to precompute the codes for each character, storing them directly in a table indexed from 0 to 256 (including the EOM). As an intermediary representation, let's assume that we wanted to compute strings of a form such as "0010", so that we have the code string for each original character from 0 to 256 (in the end, we don't really output these strings, but instead a stream of bits, but the string will be easy to deal with for now).
It turns out that the easiest way to fill in this entire table is not by doing it one character at a time, but instead by doing a recursive tree traversal. If you can do a tree traversal, keeping track of the "prefix" associated with a node as you go, then each time you reach a leaf of the tree, that prefix will be the full code for that associated character; you can write it directly into the proper place in the table we are building.
You are now ready to create the actual output file. Keep in mind that the output file must have two parts to it. First the tree itself must be written to the beginning of that file, using the precise conventions that we discussed with the decode assignment. Just as we suggeseted a recursive approach when parsing that format, you can generate the format quite easily with recursion. Perform a preorder traversal, outputting the '0' for an internal node, and a '1' followed by a 9-bit pattern for each external node.
Immediately after the header should be the encoded version of the entire original message, followed by the single EOM character. You will need to explicitly (re)open the input file in order to begin reading it from the beginning (recall that you already read to the end of that file when computing the original frequencies). With it reopened, you can read the file byte-by-byte, outputing the appropriate binary code associated with each character. As we noted earlier, you are not actually outputting the "01" string that was computed earlier. Instead you need to output individual 0 and 1 bits that correspond as you iterate through the characters of the string.
Note that the original input file does NOT actually have any special EOM character, as this is not the standard convention for text files on a computer system. Instead, note that our InBitStream class has an eof() method which allows you to test whether or not you have reached the end of the actual file.
The supporting files are identical to those that were provided for the decode assignment, other than a new makefile that will allow make encode to compile the new project.
All such files can be downloaded here and descriptions can be found in the decode project description
You will notice that the Huffman algorithm requires maintaining a priority queue of trees, with each tree having an integer frequency as its priority, and the goal of finding the minimum priority trees at each step. From a technical perspective, you want to avoid copying trees as elements of the priority queue, since that is expenseive, so you should really think of having a priority queue of LinkedBinaryTree* values (i.e., pointers).
As far as the priority queue impelmentation, we are not providing you with any. You have several options:
Code your own.
Priority queues with binary heaps are described in Chapater 8.3
of the book. Coding the behaviors, especially if you use the
simple vector representation of the heap, isn't so bad.
STL's std::priority_queue class
There is a working priority queue class in the standard
libraries. You are welcome to use it. However, please be aware
that:
STL's <algorithm> library
The algorithm library offers support for priority queues
directly in an array or vector through its
push_heap and pop_heap functions.
As with the priority_queue class, the convention is
that the heap is max-oriented, with the front of the
queue having the biggest priority.
As was the case with the preceding assignment, debugging your code is a bit challenging, especially given that your eventual output file is an arbitrary binary file, and thus not directly viewable in an editor. Thus it is a bit challenging to even verify whether your output was legitimate, much less determine what went wrong if not correct. To this end, we provide the following assistance.
Keep in mind that because there were some ties during the Huffman algorithm, the encoding tree you get may not be precisely the same as the encoding tree that I get. Therefore, your compressed version of a file may not be identical to my compressed version of the same file.
But if you've done the job properly the size of the file should be the same. More importantly, the decoder program should always be able to properly uncompress the data to retrieve the original version. If you are confident in your own decoder, you might try running it on your encoded output to make sure that the two parts work well together. If you are not confident in your own decoder, I am providing everyone with access to my own decoder (not the source code, but an executable). You will find that at: turing.slu.edu:/Public/goldwasser/2100/tools/myDecode
If something goes wrong, it might be nice to see precisely what you wrote into the output file, but as we mentioned, a standard editor will not be that helpful. To this end, we have written a special program, Viewer, which you can use to turn a true binary file into an ASCII version of the file using the characters "0" and "1". Obviously, this destroys the compression, but it gives you a way to view your output with a standard editor. You will find that at: turing.slu.edu:/Public/goldwasser/2100/tools/Viewer
encode.cpp
For simplicity, please put all of your code in a single file.
(you may declare whatever variables,
classes, function you wish within this single file).
Readme File
A brief summary of your program, and any further comments you
wish to make to the grader. If you do the extra credit, please
make this clear.
The assignment is worth 20 points.
Though we characterized this entire assignment under the assumption that the original file to be compressed was composed of ASCII characters, that was completely unnecessary. Your encoder can be used, without modification, on any file type that you wish. It will be read eight-bits at a time, with frequencies of those eight bit patterns used in the algorithm, yet it does not matter whether those bits truly represented text in the original file. Of course the amount of compression achieved will still depend upon taking advantage of irregularities in the frequency counts.
If interested, try compressing various other file types (e.g. various images, MS word documents, even an executable itself, or even a previously compressed "myzip" file). See how much compression you get. The number of bytes of a file can be found by typing ls -l (the long format output for the ls command).
For this project, we assumed that we processed the original file in eight-bit chunks, and thus had a character set which ranged over values 0 to 255 (not including the EOM message we used). Yet the Huffman encoding algorithm did not depend upon the assumption that we originally considered 8-bit patterns; it can be used for any original fixed-length pattern size.
For extra credit, implement another version of your program encode16, which does the encoding process using 16-bits at a time, thus characters indexed from 0 to 65535 (again not including the EOM). Presumably you will need to rewrite your decoder to get a 16 bit version of the decoder as well. Once you have the 16-bit encoding implemented, run some experiments on various files to determine how this version compares to the original in terms of the amount of compression achieved. Included a discussion of this in your readme file.
Note: I have not actually tried this yet! I expect the implementation to be quite similar (even better if you make the BITS_PER_PATTERN as a predefined constant which you use when writing your code). The only complication for you to deal with is in deciding what to do in the case that the original file had an odd-number of bytes overall. In this case, there will be one incomplete 16-bit pattern at the end which you still must include in some way. This likely means you may need to alter the EOM conventions in an appropriate way.