Saint Louis University |
Computer Science 180
|
Dept. of Math & Computer Science |
Please see the general programming webpage for details about the programming environment for this course, guidelines for programming style, and details on electronic submission of assignments.
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
The files you may need for this assignment can be downloaded here.
This project is the dual of the decode project. In that earlier project, you wrote code to take a compressed file and turn it back into a traditional ASCII file. In this assignment, we want you to take an original ASCII file and produce a compressed version of it using the same conventions. Therefore your encode and ecode programs ideally form a pair which can work together.
One of the bigger challenges of this assignment will be in developing an appropriate coding scheme which optimizes the compression for a particular input file. To this end, you will use the Huffman coding algorithm as discussed in Chapter 11.4 of the text as well as in lecture. Carefully review both of those references for an overview of the process.
Your program must see the encoding process through from begining to end. You should prompt the user for the filename of the original input, build up the Huffman coding tree for that input, then prompt the user for an output file in which you will write the compressed form of the file (including the necessary header information for representing the tree itself). Here is a suggested outline of the steps.
Open the input file, read the entire thing eight-bits at a time, so that you can compute the frequency counts for each character. Keep in mind that the original ASCII characters can be viewed as integers from 0 to 255, and so you can use an array for storing these counts as you go. Go ahead and close the input file for the time being; we'll come back to it later.
Build the tree which represents the Huffman code, including all characters which had non-zero frequency as well as one presumed occurrence of a special EOM character. The general algorithm for this process is given on page 563 of the text.
Soon, we will begin to encode the original message by reading each character of the input file (again), and outputting the Huffman code for that character. This code can be computed for each character by walking up from the corresponding leaf of the tree (and then reversing the pattern to get the true code), but this is time consuming and also requires that we know where the leaf is associated with each character.
A better approach is to precompute the codes for each character, storing them directly in a table indexed from 0 to 256 (including the EOM). As an intermediary representation, let's assume that we wanted to compute strings of a form such as "0010", so that we have the code string for each original character from 0 to 256. (in the end, we don't really output these strings, but instead a stream of bits, but the string will be easy to deal with for now).
It turns out that the easiest way to fill in this entire table is not by doing it one character at a time, but instead by doing a recursive tree traversal. If you can do a tree traversal, keeping track of the "prefix" associated with a node as you go, then each time you reach a leaf of the tree, that prefix will be the full code for that associated character; you can write it directly into the proper place in the table we are building.
You are now ready to create the actual output file. Keep in mind that the output file must have two parts to it. First the tree itself must be written to the beginning of that file, using the precise conventions that we discussed with the decode assignment. Just as we suggeseted a recursive approach when parsing that format, you can generate the format quite easily with recursion.
Immediately after the header should be the encoded version of the entire original message, followed by the single EOM character. You will need to explicitly (re)open the input file in order to begin reading it from the beginning (recall that you already read to the end of that file when computing the original frequencies). With it reopened, you can read the file byte-by-byte, outputing the appropriate binary code associated with each character. Note that the original input file does NOT actually have any special EOM character, as this is not the standard convention for text files on a computer system. Instead, note that our InBitStream class has an eof() method which allows you to test whether or not you have reached the end of the actual file.
All such files can be downloaded here.
For this assignment you must write your own top-level program; all of your code should go in a single file, encode.cpp, with the main routine starting out the process. Yet to aid your program, we are providing several exisiting classes for convenience.
PriorityQueue.h, PriorityQueue.tcc
A definition and complete implementation of a templated
PriorityQueue class akin to that of Ch. 7 of our textbook.
Instantiating this class requires three different template
parameters, as PriorityQueue<Key,Element,Comp>
where you will have to define an appropriate comparator class
for this application.
BinaryTree.h, BinaryTree.tcc
A definition and complete implementation of a BinaryTree class
akin to that of our textbook.
Since you will need the ability to modify your underlying tree, we
have implemented the following update methods, some of which are
discussed on page 294 of the text (though some are not):
replaceElement(const Position& p, const Object& element)
Replaces the position's current element with the provided element.
expandExternal(const Position& p)
Takes an external position p and converts
it to an internal node by creating two new (external) children.
A BoundaryViolationException will be thrown if p
is internal.
removeAboveExternal(const Position& w)
Takes an external position w of
T, and deletes w and the parent of w from
the tree, promoting the sibling of w into the parent's
place (see Figure 6.13 on page 276).
replaceExternalWithSubtree(const Position& p, BinaryTree& T2)
This method replaces the external position
v with a new subtree which is based upon the entire
contents of parameter T2. Please note that tree
T2 is itself destroyed by this action.
A BoundaryViolationException will be thrown if p
is internal.
Bitstream.h, BitStreams.cpp
Defines two convenient classes, one for reading from and one for
writing to binary files.
InBitStream
Supports the following methods:
InBitStream()
New input stream object; though not usable yet until a file is opened.
bool open(filename)
Opens (or reopens) a file to be accessed via the
stream. Returns true if successful.
close()
Closes the underlying file.
int read()
Reads the next bit of the file. Returns either 0 or 1.
int read(int n)
Reads the next n bits of the file.
Returns those n bits as an integer value equivalent
to the associated n-bit binary number.
bool isOpen()
Returns true if stream currently associated with an
open file.
bool eof()
Has end of underlying file been reached?
OutBitStream
Supports the following methods:
OutBitStream()
New output stream object; though not usable yet until a file is opened.
bool open(filename)
Opens (or reopens) a file to be accessed via the
stream. Returns true if successful.
close()
Closes the underlying file.
void write(int value)
Writes single bit to the file (assumes value is 0 or 1).
void write(int value, int n)
Writes n bits to the file. Those bits are described
by giving an integer value equivalent to the
associated n-bit binary number.
bool isOpen()
Returns true if stream currently associated with an
open file.
VariousExceptions.h
There are a variety of exception classes involved in this
assignment. Rather than define each in its own file, we've
bundled them all together for convenience.
makefile
This makefile should allow you to rebuild your project by
simply typing 'make' rather than in invoking the compiler
directly.
As was the case with the preceding assignment, debugging your code is a bit challenging, especially given that your eventual output file is an arbitrary binary file, and thus not directly viewable in an editor. Thus it is a bit challenging to even verify whether your output was legitimate, much less determine what went wrong if not correct. To this end, we provide the following assistance.
Keep in mind that because there were some ties during the Huffman algorithm, the encoding tree you get may not be precisely the same as the encoding tree that I get. Therefore, your compressed version of a file may not be identical to my compressed version of the same file.
But if you've done the job properly the size of the file should be the same. More importantly, the decoder program should always be able to properly uncompress the data to retrieve the original version. If you are confident in your own decoder, you might try running it on your encoded output to make sure that the two parts work well together. If you are not confident in your own decoder, I am providing everyone with access to my own decoder (not the source code, but an executable). You will find that at: turing::~goldwasser/cs180/tools/myDecode
If something goes wrong, it might be nice to see precisely what you wrote into the output file, but as we mentioned, a standard editor will not be that helpful. To this end, we have written a special program, Viewer, which you can use to turn a true binary file into an ASCII version of the file using the characters "0" and "1". Obviously, this destroys the compression, but it gives you a way to view your output with a standard editor. You will find that at: turing::~goldwasser/cs180/tools/Viewer
encode.cpp
For simplicity, please put all of your code in a single file.
(you may declare whatever variables,
classes, function you wish within this single file).
Readme File
A brief summary of your program, and any further comments you
wish to make to the grader. If you do the extra credit, please
make this clear.
The assignment is worth 10 points.
Though we characterized this entire assignment under the assumption that the original file to be compressed was composed of ASCII characters, that was completely unnecessary. Your encoder can be used, without modification, on any file type that you wish. It will be read eight-bits at a time, with frequencies of those eight bit patterns used in the algorithm, yet it does not matter whether those bits truly represented text in the original file. Of course the amount of compression achieved will still depend upon taking advantage of irregularities in the frequency counts.
If interested, try compressing various other file types (e.g. various images, MS word documents, even an executable itself, or even a previously compressed "myzip" file). See how much compression you get. The number of bytes of a file can be found by typing ls -l (the long format output for the ls command).
For this project, we assumed that we processed the original file in eight-bit chunks, and thus had a character set which ranged over values 0 to 255 (not including the EOM message we used). Yet the Huffman encoding algorithm did not depend upon the assumption that we originally considered 8-bit patterns; it can be used for any original fixed-length pattern size.
For extra credit, implement another version of your program encode16, which does the encoding process using 16-bits at a time, thus characters indexed from 0 to 65535 (again not including the EOM). Presumably you will need to rewrite your decoder to get a 16 bit version of the decoder as well. Once you have the 16-bit encoding implemented, run some experiments on various files to determine how this version compares to the original in terms of the amount of compression achieved. Included a discussion of this in your readme file.
Note: I have not actually tried this yet! I expect the implementation to be quite similar (even better if you make the BITS_PER_PATTERN as a predefined constant which you use when writing your code). The only complication for you to deal with is in deciding what to do in the case that the original file had an odd-number of bytes overall. In this case, there will be one incomplete 16-bit pattern at the end which you still must include in some way. This likely means you may need to alter the EOM conventions in an appropriate way.