Course Home | Homework | Programming | Schedule & Lecture Notes | Submit

Saint Louis University

Computer Science 180
Data Structures

Michael Goldwasser

Spring 2006

Dept. of Math & Computer Science

Programming Assignment 09

Encode

Due: Friday, 5 May 2006, 8pm


Please see the general programming webpage for details about the programming environment for this course, guidelines for programming style, and details on electronic submission of assignments.

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


The files you may need for this assignment can be downloaded here.


Contents:


Overview

This project is the dual of the decode project. In that earlier project, you wrote code to take a compressed file and turn it back into a traditional ASCII file. In this assignment, we want you to take an original ASCII file and produce a compressed version of it using the same conventions. Therefore your encode and ecode programs ideally form a pair which can work together.

One of the bigger challenges of this assignment will be in developing an appropriate coding scheme which optimizes the compression for a particular input file. To this end, you will use the Huffman coding algorithm as discussed in Chapter 11.4 of the text as well as in lecture. Carefully review both of those references for an overview of the process.


Your Task

Your program must see the encoding process through from begining to end. You should prompt the user for the filename of the original input, build up the Huffman coding tree for that input, then prompt the user for an output file in which you will write the compressed form of the file (including the necessary header information for representing the tree itself). Here is a suggested outline of the steps.


Supporting Files

All such files can be downloaded here.

For this assignment you must write your own top-level program; all of your code should go in a single file, encode.cpp, with the main routine starting out the process. Yet to aid your program, we are providing several exisiting classes for convenience.


Debugging Tools

As was the case with the preceding assignment, debugging your code is a bit challenging, especially given that your eventual output file is an arbitrary binary file, and thus not directly viewable in an editor. Thus it is a bit challenging to even verify whether your output was legitimate, much less determine what went wrong if not correct. To this end, we provide the following assistance.


Files to Submit


Grading Standards

The assignment is worth 10 points.


Playing With Non-Text Formats

Though we characterized this entire assignment under the assumption that the original file to be compressed was composed of ASCII characters, that was completely unnecessary. Your encoder can be used, without modification, on any file type that you wish. It will be read eight-bits at a time, with frequencies of those eight bit patterns used in the algorithm, yet it does not matter whether those bits truly represented text in the original file. Of course the amount of compression achieved will still depend upon taking advantage of irregularities in the frequency counts.

If interested, try compressing various other file types (e.g. various images, MS word documents, even an executable itself, or even a previously compressed "myzip" file). See how much compression you get. The number of bytes of a file can be found by typing ls -l (the long format output for the ls command).


Extra Credit (1 point)

For this project, we assumed that we processed the original file in eight-bit chunks, and thus had a character set which ranged over values 0 to 255 (not including the EOM message we used). Yet the Huffman encoding algorithm did not depend upon the assumption that we originally considered 8-bit patterns; it can be used for any original fixed-length pattern size.

For extra credit, implement another version of your program encode16, which does the encoding process using 16-bits at a time, thus characters indexed from 0 to 65535 (again not including the EOM). Presumably you will need to rewrite your decoder to get a 16 bit version of the decoder as well. Once you have the 16-bit encoding implemented, run some experiments on various files to determine how this version compares to the original in terms of the amount of compression achieved. Included a discussion of this in your readme file.

Note: I have not actually tried this yet! I expect the implementation to be quite similar (even better if you make the BITS_PER_PATTERN as a predefined constant which you use when writing your code). The only complication for you to deal with is in deciding what to do in the case that the original file had an odd-number of bytes overall. In this case, there will be one incomplete 16-bit pattern at the end which you still must include in some way. This likely means you may need to alter the EOM conventions in an appropriate way.


Michael Goldwasser
CSCI 180, Spring 2006
Last modified: Wednesday, 26 April 2006
Course Home | Homework | Programming | Schedule & Lecture Notes | Submit