Assignments | Course Home | Documentation | Lab Hours/Tutoring | Schedule | Submit

Saint Louis University

Computer Science 2100
Data Structures

Michael Goldwasser

Fall 2015

Dept. of Math & Computer Science

Programming Assignment 07

Encode

Due: Monday, 7 December 2015, 11:59pm


Please see the general programming webpage for details about the programming environment for this course, guidelines for programming style, and details on electronic submission of assignments.

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


The files you may need for this assignment can be downloaded here.


Contents:


Overview

This project is the dual of the decode project. In that earlier project, you wrote code to take a compressed file and turn it back into a traditional ASCII file. In this assignment, we want you to take an original ASCII file and produce a compressed version of it using the same conventions. Therefore your encode and decode programs ideally form a pair that can work together.

One of the bigger challenges of this assignment will be in developing an appropriate coding scheme that optimizes the compression for a particular input file. To this end, you will use the Huffman coding algorithm as discussed in Chapter 12.4 of the text as well as demonstrated by our huffman software. Carefully review both of those references for an overview of the process.


Your Task

Your program must see the encoding process through from begining to end. You should prompt the user for the filename of the original input, build up the Huffman coding tree for that input, then prompt the user for an output file in which you will write the compressed form of the file (including the necessary header information for representing the tree itself). Here is a suggested outline of the steps.


Supporting Files

The supporting files are identical to those that were provided for the decode assignment, other than a new makefile that will allow make encode to compile the new project.

All such files can be downloaded here and descriptions can be found in the decode project description


Priority Queues

You will notice that the Huffman algorithm requires maintaining a priority queue of trees, with each tree having an integer frequency as its priority, and the goal of finding the minimum priority trees at each step. From a technical perspective, you want to avoid copying trees as elements of the priority queue, since that is expenseive, so you should really think of having a priority queue of LinkedBinaryTree* values (i.e., pointers).

As far as the priority queue impelmentation, we are not providing you with any. You have several options:


Debugging Tools

As was the case with the preceding assignment, debugging your code is a bit challenging, especially given that your eventual output file is an arbitrary binary file, and thus not directly viewable in an editor. Thus it is a bit challenging to even verify whether your output was legitimate, much less determine what went wrong if not correct. To this end, we provide the following assistance.


Files to Submit


Grading Standards

The assignment is worth 20 points.


Playing With Non-Text Formats

Though we characterized this entire assignment under the assumption that the original file to be compressed was composed of ASCII characters, that was completely unnecessary. Your encoder can be used, without modification, on any file type that you wish. It will be read eight-bits at a time, with frequencies of those eight bit patterns used in the algorithm, yet it does not matter whether those bits truly represented text in the original file. Of course the amount of compression achieved will still depend upon taking advantage of irregularities in the frequency counts.

If interested, try compressing various other file types (e.g. various images, MS word documents, even an executable itself, or even a previously compressed "myzip" file). See how much compression you get. The number of bytes of a file can be found by typing ls -l (the long format output for the ls command).


Extra Credit (2 points)

For this project, we assumed that we processed the original file in eight-bit chunks, and thus had a character set which ranged over values 0 to 255 (not including the EOM message we used). Yet the Huffman encoding algorithm did not depend upon the assumption that we originally considered 8-bit patterns; it can be used for any original fixed-length pattern size.

For extra credit, implement another version of your program encode16, which does the encoding process using 16-bits at a time, thus characters indexed from 0 to 65535 (again not including the EOM). Presumably you will need to rewrite your decoder to get a 16 bit version of the decoder as well. Once you have the 16-bit encoding implemented, run some experiments on various files to determine how this version compares to the original in terms of the amount of compression achieved. Included a discussion of this in your readme file.

Note: I have not actually tried this yet! I expect the implementation to be quite similar (even better if you make the BITS_PER_PATTERN as a predefined constant which you use when writing your code). The only complication for you to deal with is in deciding what to do in the case that the original file had an odd-number of bytes overall. In this case, there will be one incomplete 16-bit pattern at the end which you still must include in some way. This likely means you may need to alter the EOM conventions in an appropriate way.


Michael Goldwasser
CSCI 2100, Fall 2015
Last modified: Monday, 07 December 2015
Assignments | Course Home | Documentation | Lab Hours/Tutoring | Schedule | Submit