COS 226, Spring 1999. Program 6

COS 226 Programming Assignment 6 - Checklist

Checklist 6: Language Modeling.

The following is a checklist which provides a summary of the assignment. This is meant only as a supplement; please read the original assignment description.

Frequently Asked Questions

Requirements

Possible Progress Steps

Sample Input

Performance

Frequently Asked Questions:
(This is the only part of this checklist which will change during the week.)

[Michael 4/3] At one point in the original assignment, it states "Your driver program should not store the whole text string. Indeed, you only need to keep track of the most recent k characters in the string."
however...In one of my precepts I said that we will not be strictly checking this requirement, and you will notice that it has never appeared as a requirement anywhere in the checklist It is unclear in context whether this comment was talking about the warmup method or about your final sumbitted program (it appeared in the paragraph which started, "As a warmup...")
At the same time, I said that the reason for the comment is indeed that there is really no need to explicitly store the original string for most people's programs. For example, in the direct symbol table program, the flow of control is to walk through the original string and call Insert for each string you pass. In this case, there is never any other time you will need the original string, and so in this case, rather than take the space to store the whole thing, you can just do the insertions while you read in the input.
I think this is true of most of the more interesting methods too, however if you feel there is good reason to be storing the input string for your program, you are certainly free to do so. (As we said, just don't throw away space, and explain your method in the readme).

[Michael 3/30] Some performance bounds for my own implementation have been added to the bottom of this file.

[Michael 3/30] The original 'baby.txt' had some extended ascii values (accented versions of vowels with values>256), but these have now been replaced. At this point, all characters in the input files are standard ascii and thus have equivalent values between 0 and 127 inclusive.

[Michael 3/30] For this assignment, it is important to be able to generate different random trials, by changing the seed used for the random number generator.
For my program, I use random(), and thus have the following two lines at the top:

#include <sys/types.h> #include <sys/timeb.h>

And the following two lines within my main routine:

long seed=time(NULL); srandom(seed);

Requirements:

Functionality:

What is most important is that your program correctly generates each output character based on the correct probability, as described in the original assignment.
Submit only your final, most efficient, program. (all of the other progress steps in the assignment are to be considered only as suggested steps of progress)

In particular, please note the original comments on the use of space in this assignment.

Input Format: We will test your program by typing a.out k < inputfile on a unix system, where command line argument k is the order of the language model. The alphabet for the inputfile may contain all possible characters. (Note: this is in contrast to the input files we used for the previous assignment which only contained characters and spaces).

Output Format: The output for your program should be text which is generated by your program.

In the original assignment, it mentions that the length of your output should be the same number of characters as appeared in the input. However, since we are using extremely large inputfiles, please feel free to stop the output at any reasonable length (say 5000 characters?)

There is a possiblity that your random process will at some point generate a match for the final k characters of the inputfile. If that string does not appear anywhere else in the input, then this will force a dead end to the random generation. If this causes your output to be too short, make sure you re-run and get a better sample output to turn in.

readme: Your readme file for this program should contain the following:

A high-level description for the design of your algorithm, and what influenced your decisions.

A (brief) discussion of your performance characteristics, especially that of space usage

Submit one or two of the most amusing examples that your program generated.

Possible Progress Steps: These are purely suggestions for how you might make progress. You do not have to follow these steps.

Do an array based implementation where each k+1-letter combination has an entry which counts the number of occurences in the text, and use this to generate your language model.

Try to signficantly reduce the space usage by either modifying your original method or completely redesigning a new data structure.

Make sure to experiment as the value of k grows.

Sample Input Here are some sample input files which you may use, although you are free to find some other interesting and fun examples. (Note: unlike assignment 5, these files contain original spacing, punctuation and capitalization).

Inputfile	Source	N
princeton.txt	A Packet article about Princeotn	7959
aesopshort.txt	collection of Aesop's fables	10280
moby1.txt	Moby Dick - Chapter 1	12218
amendments.txt	Constitutional Amendments	18369
y2kintro.txt	Introduction of the recent Senate report on Y2K	21224
baby.txt	How baby's learn language	22200
manifesto.txt	Communist Party Manifesto	72955
muchado.txt	Much Ado about Nothing	123413
aesop.txt	collection of Aesop's fables	191945
starr.txt	The Starr Report narrative	234378	(warning: explicit language)
lilwomen.txt	Little Women	1042048
mobydick.txt	Moby Dick	1191463

[Note: all of these files are also located in ~/cs226/prog6_files/ on the phoenix machines, so there is no need to download them if you are running on those machines]

Performance As always, take these numbers with a grain of salt. It is possible that students will have implementations much better than my own, and it is possible that we will give full credit to programs that are less efficient then mine. The following table lists the user CPU time (the first field if you use the 'time' command), when my program was run on flagstaff (avg load was between 5 and 6), generating 3000 characters of output.

Inputfile	k=3	k=7	k=12	k=20
princeton.txt	0.02	0.10	0.17	0.27
aesopshort.txt	0.05	0.15	0.22	0.34
moby1.txt	0.04	0.16	0.26	0.42
amendments.txt	0.05	0.18	0.29	0.50
y2kintro.txt	0.08	0.24	0.42	0.70
baby.txt	0.08	0.25	0.39	0.70
manifesto.txt	0.17	0.67	1.20	2.14
muchado.txt	0.28	1.18	2.20	3.70
aesop.txt	0.38	1.70	3.40	5.89
starr.txt	0.41	1.85	3.60	6.60
lilwomen.txt	1.86	9.45	18.75	31.03
mobydick.txt	2.38	11.69	22.04	37.34

cos226 Class Page
wass@cs.princeton.edu

Last modified: April 4, 1999