Computer Science P125
Introduction to Computer Science

Programming Assignment 06

Shakespearl

Due: Wednesday, 20 April 2005, 8pm

Please see the general programming webpage for details about the programming environment for this course, guidelines for programming style, and details on electronic submission of assignments.

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.

Overview

A classic claim is that having a million monkeys typing on a million keyboards would eventually lead to the creation of each of Shakespeare's works. In essence, there is a miniscule chance that a selection of random characters might match such a work, and so with enough random experiments it should eventually happen. Not surprisingly, small scale tests of this experiment have thus far failed.

(image from http://www.vivaria.net/experiments/notes/documentation/)

Our goal in this assignment will be to help those monkeys out just a bit. In fact, perhaps they can create works even better than those of Shakespeare himself. The idea is the following. Rather than modeling the process of generating text by choosing each keystroke uniformly at random, we will use an existing work of Shakespeare to seed a random process for language generation.

A Model for Language Generation

The input to the generator will be a source document together with a parameter k which we will call the order of the model. Our output will always begin with the first k characters of the source document. After that point, however, each additional character is chosen, based upon the most recent k-character string in the output. Let us denote this k-character string as the current base. We will find all occurrences of that base in the source document, and select one of those occurrences uniformly at random. The next character of the output will then be chosen to match the character which followed the chosen occurrence from the source.

For example, assume that k=2 and that the current base is th. If the source document had the following form,

*******th***thth******th****th************th****th***********th**

we would hope to pick one of the eight occurrences of th uniformly at random, such as,

*******th***thth******th****th************th****th***********th**

and then generate the next character of output based on which character followed the chosen occurrence in the source document, e.g.,

*******th***thth******th****th************the***th***********th**

Next time we pick a new character of the output, we would continue with he as the base.

Note well: If implementing this scheme correctly, you can always be assured that the base occurs at least once in the source document. However, one special case which may occur is when the randomly chosen occurrence of the base happens to be the final k characters of the source. In this case, there is no followup character to use; your generator can return a special value which will signify the end of the output.

Another Note: Consider a scenario with k=3, a current base of aaa, and a source string "aaaaaaab". You should consider this string to have five different occurrences of the base, with four of those five followed by an additional a and the fifth occurrence followed by b, and thus we'd expect that the next character should be chosen as an a with probability of 4/5.

Object-Oriented Design

Rather than implement the entire program, you will be responsible for one key piece of the overall code. Namely, you will need to implement a new class, Generator, which mangages the process for generating each new character of text. In particular, we provide the following class definition in file generator.h:


class Generator {
private:
  string source;	// the original string
  unsigned int k;	// the order value for the generation model
  string base;		// the current base string

public:

  /*
   *  The constructor takes two pieces of information, namely a
   *  reference to the original string and the order 'k' for the
   *  generation model.
   */
  Generator(const string& src, unsigned int order);

  /*
   *  This routine is responsible for returning one additional
   *  character based upon the model.
   */
  char nextChar();


  /*
   * Extra Credit: 
   *   Same end result as nextChar() but with a more efficient
   *   technique, as described in the assignment.
   */
  char nextCharExtra();

};

Generator(const string &source, order k)
A constructor for your generator which can perform any initialization which you deem necessary necessary. The first parameter is a reference to the original source document; the second is the value of k as discussed above.
char nextChar()
This method is responsible for generating another character of the output. For the first k calls, the method should be returning each of the first k characters of the original source. From that point on, each new character should be based upon the k most recently generated characters, as per the assignment description. In the special case discussed earlier, you should return the value 0.

Your class should not have any reason to store the entire output, though you will need to maintain a string which represents the most recent k-character base.

The biggest challenge will be the implementation of the nextChar() method. You should accomplish this based upon the following algorithm.

During a first pass through the source, count the number of occurrences of the base. Let's say that there are p such occurrences.

Now use a random number generator to choose a number between 0 and p-1, then make a second pass through the source to locate the desired occurrence of the base.

Based on the chosen occurrence of the base, determine the next character as the character which immediately follows such occurrence. That should be returned to the caller (but before doing so, you should think about how to update your representation of the base so that your generator will be ready the next time it needs to generate a character.

Files You Will Need

To get going, please copy the project to your own diredtory on turing with the command:

cp -Rp ~goldwasser/csp125/programs/Shakespearl .

There are several files:

generator.h
This header file defines the Generator class, however it does not contain implementations for the methods.
generator.cpp
This is the file which you will need to edit. It contains relatively vacuous stubs for the necessary methods. Your task will be to fully implement these methods.
shakespearl.cpp
This file defines the front-end driver we are using. There is really little reason for you to read this file.
Makefile
You should simply type make at the command prompt and then your program will be compiled. Either syntax errors will be displayed, or else the compilation will be successful and an executable named Shakespearl will be created which you can run.
input
This subdirectory contains a set of possible input files.

Additional Programming Details

To successfully complete this assignment, you will need to make use of two existing tools from the C++ libraries.

The string class
The string class is discussed in great detail in text, as well as in our own lecture notes. The original source document will be represented as one long string object. You will also want to use a string to represent the current base.

The at(int index) method can be used to get a single character of a string, at the given index. The substr(int index, int length) method can be used to get a string which represents a substring of the original, starting at the given index and continuing up to length characters.

The find(string pattern, int start) method will be very useful for finding occurrences of the base in the source document. There is one piece of information about this method which we did not discuss earlier in the semester. Recall that find returns an index denoting the starting point in the original string at which the next occurrence of the desired pattern begins.

Until now, we never told you what happens when you call find and no further occurrences exist. In this case, the returned index is a special constant denoted as string::npos (to denote that the pattern was found at no position). You will likely need to check whether the returned value is equal to this constant when writing a loop to count the overall number of occurrences.
Generating random numbers
Much like your own Generator class is used in this assignment to generate characters by this model, C++ provides an underlying function for generating random numbers. This function does not truly have the ability to pick things at random, rather it simulates this by using a mathematical function which seems somewhat random.

You may call the function rand() which returns a random integer. The integer is chosen to lie between 0 and a particularly large maximum number. Since you will want to generate a number which is between 0 and p-1 for some value p, you should use the expression (rand()%p), as the remainder of a number divided by p will surely be at least 0 and at most p-1.

For those interested in a more detailed discussion, please see Chapter 5.7 of our text.

Testing Your Program

The text generated by this language model will be greatly effected by the parameter k. When k is small, the output will seem almost like gibberish; If k were quite large, the output would eventually be identical to the source document. But as k varies in between, we get some interesting texts which are original, though in the style of the source work.

This may seem like fun and games as you watch your program spitting out random looking text. So how are you to tell whether or not your program is working correctly, given the presence of some randomness in the model? This is not so easy, but indeed you are responsible for assuring the accuracy of your program in carrying out the prescribed model. (We will certainly be evaluating your work in this regard.)

We suggest that you start by using somewhat small, controlled input sources. This will allow you to test and debug your program while you are still able to hand simulate the scenario. We will offer the following three simple examples. What behavior would you expect for these source strings:

example1.txt
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()
using any value of k
example2.txt
aaabaaacaaadaaabaaabaaac
using value of k=3
example3.txt
aaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXc
using value of k=4

If you feel that you have things under control for these types of examples, then have fun and move onto some larger examples, such as those discussed in the next section.

Alternative Document Sources

You are welcome to create your own input file if you wish. For conveneince and fun, we are providing several interesting and lengthier examples. You will find asubdirectory named input in your own project directory, containing the following files:

muchado.txt (123413 characters)
Much Ado About Nothing. (Let's see how those monkeys do with this).
billiken.txt (3938 characters)
An article on the history of the Billiken mascot.
amendments.txt (18369 characters)
The Amendments to the Constitution
(given the current political climate, it will be interesting to see which side of the debate is supported by our experiments).
manifesto.txt (72955 characters)
Communist Party Manifesto
(to consider perhaps alternative political structures)
aesopshort.txt (10280 characters)
A short collection of Aesop's fables.
aesop.txt (191945 characters)
A much longer collection of Aesop's fables.
lilwomen.txt (1042048 characters)
Little Women

Running the Shakespearl Program

When the main driver is executed, the user will be prompted for four pieces of information:

The full pathname and filename for the source doucument.
The order k to be used in the generation.
The number of characters of output which should be generated.
(Optionally) A numerical value to be used in "seeding" the random number generator. If this fourth parameter is not specfied, the program will choose its own seed.

Files to Submit

generator.h, generator.cpp
These two files are the only source code which you will write. They contain the Generator class definition and implementation, respectively.
Readme File

Again, we also ask for you to estimate the amount of time you spent on the assignment, and to let us know of any difficulties you had or other issues you wish to discuss.

If you worked as a pair, please make sure that both names are given and that you discuss how you each contributed to the submitted work.

Grading Standards

The assignment is worth 10 points. If you worked as a pair, you will each be given the same grade.

The primary criteria will be whether your program generates text in accordance with the precise algorithm described in this assignment. Please make sure you read the section on testing your program.

A secondary criteria will be on the style, readability and documentation of your source code.

Extra Credit (1 point)

Note that the required algorithm for implementing the nextChar method required two distinct passes through the source string. The first to compute a count of the number of occurrences of the base; the second to then randomly select one such occurance which determines the continuation.

It turns out that a more efficient approach exists, requiring only a single pass. The apparent hurdle is that we want to choose among those occurrences of the base with equal probability but we do not know how many such occurrences we are choosing between until we have made a pass through the input. Fortunately, there is a clever mathematical trick to make this selection on the fly, while still ensuring that the end result is a fair choice.

Perform a pass through the source, searching for occurrences of the base. While doing this, maintain two important pieces of additional information:

Identify one of the occurrences so far as the current leader in the selection process.

Keep track of a count of the number of occurrences which you have seen thus far.

When you come upon the n^th occurrence of the base, reset the leader to this occurrence with probability 1/n. Do this for each occurrence which is found, until the end of the source is reached. Then use the leading candidate as the chosen occurrence for determining the new output character.

Though you certainly do not need to prove that this technique results in all occurrences being chosen equally likely, you may want to try it out by hand to convince yourself. For example, trace through what happens when there are a total of three occurrences from which to choose.

So as not to jeopardize your score on the required assignment, we have defined another method, nextCharExtra as part of the Generator class. If you wish to attempt the extra credit, you should first make sure that you get nextChar implemented correctly, according to the original assignment specifications. Then implement nextCharExtra to complete the extra credit challenge.

Michael Goldwasser

CS-P125, Spring 2005
Last modified: Wednesday, 20 April 2005

Saint Louis University

Computer Science P125
Introduction to Computer Science

Michael Goldwasser

Spring 2005

Dept. of Math & Computer Science

Programming Assignment 06

Shakespearl

Due: Wednesday, 20 April 2005, 8pm

Collaboration Policy

Contents:

Overview

A Model for Language Generation

Object-Oriented Design

Files You Will Need

Additional Programming Details

Testing Your Program

Alternative Document Sources

Running the Shakespearl Program

Files to Submit

Grading Standards

Extra Credit (1 point)

Computer Science P125 Introduction to Computer Science

Spring 2005

Programming Assignment 06

Shakespearl

Due: Wednesday, 20 April 2005, 8pm

Collaboration Policy

Contents:

Computer Science P125
Introduction to Computer Science