Art Show | Course Home | Homework | Labs | Programming | Schedule & Lecture Notes | Submit

Saint Louis University

Computer Science P125
Introduction to Computer Science

Michael Goldwasser

Spring 2005

Dept. of Math & Computer Science

Programming Assignment 06

Shakespearl

Due: Wednesday, 20 April 2005, 8pm

Please see the general programming webpage for details about the programming environment for this course, guidelines for programming style, and details on electronic submission of assignments.

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Contents:

(image from
http://members.aol.com/LearnNothing/TypingMonkeys.htm)

Overview

A classic claim is that having a million monkeys typing on a million keyboards would eventually lead to the creation of each of Shakespeare's works. In essence, there is a miniscule chance that a selection of random characters might match such a work, and so with enough random experiments it should eventually happen. Not surprisingly, small scale tests of this experiment have thus far failed.


(image from http://www.vivaria.net/experiments/notes/documentation/)

Our goal in this assignment will be to help those monkeys out just a bit. In fact, perhaps they can create works even better than those of Shakespeare himself. The idea is the following. Rather than modeling the process of generating text by choosing each keystroke uniformly at random, we will use an existing work of Shakespeare to seed a random process for language generation.


A Model for Language Generation

The input to the generator will be a source document together with a parameter k which we will call the order of the model. Our output will always begin with the first k characters of the source document. After that point, however, each additional character is chosen, based upon the most recent k-character string in the output. Let us denote this k-character string as the current base. We will find all occurrences of that base in the source document, and select one of those occurrences uniformly at random. The next character of the output will then be chosen to match the character which followed the chosen occurrence from the source.

For example, assume that k=2 and that the current base is th. If the source document had the following form,

*******th***thth******th****th************th****th***********th**

we would hope to pick one of the eight occurrences of th uniformly at random, such as,

*******th***thth******th****th************th****th***********th**

and then generate the next character of output based on which character followed the chosen occurrence in the source document, e.g.,

*******th***thth******th****th************the***th***********th**

Next time we pick a new character of the output, we would continue with he as the base.

Note well: If implementing this scheme correctly, you can always be assured that the base occurs at least once in the source document. However, one special case which may occur is when the randomly chosen occurrence of the base happens to be the final k characters of the source. In this case, there is no followup character to use; your generator can return a special value which will signify the end of the output.

Another Note: Consider a scenario with k=3, a current base of aaa, and a source string "aaaaaaab". You should consider this string to have five different occurrences of the base, with four of those five followed by an additional a and the fifth occurrence followed by b, and thus we'd expect that the next character should be chosen as an a with probability of 4/5.


Object-Oriented Design

Rather than implement the entire program, you will be responsible for one key piece of the overall code. Namely, you will need to implement a new class, Generator, which mangages the process for generating each new character of text. In particular, we provide the following class definition in file generator.h:


class Generator {
private:
  string source;	// the original string
  unsigned int k;	// the order value for the generation model
  string base;		// the current base string

public:

  /*
   *  The constructor takes two pieces of information, namely a
   *  reference to the original string and the order 'k' for the
   *  generation model.
   */
  Generator(const string& src, unsigned int order);

  /*
   *  This routine is responsible for returning one additional
   *  character based upon the model.
   */
  char nextChar();


  /*
   * Extra Credit: 
   *   Same end result as nextChar() but with a more efficient
   *   technique, as described in the assignment.
   */
  char nextCharExtra();

};

Your class should not have any reason to store the entire output, though you will need to maintain a string which represents the most recent k-character base.

The biggest challenge will be the implementation of the nextChar() method. You should accomplish this based upon the following algorithm.


Files You Will Need

To get going, please copy the project to your own diredtory on turing with the command:

cp -Rp ~goldwasser/csp125/programs/Shakespearl .

There are several files:


Additional Programming Details

To successfully complete this assignment, you will need to make use of two existing tools from the C++ libraries.

Testing Your Program

The text generated by this language model will be greatly effected by the parameter k. When k is small, the output will seem almost like gibberish; If k were quite large, the output would eventually be identical to the source document. But as k varies in between, we get some interesting texts which are original, though in the style of the source work.

This may seem like fun and games as you watch your program spitting out random looking text. So how are you to tell whether or not your program is working correctly, given the presence of some randomness in the model? This is not so easy, but indeed you are responsible for assuring the accuracy of your program in carrying out the prescribed model. (We will certainly be evaluating your work in this regard.)

We suggest that you start by using somewhat small, controlled input sources. This will allow you to test and debug your program while you are still able to hand simulate the scenario. We will offer the following three simple examples. What behavior would you expect for these source strings:

If you feel that you have things under control for these types of examples, then have fun and move onto some larger examples, such as those discussed in the next section.

Alternative Document Sources

You are welcome to create your own input file if you wish. For conveneince and fun, we are providing several interesting and lengthier examples. You will find asubdirectory named input in your own project directory, containing the following files:


Running the Shakespearl Program

When the main driver is executed, the user will be prompted for four pieces of information:

  1. The full pathname and filename for the source doucument.

  2. The order k to be used in the generation.

  3. The number of characters of output which should be generated.

  4. (Optionally) A numerical value to be used in "seeding" the random number generator. If this fourth parameter is not specfied, the program will choose its own seed.


Files to Submit


Grading Standards

The assignment is worth 10 points. If you worked as a pair, you will each be given the same grade.

The primary criteria will be whether your program generates text in accordance with the precise algorithm described in this assignment. Please make sure you read the section on testing your program.

A secondary criteria will be on the style, readability and documentation of your source code.


Extra Credit (1 point)

Note that the required algorithm for implementing the nextChar method required two distinct passes through the source string. The first to compute a count of the number of occurrences of the base; the second to then randomly select one such occurance which determines the continuation.

It turns out that a more efficient approach exists, requiring only a single pass. The apparent hurdle is that we want to choose among those occurrences of the base with equal probability but we do not know how many such occurrences we are choosing between until we have made a pass through the input. Fortunately, there is a clever mathematical trick to make this selection on the fly, while still ensuring that the end result is a fair choice.

Though you certainly do not need to prove that this technique results in all occurrences being chosen equally likely, you may want to try it out by hand to convince yourself. For example, trace through what happens when there are a total of three occurrences from which to choose.

So as not to jeopardize your score on the required assignment, we have defined another method, nextCharExtra as part of the Generator class. If you wish to attempt the extra credit, you should first make sure that you get nextChar implemented correctly, according to the original assignment specifications. Then implement nextCharExtra to complete the extra credit challenge.


Michael Goldwasser
CS-P125, Spring 2005
Last modified: Wednesday, 20 April 2005
Art Show | Course Home | Homework | Labs | Programming | Schedule & Lecture Notes | Submit