Saint Louis University |
Computer Science P125
|
Dept. of Math & Computer Science |
Please see the general programming webpage for details about the programming environment for this course, guidelines for programming style, and details on electronic submission of assignments.
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
A classic claim is that having a million monkeys
typing on a million keyboards would eventually lead to the creation of
each of Shakespeare's works. In essence, there is a miniscule chance
that a selection of random characters might match such a work, and so
with enough random experiments it should eventually happen.
Not surprisingly, small scale tests of this experiment have thus far
failed.
Our goal in this assignment will be to help those monkeys out just a bit. In fact, perhaps they can create works even better than those of Shakespeare himself. The idea is the following. Rather than modeling the process of generating text by choosing each keystroke uniformly at random, we will use an existing work of Shakespeare to seed a random process for language generation.
The input to the generator will be a source document together with a parameter k which we will call the order of the model. Our output will always begin with the first k characters of the source document. After that point, however, each additional character is chosen, based upon the most recent k-character string in the output. Let us denote this k-character string as the current base. We will find all occurrences of that base in the source document, and select one of those occurrences uniformly at random. The next character of the output will then be chosen to match the character which followed the chosen occurrence from the source.
For example, assume that k=2 and that the current base is
th. If the source document had the following form,
*******th***thth******th****th************th****th***********th**
we would hope to pick one of the eight occurrences of th
uniformly at random, such as,
*******th***thth******th****th************th****th***********th**
and then generate the next character of output based on which
character followed the chosen occurrence in the source document, e.g.,
*******th***thth******th****th************the***th***********th**
Next time we pick a new character of the output,
we would continue with he as the base.
Rather than implement the entire program, you will be responsible for one key piece of the overall code. Namely, you will need to implement a new class, Generator, which mangages the process for generating each new character of text. In particular, we provide the following class definition in file generator.h:
class Generator { private: string source; // the original string unsigned int k; // the order value for the generation model string base; // the current base string public: /* * The constructor takes two pieces of information, namely a * reference to the original string and the order 'k' for the * generation model. */ Generator(const string& src, unsigned int order); /* * This routine is responsible for returning one additional * character based upon the model. */ char nextChar(); /* * Extra Credit: * Same end result as nextChar() but with a more efficient * technique, as described in the assignment. */ char nextCharExtra(); };
Generator(const string &source, order k)
A constructor for your generator which can perform any
initialization which you deem necessary necessary. The first
parameter is a reference to the original source document; the second is the
value of k as discussed above.
char nextChar()
This method is responsible for generating another character of
the output. For the first k calls, the method should
be returning each of the first k characters of the
original source. From that point on, each new character should
be based upon the k most recently generated characters,
as per the assignment description. In the
special case discussed
earlier, you should return the value 0.
Your class should not have any reason to store the entire output, though you will need to maintain a string which represents the most recent k-character base.
The biggest challenge will be the implementation of the nextChar() method. You should accomplish this based upon the following algorithm.
- During a first pass through the source, count the number of occurrences of the base. Let's say that there are p such occurrences.
- Now use a random number generator to choose a number between 0 and p-1, then make a second pass through the source to locate the desired occurrence of the base.
- Based on the chosen occurrence of the base, determine the next character as the character which immediately follows such occurrence. That should be returned to the caller (but before doing so, you should think about how to update your representation of the base so that your generator will be ready the next time it needs to generate a character.
To get going, please copy the project to your own diredtory on turing with the command:
cp -Rp ~goldwasser/csp125/programs/Shakespearl .
There are several files:
generator.h
This header file defines the Generator class, however it does
not contain implementations for the methods.
generator.cpp
This is the file which you will need to edit. It contains
relatively vacuous stubs for the necessary methods. Your task
will be to fully implement these methods.
shakespearl.cpp
This file defines the front-end driver we are using. There is
really little reason for you to read this file.
Makefile
You should simply type make at
the command prompt and then your program will be compiled.
Either syntax errors will be displayed, or else the compilation
will be successful and an executable named Shakespearl will
be created which you can run.
input
This subdirectory contains a set of possible input files.
The string class is discussed in great detail in text, as well as in our own lecture notes. The original source document will be represented as one long string object. You will also want to use a string to represent the current base.
The
The find(string pattern, int start) method will be very useful for finding occurrences of the base in the source document. There is one piece of information about this method which we did not discuss earlier in the semester. Recall that find returns an index denoting the starting point in the original string at which the next occurrence of the desired pattern begins.
Until now, we never told you what happens when you call find and no further occurrences exist. In this case, the returned index is a special constant denoted as string::npos (to denote that the pattern was found at no position). You will likely need to check whether the returned value is equal to this constant when writing a loop to count the overall number of occurrences.
Much like your own Generator class is used in this assignment to generate characters by this model, C++ provides an underlying function for generating random numbers. This function does not truly have the ability to pick things at random, rather it simulates this by using a mathematical function which seems somewhat random.
You may call the function rand() which returns a random integer. The integer is chosen to lie between 0 and a particularly large maximum number. Since you will want to generate a number which is between 0 and p-1 for some value p, you should use the expression (rand()%p), as the remainder of a number divided by p will surely be at least 0 and at most p-1.For those interested in a more detailed discussion, please see Chapter 5.7 of our text.
The text generated by this language model will be greatly effected by the parameter k. When k is small, the output will seem almost like gibberish; If k were quite large, the output would eventually be identical to the source document. But as k varies in between, we get some interesting texts which are original, though in the style of the source work.
This may seem like fun and games as you watch your program spitting out random looking text. So how are you to tell whether or not your program is working correctly, given the presence of some randomness in the model? This is not so easy, but indeed you are responsible for assuring the accuracy of your program in carrying out the prescribed model. (We will certainly be evaluating your work in this regard.)
We suggest that you start by using somewhat small, controlled input sources. This will allow you to test and debug your program while you are still able to hand simulate the scenario. We will offer the following three simple examples. What behavior would you expect for these source strings:
example1.txt
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()
using any value of k
example2.txt
aaabaaacaaadaaabaaabaaac
using value of k=3
example3.txt
aaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXc
using value of k=4
You are welcome to create your own input file if you wish. For conveneince and fun, we are providing several interesting and lengthier examples. You will find asubdirectory named input in your own project directory, containing the following files:
muchado.txt (123413 characters)
Much Ado About Nothing. (Let's see how those monkeys do with this).
billiken.txt (3938 characters)
An article on the history of the Billiken mascot.
amendments.txt
(18369 characters)
The Amendments to the Constitution
(given the current political
climate, it will be interesting to see which side of the debate
is supported by our experiments).
manifesto.txt
(72955 characters)
Communist Party Manifesto
(to consider perhaps alternative political structures)
aesopshort.txt
(10280 characters)
A short collection of Aesop's fables.
aesop.txt (191945
characters)
A much longer collection of Aesop's fables.
lilwomen.txt
(1042048 characters)
Little Women
When the main driver is executed, the user will be prompted for four pieces of information:
The full pathname and filename for the source doucument.
The order k to be used in the generation.
The number of characters of output which should be generated.
(Optionally) A numerical value to be used in "seeding" the random number generator. If this fourth parameter is not specfied, the program will choose its own seed.
generator.h, generator.cpp
These two files are the only source code which you will write.
They contain the Generator class definition and implementation,
respectively.
Again, we also ask for you to estimate the amount of time you spent on the assignment, and to let us know of any difficulties you had or other issues you wish to discuss.
If you worked as a pair, please make sure that both names are given and that you discuss how you each contributed to the submitted work.
The assignment is worth 10 points. If you worked as a pair, you will each be given the same grade.
The primary criteria will be whether your program generates text in accordance with the precise algorithm described in this assignment. Please make sure you read the section on testing your program.
A secondary criteria will be on the style, readability and documentation of your source code.
Note that the required algorithm for implementing the nextChar method required two distinct passes through the source string. The first to compute a count of the number of occurrences of the base; the second to then randomly select one such occurance which determines the continuation.
It turns out that a more efficient approach exists, requiring only a single pass. The apparent hurdle is that we want to choose among those occurrences of the base with equal probability but we do not know how many such occurrences we are choosing between until we have made a pass through the input. Fortunately, there is a clever mathematical trick to make this selection on the fly, while still ensuring that the end result is a fair choice.
Perform a pass through the source, searching for occurrences of the base. While doing this, maintain two important pieces of additional information:
- Identify one of the occurrences so far as the current leader in the selection process.
- Keep track of a count of the number of occurrences which you have seen thus far.
When you come upon the nth occurrence of the base, reset the leader to this occurrence with probability 1/n. Do this for each occurrence which is found, until the end of the source is reached. Then use the leading candidate as the chosen occurrence for determining the new output character.
Though you certainly do not need to prove that this technique results in all occurrences being chosen equally likely, you may want to try it out by hand to convince yourself. For example, trace through what happens when there are a total of three occurrences from which to choose.
So as not to jeopardize your score on the required assignment, we have defined another method, nextCharExtra as part of the Generator class. If you wish to attempt the extra credit, you should first make sure that you get nextChar implemented correctly, according to the original assignment specifications. Then implement nextCharExtra to complete the extra credit challenge.