Saint Louis University |
Computer Science 1300/5001
|
Computer Science Department |
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
We have all heard the argument that having a million monkeys
typing on a million keyboards would eventually lead to the creation of
each of Shakespeare's works. In essence, there is a miniscule chance
that a selection of random characters might match such a work, and so
with enough random experiments it should eventually happen.
Not surprisingly, small-scale tests of this experiment have thus far
failed.
Our goal in this assignment will be to help those monkeys out just a bit. In fact, perhaps they can create works even better than those of Shakespeare himself. The idea is the following. Rather than modeling the process of generating text by choosing each keystroke uniformly at random, we will use an existing work of Shakespeare to seed a random process for language generation.
The input to the generator will be a source text together with a
parameter n which we will call the order of the
model. Our output will always begin with the first n
characters of the source document. After that point, each
additional character is chosen using a random process that is based
upon the most recent n-character string in the output, known
in linguistics as an
The model for randomly generating the next character in our output will be to determine the number of times the n-gram occurs in the source text, and to mimic the distribution of what character follows those occurrences in the source.
As an example, assume we have an order-2 model and our source text is
aaabaaacaaadaaabaaabaaac
We would automatically begin our output with aa. To determine
the next character, notice that there are twelve occurrences of
aa in the course. (Each aaa represents two separate
occurrences.) Of those twelve occurrences, six are immediately
followed by another a, three are followed
by b, two are followed by c and one
by d. So we randomly pick the next character using this
distribution, therefore with probablilty 1/2 of picking a,
probability 1/4 of picking b, probability 1/6 of
picking c and probability 1/12 of picking d.
If we had next picked b by that process, our most recent n-gram is now ab and based on the source text we are sure to next pick a because 100% of the ab occurrences in the source are followed by a. Had we originally picked c as our third character, and thus have ac as the most recent n-gram, we notice that there are two occurrences of ac in the original source. One of those is followed by a and the other occurs at the end of the string. In our model, this means that we should pick a with probability 1/2 and we should consider ourselves at the end of the output with probability 1/2 (in which case, we should not generate and further characters).
The text generated by this language model will be greatly effected by the parameter n. When n is small, the output will seem almost like gibberish; If n were quite large, the output would eventually be identical to the source document. But as n varies in between, we get some interesting texts which are original, though in the style of the source work.
Generator(source, n)
A constructor for your generator class that takes the source
text as a string and the value n defining the order of the
model.
The constructor should initialize any necessary attributes of
your generator object, including preprocessing the input as
described in the following section.
nextChar()
This method is responsible for generating and returning one
additional character of the output. It will have to rely on the
current state of your object to determine an appropriate response.
For the first n calls, the method should
be returning each of the first n characters of the
original source. From that point on, each new character should
be based upon the n most recently generated characters,
as per the assignment description. In the special case where you
want to indicate the end of the output, return the empty
string ''.
A somewhat "lazy" approach to implementing this model would be to scan the entire input string every time nextChar() is called, to look for all of the occurrences of the most recent n-gram in order to determine the desired distribution for the next character. However, if generating a large number of output characters, it is far more efficient to do some preprocessing of the input within the constructor (so that there is less work to do within the nextChar() method).
We wish you to be more efficient through the use of dictionaries. In particular, we wish to have you create and store a dictionary as part of the initialization of your generator, with the following structure. The keys of the dictionary should be the n-grams that occur in the source text, and the value associated with a particular n-gram should be yet another dictionary! That secondary dictionary should keep frequency counts for what character followed the associated n-gram. Returning to our earlier example input and n=2, the dictionary should appear as follows:
{'aa': {'a': 6, 'b': 3, 'c': 2, 'd': 1}, 'ab': {'a': 3}, 'ba': {'a': 3}, 'ac': {'a': 1, '': 1}, 'ca': {'a': 1}, 'ad': {'a': 1}, 'da': {'a': 1}}Notice that there are a total of seven two-grams that occur in the original string ('aa', 'ab', 'ba', 'ac', 'ca', 'ad', 'da'). For the n-gram 'aa', the secondary dictionary represents that the twelve occurrences were followed by 'a' six times, by 'b' three times, by 'c' two times, and by 'd' one time. Notice that the 'ab' n-gram was followed by 'a' all three times it occurred, and that n-gram 'ac' was followed once by 'a' and once was at the end of the input.
By precomputing such a dictionary, the task of nextChar() is greatly simplified (though still not trivial), as the necessary dictionary for the current n-gram can be examined. However, it is still necessary to maintain the most recent n-gram to have been output (and to properly handle the initial generation of the first n characters that match the source).
We provide two Python script files for this project:
generator.py
A template for the class which you must write. This is the only
file that you should be modifying.
driver.py
This is a file which you should not edit. It is the main driver
to execute your program. It queries the user for a name of a
source file, the choice of n, the number of characters
of output that are desired, an optional random number
generator seed, and the name of the desired output file.
You are free (and encouraged) to make your own small input files to test your implementation. The biggest challenge with testing this program is that it is a random process and so it will take care to determine whether your generator is behaving in a way that is consistent with the indicated model. (We will certainly be evaluating your work in this regard.)
We suggest that you start by using somewhat small, controlled input sources. This will allow you to test and debug your program while you are still able to hand simulate the scenario. For such small examples, you might print a copy of your computing dictionary so that you can manually examine it (but please comment out any such print statements in the end result.
We also offer the following three simple examples. What behavior would you expect for these source strings?
example1.txt
aaabaaacaaadaaabaaabaaac
This it our original example. We recommend testing with value of
n=2 or n=3.
example2.txt
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!@#$%^&*()
using any value of n
example3.txt
aaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXaaaaaaaaaaaaXc
using value of n=4
For lengthier and more interesting examples, we are providing the following files, which you may download individually or as a combined zip file.
muchado.txt (122095 characters)
Much Ado About Nothing. (Let's see how those monkeys do with this).
billiken.txt (3939 characters)
An article on the history of the Billiken mascot.
amendments.txt
(18369 characters)
The Amendments to the Constitution
(given the current political
climate, it will be interesting to see which side of the debate
is supported by our experiments).
manifesto.txt
(72955 characters)
Communist Party Manifesto
(to consider perhaps alternative political structures)
aesopshort.txt
(10164 characters)
A short collection of Aesop's fables.
aesop.txt (189969
characters)
A much longer collection of Aesop's fables.
lilwomen.txt
(1039348 characters)
Little Women
You should be submitting two files:
Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.
The assignment is worth 40 points, approprtioned as follows: