Course Home | Class Schedule | Assignments | Git Submission | Perusall | Python | Tutoring

Saint Louis University

Computer Science 1300/5001
Introduction to Object-Oriented Programming

Michael Goldwasser

Fall 2018

Computer Science Department

Programming Assignment 09

Shakespearl

Due: 11:59pm, Monday, 3 December 2018


Contents:

(image from
http://members.aol.com/LearnNothing/TypingMonkeys.htm)

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

We have all heard the argument that having a million monkeys typing on a million keyboards would eventually lead to the creation of each of Shakespeare's works. In essence, there is a miniscule chance that a selection of random characters might match such a work, and so with enough random experiments it should eventually happen. Not surprisingly, small-scale tests of this experiment have thus far failed.


(image originally from www.vivaria.net)

Our goal in this assignment will be to help those monkeys out just a bit. In fact, perhaps they can create works even better than those of Shakespeare himself. The idea is the following. Rather than modeling the process of generating text by choosing each keystroke uniformly at random, we will use an existing work of Shakespeare to seed a random process for language generation.


A Model for Language Generation

The input to the generator will be a source text together with a parameter n which we will call the order of the model. Our output will always begin with the first n characters of the source document. After that point, each additional character is chosen using a random process that is based upon the most recent n-character string in the output, known in linguistics as an n-gram. (In genomics, these are often denoted with parameter k and called k-mers.)

The model for randomly generating the next character in our output will be to determine the number of times the n-gram occurs in the source text, and to mimic the distribution of what character follows those occurrences in the source.

As an example, assume we have an order-2 model and our source text is

aaabaaacaaadaaabaaabaaac

We would automatically begin our output with aa. To determine the next character, notice that there are twelve occurrences of aa in the course. (Each aaa represents two separate occurrences.) Of those twelve occurrences, six are immediately followed by another a, three are followed by b, two are followed by c and one by d. So we randomly pick the next character using this distribution, therefore with probablilty 1/2 of picking a, probability 1/4 of picking b, probability 1/6 of picking c and probability 1/12 of picking d.

If we had next picked b by that process, our most recent n-gram is now ab and based on the source text we are sure to next pick a because 100% of the ab occurrences in the source are followed by a. Had we originally picked c as our third character, and thus have ac as the most recent n-gram, we notice that there are two occurrences of ac in the original source. One of those is followed by a and the other occurs at the end of the string. In our model, this means that we should pick a with probability 1/2 and we should consider ourselves at the end of the output with probability 1/2 (in which case, we should not generate and further characters).

The text generated by this language model will be greatly effected by the parameter n. When n is small, the output will seem almost like gibberish; If n were quite large, the output would eventually be identical to the source document. But as n varies in between, we get some interesting texts which are original, though in the style of the source work.


Your Task

Your task will be to write code for a new class Generator which supports the following two methods:

Preprocessing

A somewhat "lazy" approach to implementing this model would be to scan the entire input string every time nextChar() is called, to look for all of the occurrences of the most recent n-gram in order to determine the desired distribution for the next character. However, if generating a large number of output characters, it is far more efficient to do some preprocessing of the input within the constructor (so that there is less work to do within the nextChar() method).

We wish you to be more efficient through the use of dictionaries. In particular, we wish to have you create and store a dictionary as part of the initialization of your generator, with the following structure. The keys of the dictionary should be the n-grams that occur in the source text, and the value associated with a particular n-gram should be yet another dictionary! That secondary dictionary should keep frequency counts for what character followed the associated n-gram. Returning to our earlier example input and n=2, the dictionary should appear as follows:

{'aa': {'a': 6, 'b': 3, 'c': 2, 'd': 1}, 'ab': {'a': 3}, 'ba': {'a': 3},
 'ac': {'a': 1, '': 1}, 'ca': {'a': 1}, 'ad': {'a': 1}, 'da': {'a': 1}}
Notice that there are a total of seven two-grams that occur in the original string ('aa', 'ab', 'ba', 'ac', 'ca', 'ad', 'da'). For the n-gram 'aa', the secondary dictionary represents that the twelve occurrences were followed by 'a' six times, by 'b' three times, by 'c' two times, and by 'd' one time. Notice that the 'ab' n-gram was followed by 'a' all three times it occurred, and that n-gram 'ac' was followed once by 'a' and once was at the end of the input.

By precomputing such a dictionary, the task of nextChar() is greatly simplified (though still not trivial), as the necessary dictionary for the current n-gram can be examined. However, it is still necessary to maintain the most recent n-gram to have been output (and to properly handle the initial generation of the first n characters that match the source).


Files You Will Need

We provide two Python script files for this project:


Testing Your Program

You are free (and encouraged) to make your own small input files to test your implementation. The biggest challenge with testing this program is that it is a random process and so it will take care to determine whether your generator is behaving in a way that is consistent with the indicated model. (We will certainly be evaluating your work in this regard.)

We suggest that you start by using somewhat small, controlled input sources. This will allow you to test and debug your program while you are still able to hand simulate the scenario. For such small examples, you might print a copy of your computing dictionary so that you can manually examine it (but please comment out any such print statements in the end result.

We also offer the following three simple examples. What behavior would you expect for these source strings?

If you feel that you have things under control for these types of examples, then have fun and move onto some larger examples, such as those discussed in the next section.

Alternative Document Sources

For lengthier and more interesting examples, we are providing the following files, which you may download individually or as a combined zip file.


Submitting Your Assignment

You should be submitting two files:

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.


Grading Standards

The assignment is worth 40 points, approprtioned as follows:


Acknowledgment

This project is inspired by Joe Zachary's "Nifty Assignment".
Michael Goldwasser
Last modified: Monday, 03 December 2018