Saint Louis University |
Computer Science 150
|
Dept. of Math & Computer Science |
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
Software, such as a word processor, search engine, or mobile interface, typically includes plug-in support specific to a language to aid with spelling. In this assignment, you will implement a class that provides general language support; such a class could presumably be (re)used in these broader software applications.
For the purpose of spell checking, a simple language model is a set of valid words. By convention, a language specification may include both capitalized and uncapitalized words. A word that is capitalized in the language specification is only legitimate when capitalized (e.g., 'Missouri' is okay but 'missouri' is not). A word that is uncapitalized in the language specification can be used in either capitalized or uncapitalized from (e.g., if 'dog' is in the language specification, then both 'dog' and 'Dog' are legitimate usages).
The goals of the new class will be to answer the following types of queries:
Is a given string a legitimate word in the language? (based on the above conventions regarding capitalization)
Given a string, which may or may not be in the language, produce a list of suggestions that are valid words in the language and reasonably "close" to the given string in terms of spelling. (We will say more below, about the notion of distance between words.)
Formally, you are to implement a LanguageHelper class with the following three methods.
__init__(self, words)
The words parameter can be any iterable sequence of
strings that define the words in the language. For example, the
parameter may be a list of strings, or a file object that has one
word per line.
The class is responsible for recording all words from the language into an internal data representation, and stripping any extraneous whitespace from each entry (such as newline characters that will appear in a file). For the sake of efficiency, we recommend that you store the language words in a Python set instance. (We discuss sets in a later section.)
__contains__(self, query)
The query parameter is a string.
This method should determine whether the
string is considered a legitimate word, returning
True if the word is contained in the language and
False otherwise. This method should adhere to
the aforementioned conventions regarding capitalized and
uncapitalized words. For example, dog, Dog and
Missouri are contained in the English language, yet
missouri and Missourri are not.
The __contains__ special method is used by Python to support the in operator. It allows the standard syntax
which is implicitly translated by Python to the internal call'Missouri' in language
presuming that language is an instance of our LanguageHelper class.language.__contains__('Missouri')
getSuggestions(self, query)
Given a query string, this method should return an alphabetical
list of "nearby" words in the language. Doing a good job at offering
suggestions is the most difficult part of writing a good
language helper. We discuss this aspect of the project in
a later section.
Although a natural internal representation of the language would be to maintain a list of words, there is another built-in data structure in Python, known as a set, that provides greater efficiency for lookups. The textbook discusses the class on pages 409-414, but for this project you only need to use three behaviors.
A common rule for defining how close a given string is to another (potentially correct) string is by measuring the number of changes that would have to be made to get from one word to the other. This is typically called the edit distance between two words.
More concretely, consider the following types of edits which might get from a misspelled word back to a legitimate word.
Delete one character from the word.
(e.g. converting unneccessary to unnecessary)
Add one letter to the word.
(e.g., converting writen to written)
Change a single letter to some other letter
(e.g. converting flexable to flexible)
Invert two neighboring characters
(e.g., converting wierd to
weird, or wierd to wired)
For this project, the getSuggestions method should return a sorted list of all legitimate language words that are precisely one edit away from the query.
The challenge will be in implementing this efficiently enough so that you can produce these suggestions without any significant delay for the user. There are several possible approaches. One is to write a method to check whether any two words are within one edit of each other. Then you could iterate this test between the mistaken word and each word of the language file. But for a language file with hundreds of thousands of words, this is unnecessarily expensive.
Instead, you are to take the query word and generate all possible strings that are one edit away from it; then, for each of those strings, test if it is a legitimate word in the language. Notice that for a typical 7-letter word, there are only 7 ways to delete a character, 6 ways to invert two neighboring characters, and 8*26 ways to insert a new character, because there are 8 possible slots, and 26 possible letters to put in each slot. Okay, depending on how capitalization is managed, perhaps 52 possible characters, or more if allowing hyphen or apostrophe. But still, this seems like a far smalller number of things to check than trying to compare the mistaken word to each word in the language.
The final list returned by getSuggestions should be sorted alphabetically, and should not contain any duplicates (even though there may have been two different edits which converge to the same word, such as deciding which 'c' of unneccessary to delete).
If the query word begins with a capital letter (e.g., Wierd), all reported suggestions should be capitalized as well (figuring that if a user typed a capital letter when misspelling a word, they would be likely to want the correctly spelled word to be capitalized.
If the query word begins with a lowercase letter (e.g., missouri), you may consider replacing the first letter with any other lowercase letter, or with the uppercase version of the same letter (e.g., Missouri)
For your LanguageHelper class, you must use the following good software practices:
Naming Conventions
Please follow all the guidelines for naming conventions given in
Chapter 7.4 and 7.6.
Formal Documentation
Provide formal documentation for the class and all its public
methods, as described in Chapter 7.5
Unit Testing
Provide unit testing embedded directly at the end
of the LanguageHelper.py file, as described
in Chapter 7.7
Make sure that your test includes many interesting cases, demonstrating both the check of containment for words that are included and are not included, and a variety of calls to getSuggestions. Although it is fun to play with a large language (see below), we suggest developing some of your unit tests in a more controlled environment with a small, artificial language of your choosing.
We are providing a file, English.txt containing over 364,000 correctly spelled English words. The words may involve a combination of uppercase and lowercase letters, as discussed above. The file has one word per line and words are alphabetized as they might appear in a standard English dictionary.
To ease your program development, you can make it appear as if this file is in your own directory on turing by typing the following command.
ln -sf /Public/goldwasser/150/programs/spell/English.txt .
This doesn't really copy the file, rather it creates what is called a symbolic link (aka shortcut) to this file within your directory. It seems unnecessary wasteful for everyone to have their own copy of this file.
The source code for your LanguageHelper class should be placed in a file named LanguageHelper.py and submitted electronically.
You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments. Please include additional observations in this readme about the efficiency of the getSuggestions method on various size words and languages.
Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.
The assignment is worth 10 points. Those points will be apportioned approximately as:
Implement a method
If you complete the extra credit challenge, make sure to discuss it in the readme, and to demonstrate use of the new method in your unit tests.