Saint Louis University |
Computer Science 1300
|
Computer Science Department |
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
Software, such as a word processor, search engine, or mobile interface, typically includes plug-in support specific to a language to aid with spelling. In this assignment, you will implement a class that provides general language support; such a class could presumably be (re)used in these broader software applications.
For the purpose of spell checking, a simple language model is a set of valid words. By convention, a language specification may include both capitalized and uncapitalized words. A word that is is entirely lowercased in the language specification can be used in either capitalized or uncapitalized from (e.g., if 'dog' is in the language specification, then both 'dog' and 'Dog' are legitimate usages). However, any word that includes one or more uppercased letters in the original language reflects a form that cannot be modified (e.g., 'Missouri' is acceptable but 'missouri' is not; 'NATO' is acceptable, but neither 'Nato', 'nato', nor 'nAtO' would be acceptable).
The goals of the new class will be to answer the following types of queries:
Is a given string a legitimate word in the language? (based on the above conventions regarding capitalization)
Given a string, which may or may not be in the language, produce a list of suggestions that are valid words in the language and reasonably "close" to the given string in terms of spelling. (We will say more below, about the notion of distance between words.)
Formally, you are to provide a file named language_tools.py that defines aLanguageHelper class with the following three methods.
__init__(self, words)
The words parameter can be any iterable sequence of strings that define the words in the language. For example, the parameter may be a list of strings, or a file object that has one word per line. All you should assume about this parameter is that you are able to do a loop,
The class is responsible for recording all words from the language into an internal data representation, and stripping any extraneous whitespace from each entry (such as newline characters that will appear in a file). For the sake of efficiency, we recommend that you store the language words in a Python set instance. (We discuss sets in a later section.)
__contains__(self, query)
The query parameter is a string. This method should determine whether the string is considered a legitimate word, returning True if the word is contained in the language and False otherwise. This method should adhere to the aforementioned conventions regarding capitalized and uncapitalized words. For example, dog, Dog and Missouri are contained in the English language, yet missouri and Missourri are not.
The __contains__ special method is used by Python to support the in operator. It allows the standard syntax
'Missouri' in language
which is implicitly translated by Python to the internal call
language.__contains__('Missouri')
presuming that language is an instance of our LanguageHelper class.
getSuggestions(self, query)
Given a query string, this method should return an alphabetical list of "nearby" words in the language. Doing a good job at offering suggestions is the most difficult part of writing a good language helper. We discuss this aspect of the project in a later section.
For the sake of simplicity, you do not need to provide robust error checking of any parameters for the purpose of this project.
Although a natural internal representation of the language would be to maintain a list of words, there is another built-in data structure in Python, known as a set, that provides greater efficiency for lookups. For this project you only need to use three behaviors.
A common rule for defining how close a given string is to another (potentially correct) string is by measuring the number of changes that would have to be made to get from one word to the other. This is typically called the edit distance between two words.
More concretely, consider the following types of edits which might get from a misspelled word back to a legitimate word.
Delete one character from the word.
(e.g. converting unneccessary to unnecessary)
Add one letter to the word.
(e.g., converting writen to written)
Change a single letter to some other letter
(e.g. converting flexable to flexible)
Invert two neighboring characters
(e.g., converting wierd to weird, or wierd to wired)
For this project, the getSuggestions method should return a sorted list of all legitimate language words that are at most one edit away from the query. (If the query itself is legitimte, you should include that in the results but still include any further suggestions that are one edit away.)
The challenge will be in implementing this efficiently enough so that you can produce these suggestions without any significant delay for the user. There are several possible approaches. One is to write a method to check whether any two words are within one edit of each other. Then you could iterate this test between the mistaken word and each word of the language file. But for a language file with hundreds of thousands of words, this is unnecessarily expensive.
Instead, you are to take the query word and generate all possible strings that are one edit away from it; then, for each of those strings, test if it is a legitimate word in the language. Notice that for a typical 7-letter word, there are only 7 ways to delete a character, 6 ways to invert two neighboring characters, and 8*26 ways to insert a new character, because there are 8 possible slots, and 26 possible letters to put in each slot. Okay, depending on how capitalization is managed, perhaps 52 possible characters, or more if allowing hyphen or apostrophe. But still, this seems like a far smalller number of things to check than trying to compare the mistaken word to each word in the language.
The final list returned by getSuggestions should be sorted alphabetically, and should not contain any duplicates (even though there may have been two different edits which converge to the same word, such as deciding which 'c' of unneccessary to delete).
If the query word begins with a capital letter (e.g., Wierd), all reported suggestions should be capitalized as well (figuring that if a user typed a capital letter when misspelling a word, they would be likely to want the correctly spelled word to be capitalized.
If the query word begins with a lowercase letter (e.g., missouri), you may consider replacing the first letter with any other lowercase letter, or with the uppercase version of the same letter (e.g., Missouri)
For your LanguageHelper class, you must use the following good software practices:
Naming Conventions
Please follow all the guidelines for naming conventions given in Chapter 7.4 and 7.6.
Formal Documentation
Provide formal documentation for the class and all its public methods, as described in Chapter 7.5
Unit Testing
You will be required to provide a series of unit tests using Python's unittest module. We will get you started by providing the following template: language_tests.py.
Although the large English corpus is fun for playing with a large data set, and with experimenting with efficiency of our methods, it is difficult to use for testing the accuracy of answers because it would require that we manually compute what we know to be the correct answer for each such test. So instead, we ask that you build up a more controlled testing environment by defining your own personal lexicon of words, crafted to test the variety of situations that might arise with the formal specifications for this project.
While we start you off by some basic tests, such as making sure that all words in the lexicon are recognized by the LanguageHelper, you should add your own additional tests for more interesting cases. Make sure to test the rules for how upper and lower cases are to be handled, making sure to test for some words that should NOT be identified as part of the language, and a range of interesting tests for the getSuggestions method.
We are providing a file, English.txt containing over 364,000 correctly spelled English words. The words may involve a combination of uppercase and lowercase letters, as discussed above. The file has one word per line and words are alphabetized as they might appear in a standard English dictionary.
If you are working on hopper, you can make it appear as if this file is in your own directory by typing the following command.
ln -sf /public/goldwasser/1300/spell/English.txt .
This doesn't really copy the file, rather it creates what is called a symbolic link (aka shortcut) to this file within your directory. It seems unnecessary wasteful for everyone to have their own copy of this file.
This project must be submitted electronically using our department's git repository. More specifically, we have created a folder named program04 in your repositiory and you should place the following three files within:
language_tools.py: this file should contain the definition of your LanguageHelper class.
language_tests.py: this file should contain all of your unit-testing.
readme.txt: for every project a "readme" text file must be submitted containing:
See as well a discussion of the late policy for programming assignments.
The assignment is worth 40 points. Those points will be apportioned approximately as:
Implement a method
If you complete the extra credit challenge, make sure to discuss it in the readme, and to demonstrate use of the new method in your unit tests.