Programming Assignment #6: Spell Checker

Assigned: Monday, Nov. 21
Due: Wednesday, Nov. 30 by midnight

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.

Overview

Topic: Class Design and Good Software Practices
Related Reading: Chapters 6 and 7

This assignment is worth 10 points.

Goal:

Software, such as a word processor, search engine, or mobile interface, typically includes plug-in support to aid with spelling. In this assignment, you will implement a class that provides general language support, which could presumably be used in these broader software applications.

For the purpose of spell checking, a simple language model is a set of valid words. By convention, a language specification may include both capitalized and uncapitalized words. A word that is capitalized in the language specification is only legitimate when capitalized (e.g., 'Missouri' is okay but 'missouri' is not). A word that is uncapitalized in the language specification can be used in either capitalized or uncapitalized from (e.g., if 'dog' is in the language specification, then both 'dog' and 'Dog' are legitimate usages).

The goals of the new class will be to answer the following types of queries:

Is a given string a legitimate word in the language? (based on the above conventions regarding capitalization)
Given a string, which may or may not be in the language, produce a list of suggestions that are valid words in the language and reasonably "close" to the given string in terms of spelling. (We will say more below, about the notion of distance between words.)

The `LanguageHelper` Class Specifications

Formally, you are to implement a LanguageHelper class with the following three methods.

We are providing you with the start of the class, LanguageHelper.py, in order to facilitate your efforts.

You are to generate the rest of the class, which should include the following three methods:

__init__(self, languageFilename)
The languageFilename should be a string containing the filename (and path, if not in same directory as the class) of the file defining the language to use for spell checking. The file we'll be using for spell checking against the English language, is provided below.

The class is responsible for recording all words from the language into an internal data representation, and stripping any extraneous whitespace from each entry (such as the newline characters at the end of each line in the file).
For the sake of efficiency, we recommend that you store the language words in a Python set instance. (We discuss sets in a later section.)
__contains__(self, query)
The query parameter is a string. This method should determine whether the string is considered a legitimate word, returning True if the word is contained in the language and False otherwise. This method should adhere to the aforementioned conventions regarding capitalized and uncapitalized words. For example, dog, Dog and Missouri are contained in the English language, yet missouri and Missourri are not.

The __contains__ special method is used by Python to support the in operator. It allows the standard syntax
```
   'Missouri' in language
```
which is implicitly translated by Python to the internal call
```
   language.__contains__('Missouri')
```
presuming that language is an instance of our LanguageHelper class.
getSuggestions(self, query)
Given a query string, this method should return an alphabetical list of "nearby" words in the language. Doing a good job at offering suggestions is the most difficult part of writing a good language helper. We discuss this aspect of the project in a later section.

For the sake of simplicity, you do not need to provide robust error checking of any parameters for the purpose of this project.

Python's `set` Class

Although a natural internal representation of the language would be to maintain a list of words, there is another built-in data structure in Python, known as a set, that provides greater efficiency for lookups. The textbook discusses the class on pages 409-414, but for this project you only need to use three behaviors.

s = set()
create a new set instance (which is initially an empty set).
s.add(value)
adds the given value to the set (value will be a string in our application).
value in s
returns True if the given value is currently in the set, and False otherwise.

Generating Good Suggestions

A common rule for defining how close a given string is to another (potentially correct) string is by measuring the number of changes that would have to be made to get from one word to the other. This is typically called the edit distance between two words.

More concretely, consider the following types of edits which might get from a misspelled word back to a legitimate word.

Delete one character from the word.
(e.g. converting unneccessary to unnecessary)
Add one letter to the word.
(e.g., converting writen to written)
Change a single letter to some other letter
(e.g. converting flexable to flexible)
Invert two neighboring characters
(e.g., converting wierd to weird, or wierd to wired)

For this project, the getSuggestions method should return a sorted list of all legitimate language words that are precisely one edit away from the query.

The challenge will be in implementing this efficiently enough so that you can produce these suggestions without any significant delay for the user. There are several possible approaches. One is to write a method to check whether any two words are within one edit of each other. Then you could iterate this test between the mistaken word and each word of the language file. But for a language file with hundreds of thousands of words, this is unnecessarily expensive.

Instead, you are to take the query word and generate all possible strings that are one edit away from it; then, for each of those strings, test if it is a legitimate word in the language. Notice that for a typical 7-letter word, there are only 7 ways to delete a character, 6 ways to invert two neighboring characters, and 8*26 ways to insert a new character, because there are 8 possible slots, and 26 possible letters to put in each slot. Okay, depending on how capitalization is managed, perhaps 52 possible characters, or more if allowing hyphen or apostrophe. But still, this seems like a far smalller number of things to check than trying to compare the mistaken word to each word in the language.

Specifications

The final list returned by getSuggestions should be sorted alphabetically, and should not contain any duplicates (even though there may have been two different edits which converge to the same word, such as deciding which 'c' of unneccessary to delete).
If the query word begins with a capital letter (e.g., Wierd), all reported suggestions should be capitalized as well (figuring that if a user typed a capital letter when misspelling a word, they would be likely to want the correctly spelled word to be capitalized.
If the query word begins with a lowercase letter (e.g., missouri), you may consider replacing the first letter with any other lowercase letter, or with the uppercase version of the same letter (e.g., Missouri)

Good Software Practices

For your LanguageHelper class, you must use the following good software practices:

Naming Conventions
Please follow all the guidelines for naming conventions given in Chapter 7, Sections 7.4 and 7.6.
Formal Documentation
Provide formal documentation for the class and all its public methods, as described in Chapter 7, Section 7.5
Unit Testing
Provide unit testing embedded directly at the end of the LanguageHelper.py file, as described in Chapter 7, Section 7.7
Make sure that your test includes many interesting cases, demonstrating both the check of containment for words that are included and are not included, and a variety of calls to getSuggestions. Although it is fun to play with a large language (see below), we suggest developing some of your unit tests in a more controlled environment with a small, artificial language of your choosing.

Files You Will Need

Here's an initial start for the LanguageHelper class, LanguageHelper.py, to help you get started.

We are providing a file, English.txt containing over 364,000 correctly spelled English words. The words may involve a combination of uppercase and lowercase letters, as discussed above. The file has one word per line and words are alphabetized as they might appear in a standard English dictionary.

To ease your program development, you can make it appear as if this file is in your own directory on hopper.slu.edu by typing the following command.
```
ln -sf /public/fritts/csci1300/English.txt .
```
This doesn't really copy the file, rather it creates what is called a symbolic link (aka shortcut) to this file within your directory. It seems unnecessary wasteful for everyone to have their own copy of this file.

Here are some sample files with various words misspelled, which may aid you in preliminary testing:
- sample
- sampleLetter

Submitting Your Assignment

The source code for your LanguageHelper class should be saved in a file named LanguageHelper.py and submitted electronically.

You should also submit a separate 'readme' text file, with the requisite information.

You should submit these two files electronically by emailing them to the instructor.

Grading Standards

The assignment is worth 10 points. Those points will be apportioned approximately as:

(2 points) Proper construction of a LanguageHelper instance
(2 points) Proper implementation of the __contains__ method, including the desired case-sensitivity.
(4 points) Proper implementation of the getSuggestions method, including the desired case-sensitivity.
(2 points) Good software practices, including name conventions, formal documentation, and appropriate unit tests.