Programming Assignment #6:   Spell Checker

Assigned:    Monday, Nov. 21
Due:    Wednesday, Nov. 30    by midnight

Contents:


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

Topic:   Class Design and Good Software Practices
Related Reading:   Chapters 6 and 7

This assignment is worth 10 points.

Goal:

Software, such as a word processor, search engine, or mobile interface, typically includes plug-in support to aid with spelling. In this assignment, you will implement a class that provides general language support, which could presumably be used in these broader software applications.

For the purpose of spell checking, a simple language model is a set of valid words. By convention, a language specification may include both capitalized and uncapitalized words. A word that is capitalized in the language specification is only legitimate when capitalized (e.g., 'Missouri' is okay but 'missouri' is not). A word that is uncapitalized in the language specification can be used in either capitalized or uncapitalized from (e.g., if 'dog' is in the language specification, then both 'dog' and 'Dog' are legitimate usages).

The goals of the new class will be to answer the following types of queries:


The LanguageHelper Class Specifications

Formally, you are to implement a LanguageHelper class with the following three methods.

We are providing you with the start of the class, LanguageHelper.py, in order to facilitate your efforts.

You are to generate the rest of the class, which should include the following three methods:

For the sake of simplicity, you do not need to provide robust error checking of any parameters for the purpose of this project.


Python's set Class

Although a natural internal representation of the language would be to maintain a list of words, there is another built-in data structure in Python, known as a set, that provides greater efficiency for lookups. The textbook discusses the class on pages 409-414, but for this project you only need to use three behaviors.


Generating Good Suggestions

A common rule for defining how close a given string is to another (potentially correct) string is by measuring the number of changes that would have to be made to get from one word to the other. This is typically called the edit distance between two words.

More concretely, consider the following types of edits which might get from a misspelled word back to a legitimate word.

For this project, the getSuggestions method should return a sorted list of all legitimate language words that are precisely one edit away from the query.

The challenge will be in implementing this efficiently enough so that you can produce these suggestions without any significant delay for the user. There are several possible approaches. One is to write a method to check whether any two words are within one edit of each other. Then you could iterate this test between the mistaken word and each word of the language file. But for a language file with hundreds of thousands of words, this is unnecessarily expensive.

Instead, you are to take the query word and generate all possible strings that are one edit away from it; then, for each of those strings, test if it is a legitimate word in the language. Notice that for a typical 7-letter word, there are only 7 ways to delete a character, 6 ways to invert two neighboring characters, and 8*26 ways to insert a new character, because there are 8 possible slots, and 26 possible letters to put in each slot. Okay, depending on how capitalization is managed, perhaps 52 possible characters, or more if allowing hyphen or apostrophe. But still, this seems like a far smalller number of things to check than trying to compare the mistaken word to each word in the language.

Specifications


Good Software Practices

For your LanguageHelper class, you must use the following good software practices:


Files You Will Need


Submitting Your Assignment

The source code for your LanguageHelper class should be saved in a file named LanguageHelper.py and submitted electronically.

You should also submit a separate 'readme' text file, with the requisite information.

You should submit these two files electronically by emailing them to the instructor.


Grading Standards

The assignment is worth 10 points. Those points will be apportioned approximately as: