Course Home | Assignments | Computing Resources | Lab Hours/Tutoring | Python | Schedule | Submit

Saint Louis University

Computer Science 150
Introduction to Object-Oriented Programming

Michael Goldwasser

Spring 2013

Dept. of Math & Computer Science

Programming Assignment 07

"Did you mean Billiken?"

Due: 11:59pm, Tuesday, 9 April 2013


Contents:


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

Software, such as a word processor, search engine, or mobile interface, typically includes plug-in support specific to a language to aid with spelling. In this assignment, you will implement a class that provides general language support; such a class could presumably be (re)used in these broader software applications.

For the purpose of spell checking, a simple language model is a set of valid words. By convention, a language specification may include both capitalized and uncapitalized words. A word that is capitalized in the language specification is only legitimate when capitalized (e.g., 'Missouri' is okay but 'missouri' is not). A word that is uncapitalized in the language specification can be used in either capitalized or uncapitalized from (e.g., if 'dog' is in the language specification, then both 'dog' and 'Dog' are legitimate usages).

The goals of the new class will be to answer the following types of queries:


The LanguageHelper Class Specifications

Formally, you are to implement a LanguageHelper class with the following three methods.

For the sake of simplicity, you do not need to provide robust error checking of any parameters for the purpose of this project.

Python's set Class

Although a natural internal representation of the language would be to maintain a list of words, there is another built-in data structure in Python, known as a set, that provides greater efficiency for lookups. The textbook discusses the class on pages 409-414, but for this project you only need to use three behaviors.


Generating Good Suggestions

A common rule for defining how close a given string is to another (potentially correct) string is by measuring the number of changes that would have to be made to get from one word to the other. This is typically called the edit distance between two words.

More concretely, consider the following types of edits which might get from a misspelled word back to a legitimate word.

For this project, the getSuggestions method should return a sorted list of all legitimate language words that are precisely one edit away from the query.

The challenge will be in implementing this efficiently enough so that you can produce these suggestions without any significant delay for the user. There are several possible approaches. One is to write a method to check whether any two words are within one edit of each other. Then you could iterate this test between the mistaken word and each word of the language file. But for a language file with hundreds of thousands of words, this is unnecessarily expensive.

Instead, you are to take the query word and generate all possible strings that are one edit away from it; then, for each of those strings, test if it is a legitimate word in the language. Notice that for a typical 7-letter word, there are only 7 ways to delete a character, 6 ways to invert two neighboring characters, and 8*26 ways to insert a new character, because there are 8 possible slots, and 26 possible letters to put in each slot. Okay, depending on how capitalization is managed, perhaps 52 possible characters, or more if allowing hyphen or apostrophe. But still, this seems like a far smalller number of things to check than trying to compare the mistaken word to each word in the language.

Specifications


Good Software Practices

For your LanguageHelper class, you must use the following good software practices:


Files You Will Need

We are providing a file, English.txt containing over 364,000 correctly spelled English words. The words may involve a combination of uppercase and lowercase letters, as discussed above. The file has one word per line and words are alphabetized as they might appear in a standard English dictionary.

To ease your program development, you can make it appear as if this file is in your own directory on turing by typing the following command.

ln -sf /Public/goldwasser/150/programs/spell/English.txt .

This doesn't really copy the file, rather it creates what is called a symbolic link (aka shortcut) to this file within your directory. It seems unnecessary wasteful for everyone to have their own copy of this file.


Submitting Your Assignment

The source code for your LanguageHelper class should be placed in a file named LanguageHelper.py and submitted electronically.

You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments. Please include additional observations in this readme about the efficiency of the getSuggestions method on various size words and languages.

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.


Grading Standards

The assignment is worth 10 points. Those points will be apportioned approximately as:


Extra Credit

Implement a method getSuggestionsExtra(self, word) that computes all language entries that have an edit distance of at most two from the original word. (We are specifically asking you to use a different name for the extra credit method, so as not to interfere with the original method that uses edit-distance one.

If you complete the extra credit challenge, make sure to discuss it in the readme, and to demonstrate use of the new method in your unit tests.


Michael Goldwasser
Last modified: Friday, 05 April 2013