Course Home | Class Schedule | Moodle CMS | Git Submission | Perusall | Tutoring

Saint Louis University

Computer Science 1300
Introduction to Object-Oriented Programming

Michael Goldwasser

Spring 2020

Computer Science Department

Programming Assignment 07

"Did you mean Billiken?"

Due: 11:59pm, Wednesday, 8 April 2020


Contents:


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

Software, such as a word processor, search engine, or mobile interface, typically includes plug-in support specific to a language to aid with spelling. In this assignment, you will implement a class that provides general language support; such a class could presumably be (re)used in these broader software applications.

For the purpose of spell checking, a simple language model is a set of valid words. By convention, a language specification may include both capitalized and uncapitalized words. A word that is is entirely lowercased in the language specification can be used in either capitalized or uncapitalized from (e.g., if 'dog' is in the language specification, then both 'dog' and 'Dog' are legitimate usages). However, any word that includes one or more uppercased letters in the original language reflects a form that cannot be modified (e.g., 'Missouri' is acceptable but 'missouri' is not; 'NATO' is acceptable, but neither 'Nato', 'nato', nor 'nAtO' would be acceptable).

The goals of the new class will be to answer the following types of queries:


The LanguageHelper Class Specifications

Formally, you are to provide a file named language_tools.py that defines aLanguageHelper class with the following three methods.

For the sake of simplicity, you do not need to provide robust error checking of any parameters for the purpose of this project.


Python's set Class

Although a natural internal representation of the language would be to maintain a list of words, there is another built-in data structure in Python, known as a set, that provides greater efficiency for lookups. For this project you only need to use three behaviors.


Generating Good Suggestions

A common rule for defining how close a given string is to another (potentially correct) string is by measuring the number of changes that would have to be made to get from one word to the other. This is typically called the edit distance between two words.

More concretely, consider the following types of edits which might get from a misspelled word back to a legitimate word.

For this project, the getSuggestions method should return a sorted list of all legitimate language words that are at most one edit away from the query. (If the query itself is legitimte, you should include that in the results but still include any further suggestions that are one edit away.)

The challenge will be in implementing this efficiently enough so that you can produce these suggestions without any significant delay for the user. There are several possible approaches. One is to write a method to check whether any two words are within one edit of each other. Then you could iterate this test between the mistaken word and each word of the language file. But for a language file with hundreds of thousands of words, this is unnecessarily expensive.

Instead, you are to take the query word and generate all possible strings that are one edit away from it; then, for each of those strings, test if it is a legitimate word in the language. Notice that for a typical 7-letter word, there are only 7 ways to delete a character, 6 ways to invert two neighboring characters, and 8*26 ways to insert a new character, because there are 8 possible slots, and 26 possible letters to put in each slot. Okay, depending on how capitalization is managed, perhaps 52 possible characters, or more if allowing hyphen or apostrophe. But still, this seems like a far smalller number of things to check than trying to compare the mistaken word to each word in the language.

Specifications


Good Software Practices

For your LanguageHelper class, you must use the following good software practices:


Files You Will Need

We are providing a file, English.txt containing over 364,000 correctly spelled English words. The words may involve a combination of uppercase and lowercase letters, as discussed above. The file has one word per line and words are alphabetized as they might appear in a standard English dictionary.

If you are working on hopper, you can make it appear as if this file is in your own directory by typing the following command.

ln -sf /public/goldwasser/1300/spell/English.txt .

This doesn't really copy the file, rather it creates what is called a symbolic link (aka shortcut) to this file within your directory. It seems unnecessary wasteful for everyone to have their own copy of this file.


Submitting Your Assignment

This project must be submitted electronically using our department's git repository. More specifically, we have created a folder named program04 in your repositiory and you should place the following three files within:

See as well a discussion of the late policy for programming assignments.


Grading Standards

The assignment is worth 40 points. Those points will be apportioned approximately as:


Extra Credit

Implement a method getSuggestionsExtra(self, word) that computes all language entries that have an edit distance of at most two from the original word. (We are specifically asking you to use a different name for the extra credit method, so as not to interfere with the original method that uses edit-distance one.

If you complete the extra credit challenge, make sure to discuss it in the readme, and to demonstrate use of the new method in your unit tests.


Michael Goldwasser
Last modified: Saturday, 21 December 2019