Assignment 06

Course Home | Assignments | Schedule | Submit

Computer Science 362
Artificial Intelligence

Text Message Spam Filtering

Due: Monday, 9 December 2013, 11:59pm

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.

Overview

In this assignment, we will be experimenting with several classification algorithms in the context of identifying unwanted text messages. Our data set, provided as part of the UCI Machine Learning Repository, is a series of over 5000 text messages that have been previously labeled as either legitimate ("ham") or unwanted ("spam"). In particular, there are 4518 ham messages and 653 spam messages, taken primarily from users in the UK and Singapore.

We will perform experiments in Python using 10-fold cross validation; a round of that process involves randomly dividing the input into 10 stripes, and then using each of those 10 stripes as an isolated test set on a classifier trained with the other nine stripes.

Experiment Pipeline

The steps of our experiment are as follows:

The raw text messages and their labels as ham or spam are read from an input file.
The messages may optionally be converted to lowercase.
Each message string is tokenized, resulting in a list of string tokens. By default, the string is tokenized by breaking it apart using whitespace, but this process can be customized.
Each list of tokens is converted to a feature space, which we represent using a dictionary that maps from feature name to feature value. By default, our feature space maps each string that appears as a token to the value 1 (even in cases where a token appears more than once). Strings that do not occur as tokens do not appear explicitly in the dictionary, although we consider them as implicitly mapped to the value None. The conversion from a list of tokens to a feature space can be customized (for example, looking at broader properties such as the number of uppercase letters or punctuation symbols).
The 10-fold cross-validation process is performed. During each phase of this process:
- A classifier is trained on 90% of the original samples. Specifically, for each sample the classifier is given the feature space dictionary and the ham vs. spam designation.
- For each sample in the 10% designated as the test set, the classifier is presented with its feature space dictionary and must return a ham/spam prediction.

Necessary Software and Data

The necessary software and data files are found within turing:/Public/goldwasser/362/classification/, or can be downloaded as the attached zip file. It consists of six files:

spamalot.py
You should not modify this file. It is the main driver for the program. It includes commandline options for controlling many of the settings, and it orchestrates the 10-fold cross-validation process and the reporting of results. You should run this program with the -h flag for help on use of command-line options.
tokenize.py
This file contains definitions for the process of converting a message string to a list of tokens.

For example, the string
Its ok, if anybody asks abt me, u tel them..:-P
might be split on whitespace to form the tokens:
["Its", "ok,", "if", "anybody", "asks", "abt", "me,", "u", "tel", "them..:-P" ]
Or we might downcase and use and strip away any form of punctuation while splitting, to get the tokens:
["its", "ok", "if", "anybody", "asks", "abt", "me", "u", "tel", "them", "p" ]
feature.py
This file contains definitions for converting a list of tokens to a feature space.

For example, if given the following tokens,
["Its", "ok,", "if", "anybody", "asks", "abt", "me,", "u", "tel", "them..:-P" ]
we might represent a feature space as the map:
{ "Its":1, "ok,":1, "if":1, "anybody":1, "asks":1, "abt":1, "me,":1, "u":1, "tel":1, "them..:-P":1 }
classification.py
This file contains definitions for a variety of classification algorithms. To coordinate the storage of information, all algorithms are represented as subclasses of a Classifier abstract base class. This provides a constructor that records command line options and a random number generator with known seed, and it provides an evaluate method that automates the training and testing.

What remains for each subclass is simply to override the methods train(examples) and classify(sample).
SMS_Spam.dat
This is a data file containing the full set of 5171 labeled messages. It is the default input file and parsed by the spamalot.py script that we provide.
SMS_Spam_mini.dat
This is a smaller data set comprised of a randomly selected 1000 messages from the full set. We provide it to aid in development, as some of the algorithms will take significant training time on the full dataset.

Your Tasks

You are to complete each of the following tasks. We list them in order of increasing sophistication (and accuracy).

k-Nearest Neighbor Classifier
Implement an algorithm that classifies a sample based on the majority of labels for its k nearest neighbors in euclidean space, with a dimension for every token in the training set, with a sample having value 1 if it contains the token and value 0 if it does not.

We have stubbed a NearestNeighborClassifer class within classifiers.py for this purpose. It can be chosen using the commandline option -c neighbor, and the value of k can be selected with an option such as -k 3. From within your code, that value can be accessed as attribute self.options.knn.

Alternate Tokenization
By default, we use Python's split method to break a string using whitespace as a delimiter. However on a message such as
I was going to suggest a definite move out--if i'm still there-- after greece.
this rule leaves as tokens "out--if" and "there--" and "greece.".

We can get better results out of classification algorithms if we instead break the message into tokens using primarily alphabetic characters, so that we separate tokens "out" and "if" in the above example, or the simpler "there" rather than "there--", and "greece" rather than "greece.". However, we might want to special case the use of an apostrophe, allowing "i'm" to remain as a token. (Even more care may be needed if we wish to keep emoticons such as ":-P", which will otherwise become token "P" by this rule.)

You must implement one such nontrivial tokenizer that is based on using tokens that are primarily alphabetic characters.

Naive Bayesian Classifier
You are to implement the following version of naive Bayesian classification. When we assuming independent of tokens in a message, a probability such as
Pr(spam | t₁, t₂, ..., t_n)
for tokens t₁, t₂, ..., t_n becomes equivalent to
Pr(spam | t₁) * Pr(spam | t₂) * ... * Pr(spam | t_n)
We can estimate a term such as Pr(spam | t₁) empirically by looking at all messages in the training set containing token t₁ and how many of those were known to be spam vs. ham. A newly examined sample can be classified by comparing which of Pr(spam | t₁, t₂, ..., t_n) or Pr(ham | t₁, t₂, ..., t_n) is greater.

However, there is a potential problem with infrequent tokens. For example, if a token t occurs only once in the training set, in a sample that is spam, does that mean Pr(spam | t) = 1, and that any newly observed message containing t is spam (no matter what other content it contains)? To balance the impact of infrequent tokens (while not completely ignoring them), you are to implement a "corrected conditional probability" based on a balancing parameter b, as follows:
P̂(spam | t) = (b * Pr(spam) + k * Pr(spam | t)) / (b + k)
where Pr(spam) is the a priori probability of a sample being spam, and k is the number of occurrences of token t in the training set. Notice that if b is zero, then this expression is precisely Pr(spam | t). But by choosing a positive value of b, we balance between the a priori probability and that estimated empirically.

In the special case that both b and k are zero, you can decide whether how you want to let such a token influence your classification.

We have stubbed a NaiveBayesianClassifier class within classifiers.py for this goal. It can be chosen using the commandline option -c bayes and the value of b can be selected with an option such as -b 0.5. From within your code, that value can be accessed as attribute self.options.bayesStrength.

Decision Tree Classifier
You are to implement a binary decision tree based on the feature space, using the greedy rule described in our text in which the root of the tree is chosen by which feature produces a split that maximizes the information gain (or equivalently, which minimizes the remaining entropy of the samples).

Although the feature space for this application is huge, our training set has only 5000 samples, so the number of nodes of the tree will be bounded by that number, and hopefully smaller if it finds large clusters of clearly designated samples. In the interest of pruning the tree, both for efficiency and to avoid possible overfitting of the tree, you are to support two additional rules for pruning the tree:
- If a subtree has at most m training examples, leave that subtree as a leaf; classify samples based consistent with the majority of those training examples.
- If the set of training examples at a certain node of the tree has a majority classification consistent with a fixed fraction u of those samples, then leave that node as a leaf with such classification. For example if u=0.75, then we would stop splitting the tree if at least 3 out of every 4 samples had a common classification. (Although we will probably want to use values much closer to 1.)
We have stubbed a DecisionTreeClassifier class within classifiers.py for this goal. It can be chosen using the commandline option -c tree and the value of m described above can be selected with command line such as -m 5, and accessed from within the program as attribute self.options.treeThreshold; the value of u described above can be selected with command line such as -u 0.995, and accessed from within the program as attribute self.options.treeUniformity.

Note: Constructing a decision tree for the full training set may be time-consuming. We recommend that you being on the smaller version we provide (or even to make an even smaller one that you can trace manually).

For those interested, here is a typical decision tree that results from our data set, described using a notation similar to our book and the c4.5 software.

Experiments

Perform experiments with your implementations in an attempt to tune the parameters to get the greates success rate in classification. You should be able to get to 96% or 97% with techniques described above, and perhaps better with more effort in tokenizing and building the feature space.

Submitting Your Assignment

You are to submit a revised version of files tokenize.py and classification.py. Furthermore, you are to submit a README file that describes your experimental results with each of the algorithms and the various tuning parameters.

Submit your files electronically via the course website (details on the submission process).

Grading Standards

This project will be worth a total of 50 points. We will devote 15 points to each of the primary classification methods (nearest neighbor, naive Bayesian, decision tree), and the 5 remaining points for the improved tokenization. The quality of the required README file will impact the evaluation of all components of the software.

Extra Credit

Up to 5 points of extra credit will be awarded for those who go above and beyond the basic requirements in experimenting with the design and tuning of algorithms (especially if it goes to improve the overall success rate). Please make sure to explain any such extraordinary efforts in the readme.

Michael Goldwasser

CSCI 362, Fall 2013
Last modified: Monday, 09 December 2013

Course Home | Assignments | Schedule | Submit

Saint Louis University

Computer Science 362
Artificial Intelligence

Michael Goldwasser

Fall 2013

Dept. of Math & Computer Science

Assignment 06

Text Message Spam Filtering

Due: Monday, 9 December 2013, 11:59pm

Contents:

Collaboration Policy

Overview

Experiment Pipeline

Necessary Software and Data

Your Tasks

Experiments

Submitting Your Assignment

Grading Standards

Extra Credit

Computer Science 362 Artificial Intelligence

Fall 2013

Assignment 06

Text Message Spam Filtering

Due: Monday, 9 December 2013, 11:59pm

Contents:

Computer Science 362
Artificial Intelligence