Course Home | Assignments | Schedule | Submit

Saint Louis University

Computer Science 362
Artificial Intelligence

Michael Goldwasser

Fall 2013

Dept. of Math & Computer Science

Assignment 06

Text Message Spam Filtering

Due: Monday, 9 December 2013, 11:59pm


Contents:


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

In this assignment, we will be experimenting with several classification algorithms in the context of identifying unwanted text messages. Our data set, provided as part of the UCI Machine Learning Repository, is a series of over 5000 text messages that have been previously labeled as either legitimate ("ham") or unwanted ("spam"). In particular, there are 4518 ham messages and 653 spam messages, taken primarily from users in the UK and Singapore.

We will perform experiments in Python using 10-fold cross validation; a round of that process involves randomly dividing the input into 10 stripes, and then using each of those 10 stripes as an isolated test set on a classifier trained with the other nine stripes.


Experiment Pipeline

The steps of our experiment are as follows:

  1. The raw text messages and their labels as ham or spam are read from an input file.

  2. The messages may optionally be converted to lowercase.

  3. Each message string is tokenized, resulting in a list of string tokens. By default, the string is tokenized by breaking it apart using whitespace, but this process can be customized.

  4. Each list of tokens is converted to a feature space, which we represent using a dictionary that maps from feature name to feature value. By default, our feature space maps each string that appears as a token to the value 1 (even in cases where a token appears more than once). Strings that do not occur as tokens do not appear explicitly in the dictionary, although we consider them as implicitly mapped to the value None. The conversion from a list of tokens to a feature space can be customized (for example, looking at broader properties such as the number of uppercase letters or punctuation symbols).

  5. The 10-fold cross-validation process is performed. During each phase of this process:


Necessary Software and Data

The necessary software and data files are found within turing:/Public/goldwasser/362/classification/, or can be downloaded as the attached zip file. It consists of six files:


Your Tasks

You are to complete each of the following tasks. We list them in order of increasing sophistication (and accuracy).
  1. k-Nearest Neighbor Classifier
    Implement an algorithm that classifies a sample based on the majority of labels for its k nearest neighbors in euclidean space, with a dimension for every token in the training set, with a sample having value 1 if it contains the token and value 0 if it does not.

    We have stubbed a NearestNeighborClassifer class within classifiers.py for this purpose. It can be chosen using the commandline option -c neighbor, and the value of k can be selected with an option such as -k 3. From within your code, that value can be accessed as attribute self.options.knn.


  2. Alternate Tokenization
    By default, we use Python's split method to break a string using whitespace as a delimiter. However on a message such as
    I was going to suggest a definite move out--if i'm still there-- after greece.
    this rule leaves as tokens "out--if" and "there--" and "greece.".

    We can get better results out of classification algorithms if we instead break the message into tokens using primarily alphabetic characters, so that we separate tokens "out" and "if" in the above example, or the simpler "there" rather than "there--", and "greece" rather than "greece.". However, we might want to special case the use of an apostrophe, allowing "i'm" to remain as a token. (Even more care may be needed if we wish to keep emoticons such as ":-P", which will otherwise become token "P" by this rule.)

    You must implement one such nontrivial tokenizer that is based on using tokens that are primarily alphabetic characters.


  3. Naive Bayesian Classifier
    You are to implement the following version of naive Bayesian classification. When we assuming independent of tokens in a message, a probability such as

    Pr(spam | t1, t2, ..., tn)
    for tokens t1, t2, ..., tn becomes equivalent to
    Pr(spam | t1) * Pr(spam | t2) * ... * Pr(spam | tn)
    We can estimate a term such as Pr(spam | t1) empirically by looking at all messages in the training set containing token t1 and how many of those were known to be spam vs. ham. A newly examined sample can be classified by comparing which of Pr(spam | t1, t2, ..., tn) or Pr(ham | t1, t2, ..., tn) is greater.

    However, there is a potential problem with infrequent tokens. For example, if a token t occurs only once in the training set, in a sample that is spam, does that mean Pr(spam | t) = 1, and that any newly observed message containing t is spam (no matter what other content it contains)? To balance the impact of infrequent tokens (while not completely ignoring them), you are to implement a "corrected conditional probability" based on a balancing parameter b, as follows:

    P̂(spam | t) = (b * Pr(spam) + k * Pr(spam | t)) / (b + k)
    where Pr(spam) is the a priori probability of a sample being spam, and k is the number of occurrences of token t in the training set. Notice that if b is zero, then this expression is precisely Pr(spam | t). But by choosing a positive value of b, we balance between the a priori probability and that estimated empirically.

    In the special case that both b and k are zero, you can decide whether how you want to let such a token influence your classification.

    We have stubbed a NaiveBayesianClassifier class within classifiers.py for this goal. It can be chosen using the commandline option -c bayes and the value of b can be selected with an option such as -b 0.5. From within your code, that value can be accessed as attribute self.options.bayesStrength.


  4. Decision Tree Classifier
    You are to implement a binary decision tree based on the feature space, using the greedy rule described in our text in which the root of the tree is chosen by which feature produces a split that maximizes the information gain (or equivalently, which minimizes the remaining entropy of the samples).

    Although the feature space for this application is huge, our training set has only 5000 samples, so the number of nodes of the tree will be bounded by that number, and hopefully smaller if it finds large clusters of clearly designated samples. In the interest of pruning the tree, both for efficiency and to avoid possible overfitting of the tree, you are to support two additional rules for pruning the tree:

    We have stubbed a DecisionTreeClassifier class within classifiers.py for this goal. It can be chosen using the commandline option -c tree and the value of m described above can be selected with command line such as -m 5, and accessed from within the program as attribute self.options.treeThreshold; the value of u described above can be selected with command line such as -u 0.995, and accessed from within the program as attribute self.options.treeUniformity.

    Note: Constructing a decision tree for the full training set may be time-consuming. We recommend that you being on the smaller version we provide (or even to make an even smaller one that you can trace manually).

    For those interested, here is a typical decision tree that results from our data set, described using a notation similar to our book and the c4.5 software.


Experiments

Perform experiments with your implementations in an attempt to tune the parameters to get the greates success rate in classification. You should be able to get to 96% or 97% with techniques described above, and perhaps better with more effort in tokenizing and building the feature space.


Submitting Your Assignment

You are to submit a revised version of files tokenize.py and classification.py. Furthermore, you are to submit a README file that describes your experimental results with each of the algorithms and the various tuning parameters.

Submit your files electronically via the course website (details on the submission process).


Grading Standards

This project will be worth a total of 50 points. We will devote 15 points to each of the primary classification methods (nearest neighbor, naive Bayesian, decision tree), and the 5 remaining points for the improved tokenization. The quality of the required README file will impact the evaluation of all components of the software.


Extra Credit

Up to 5 points of extra credit will be awarded for those who go above and beyond the basic requirements in experimenting with the design and tuning of algorithms (especially if it goes to improve the overall success rate). Please make sure to explain any such extraordinary efforts in the readme.


Michael Goldwasser
CSCI 362, Fall 2013
Last modified: Monday, 09 December 2013
Course Home | Assignments | Schedule | Submit