Saint Louis University |
Computer Science 362
|
Dept. of Math & Computer Science |
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
In this assignment, we will be experimenting with several classification algorithms in the context of identifying unwanted text messages. Our data set, provided as part of the UCI Machine Learning Repository, is a series of over 5000 text messages that have been previously labeled as either legitimate ("ham") or unwanted ("spam"). In particular, there are 4518 ham messages and 653 spam messages, taken primarily from users in the UK and Singapore.
We will perform experiments in Python using 10-fold cross validation; a round of that process involves randomly dividing the input into 10 stripes, and then using each of those 10 stripes as an isolated test set on a classifier trained with the other nine stripes.
The steps of our experiment are as follows:
The raw text messages and their labels as ham or spam are read from an input file.
The messages may optionally be converted to lowercase.
Each message string is tokenized, resulting in a list of string tokens. By default, the string is tokenized by breaking it apart using whitespace, but this process can be customized.
Each list of tokens is converted to a feature space, which we represent using a dictionary that maps from feature name to feature value. By default, our feature space maps each string that appears as a token to the value 1 (even in cases where a token appears more than once). Strings that do not occur as tokens do not appear explicitly in the dictionary, although we consider them as implicitly mapped to the value None. The conversion from a list of tokens to a feature space can be customized (for example, looking at broader properties such as the number of uppercase letters or punctuation symbols).
A classifier is trained on 90% of the original samples. Specifically, for each sample the classifier is given the feature space dictionary and the ham vs. spam designation.
For each sample in the 10% designated as the test set, the classifier is presented with its feature space dictionary and must return a ham/spam prediction.
The necessary software and data files are found within turing:/Public/goldwasser/362/classification/, or can be downloaded as the attached zip file. It consists of six files:
tokenize.py
This file contains definitions for the process of converting a
message string to a list of tokens.
For example, the string
Its ok, if anybody asks abt me, u tel them..:-Pmight be split on whitespace to form the tokens:
["Its", "ok,", "if", "anybody", "asks", "abt", "me,", "u", "tel", "them..:-P" ]Or we might downcase and use and strip away any form of punctuation while splitting, to get the tokens:
["its", "ok", "if", "anybody", "asks", "abt", "me", "u", "tel", "them", "p" ]
feature.py
This file contains definitions for converting a list of tokens
to a feature space.
For example, if given the following tokens,
["Its", "ok,", "if", "anybody", "asks", "abt", "me,", "u", "tel", "them..:-P" ]we might represent a feature space as the map:
{ "Its":1, "ok,":1, "if":1, "anybody":1, "asks":1, "abt":1, "me,":1, "u":1, "tel":1, "them..:-P":1 }
classification.py
This file contains definitions for a variety of classification
algorithms. To coordinate the storage of information, all
algorithms are represented as subclasses of a Classifier
abstract base class. This provides a constructor that records
command line options and a random number generator with known
seed, and it provides an evaluate method that automates
the training and testing.
What remains for each subclass is simply to override the methods train(examples) and classify(sample).
SMS_Spam.dat
This is a data file containing the full set of 5171 labeled messages.
It is the default input file and parsed by the
spamalot.py script that we provide.
SMS_Spam_mini.dat
This is a smaller data set comprised of a randomly selected 1000
messages from the full set. We provide it to aid in development,
as some of the algorithms will take significant training time on
the full dataset.
k-Nearest Neighbor Classifier
Implement an algorithm that classifies a sample based on the
majority of labels for its k nearest neighbors in euclidean
space, with a dimension for every token in the training set,
with a sample having value 1 if it contains the token and
value 0 if it does not.
We have stubbed a NearestNeighborClassifer class within
classifiers.py for this purpose. It can be chosen using
the commandline option
this rule leaves as tokens "out--if" and "there--" and "greece.".I was going to suggest a definite move out--if i'm still there-- after greece.
We can get better results out of classification algorithms if we instead break the message into tokens using primarily alphabetic characters, so that we separate tokens "out" and "if" in the above example, or the simpler "there" rather than "there--", and "greece" rather than "greece.". However, we might want to special case the use of an apostrophe, allowing "i'm" to remain as a token. (Even more care may be needed if we wish to keep emoticons such as ":-P", which will otherwise become token "P" by this rule.)
You must implement one such nontrivial tokenizer that is based on using tokens that are primarily alphabetic characters.
Naive Bayesian Classifier
You are to implement the following version of naive Bayesian
classification. When we assuming independent of tokens in a
message, a probability such as
Pr(spam | t1, t2, ..., tn)for tokens t1, t2, ..., tn becomes equivalent to
Pr(spam | t1) * Pr(spam | t2) * ... * Pr(spam | tn)We can estimate a term such as
However, there is a potential problem with infrequent
tokens. For example, if a token t occurs only once in the training
set, in a sample that is spam, does that mean
P̂(spam | t) = (b * Pr(spam) + k * Pr(spam | t)) / (b + k)where Pr(spam) is the a priori probability of a sample being spam, and k is the number of occurrences of token t in the training set. Notice that if b is zero, then this expression is precisely
In the special case that both b and k are zero, you can decide whether how you want to let such a token influence your classification.
We have stubbed a NaiveBayesianClassifier class within
classifiers.py for this goal. It can be chosen using
the commandline option
Decision Tree Classifier
You are to implement a binary decision tree based on the feature
space, using the greedy rule described in our text in which the
root of the tree is chosen by which feature produces a split
that maximizes the information gain (or equivalently, which
minimizes the remaining entropy of the samples).
Although the feature space for this application is huge, our training set has only 5000 samples, so the number of nodes of the tree will be bounded by that number, and hopefully smaller if it finds large clusters of clearly designated samples. In the interest of pruning the tree, both for efficiency and to avoid possible overfitting of the tree, you are to support two additional rules for pruning the tree:
If a subtree has at most m training examples, leave that subtree as a leaf; classify samples based consistent with the majority of those training examples.
If the set of training examples at a certain node of the tree has a majority classification consistent with a fixed fraction u of those samples, then leave that node as a leaf with such classification. For example if u=0.75, then we would stop splitting the tree if at least 3 out of every 4 samples had a common classification. (Although we will probably want to use values much closer to 1.)
We have stubbed a DecisionTreeClassifier class within
classifiers.py for this goal. It can be chosen using
the commandline option
Note: Constructing a decision tree for the full training set may be time-consuming. We recommend that you being on the smaller version we provide (or even to make an even smaller one that you can trace manually).
For those interested, here is a typical decision tree that results from our data set, described using a notation similar to our book and the c4.5 software.
Perform experiments with your implementations in an attempt to tune the parameters to get the greates success rate in classification. You should be able to get to 96% or 97% with techniques described above, and perhaps better with more effort in tokenizing and building the feature space.
You are to submit a revised version of files tokenize.py and classification.py. Furthermore, you are to submit a README file that describes your experimental results with each of the algorithms and the various tuning parameters.
Submit your files electronically via the course website (details on the submission process).
This project will be worth a total of 50 points. We will devote 15 points to each of the primary classification methods (nearest neighbor, naive Bayesian, decision tree), and the 5 remaining points for the improved tokenization. The quality of the required README file will impact the evaluation of all components of the software.
Up to 5 points of extra credit will be awarded for those who go above and beyond the basic requirements in experimenting with the design and tuning of algorithms (especially if it goes to improve the overall success rate). Please make sure to explain any such extraordinary efforts in the readme.