Course Home | Assignments | Computing Resources | Lab Hours/Tutoring | Python | Schedule | Submit

Saint Louis University

Computer Science 150
Introduction to Object-Oriented Programming

Michael Goldwasser

Spring 2013

Dept. of Math & Computer Science

Programming Assignment 09

Twitter Trends

Due: 11:59pm, Tuesday, 23 April 2013
11:59pm, Thursday, 25 April 2013


Contents:


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

In this assignment, you will be developing a geographic visualization of twitter data from across the USA. As an example, consider the following map which portrays how people across the country feel about Justin Bieber (based upon an analysis of their tweets). States that are red have the most positive view, while states that are dark blue have the most negative view; yellow represents a more neutral view, while states in gray have insufficient data.

The methodology used to produce this diagram is as follows:

  1. We filter through all tweets in our data set, pulling aside those containing the word 'bieber'.

  2. Each tweet comes with geographic meta data in the form of latitude and longitude. That can be used to associate the tweet with a specific state.

  3. We analyze the "sentiment" of a tweet by breaking the message into individual words, and then looking up each word in a prescribed dictionary that maps certain words to floating-point values in the range [-1, +1]. A sentiment of -1.0 is the most negative, while a sentiment of +1.0 is the most positive. We will rely upon an existing classification of about 22000 words with sentiment values. For example:

    If a word of the tweet is not found in the sentiment dictionary, it is ignored. The overall sentiment of the tweet is the average of the sentiment scores that are found. If no sentiment scores are found for any of the words of the tweet, this tweet is ignored.

  4. The overall sentiment of a state is computed as the average sentiment score for all tweets that are associated with that state (ignoring those tweets that did not have a sentiment score). The state's sentiment score is then mapped to a color between blue (negative) and red (positive) using a prescribed color gradient.


Files We Are Providing

We are providing you with much of the necessary framework for this project; your task (described below) will be to implement the main workflow for analyzing a query. The source code and various data files are being distributed as follows. If you are working on turing, you can get a copy of the distribution by typing the following command from your account:

cp -R /Public/goldwasser/150/trends .

For those working on your own machine, we have packaged the project as trends.zip. We note that this distribution includes data files for the example queries: bieber, bacon, dog, and cat for experimentation, but it does not include the larger data file all_tweets.txt that can be used to evaluate trends for other search terms. That file is 193Mb in size. If you wish, it can be downloaded here: all_tweets.txt

This project uses many files, but all of your code will be placed in the remainder of the file trends.py. However, you will need to be aware of the following:


Your Task

All of your code should go in the file trends.py. We have started you out by importing the necessary modules, loading the sentiments dictionary, the states list, the tweets list, and a Country instance named usa. You must implement the methodology described in the original overview of this project.

Our portion of the code already takes care of filtering through all tweets to find only those that have a given query term (e.g., bieber). Actually, if you do not have the full all_tweets.txt file on your system, you will only be able to run the program for one of the four predefined queries (bieber, bacon, cat, dog). If you run Python from a command prompt, you can indicate the query word(s) as a commandline argument, as

python trends.py bieber
If you are running the program from within IDLE, you will have to enter the query when prompted after the program begins.

Your primary tasks will be the following:

You should feel free to define any additional functions within the trends.py file that help you organize your code in a more clear and modular fashion.


Submitting Your Assignment

All of your new code should be placed in the trends.py file. That file should be submitted electronically. You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments.

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.


Grading Standards

The assignment is worth 10 points.


Extra Credit

The original data set that we are using is based on a snapshot of 1.7 million tweets recorded during a one-week period from August 28th through September 3rd of 2011.

For extra credit, instead of producing a single image for the weekly sentiment, produce a series of daily images that are displayed either with a timed delay from day to day, or by waiting for the user to indicate when s/he is ready to see the next image.

To accomplish this task, note that each Tweet instance has a timestamp, which is represented as an instance of Python's standard datetime class. You may rely on the fact that the series of tweets are loaded in chronological order.


Acknowledgment

This project is a variant of one developed by Aditi Muralidharan, John DeNero, and Hamilton Nguyen, as described at http://nifty.stanford.edu/2013/denero-muralidharan-trends/.


Michael Goldwasser
Last modified: Sunday, 14 April 2013