Saint Louis University |
Computer Science 150
|
Dept. of Math & Computer Science |
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
In this assignment, you will be developing a geographic visualization of twitter data from across the USA. As an example, consider the following map which portrays how people across the country feel about Justin Bieber (based upon an analysis of their tweets). States that are red have the most positive view, while states that are dark blue have the most negative view; yellow represents a more neutral view, while states in gray have insufficient data.
The methodology used to produce this diagram is as follows:
We filter through all tweets in our data set, pulling aside those containing the word 'bieber'.
Each tweet comes with geographic meta data in the form of latitude and longitude. That can be used to associate the tweet with a specific state.
We analyze the "sentiment" of a tweet by breaking the message into individual words, and then looking up each word in a prescribed dictionary that maps certain words to floating-point values in the range [-1, +1]. A sentiment of -1.0 is the most negative, while a sentiment of +1.0 is the most positive. We will rely upon an existing classification of about 22000 words with sentiment values. For example:
If a word of the tweet is not found in the sentiment dictionary, it is ignored. The overall sentiment of the tweet is the average of the sentiment scores that are found. If no sentiment scores are found for any of the words of the tweet, this tweet is ignored.
The overall sentiment of a state is computed as the average sentiment score for all tweets that are associated with that state (ignoring those tweets that did not have a sentiment score). The state's sentiment score is then mapped to a color between blue (negative) and red (positive) using a prescribed color gradient.
We are providing you with much of the necessary framework for this project; your task (described below) will be to implement the main workflow for analyzing a query. The source code and various data files are being distributed as follows. If you are working on turing, you can get a copy of the distribution by typing the following command from your account:
cp -R /Public/goldwasser/150/trends .
For those working on your own machine, we have packaged the project as trends.zip. We note that this distribution includes data files for the example queries: bieber, bacon, dog, and cat for experimentation, but it does not include the larger data file all_tweets.txt that can be used to evaluate trends for other search terms. That file is 193Mb in size. If you wish, it can be downloaded here: all_tweets.txt
This project uses many files, but all of your code will be placed in the remainder of the file trends.py. However, you will need to be aware of the following:
A geo module defines a GeoPosition class to represent a geographic location in terms of latitude and longitude. There will be such a position associated with each tweet, and the state's geographic descriptions are based on these as well.
The most important issue for you to remember is that geographic positions are spherical (even if we project them to two-dimensions for drawing a map). In particular, this means that when computing the distance between two positions, you cannot rely on the familiar planar equation. Instead, there is a distance method of the GeoPosition class that properly computes the shortest path between two geographic locations (based on the distance traveled on the great circle that connects them).
The class also provides methods latitude and longitude, to access the individual components.
The tweet module provides a Tweet class, with an instance of that class representing a single twitter message. The class supports the following accessor methods:
The state module defines a State class used to represent information about a state. Each state has a standard two-letter abbreviation (e.g., MO for Missouri), that is returned by the abbrev() method.
The boundaries of each state are defined with a series of geographic positions. These are used when creating the graphical representation of our visualization. (Fortunately for you, we will take care of that!) What you will need to know about is that the State class supports a method, centroid(), that returns a single GeoPosition that represents what is known as the centroid of the state. Informally, the centroid is an "average" of all positions in the state. We will use that single position as an approximation for the entire state when determining the state to which a tweet is closest.
The us_states module contains the actual data needed for representing the United States. You will not need to examine this file; it will be used by other parts of the project.
The country module defines a Country class that handles the actual rendering of the states. An instance of this class will be very much like a specialized canvas in our cs1graphics framework. It supports the following two methods:
The colors module provides support for translating our numeric "sentiment" values into an appropriate color based on a fixed gradient suggested by Cynthia Brewer of Penn State University. In particular, the module defines a function with calling signature:
get_sentiment_color(sentimentValue)that returns an RGB triple of an appropriate color for the given numeric sentiment value. If None is sent as a parameter, it returns the color gray (which is different than the color indicated by a neutral sentiment of 0.0).
The parse module handles most of the low-level interaction with the data files (so that you won't have to). In particular, it has functionality to load the original dictionary of sentiment scores, and to load and filter the tweet data based on a desired search term. You will not need to directly call any of these functions.
The data folder contains the raw data for sentiment scores and tweets.
The samples folder contains four examples of complete images for the respective terms: bacon, bieber, cat, and dog. The bieber image is the one shown at the beginning of this page; others can be viewed for bacon, cat, and dog.
All of your code should go in the file trends.py. We have started you out by importing the necessary modules, loading the sentiments dictionary, the states list, the tweets list, and a Country instance named usa. You must implement the methodology described in the original overview of this project.
Our portion of the code already takes care of filtering through all tweets to find only those that have a given query term (e.g., bieber). Actually, if you do not have the full all_tweets.txt file on your system, you will only be able to run the program for one of the four predefined queries (bieber, bacon, cat, dog). If you run Python from a command prompt, you can indicate the query word(s) as a commandline argument, as
python trends.py bieberIf you are running the program from within IDLE, you will have to enter the query when prompted after the program begins.
Your primary tasks will be the following:
Initialize a dictionary (initially empty) that will map a given state abbreviation (e.g., 'MO') to a list of sentiment scores for tweets that are associated with that state.
Loop through the list of tweets and for each tweet:
Compute the average sentiment for that tweet by breaking the full message into words, and looking up each word in the sentiment dictionary (only some will be found).
You are to use the following rule for determining the words of the message. Each maximal substring of consecutive alphabetic characters should be considered a word. As an example, if the original tweet were
justin bieber...doesn't deserve the award..eminem deserves it.The words of the tweet should be considered: ['justin', 'bieber', 'doesn', 't', 'deserve', 'the', 'award', 'eminem', 'deserves', 'it']
Assuming the tweet has a sentiment score (that is, at least one word of the tweet was identified in the sentiments dictionary), assign this tweet's sentiment score to the "closest" state.
The rule that you should use is to assign the tweet to whichever state has its centroid closest to the location of the tweet. This is an imperfect rule (for example, because tweets from New York City will actually be closer to the centroid of Connecticut and New Jersey then to the centroid of New York state); but it is an easy rule to implement, and it will do for now.
Once you have scored all tweets and assigned those scores to the appropriate state, compute the cumulative sentiment for each state as the average of all sentiments that were assigned. Then use that sentiment to pick an appropriate color (using the get_sentiment_color function from our colors module), and set the state's color in the visualization.
You should feel free to define any additional functions within the trends.py file that help you organize your code in a more clear and modular fashion.
All of your new code should be placed in the trends.py file. That file should be submitted electronically. You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments.
Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.
The assignment is worth 10 points.
The original data set that we are using is based on a snapshot of 1.7 million tweets recorded during a one-week period from August 28th through September 3rd of 2011.
For extra credit, instead of producing a single image for the weekly sentiment, produce a series of daily images that are displayed either with a timed delay from day to day, or by waiting for the user to indicate when s/he is ready to see the next image.
To accomplish this task, note that each Tweet instance has a timestamp, which is represented as an instance of Python's standard datetime class. You may rely on the fact that the series of tweets are loaded in chronological order.
This project is a variant of one developed by Aditi Muralidharan, John DeNero, and Hamilton Nguyen, as described at http://nifty.stanford.edu/2013/denero-muralidharan-trends/.