Saint Louis University |
Computer Science 1300
|
Computer Science Department |
For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.
Please make sure you adhere to the policies on academic integrity in this regard.
In this assignment, we are going to attempt to recreate a data science experiment described in a 2013 article published in PLoS titled The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place . The general approach of the research was to take a large random sample of tweets from the United States, and to analyze differences in perceived sentiment according to the geographic location from which tweets were posted. The relative "happiness" associated with one word can be approximated based on looking at what other words appear in the same tweets (and using some computed estimates of the sentiment associated with those words).
Our experiment will be admittedly more simplistic — giving us some practice in data analysis, but with smaller data sets and some corner-cutting for the sake of efficiency. As a result, it's not clear how reliable conclusions will be with our results (but it's still fun to hypothesize). We describe our complete methodology shortly, but as a motivating example, the following figure displays the result of searching our data set for tweets that use the word "college" and then analyzing the relative sentiment of the tweets when grouped by states. The most positive sentiments are visualized in red (e.g, Kentucky) and the most negative are in blue (e.g., Wisconsin), with a spectrum of colors between such that yellow is relatively neutral. States drawn in black do not have a sufficient quantity of information within the data set for that query.
In this section, we describe the complete methodology of our experiments. Note that we will have already implemented much of the supporting software, with your specific task further outlined in a later section of this document. A diagram such as the above example is constructed as follows:
Given a query term (e.g. 'college'), we filter through all tweets in our data set, pulling aside those containing the query word. This is one of the more expensive tasks, and so to save time for repeated testing, our software saves the set of tweets for a given query in a separate file within a data/ folder, so that this step can be skipped if a particular query is repeated.
Each tweet comes with geographic meta data in the form of latitude and longitude. That can be used to associate the tweet with a specific state. To do this accurately requires more careful geometric analysis of the state's shapes. So we use a more rough estimate as follows. With each state, we stored the longitude and latitude of the state's center of mass (the so called "centroid"). Then we map a tweet to whichever state has the nearest centroid. (This does mean that we are grossly mismapping some tweets; for example, its likely that any tweet from NYC would be mapped to Connecticuit rather than New York, given how far away the centroid of New York state is.)
We analyze the "sentiment" of a tweet by breaking the message into individual words. Since there are a variety of punctuation styles on twitter and not always traditional spacing, we do not simply use a split() command. Instead, we define a word to be any maximally consecutive sequence of alphabetic characters (including an apostrophe) that are delimited by any non-alphabetic characters. For example, one of the tweets in the data set reads
Tried....but I can't stay up any longer....nite tweeps....We will break that apart into a list of the following 10 words:
["Tried", "but", "I", "can't", "stay", "up", "any", "longer", "nite", "tweeps"]
For each of the identified words other than the query itself, we check if that word (when lowercased) is on a known list of words with identified sentiment values. Those sentiment values are floating-point values on a range with 1 being extremely negative and 9 being extremely positive. For example, the word 'laugh' has a sentiment score of 8.22, while the word 'cry' has a sentiment score of 1.84. We are relying on a list of just over 10,000 sentiment scores provided by the Mechanical Turk project that was referenced in the academic paper.
If a word of a tweet is not found in the sentiment dictionary, it is ignored. Also, to avoid words that have less significance, we ignore all words that have sentiment scores from 4.0 to 6.0.
The overall sentiment of a state is computed as the average sentiment score for all occurrences of non-query words in tweets mapped to that state that have sentiment scores outside the moderate range of 4.0 to 6.0. Note that if a word occurs multiple times within a state, we include each occurrence when computing the average. In an attempt to avoid clear outliers, we require that a state have at least four relevant words in order for us to report its average sentiment.
We also compute the overall national sentiment for tweets with the query word (including results for those states that might have had too few words).
Finally, we visualize the results by mapping a state's sentiment score to a color between blue (negative) and red (positive), using a prescribed color gradient that we've implemented. States that did not have sufficient number of significant words are drawn in black.
We have already implemented and our providing significant portions of this implementation. In particular, we manage parsing of all the data files and we provide tools for the cs1graphics visualization of the map.
All of your code must be in the file trends.py within the project. That is the only file you will submit. Your task is to perform the core analysis of the tweets and sentiment scores. You are expected to follow the precise methodology described above, so re-read it carefully and if you feel there are any ambiguities, please ask. You should feel free to define any additional functions within the trends.py file that help you organize your code in a more clear and modular fashion.
As advice, we suggest that you initialize one or more dictionaries (initially empty) that will map a given state abbreviation (e.g., 'MO') to the corresponding statistics. While it may be tempting to build a list of all relevant words for each state, it is possible to instead keep sufficient statistical markers along the way so that you do not need the full list of words.
In additional to the graphical display, you are required to provide textual output, as demonstrated for the later examples, giving the overall nationwide average sentiment, and the minimum and maximum sentiment scores for states that had sufficient data.
As noted above, we are providing you with much of the necessary framework for this project. The source code and various data files are being distributed as follows. If you are working on hopper, you can get a copy of the distribution by typing the following command from your account:
cp -R /public/goldwasser/1300/trends .
For those working on your own machine, we have packaged the project as trends.zip. We note that this distribution includes data files for the example queries: city, church, college, and police for experimentation, but it does not include the larger data file all_tweets.txt that can be used to evaluate trends for other search terms. That file is 42Mb in size. If you wish, it can be downloaded here: all_tweets.txt and placed with the data/ directory. (Note: this data set was captured, by another researcher, from a random sample of tweets; we take no responsibility for the content of those tweets.)
This project uses many files, but all of your code must be placed within the file trends.py. With that said, you will need to be aware of the following:
A geo module defines a GeoPosition class to represent a geographic location in terms of latitude and longitude. There will be such a position associated with each tweet, and the state's geographic descriptions are based on these as well.
The most important issue for you to remember is that geographic positions are spherical (even if we project them to two-dimensions for drawing a map). In particular, this means that when computing the distance between two positions, you cannot rely on the familiar Euclidean equation. Instead, there is a distance(other) method of the GeoPosition class that properly computes the shortest path between two geographic locations (based on the distance traveled on the great circle that connects them).
The class also provides methods latitude() and longitude(), to access the individual components of a position, if interested.
The tweet module provides a Tweet class, with an instance of that class representing a single twitter message. The class supports the following accessor methods:
The state module defines a State class used to represent information about a state. Each state has a standard two-letter abbreviation (e.g., MO for Missouri), that is returned by the abbrev() method.
The boundaries of each state are defined with a series of geographic positions. These are used when creating the graphical representation of our visualization. (Fortunately for you, we will take care of that!) What you will need to know about is that the State class supports a method, centroid(), that returns a single GeoPosition that represents what is known as the centroid of the state. Informally, the centroid is an "average" of all positions in the state. We will use that single position as an approximation for the entire state when determining the state to which a tweet is closest.
The us_states module contains the actual data needed for representing the United States. You will not need to examine this file; it will be used by other parts of the project.
The country module defines a Country class that handles the actual rendering of the states. An instance of this class will be very much like a specialized canvas in our cs1graphics framework. It supports the following two methods:
The colors module provides support for translating our numeric "sentiment" values into an appropriate color based on a fixed gradient formula. In particular, the module defines a function with calling signature:
get_sentiment_color(sentimentValue)that returns an RGB triple of an appropriate color for the given numeric sentiment value. If None is sent as a parameter, it returns black as the color.
The parse module handles most of the low-level interaction with the data files (so that you won't have to). In particular, it has functionality to load the original dictionary of sentiment scores, and to load and filter the tweet data based on a desired search term. You will not need to directly call any of these functions.
The data folder contains the raw data for sentiment scores and tweets, and this is where raw results for previous queries are cached for efficiency.
The main driver for the project is the trends.py file, so that is the one that should be executed. When executed, it asks the user to enter a query word. Note that if you do not download the full all_tweets.txt file on your system, you will only be able to run the program for one of the four predefined queries (city, church, college, police). With the all_tweets.txt file you can pick any query word you wish.
Note that if running Python from a command prompt, rather than idle, you can indicate the query word as a command-line argument, using a syntax such as
python trends.py college
We have prepared four sample results to allow you to do some sanity checking of your program relative to our results. You may click on any of the following images to see its full-sized version.
Query: city
National sentiment value 5.96
Highest sentiment value of 6.54 in MN
Lowest sentiment value of 4.46 in AK
Query: church
National sentiment value 6.05
Highest sentiment value of 6.90 in RI
Lowest sentiment value of 5.45 in PA
Query: college
National sentiment value 5.90
Highest sentiment value of 6.71 in KY
Lowest sentiment value of 4.67 in WI
Query: police
National sentiment value 5.37
Highest sentiment value of 6.55 in KY
Lowest sentiment value of 4.52 in DE
All of your new code should be placed in the trends.py file. That file should be submitted electronically. You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments.
Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.
The assignment is worth 40 points, distributed as follows:
This project is a variant of one developed by Aditi Muralidharan, John DeNero, and Hamilton Nguyen, as described at http://nifty.stanford.edu/2013/denero-muralidharan-trends/.
Our set of tweets are taken from the corpus provided at http://www.cs.cmu.edu/~ark/GeoText/
Our base list of word sentiment scores are taken from the Mechanical Turk project (https://www.uvm.edu/storylab/2011/12/08/hedonometrics/)