Course Home | Assignments | Computing Resources | Lab Hours/Tutoring | Python | Schedule | Submit

Saint Louis University

Computer Science 1300
Introduction to Object-Oriented Programming

Michael Goldwasser

Fall 2017

Computer Science Department

Programming Assignment 09

Twitter Trends

Due: 11:59pm, Tuesday, 5 December 2017


Contents:


Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.


Overview

In this assignment, we are going to attempt to recreate a data science experiment described in a 2013 article published in PLoS titled The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place . The general approach of the research was to take a large random sample of tweets from the United States, and to analyze differences in perceived sentiment according to the geographic location from which tweets were posted. The relative "happiness" associated with one word can be approximated based on looking at what other words appear in the same tweets (and using some computed estimates of the sentiment associated with those words).

Our experiment will be admittedly more simplistic — giving us some practice in data analysis, but with smaller data sets and some corner-cutting for the sake of efficiency. As a result, it's not clear how reliable conclusions will be with our results (but it's still fun to hypothesize). We describe our complete methodology shortly, but as a motivating example, the following figure displays the result of searching our data set for tweets that use the word "college" and then analyzing the relative sentiment of the tweets when grouped by states. The most positive sentiments are visualized in red (e.g, Kentucky) and the most negative are in blue (e.g., Wisconsin), with a spectrum of colors between such that yellow is relatively neutral. States drawn in black do not have a sufficient quantity of information within the data set for that query.

Methodology

In this section, we describe the complete methodology of our experiments. Note that we will have already implemented much of the supporting software, with your specific task further outlined in a later section of this document. A diagram such as the above example is constructed as follows:

  1. Given a query term (e.g. 'college'), we filter through all tweets in our data set, pulling aside those containing the query word. This is one of the more expensive tasks, and so to save time for repeated testing, our software saves the set of tweets for a given query in a separate file within a data/ folder, so that this step can be skipped if a particular query is repeated.

  2. Each tweet comes with geographic meta data in the form of latitude and longitude. That can be used to associate the tweet with a specific state. To do this accurately requires more careful geometric analysis of the state's shapes. So we use a more rough estimate as follows. With each state, we stored the longitude and latitude of the state's center of mass (the so called "centroid"). Then we map a tweet to whichever state has the nearest centroid. (This does mean that we are grossly mismapping some tweets; for example, its likely that any tweet from NYC would be mapped to Connecticuit rather than New York, given how far away the centroid of New York state is.)

  3. We analyze the "sentiment" of a tweet by breaking the message into individual words. Since there are a variety of punctuation styles on twitter and not always traditional spacing, we do not simply use a split() command. Instead, we define a word to be any maximally consecutive sequence of alphabetic characters (including an apostrophe) that are delimited by any non-alphabetic characters. For example, one of the tweets in the data set reads

    Tried....but I can't stay up any longer....nite tweeps....
    We will break that apart into a list of the following 10 words:
    ["Tried", "but", "I", "can't", "stay", "up", "any", "longer", "nite", "tweeps"]

  4. For each of the identified words other than the query itself, we check if that word (when lowercased) is on a known list of words with identified sentiment values. Those sentiment values are floating-point values on a range with 1 being extremely negative and 9 being extremely positive. For example, the word 'laugh' has a sentiment score of 8.22, while the word 'cry' has a sentiment score of 1.84. We are relying on a list of just over 10,000 sentiment scores provided by the Mechanical Turk project that was referenced in the academic paper.

    If a word of a tweet is not found in the sentiment dictionary, it is ignored. Also, to avoid words that have less significance, we ignore all words that have sentiment scores from 4.0 to 6.0.

  5. The overall sentiment of a state is computed as the average sentiment score for all occurrences of non-query words in tweets mapped to that state that have sentiment scores outside the moderate range of 4.0 to 6.0. Note that if a word occurs multiple times within a state, we include each occurrence when computing the average. In an attempt to avoid clear outliers, we require that a state have at least four relevant words in order for us to report its average sentiment.

    We also compute the overall national sentiment for tweets with the query word (including results for those states that might have had too few words).

  6. Finally, we visualize the results by mapping a state's sentiment score to a color between blue (negative) and red (positive), using a prescribed color gradient that we've implemented. States that did not have sufficient number of significant words are drawn in black.


Your Task

We have already implemented and our providing significant portions of this implementation. In particular, we manage parsing of all the data files and we provide tools for the cs1graphics visualization of the map.

All of your code must be in the file trends.py within the project. That is the only file you will submit. Your task is to perform the core analysis of the tweets and sentiment scores. You are expected to follow the precise methodology described above, so re-read it carefully and if you feel there are any ambiguities, please ask. You should feel free to define any additional functions within the trends.py file that help you organize your code in a more clear and modular fashion.

As advice, we suggest that you initialize one or more dictionaries (initially empty) that will map a given state abbreviation (e.g., 'MO') to the corresponding statistics. While it may be tempting to build a list of all relevant words for each state, it is possible to instead keep sufficient statistical markers along the way so that you do not need the full list of words.

In additional to the graphical display, you are required to provide textual output, as demonstrated for the later examples, giving the overall nationwide average sentiment, and the minimum and maximum sentiment scores for states that had sufficient data.


Files We Are Providing

As noted above, we are providing you with much of the necessary framework for this project. The source code and various data files are being distributed as follows. If you are working on hopper, you can get a copy of the distribution by typing the following command from your account:

cp -R /public/goldwasser/1300/trends .

For those working on your own machine, we have packaged the project as trends.zip. We note that this distribution includes data files for the example queries: city, church, college, and police for experimentation, but it does not include the larger data file all_tweets.txt that can be used to evaluate trends for other search terms. That file is 42Mb in size. If you wish, it can be downloaded here: all_tweets.txt and placed with the data/ directory. (Note: this data set was captured, by another researcher, from a random sample of tweets; we take no responsibility for the content of those tweets.)

This project uses many files, but all of your code must be placed within the file trends.py. With that said, you will need to be aware of the following:


Running the Program

The main driver for the project is the trends.py file, so that is the one that should be executed. When executed, it asks the user to enter a query word. Note that if you do not download the full all_tweets.txt file on your system, you will only be able to run the program for one of the four predefined queries (city, church, college, police). With the all_tweets.txt file you can pick any query word you wish.

Note that if running Python from a command prompt, rather than idle, you can indicate the query word as a command-line argument, using a syntax such as

python trends.py college


Examples

We have prepared four sample results to allow you to do some sanity checking of your program relative to our results. You may click on any of the following images to see its full-sized version.

Query: city

National sentiment value 5.96
Highest sentiment value of 6.54 in MN
Lowest sentiment value of 4.46 in AK


Query: church

National sentiment value 6.05
Highest sentiment value of 6.90 in RI
Lowest sentiment value of 5.45 in PA


Query: college

National sentiment value 5.90
Highest sentiment value of 6.71 in KY
Lowest sentiment value of 4.67 in WI


Query: police

National sentiment value 5.37
Highest sentiment value of 6.55 in KY
Lowest sentiment value of 4.52 in DE


Submitting Your Assignment

All of your new code should be placed in the trends.py file. That file should be submitted electronically. You should also submit a separate 'readme' text file, as outlined in the general webpage on programming assignments.

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.


Grading Standards

The assignment is worth 40 points, distributed as follows:


Acknowledgments

This project is a variant of one developed by Aditi Muralidharan, John DeNero, and Hamilton Nguyen, as described at http://nifty.stanford.edu/2013/denero-muralidharan-trends/.

Our set of tweets are taken from the corpus provided at http://www.cs.cmu.edu/~ark/GeoText/

Our base list of word sentiment scores are taken from the Mechanical Turk project (https://www.uvm.edu/storylab/2011/12/08/hedonometrics/)


Michael Goldwasser
Last modified: Thursday, 21 December 2017