Abstract

The CUNY DATA 607 final project is an opportunity to learn more about working with Twitter data, AWS, and MySQL, while learning to conduct a sentiment analysis.

Because of the unique temporal nature of Twitter, we create a dataset culled from tweets of a popular live event with a curated hashtag: The NBC airing of the live Hairspray musical on December 7th, 2016 (Dec 8, GMT) - #hairspraylive.

We acquired the dataset after the live airing using the Twitter API. Because of the nuances of the API and overall volume, our scrape will be split into days and use AWS EC2 to issue GET requests and a MySQL database for storage of results.

This scraped data will then be used in correlation with a positive/negative word lexicon to conduct analysis of:

  • Overall Sentiment
  • Sentiment as a TimeSeries

Environment Prep

if (!require("RMySQL")) install.packages('RMySQL')
if (!require('plyr')) install.packages ('plyr')
if (!require('dplyr')) install.packages ('dplyr')
if (!require('stringr')) install.packages ('stringr')
if (!require('ggplot2')) install.packages ('ggplot2')
if (!require('reshape2')) install.packages ('reshape2')

Data Acquisition

AWS EC2

This is a popular event and grabbing all the data provided to be extremely challenging and time consuming. While the final dataset isn’t particularly large, it was challenging to manage the acquisition via the Twitter API.

This great walkthrough helped a noob like me use RStudio in AWS EC2 and access both from CLI and in a browser. Louis Aslett is kind enough to maintain an Rstudio Amazon Machine Image (AMI) that also integrates with Dropbox, so it was reasonably trivial to save .csv files with generated/scraped data in Dropbox instead of an S3 bucket. I spent some time configuring an instance of Rstudio, then saved as my own AMI so I could quickly and easily spin up additional instances with the same settings.

I anticipate working this way often in the future.

Twitter API Credentials

In order to run the scrape function and get data, the user will need to create an application in the Twitter API and populate with the fields below. Ours are stored in the file sourced below.

  • api_key
  • api_secret
  • access_token
  • access_token_secret

AWS_FP.R

Below find an example of what was used in EC2 to scrape Twitter. This would be run on a VM to scrape for a single day, then saved to a csv which we will later import into MySQL.

if (!require('twitteR')) install.packages ('twitteR')
if (!require('ROAuth')) install.packages ('ROAuth')
if (!require('httr')) install.packages ('httr')

source('FPCredentials.R')

options(httr_oauth_cache=TRUE)
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

#version to just look for date range
data1 <- searchTwitter("#hairspraylive", n=100000, since='2016-12-07', until='2016-12-08', lang="en")

#setdf (meth1)
df1 <- twListToDF(data1)

#write to csv for later use
write.csv(df1, file="df1.csv")

rm(access_token, access_token_secret, api_key, api_secret)

## Check Session Limits
rate.limit <- getCurRateLimitInfo()
# print out all metrics that have been changed
rate.limit[rate.limit$limit != rate.limit$remaining,]

Handling Rate Limit

Twitter chokes API calls from a single IP address. Below, you can see the periodic call restrictions imposed, followed by a lower ceiling for the next call.

Diminishing returns on GET calls toTwitter API

To solve this problem, I attempted to run these calls in parallel from separate instances, each instance responsible for gathering data from a single day.

SIDE NOTE: Twitter has many disclaimers about their API - ultimately it is not a good resource for ‘reproduceability’. To test this, I ran the same query of data from 48hrs prior on two separate instances at the same time. The resulting datasets were similar, but where one had 37.4k observations, the other had 39.2k. Again, not ideal for reproduceability, but good enough for an informal sentiment analysis.

I used AWS Cloudwatch to set alerts and configure a small dashboard where I could monitor progress/performance. Having monitoring in place helped diagnose some of the issues related to pulling tweets the day of the event (since a much larger dataset).

My Cloudwatch Dashboard

Major Issue: Handling Day Of

There were serious problems acquiring the entire dataset. Ultimately, the final dataset used to conduct analysis later is not appropriately complete.

The Twitter API does not permit pulling at a level more granular than date. I attempted to pull day-of data from multiple EC2 VMs of different instance flavors over a span of about 1 week. In the end, I had the following issues:

  • the API would choke my GET request which resulted in the VM idling
  • the GET request would complete, resulting in the appropriate R object. Then the VM would crash when I attempted to get the data out of the VM.

The latter was particularly frustrating, as I “had” the data, but then couldn’t transfer out without losing it. In these instances where a GET completed, I saw between ~240,000 and ~600,000 tweets in that single day. I tried:

  • saving as an R object and transfering using dropbox
  • saving as an R object and scping to local
  • saving as an R object and transfering to S3
  • snapshotting the instance before taking further action (still lost data in Rstudio)
  • saving as a csv and using any transfer method. (note - was never able to get the large object to correctly transfer to df then to a csv before a crash)

I did have a successful scrape using my local machine left overnight that resulted in a dataset of 100000 tweets. The twitter API gives the most recent data first, so my 100k tweets ended at 23:59:59 GMT and went backwards to about 04:30:00 from there. Unfortunately the show itself started at 01:00:00 GMT so my only successful pull was missing the show itself.

To compensate in some small way in the hopes of acquiring at least a few tweets during this time span, I used the resultType=“popular” tag and was able to get at least 1 tweet for each missing hour. This was then merged with the 100k data mentioned above. See Mosaic Plot, Hourly Tweets below.

#use popular result type to at least get a few (returns < 20, but in needed time frame)
d2 <- searchTwitter("#hairspraylive", since='2016-12-08', until='2016-12-09', lang="en", resultType="popular")

Data Management

Storing in MySQL

A better data scientist would have found a more elegant workflow for managing this data, but for now, saving data culled from VMs as csvs, then loading into a local MySQL database works. If the volume were much greater, a more refined approach would be paramount.

mydb = dbConnect(MySQL(), user='root', dbname='hairsprayTwitter', host='localhost')

df1 <- read.csv('df1.csv')
df2 <- read.csv('df2.csv')
df3 <- read.csv('df3.csv')
df4 <- read.csv('df4.csv')

# drop table rows if already there, but keep schema defined in MySQL.
dbGetQuery(mydb,"TRUNCATE TABLE tweets")

dbWriteTable(mydb, name='tweets', value=df1, append=TRUE)
dbWriteTable(mydb, name='tweets', value=df2, append=TRUE)
dbWriteTable(mydb, name='tweets', value=df3, append=TRUE)
dbWriteTable(mydb, name='tweets', value=df4, append=TRUE)

dbDisconnect(mydb)

Pulling DF into R

mydb = dbConnect(MySQL(), user='root', dbname='hairsprayTwitter', host='localhost')
query = dbSendQuery(mydb, 
                    "SELECT text, created, retweetCount, isRetweet
                    FROM hairsprayTwitter.tweets;")

HS.df <- fetch(query, n=-1)
dbDisconnect(mydb)
## [1] TRUE

Start preparing the data

Time Prep

To help standardize data acquired, the Twitter API assigns every observation a “created” variable that gives a timestamp in GMT. For this project, we’ll leave our data in GMT, but prepare some supplementary variables for use later.

#create date and time fields
HS.df$created <- strptime(HS.df$created, "%Y-%m-%d %H:%M:%S", tz = "GMT")
HS.df$date <- format(HS.df$created, "%m-%d")
HS.df$hour <- format(HS.df$created, "%H")
HS.df$ID <- format(HS.df$created, "%d:%H")
Mosaic Plot, Hourly Tweets
plot(table(HS.df$date, HS.df$hour))

table(HS.df$date)
## 
##  12-07  12-08  12-09  12-10 
##  37412 100013  27764   8886

Note that even though we are missing the full volume of crucial 4 hours from the day of the event, that day still has 135% more tweets than the other days combined.

Opinion Lexicon

Pull Established Opinion Lexicon

Pulled from Hu and Liu-University of Illinois @ Chicago. See References.

pos <- scan('opinion-lexicon-English/positive-words.txt',
                  what='character', comment.char=';')
neg <- scan('opinion-lexicon-English/negative-words.txt',
                  what='character', comment.char=';')

Customize Opinion Lexicon

This is done based upon the iterative process detailed further along (see: Sentiment Analysis: Improving Lexicon) as we look at tweets that eluded scores. Some should have a score of 0 as there is no obvious sentiment. Others, we can catch and improve our classifier by reviewing samples and customizing our corpus.

We will also try and add the most obvious emojis.

pos <- c(pos, "excited", "friend", "honor", "cuddles", "warm", "favorite",
         "real", "hearts", "gift", "heavens", "jubilant", "exhilarating", "proud",
         "amazing", "happy", "queen", "love", "respect", "rocks", 
         "<ed><a0><bd><ed><b2><93>", "<ed><a0><bd><ed><b3><a3>", "<e2><99><a1>")
neg <- c(neg, "<ed><a0><bd><ed><b2><a9>")

Sentiment Analysis

The function below compares the text of our tweets against our postive and negative lists. It is inspired by the sentiment function described here - see References.

score.sentiment <- function(sentences, good_text, bad_text, .progress='none'){
    scores = laply(sentences, function(sentence, good_text, bad_text) {
        
        # clean text
        sentence <- gsub('[[:punct:]]', '', sentence)
        sentence <- gsub('[[:cntrl:]]', '', sentence)
        sentence <- gsub('\\d+', '', sentence)
        sentence <- iconv(sentence, 'latin1', 'ASCII', 'byte')
        sentence <- tolower(sentence)        
        words <- unlist(str_split(sentence, '\\s+'))
        
        # compare to lexicon
        pos.matches <- !is.na(match(words, good_text))
        neg.matches <- !is.na(match(words, bad_text))
        
        # calc score as difference between two
        score <- sum(pos.matches) - sum(neg.matches)
        
        return(score)
    }, good_text, bad_text, .progress=.progress )
    return(scores)
}

Score Data

HS.df$score <- score.sentiment(HS.df$text, pos, neg)

Improving Lexicon

We will try to improve the quality of our lexicon by looking for entries that come up as score=0 and seeing if there are words that should be obviously classified in a particular direction.

zeros <- HS.df[HS.df$score == 0,]
sample(zeros$text, 5)
## [1] "RT @FrankieJGrande: Without a doubt, @HairsprayLive is one of the greatest things to ever happen to live tv. ;_\xcd\xe8\xe6;_\xcd__;_\xcd__ #HairsprayLive"                
## [2] "RT @IndianaLowdmouf: You can't stop the beat. I miss dancing this number. Such a great song #HairsprayLive"                                                                
## [3] "RT @cookieballinger: YOU CANT STOP THE BEA- #HairsprayLive https://t.co/Ka7CUcj48P"                                                                                        
## [4] "Not another advert @itv2 #HairsprayLive"                                                                                                                                   
## [5] "RT @stahanson: tbt to when I dressed up as Tracy for drama club sophomore year ;_\xcd\xd9\xe4;_\xcd\xd9\xe4;_\xcd\xd9\xe4;_\xcd_\xab#hairspraylive https://t.co/FoohO5wgFI"

Distribution of Scores

qplot(factor(HS.df$score), fill=factor(HS.df$date), geom="bar", xlab = "Sentiment Score")

We can also take the mean of our distribution to find that the overall sentiment (including 0s) is 0.5112509.

What do the extremes look like?

low <- HS.df$text[HS.df$score==-5]
low
## [1] "Derek Hough is not doing it for me. Weak voice. And that weird pretend to bite the Oreo bit was just #awkward... #hairspraylive #fail"
high <- HS.df$text[HS.df$score==7]
high
## [1] "Still thinking about Harvey Fierstein's moment of pure joy on the hot dog cart in \"Welcome to the 60s\" - love, love, love. #hairspraylive"                     
## [2] "RT @holyarigrande: Good night @ArianaGrande I love u and I'm so so so proud of u. Thank u thank thank u #HairsprayLive. I love u so much ;_\xcd\xd9_\x8a\x9d_\xb4ُ"
## [3] "Good night @ArianaGrande I love u and I'm so so so proud of u. Thank u thank thank u #HairsprayLive. I love u so much ;_\xcd\xd9_\x8a\x9d_\xb4ُ"                   
## [4] "I LOVE IT I LOVE IT I LOVE IT I LOVE IT I LOVE IT I LOVE IT I LOVE IY  #HairsprayLive"

Sentiment Over Time

Calculate the mean sentiment for each hour for our full dataset AND for our dataset with tweets with the sentiment score of 0 removed.

hourmean <- aggregate(HS.df[,8], list(HS.df$ID), mean)

## look at only those with a non zero sentiment score
nozeros <- HS.df[!HS.df$score == 0,]
hourmean_n <- aggregate(nozeros[,8], list(nozeros$ID), mean)

## looks like we lost an hour obvservation that was a tweet with 0 sentiment 
## let's add back in as the same value as it's neighbors = 1
hourmean_n <- rbind(hourmean_n[1:26,], c("08:02", 1), hourmean_n[27:95,] )
final <- cbind(hourmean, hourmean_n$x)
names(final) <- c("ID", "AllData", "NoZero")
final$NoZero <- as.numeric(as.character(final$NoZero))

# melt, to prepare for final visualization
final2 <- melt(final, id="ID")

Plot as time series

ggplot(final2, aes(x=ID, y=value, group=variable, colour=variable)) + 
    geom_line() + 
    geom_vline(xintercept = c(26, 28), linetype="dashed", color="black") +
    geom_text(aes(x=28, label="\n the show", y = 0.2), colour="black", angle=90) +
    annotate("rect", fill="red", alpha = 0.1, xmin=25, xmax=28, ymin=-Inf, ymax=Inf) +
    geom_text(aes(x=28, label="\n the missing data", y = 1.15), 
              colour="red", angle=90) +
    geom_vline(xintercept = c(0), linetype="longdash", color="grey") +
    geom_text(aes(x=0, label="\n day before", y = 0.1), colour="grey", angle=90) +
    geom_vline(xintercept = c(25), linetype="longdash", color="grey") +
    geom_text(aes(x=25, label="day of \n", y = 0.1), colour="grey", angle=90) +
    geom_vline(xintercept = c(49), linetype="longdash", color="grey") +
    geom_text(aes(x=49, label="1 day after \n", y = 0.1), colour="grey", angle=90) +
    geom_vline(xintercept = c(73), linetype="longdash", color="grey") +
    geom_text(aes(x=73, label="2 day after \n", y = 0.1), colour="grey", angle=90) +
    ylab("Sentiment Score") + 
    ggtitle("#hairspraylive Sentiment Analysis \n Mean Hourly Score") +
    theme(axis.title.x=element_blank(), axis.text.x=element_blank(), 
          axis.ticks.x=element_blank())

Here, we can also see that as compared to our score mean that included tweets w/o a sentiment weight (0.5112509), our mean sentiment is 0.93342.

However, since we removed the steady hand of 0, our variability goes up:

  • SD - Full data set: 0.9405649
  • SD - No zeros: 1.1050435

It is also interesting to note that the sentiment during the ‘hype’ period (day before) is pretty similar to the sentiment after the event. We see a slight drop off after 48 hrs.

Conclusion

This was an excellent learning experience on how to handle the TwitterAPI (in its current state), manage Rstudio AMI in AWS EC2, and conduct a sentiment analysis.

The limitations acquiring the desired data made one of our secondary goals impossible - to conduct a minute-by-minute sentiment analysis during the live show itself and see if there were obvious correlations to the contents of the show.

Were I to do this again, I would attempt to use some separate functionality to acquire twitter data as a live stream instead of trying to GET past data. While I don’t think I came up with an ideal solution by any means, this type of project helps show the importance of creating a data acquisition and management architecture well suited for the task at hand.