Introduction

A few weeks ago, some colleagues and I had a discussion around the quality of play by the Oklahoma City Thunder given the seemingly excellent offseason roster moves. During that discussion, we pondered if the fan base was generally displeased (we assumed so given the poor record). In recent months, we have discovered that with R, an ambitious person can generally set out to study just about anything tangible. One idea we had was to study tweets associated with the Thunder and determine through some measurement how pleased or displeased the fanbase is/was. Our discussion happened to occur while the Thunder were playing so we decided that in-game sentiment could point us in the right direction. Enter the twitteR package. twitteR exposes the Twitter API and allows users to consume tweets (and some metadata) into R as a data frame. There are numerous tutorials demonstrating how to set up developer access to Twitter’s API through various keys and tokens. As such, that will not be discussed and the below code assumes you have proper access. For sentiment analysis, the tidytext package lends some text analytics functions in a “tidy” flow - including methods for parsing sentiment out of words.

Sentiment analysis is vaguely defined as categorizing an element, such as a tweet, as positive or negative. This can be done by labeling each word, phrase or entire tweet as positive or negative. For this exercise, we will label each word from each tweet, from hashtag “ThunderUp” as positive or negative. At the end, we will plot the proportion of positive and negative words over the course of a game to visualize sentiment over time. The tidytext packages provides a sentiment reference to thousands of words in which we can join our tweets to and thus label accordingly.

Since we will be plotting the sentiment of Thunder tweets over time, we will need to handle and manipulate the Twitter data similarly to a time series dataset. For this exercise, the Thunder played last night against the Knicks so we will use that game as an example.

Setup

To start, we will load the required packages and ingest the tweets into our environment.

#load packages
library(twitteR)
library(purrr)
library(dplyr)
library(stringr)
library(lubridate)
library(scales)
library(tidytext)
library(tidyr)
library(ggplot2)
library(ggfortify)

# establish connection
setup_twitter_oauth(consumerKey, consumerSecret, accessToken, accessTokenSecret)
## [1] "Using direct authentication"
# ingest tweets into R
tweets <- searchTwitter("#ThunderUp", n = 3000) %>%
  map_df(as.data.frame)

Here are the column names available:

##  [1] "text"          "favorited"     "favoriteCount" "replyToSN"    
##  [5] "created"       "truncated"     "replyToSID"    "id"           
##  [9] "replyToUID"    "statusSource"  "screenName"    "retweetCount" 
## [13] "isRetweet"     "retweeted"     "longitude"     "latitude"

Data Cleansing - String Manipulation

Looking at the data, the actual text of some of these tweets have several characters that represent some action that has taken place for the tweet. For example tweets starting with “RT @…” are tweets that have been retweeted. Some tweets contain links to images, gifs Also present are other special characters that are not essential to our goal here. Using some regular expression code, we can remove many of those characters to clean up the tweets. Just the same, we can filter out encoding errors which are present with the use of emojis. The preview data below highlights some examples of characters that we’re not interested in when performing sentiment analysis

## [1] "RT @NBA: Russell Westbrook arrives in style for work in NYC! #ThunderUp https://t.co/vqwxZVjuAp"                                                     
## [2] "108-91<ed><U+00A0><U+00BD><ed><U+00B4><U+00A5> #ThunderUp #DubNation #OKCThunder #Warriors #nba #nba2k #whynot #Westbrook #RussellWestbrook #paulgeorge… https://t.co/3JUV08nM7h"
## [3] "Destacando aqui Patrick Patterson nos últimos 6 jogos:\n\n- 6.6 PPG\n- 2.8 RPG \n- 51.1% FG (13-24) \n- 58.8% 3PT (10-17)\n\n#ThunderUp"             
## [4] "RT @RealBasketFlow: OKC | #CarmeloAnthony sobre su regreso a #NewYorkCity: \"¿Cómo podría no echar de menos a NY?\" | #ThunderUp #Knicks #NBA…"      
## [5] "RT @NBA: Russell Westbrook arrives in style for work in NYC! #ThunderUp https://t.co/vqwxZVjuAp"                                                     
## [6] "RT @NBA: Russell Westbrook arrives in style for work in NYC! #ThunderUp https://t.co/vqwxZVjuAp"

Now to apply the regex.

# filter out various characters from tweet text
tweets <- tweets %>%
  mutate(text = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", " ", text),
         text = gsub("@\\w+", " ", text),
         text = gsub("[[:punct:]]", " ", text),
         text = gsub("[[:digit:]]", " ", text),
         text = gsub("http\\w+", " ", text),
         text = gsub("[ \t]{2,}", " ", text),
         text = gsub("^\\s+|\\s+$", " ", text),
         text = gsub("[<].*[>]", " ", text),
         text = gsub("[<]*.[>]", " ", text)
  )

# filter out encoding errors for emojis and such
tweets$text <- iconv(tweets$text, "utf-8", "ASCII", sub = " ")

Let’s have another look after scrubbing the text column.

## [1] " Russell Westbrook arrives in style for work in NYC ThunderUp t co vqwxZVjuAp"                                                                        
## [2] "                                      ThunderUp DubNation OKCThunder Warriors nba nba k whynot Westbrook RussellWestbrook paulgeorge    t co JUV nM h"
## [3] "Destacando aqui Patrick Patterson nos   ltimos jogos \n\n PPG\n RPG \n FG \n PT \n\n ThunderUp"                                                       
## [4] " OKC CarmeloAnthony sobre su regreso a NewYorkCity C  mo podr  a no echar de menos a NY ThunderUp Knicks NBA   "                                      
## [5] " Russell Westbrook arrives in style for work in NYC ThunderUp t co vqwxZVjuAp"                                                                        
## [6] " Russell Westbrook arrives in style for work in NYC ThunderUp t co vqwxZVjuAp"

As you can see, there are multiples of some of the tweets due to the tweet being retweeted many times over. Depending on the objective at hand, one could choose to handle duplicates accordingly. We’re going to leave them intact during this exercise.

Data Cleansing - Timestamp Manipulation

The Twitter API output contains a timestamp that is in UTC time. During the next step, we’re going to convert the data to central time and create some useful columns for grouping by later on. As we plot the sentiment over time, we will observe evolving sentiment every 15 minutes. To do this, we need to normalize the timestamps into 15 minute increments. The lubridate packages provides a ‘floor_date’ function to make it happen.

# convert date to central time zone
tweets$dates <- format(tweets$created, tz="America/Chicago",usetz=TRUE)

# parse date out of timestamp column for fitering purposes
tweets$day <- as.Date(as.POSIXct(tweets$date, 'America/Chicago'))

# parse hour for possible filtering / grouping purposes using lubridate package
tweets$hr <- hour(tweets$date)

# floor_date for normalized time windows using lubridate package
tweets$time <- floor_date(tweets$created, unit = "15 minutes")

# convert normalized time back to central
tweets$time <- tweets$time - 21600

Sentiment Mining

Now that the text column is cleaned up and the timestamps are transformed for grouping and plotting later, the next step is to attach a sentiment label to each word of each tweet. To start, tidytext lends a lexicon from the NRC Word-Emotion Association which labels each word with 10 sentiments: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. For this exercise we’re going to stick with positive and negative words only. We will create a data frame of each tweet where each word is a row, then join the sentiment from the lexicon on the word.

# sentiment reference
sentimentRef <- sentiments %>%
  filter(lexicon == "nrc") %>%
  dplyr::select(word, sentiment)

# unnest each tweet word by word
tweet_words <- tweets %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]")) %>%
  group_by(word) %>%
  mutate(counts = sum(n())) %>%
  ungroup()

# join tweet words to sentiment reference 
sentiment <- tweet_words %>%
  inner_join(sentimentRef, by = "word")

# preview
head(sentiment[,c("word","sentiment")])
## # A tibble: 6 x 2
##      word    sentiment
##     <chr>        <chr>
## 1   court        anger
## 2   court anticipation
## 3   court         fear
## 4 winning anticipation
## 5 winning      disgust
## 6 winning          joy

Visualize the Data

We’re all set to visualize the sentiment during the course of the game. Using ggplot, in tidy fashion, we can pipe all of our filters to ensure that our tweets occur during the game, and the words must be either a positive or negative sentiment. We will plot the proportion of positive and negative words in 15 minute increments. Behold:

sentiment %>%
  filter(dates >= '2017-12-16 18:45:00' & dates <= '2017-12-16 21:30:00' & sentiment %in% c("positive","negative")) %>%
  count(time, sentiment) %>%
  group_by(time) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(time, round(prop*100,0), color = sentiment)) +
  geom_line(size = 1.5) + 
  theme(axis.title=element_text(face="bold",size="14", color="black")) + 
  theme(axis.text = element_text(face="bold",size="14", color="black")) +
  xlab("Time") + ylab("Percentage of Tweets") + 
  theme(legend.title=element_blank()) + theme(legend.text = element_text(size = 12, face = "bold")) + 
  ggtitle("Sentiment of #ThunderUp Tweets During Knicks Game") + theme(plot.title = element_text(face="bold", size=16)) + 
  scale_color_manual(values=c("red","darkgreen")) + ylim(0,100)

Conclusion

Given it was Carmelo’s return to NYC and the Thunder played well in the first quarter, you can see the positive vibes on Twitter until right around halftime. The Thunder made a few runs to keep it close, but the Knicks controlled the second half. Later in the game, it got ugly and you can somewhat infer that from the negative tweets.

Further steps to make this a little more scientific would be to determine how to handle retweets which no doubt could dillute this analysis. For example let’s say during the game the Thunder are losing by 20, but the official Thunder account wished someone a happy birthday. This would render this approach useless. It is not difficult to filter out retweets altogether or popular accounts.

Another step might be to more intelligently quantify sentiment. It’s possible that measuring sentiment at the word level may be too many dimensions. Might be better to measure sentiment at the tweet level by counting number of positive or negative words in a tweet and scoring the tweet’s sentiment by positive or negative word count.