Text Mining SuperBowl 2019

It has talked too much, criticized too much and it is the hot topic of this week. Yes, I’m talking about Super Bowl LIII. It was not considered the best Super Bowl ever, but still it brought excitement - especially in Boston. I wanted to look at the highlights of the Superbowl on Twitter. I do text-mining on some tweets and hashtags as of February 1st 2019! ps: This is my second Super Bowl, I didn’t know much about American Football..
I decided to investigate the following hashtags:
* #SuperBowl2019 + #SuperBowlSunday
* #SuperBowlLIII
* #SuperBowl + #halftimeshow
* #PatriotsParade
After I pull the tweets, I will clean the texts with tidy packages and then do some sentiment, n-gram and wordcloud analysis to understand what topics were tweeted the most.

Get Ready & Setup Your Twitter API

You need two packages to pull tweets from twitter:
* twitteR
* tm : this is the package used for almost every natural language processing, aka. NLP
Then you also have to setup an account on developer.twitter.com and create your own use case. Then you will create your unique keys and tokens, I will regenerate my tokens and keys so you can’t copy paste my keys and tokens if you want to try this method. But it is really not a complicated and long process.

library("twitteR")
library("tm")
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

#to get your consumerKey and consumerSecret see the twitteR documentation for instructions
consumer_key <- 'wGHv0WAyCkyZHf6QWer3VeVP1'
consumer_secret <- '3F91HXVkqYxCkwiHNTNN1KD5I1fKOucixOz9DtMlji779a9eBj'
access_token <- '164255500-A8fk1u7M3EHb3XHsWxuZ5DjsJROcOMqdQuJt8Afy'
access_secret <- 'cBIzHbUvBcKlP6E5GhJMbskrjWZM8b7q1VUQEYX0QYfPh'

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

## [1] "Using direct authentication"

## Searching for Tweets About SuperBowl 2019
I will pull 5000 tweets since 01 Feb 2019 and will set the language to English. I will only elaborate my research for #PatriotsParade as it happened in Boston. So for that particular hashtag I will set geocode to Boston and also set the search date 4th of Feb 2019. I find the lattitudes and laptitudes from:
https://www.findlatitudeandlongitude.com/?loc=boston+ma&id=43320#.XFnnZhlKjOQ

#searching twitter  for #SuperBowl2019 + #SuperBowlSunday
sb <- twitteR::searchTwitter('#SuperBowl2019 + #SuperBowlSunday', n = 5000, since = '2019-02-01', lang= "en", retryOnRateLimit = 1e4)
sb = twitteR::twListToDF(sb)

head(sb$text)

#searching twitter for #SuperBowlLIII 
sb2 <- twitteR::searchTwitter('#SuperBowlLIII', n = 5000, since = '2019-02-02', lang= "en", retryOnRateLimit = 1e3)
sb2 = twitteR::twListToDF(sb2)
head(sb2$text)

#searching twitter for #SuperBowl + #halftimeshow
hts2 <- twitteR::searchTwitter('#SuperBowl + #halftimeshow', n = 5000, since = '2019-02-02', lang= "en", retryOnRateLimit = 1e3)
hts2 = twitteR::twListToDF(hts2)
head(hts2$text)

#searching twitter for #PatriotsParade 
pats <- twitteR::searchTwitter('#PatriotsParade', n = 5000, since = '2019-02-04', lang="en", geocode = "42.358431,-71.059773,km", retryOnRateLimit = 1e3)
pts = twitteR::twListToDF(pats)

head(pts$text)

Data Clean-Up

Here I will use gsub and tidy methods to clean the data. Removing punctuation & converting to lower case & indexing will be done with unnest_tokens. function.

library(dplyr)
library(tidytext)
library(tidyverse)

###############################################
#   #SuperBowl2019 + #SuperBowlSunday
###############################################
#Removing the URLs 
sb$stripped_text <- gsub("http.*", "", sb$text)

#tokenizing: tidying the text
sb_tweets_clean <- sb %>% 
  dplyr::select(stripped_text) %>%
  unnest_tokens(word, stripped_text)

###############################################
#    #SuperBowlLIII 
###############################################
#Removing the URLs 
sb2$stripped_text <- gsub("http.*", "", sb2$text)

#tokenizing: tidying the text
sb2_tweets_clean <- sb2 %>% 
  dplyr::select(stripped_text) %>%
  unnest_tokens(word, stripped_text)


###############################################
#    #SuperBowl + #halftimeshow
###############################################
#Removing the URLs 
hts2$stripped_text <- gsub("http.*", "", hts2$text)

#tokenizing: tidying the text
hts2_tweets_clean <- hts2 %>% 
  dplyr::select(stripped_text) %>%
  unnest_tokens(word, stripped_text)

###############################################
#    #PatriotsParade 
###############################################
pts$stripped_text <- gsub("http.*", "", pts$text)

#tokenizing: tidying the text
pts_tweets_clean <- pts %>% 
  dplyr::select(stripped_text) %>%
  unnest_tokens(word, stripped_text)

Actually I’m not done with the cleaning because tweets still include words such as “a”, “it” etc., which are also called “stop words”. These words don’t contribute to our analysis at all. But additionally there are also other words tht won’t give us insights such as “super bowl”… So as I look at each hashtag, I’m going to remove some of my own undesired words.

#SuperBowl2019 + #SuperBowlSunday

library(ggplot2) #we will use ggplot later to plot some cool graphs
data("stop_words")
library(stopwords)
library(tm)
#my list of words to not include
undesired_words <- c("rt", "angeles", "2019", "superbowl", "bowl", "super")

######################
#Creating a Wordcloud
######################
library(wordcloud)  
library(RColorBrewer) #adding some color to our wordcloud

sb_tweets_clean %>% 
  count(word, sort=TRUE) %>%
  anti_join(stop_words) %>% #exclude stopwords
  filter(!word %in% undesired_words) %>% #excluding the undesired words
  filter(nchar(word)>3) %>% #only including words that have at least 3 chr
  with(wordcloud(word, #add the word
                 n,  #word-count
                 main= "Wordcloud for  #SuperBowl2019 + #SuperBowlSunday", 
                 scale=c(5,0.5),
                 use.r.layout=FALSE,
                 max.words = 40, #limit the numer of words
                 colors=brewer.pal(8, "Dark2")))#adding some color

Some Insights It seems like Patriots fan tweeted more compared to Rams fans. Maroon 5 performance seem to be talked. Also, there is one miss-spelled “supebowl”, which might be related to beer consumption during the game :P. People watch the game from Youtube. It also seems like Tom Brady has been highlighted in most of the tweets. Last interesting point might be #thatericalper, Eric Alper is a Canadian musician. But I don’t know what is the relationship between him and Superbowl.

#SuperBowlLIII

undesired_words_2 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "don't")

sb2_tweets_clean %>% 
  count(word, sort=TRUE) %>%
  anti_join(stop_words) %>% #exclude stopwords
  filter(!word %in% undesired_words_2) %>% #excluding the undesired words
  filter(nchar(word)>3) %>% #only including words that have at least 3 chr
  with(wordcloud(word, #add the word
                 n,  #word-count
                 scale=c(4, 0.5),
                 use.r.layout=FALSE,
                 max.words = 40, #limit the numer of words
                 colors=brewer.pal(8, "Dark2")))#adding some color

Some Insights: Raising Awareness In contrast to #SuperBowl2019 hashtags, people used this hashtag to tweet about more social injustices. According to the wordcloud, former NFL player Colin Kaepernick, also started the “take a knee” protest, has also saw support with “imwithkap” and “kaepernick”" words. Also other words like “children”, “mothers”, killed“,”chuckmodi1" all point out the police brutality and racial injustices. Additionally, it seems like Rams fans used this hashtag to tweet.

## # Bigrams for #SuperBowlLIII It might be interesting to see the word pairs for this particular hashtag. This part I will analyze the word pairs(bi-grams) that used most frequently for #SuperBowlLIII. To create bigrams we need the package widyr. After we create bigrams, you will notice that there are again undesired words that don’t give us any inside. We will do this by filtering the undesired words from our word-pair, but first we need to separate them with separate(). That’s why we need to clean the data before we graph them to gather insights.

library(widyr) #for n-grams

undesired_words_3 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "don't")

#ccreating the bigrams
sb2_paired_words <- sb2 %>% 
  dplyr::select(stripped_text) %>%
  unnest_tokens(paired_words, stripped_text, token="ngrams", n=2) #bigrams

#word-pair counts
sb2_paired_words %>%
  count(paired_words, sort=TRUE)

## # A tibble: 24,227 x 2
##    paired_words                n
##    <chr>                   <int>
##  1 superbowlliii rt          711
##  2 super bowl                511
##  3 of the                    271
##  4 in the                    244
##  5 the superbowlliii         234
##  6 if you                    191
##  7 the super                 190
##  8 the patriots              183
##  9 at the                    181
## 10 superbowl superbowlliii   165
## # ... with 24,217 more rows

#separating to filter
sb2_separated_words <- sb2_paired_words %>% 
  separate(paired_words, c("word1", "word2"), sep = " ") 

#Excluding the stop words from n-grams
sb2_tweets_filter <- sb2_separated_words %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(nchar(word1)>3) %>%
  filter(!word1 %in% undesired_words_2) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word2 %in% undesired_words_2) %>%
  filter(nchar(word2)>3) 

#view the filtered word pairs
sb2_word_counts <- sb2_tweets_filter %>%
  count(word1, word2, sort=TRUE)

library(kableExtra)
head(sb2_word_counts) %>%kable()%>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

word1	word2	n
colin	kaepernick	101
children	killed	95
chuckmodi1	colin	95
bey_legion	chloexhalle	93
chloexhalle	perform	88
debut	album	88

Now we can visualize our network of bigrams.

library(tidyr)
library(igraph)
library(ggraph)

sb2_word_counts %>% 
  filter(n>=50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes()) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label=name), vjust=1.8, size=5)

Word Pairs used under #SuperBowlLIII

More insights
Apparently Chuck Modi’s tweet draw a lot of attention. People supported Kaepernick by refering Chuck Modi:
“Colin Kaepernick took a knee for us!”

Everyone has opinion on Kap, but what about mothers who had their children killed by police. Pls listen: #ImWithKap #TakeAKnee on #SuperBowlLIII Reference: https://twitter.com/ChuckModi1/status/1092197280701169665 Additionally, this month being the “Black History Month” also was a popular topic to raise awareness.

There are other word pairs related to the performances. Beyonce’s “protege’s” Chloe X Halle’s performance has also caught attention during the Super Bowl - “bey_legion” here refers to Beyonce, actually I didn’t know about it before.

This year we heard tons of complains that Super Bowl was really boring… Well actually some people expressed their feelings as “superbored” about the Super Bowl.

Here it won’t be fair if we don’t give any credits to Pampers for their ad. John Legend and Adam Levine’s performance about diapers also was a hot topic.

Well it seems like there were more tweets about other topics than the game. People stated their opinions about Super Bowl related topics, but not about the game on twitter.

#SuperBowl + #halftimeshow

Now we can look at some insights on half time show. Maroon 5’s performance and him being “half time naked” were big deal. So let’s explore how this actually reflected on Twitter. Before creating some graphs, we need to do some more cleaning and filtering. For this one I am going to use the add “halftimeshow” as an undesired word. Let’s have a look at the most frequent used words about half time show. I will create the plot with ggplot.

undesired_words_4 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "halftimeshow", "halftime", "literally")


#plotting the frequency plot for the words mentioned under the hashtag

hts2_tweets_clean %>% 
  count(word, sort=TRUE) %>%
  top_n(50) %>%
  anti_join(stop_words) %>%
  filter(!word %in% undesired_words_4) %>% #excluding the undesired words
  filter(nchar(word)>3) %>% #only including words that have at least 3 chr
  mutate(word= reorder(word, n)) %>%
  ggplot(aes(x=word, y = n)) +
  geom_bar(stat= "identity",
           fill="blue") +  #coloring the bars blue
  xlab(NULL) + #removing the x-axis label
  ylab(NULL) + #removing the y-axis label
  ggtitle("What words were most frequent words about half time show tweets?")+
  theme(plot.title = element_text(size=10))+ #changing the text size
  coord_flip()

Insight on Halftime Show

Apparently Sponge Bob and David Glen Eisley’s Sweet Victory has mentioned a lot in the tweets about half time show. Actually more than Maroon 5! So Adam Levine going partially naked was not as interesting as spongebob or this tweet samples. We can look at the word-pairs for this hashtag to understand more about the opinions on the halftime show.

#PatriotsParade

I’m in Boston right now. So I see it with my own eyes that Parade was very crowded and people were very excited about it. I want to do a sentiment analysis on the tweets of people from Boston.
In R, there are 3 different lexicons(fancy word to describe different sentiments):
* Afinn
* nrc
* bing
Since I want to understand how people were actually feeling about the parade, I will use nrc lexicon, which is related to emotions.

#get_sentiments("nrc")

undesired_words_5 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "parade")

pts_tweets_clean %>%
  inner_join(get_sentiments("nrc")) %>% #using this lexicon
  count(word, sentiment, sort=TRUE) %>% 
  anti_join(stop_words) %>%
  distinct() %>%
  filter(!word %in% undesired_words_5) %>% #excluding the undesired words
  filter(nchar(word)>3) %>%
  group_by(sentiment) %>%
  top_n(5) %>% #showing the top 3 words for every emotion
  ungroup() %>%
  ggplot(aes(word, n, fill=sentiment))+
  geom_col(show.legend = FALSE) + #hiding the legend
  facet_wrap(~sentiment, scales="free_y") + #adding multiple graphs
  xlab(NULL) +
  ylab("Emotions about the Patriots Parade")+
  theme(axis.text.x=element_text(size=6, angle=45)) +
  coord_flip()

It looks like there are mixed feelings about the parade. Although people enjoyed the championship of Pats, it seems like there has been some negativity too. This might be due to the parade being overwhelmingly crowded. The negative words are drinking, disrespectful, police and ambulance. Like every over-crowded events, there was confluence in the parade as well.

SuperBowl2019

merve ozgul