It has talked too much, criticized too much and it is the hot topic of this week. Yes, I’m talking about Super Bowl LIII. It was not considered the best Super Bowl ever, but still it brought excitement - especially in Boston. I wanted to look at the highlights of the Superbowl on Twitter. I do text-mining on some tweets and hashtags as of February 1st 2019! ps: This is my second Super Bowl, I didn’t know much about American Football..
I decided to investigate the following hashtags:
* #SuperBowl2019 + #SuperBowlSunday
* #SuperBowlLIII
* #SuperBowl + #halftimeshow
* #PatriotsParade
After I pull the tweets, I will clean the texts with tidy packages and then do some sentiment, n-gram and wordcloud analysis to understand what topics were tweeted the most.
You need two packages to pull tweets from twitter:
* twitteR
* tm : this is the package used for almost every natural language processing, aka. NLP
Then you also have to setup an account on developer.twitter.com and create your own use case. Then you will create your unique keys and tokens, I will regenerate my tokens and keys so you can’t copy paste my keys and tokens if you want to try this method. But it is really not a complicated and long process.
library("twitteR")
library("tm")
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
#to get your consumerKey and consumerSecret see the twitteR documentation for instructions
consumer_key <- 'wGHv0WAyCkyZHf6QWer3VeVP1'
consumer_secret <- '3F91HXVkqYxCkwiHNTNN1KD5I1fKOucixOz9DtMlji779a9eBj'
access_token <- '164255500-A8fk1u7M3EHb3XHsWxuZ5DjsJROcOMqdQuJt8Afy'
access_secret <- 'cBIzHbUvBcKlP6E5GhJMbskrjWZM8b7q1VUQEYX0QYfPh'
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
## Searching for Tweets About SuperBowl 2019
I will pull 5000 tweets since 01 Feb 2019 and will set the language to English. I will only elaborate my research for #PatriotsParade as it happened in Boston. So for that particular hashtag I will set geocode to Boston and also set the search date 4th of Feb 2019. I find the lattitudes and laptitudes from:
https://www.findlatitudeandlongitude.com/?loc=boston+ma&id=43320#.XFnnZhlKjOQ
#searching twitter for #SuperBowl2019 + #SuperBowlSunday
sb <- twitteR::searchTwitter('#SuperBowl2019 + #SuperBowlSunday', n = 5000, since = '2019-02-01', lang= "en", retryOnRateLimit = 1e4)
sb = twitteR::twListToDF(sb)
head(sb$text)
#searching twitter for #SuperBowlLIII
sb2 <- twitteR::searchTwitter('#SuperBowlLIII', n = 5000, since = '2019-02-02', lang= "en", retryOnRateLimit = 1e3)
sb2 = twitteR::twListToDF(sb2)
head(sb2$text)
#searching twitter for #SuperBowl + #halftimeshow
hts2 <- twitteR::searchTwitter('#SuperBowl + #halftimeshow', n = 5000, since = '2019-02-02', lang= "en", retryOnRateLimit = 1e3)
hts2 = twitteR::twListToDF(hts2)
head(hts2$text)
#searching twitter for #PatriotsParade
pats <- twitteR::searchTwitter('#PatriotsParade', n = 5000, since = '2019-02-04', lang="en", geocode = "42.358431,-71.059773,km", retryOnRateLimit = 1e3)
pts = twitteR::twListToDF(pats)
head(pts$text)
Here I will use gsub and tidy methods to clean the data. Removing punctuation & converting to lower case & indexing will be done with unnest_tokens. function.
library(dplyr)
library(tidytext)
library(tidyverse)
###############################################
# #SuperBowl2019 + #SuperBowlSunday
###############################################
#Removing the URLs
sb$stripped_text <- gsub("http.*", "", sb$text)
#tokenizing: tidying the text
sb_tweets_clean <- sb %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
###############################################
# #SuperBowlLIII
###############################################
#Removing the URLs
sb2$stripped_text <- gsub("http.*", "", sb2$text)
#tokenizing: tidying the text
sb2_tweets_clean <- sb2 %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
###############################################
# #SuperBowl + #halftimeshow
###############################################
#Removing the URLs
hts2$stripped_text <- gsub("http.*", "", hts2$text)
#tokenizing: tidying the text
hts2_tweets_clean <- hts2 %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
###############################################
# #PatriotsParade
###############################################
pts$stripped_text <- gsub("http.*", "", pts$text)
#tokenizing: tidying the text
pts_tweets_clean <- pts %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
Actually I’m not done with the cleaning because tweets still include words such as “a”, “it” etc., which are also called “stop words”. These words don’t contribute to our analysis at all. But additionally there are also other words tht won’t give us insights such as “super bowl”… So as I look at each hashtag, I’m going to remove some of my own undesired words.
library(ggplot2) #we will use ggplot later to plot some cool graphs
data("stop_words")
library(stopwords)
library(tm)
#my list of words to not include
undesired_words <- c("rt", "angeles", "2019", "superbowl", "bowl", "super")
######################
#Creating a Wordcloud
######################
library(wordcloud)
library(RColorBrewer) #adding some color to our wordcloud
sb_tweets_clean %>%
count(word, sort=TRUE) %>%
anti_join(stop_words) %>% #exclude stopwords
filter(!word %in% undesired_words) %>% #excluding the undesired words
filter(nchar(word)>3) %>% #only including words that have at least 3 chr
with(wordcloud(word, #add the word
n, #word-count
main= "Wordcloud for #SuperBowl2019 + #SuperBowlSunday",
scale=c(5,0.5),
use.r.layout=FALSE,
max.words = 40, #limit the numer of words
colors=brewer.pal(8, "Dark2")))#adding some color
Some Insights It seems like Patriots fan tweeted more compared to Rams fans. Maroon 5 performance seem to be talked. Also, there is one miss-spelled “supebowl”, which might be related to beer consumption during the game :P. People watch the game from Youtube. It also seems like Tom Brady has been highlighted in most of the tweets. Last interesting point might be #thatericalper, Eric Alper is a Canadian musician. But I don’t know what is the relationship between him and Superbowl.
undesired_words_2 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "don't")
sb2_tweets_clean %>%
count(word, sort=TRUE) %>%
anti_join(stop_words) %>% #exclude stopwords
filter(!word %in% undesired_words_2) %>% #excluding the undesired words
filter(nchar(word)>3) %>% #only including words that have at least 3 chr
with(wordcloud(word, #add the word
n, #word-count
scale=c(4, 0.5),
use.r.layout=FALSE,
max.words = 40, #limit the numer of words
colors=brewer.pal(8, "Dark2")))#adding some color
Some Insights: Raising Awareness In contrast to #SuperBowl2019 hashtags, people used this hashtag to tweet about more social injustices. According to the wordcloud, former NFL player Colin Kaepernick, also started the “take a knee” protest, has also saw support with “imwithkap” and “kaepernick”" words. Also other words like “children”, “mothers”, killed“,”chuckmodi1" all point out the police brutality and racial injustices. Additionally, it seems like Rams fans used this hashtag to tweet.
## # Bigrams for #SuperBowlLIII It might be interesting to see the word pairs for this particular hashtag. This part I will analyze the word pairs(bi-grams) that used most frequently for #SuperBowlLIII. To create bigrams we need the package widyr. After we create bigrams, you will notice that there are again undesired words that don’t give us any inside. We will do this by filtering the undesired words from our word-pair, but first we need to separate them with separate(). That’s why we need to clean the data before we graph them to gather insights.
library(widyr) #for n-grams
undesired_words_3 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "don't")
#ccreating the bigrams
sb2_paired_words <- sb2 %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token="ngrams", n=2) #bigrams
#word-pair counts
sb2_paired_words %>%
count(paired_words, sort=TRUE)
## # A tibble: 24,227 x 2
## paired_words n
## <chr> <int>
## 1 superbowlliii rt 711
## 2 super bowl 511
## 3 of the 271
## 4 in the 244
## 5 the superbowlliii 234
## 6 if you 191
## 7 the super 190
## 8 the patriots 183
## 9 at the 181
## 10 superbowl superbowlliii 165
## # ... with 24,217 more rows
#separating to filter
sb2_separated_words <- sb2_paired_words %>%
separate(paired_words, c("word1", "word2"), sep = " ")
#Excluding the stop words from n-grams
sb2_tweets_filter <- sb2_separated_words %>%
filter(!word1 %in% stop_words$word) %>%
filter(nchar(word1)>3) %>%
filter(!word1 %in% undesired_words_2) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word2 %in% undesired_words_2) %>%
filter(nchar(word2)>3)
#view the filtered word pairs
sb2_word_counts <- sb2_tweets_filter %>%
count(word1, word2, sort=TRUE)
library(kableExtra)
head(sb2_word_counts) %>%kable()%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
word1 | word2 | n |
---|---|---|
colin | kaepernick | 101 |
children | killed | 95 |
chuckmodi1 | colin | 95 |
bey_legion | chloexhalle | 93 |
chloexhalle | perform | 88 |
debut | album | 88 |
Now we can visualize our network of bigrams.
library(tidyr)
library(igraph)
library(ggraph)
sb2_word_counts %>%
filter(n>=50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes()) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label=name), vjust=1.8, size=5)
Word Pairs used under #SuperBowlLIII
More insights
Apparently Chuck Modi’s tweet draw a lot of attention. People supported Kaepernick by refering Chuck Modi:
“Colin Kaepernick took a knee for us!”
Everyone has opinion on Kap, but what about mothers who had their children killed by police. Pls listen: #ImWithKap #TakeAKnee on #SuperBowlLIII Reference: https://twitter.com/ChuckModi1/status/1092197280701169665 Additionally, this month being the “Black History Month” also was a popular topic to raise awareness.
There are other word pairs related to the performances. Beyonce’s “protege’s” Chloe X Halle’s performance has also caught attention during the Super Bowl - “bey_legion” here refers to Beyonce, actually I didn’t know about it before.
This year we heard tons of complains that Super Bowl was really boring… Well actually some people expressed their feelings as “superbored” about the Super Bowl.
Here it won’t be fair if we don’t give any credits to Pampers for their ad. John Legend and Adam Levine’s performance about diapers also was a hot topic.
Well it seems like there were more tweets about other topics than the game. People stated their opinions about Super Bowl related topics, but not about the game on twitter.
Now we can look at some insights on half time show. Maroon 5’s performance and him being “half time naked” were big deal. So let’s explore how this actually reflected on Twitter. Before creating some graphs, we need to do some more cleaning and filtering. For this one I am going to use the add “halftimeshow” as an undesired word. Let’s have a look at the most frequent used words about half time show. I will create the plot with ggplot.
undesired_words_4 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "halftimeshow", "halftime", "literally")
#plotting the frequency plot for the words mentioned under the hashtag
hts2_tweets_clean %>%
count(word, sort=TRUE) %>%
top_n(50) %>%
anti_join(stop_words) %>%
filter(!word %in% undesired_words_4) %>% #excluding the undesired words
filter(nchar(word)>3) %>% #only including words that have at least 3 chr
mutate(word= reorder(word, n)) %>%
ggplot(aes(x=word, y = n)) +
geom_bar(stat= "identity",
fill="blue") + #coloring the bars blue
xlab(NULL) + #removing the x-axis label
ylab(NULL) + #removing the y-axis label
ggtitle("What words were most frequent words about half time show tweets?")+
theme(plot.title = element_text(size=10))+ #changing the text size
coord_flip()
Insight on Halftime Show
Apparently Sponge Bob and David Glen Eisley’s Sweet Victory has mentioned a lot in the tweets about half time show. Actually more than Maroon 5! So Adam Levine going partially naked was not as interesting as spongebob or this tweet samples. We can look at the word-pairs for this hashtag to understand more about the opinions on the halftime show.
I’m in Boston right now. So I see it with my own eyes that Parade was very crowded and people were very excited about it. I want to do a sentiment analysis on the tweets of people from Boston.
In R, there are 3 different lexicons(fancy word to describe different sentiments):
* Afinn
* nrc
* bing
Since I want to understand how people were actually feeling about the parade, I will use nrc lexicon, which is related to emotions.
#get_sentiments("nrc")
undesired_words_5 <- c("rt", "angeles", "2019", "superbowl", "bowl", "super", "superbowlliii", "parade")
pts_tweets_clean %>%
inner_join(get_sentiments("nrc")) %>% #using this lexicon
count(word, sentiment, sort=TRUE) %>%
anti_join(stop_words) %>%
distinct() %>%
filter(!word %in% undesired_words_5) %>% #excluding the undesired words
filter(nchar(word)>3) %>%
group_by(sentiment) %>%
top_n(5) %>% #showing the top 3 words for every emotion
ungroup() %>%
ggplot(aes(word, n, fill=sentiment))+
geom_col(show.legend = FALSE) + #hiding the legend
facet_wrap(~sentiment, scales="free_y") + #adding multiple graphs
xlab(NULL) +
ylab("Emotions about the Patriots Parade")+
theme(axis.text.x=element_text(size=6, angle=45)) +
coord_flip()
It looks like there are mixed feelings about the parade. Although people enjoyed the championship of Pats, it seems like there has been some negativity too. This might be due to the parade being overwhelmingly crowded. The negative words are drinking, disrespectful, police and ambulance. Like every over-crowded events, there was confluence in the parade as well.