On this notebook we are going to analysis tweets from march madness 2018
Use regular expression to clean the tweets text
Familiarize with some natural language processing tools
# Here we are checking if the package is installed
if(!require("tidyverse")){
install.packages("tidyverse", dependencies = TRUE)
library("tidyverse")
}
## Warning: package 'ggplot2' was built under R version 3.4.4
if(!require("syuzhet")){
install.packages("syuzhet", dependencies = TRUE)
library("syuzhet")
}
## Warning: package 'syuzhet' was built under R version 3.4.4
if(!require("cleanNLP")){
install.packages("cleanNLP", dependencies = TRUE)
library("cleanNLP")
}
## Warning: package 'cleanNLP' was built under R version 3.4.4
if(!require("magrittr")){
install.packages("magrittr", dependencies = TRUE)
library("magrittr")
}
if(!require("wordcloud")){
install.packages("wordcloud", dependencies = TRUE)
library("wordcloud")
}
tweets <- read.csv("data/sentiment_march_madness.csv")
mydata <- read.csv("data/march_madness.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets[12:21])
## anticipation disgust fear joy sadness surprise trust negative positive
## 1 2 0 0 2 0 2 2 0 2
## 2 1 3 3 1 2 2 3 4 1
## 3 0 0 0 0 0 0 0 0 1
## 4 1 0 0 1 0 0 0 0 3
## 5 2 0 0 1 0 1 1 0 1
## 6 1 1 3 0 0 0 1 1 2
## sentiment_bing
## 1 2
## 2 -1
## 3 1
## 4 0
## 5 1
## 6 1
qplot(x = 1:length(tweets$sentiment_bing),
y = tweets$sentiment_bing,
geom = "line",
xlab = "Narrative Time",
ylab = "Emotional Valence",
main = "Tweets Sentiment Trajectory")
angry_tweets <- which(tweets$anger > 0)
data_frame(tweet = tweets$text[angry_tweets][1:2])
## # A tibble: 2 x 1
## tweet
## <fct>
## 1 Look I get that that you re all excited that you beat an 11 seed but ~
## 2 "Ben Richardson was extremely emotional leaving the court screaming in~
joy_tweets <- which(tweets$joy > 0)
data_frame(tweet = tweets$text[joy_tweets][5:7])
## # A tibble: 3 x 1
## tweet
## <fct>
## 1 Thank you Loyola of Chicago and Sr Jean What great a basketball run pl~
## 2 After being honored at tomorrow s chicagobulls game the FinalFour Ra~
## 3 With everything being said I respect Loyola so much for what they acco~
value <- as.double(colSums(prop.table(tweets[, 11:18])))
emotion <- names(tweets)[11:18]
emotion <- factor(emotion, levels = names(tweets)[11:18][order(value, decreasing = FALSE)])
emotions <- data_frame(emotion, percent = value * 100)
head(emotions)
## # A tibble: 6 x 2
## emotion percent
## <fct> <dbl>
## 1 anger 6.72
## 2 anticipation 21.8
## 3 disgust 3.58
## 4 fear 8.07
## 5 joy 21.1
## 6 sadness 5.62
ggplot(data = emotions, aes(x = emotion, y = percent)) +
geom_bar(stat = "identity", aes(fill = emotion)) +
scale_fill_brewer(palette="RdYlGn") +
coord_flip() +
xlab("Emotion") +
ylab("Percentage")
summary(tweets)
## tweet_id text username
## Length:20187 : 1273 @LALATE : 81
## Class :character : 1245 @RamblersMBB : 30
## Mode :character : 197 @SkywayChicago : 27
## : 51 @chicagomargaret: 21
## : 35 @sschrimp : 18
## SisterJean: 15 @loyolaforus : 16
## (Other) :17371 (Other) :19994
## fullname date datetime
## LALATE : 81 2018-03-25:10708 2018-03-25T00:21:10Z: 16
## Loyola Basketball: 31 2018-03-23: 2976 2018-03-25T00:21:31Z: 16
## Steve Timble : 27 2018-03-24: 2274 2018-03-25T00:21:09Z: 15
## Margaret Holt : 21 2018-03-26: 1504 2018-03-25T00:21:35Z: 15
## Mark : 21 2018-03-18: 1099 2018-03-25T00:21:08Z: 14
## Steve : 19 2018-03-27: 241 2018-03-25T00:21:11Z: 14
## (Other) :19987 (Other) : 1385 (Other) :20097
## verified reply retweets favorite
## Min. :0.00000 Min. : 0.0000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0
## Median :0.00000 Median : 0.0000 Median : 0.000 Median : 1.0
## Mean :0.06192 Mean : 0.3467 Mean : 3.146 Mean : 15.8
## 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 3.0
## Max. :1.00000 Max. :591.0000 Max. :5143.000 Max. :32180.0
##
## anger anticipation disgust fear
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.1342 Mean :0.4359 Mean :0.07143 Mean :0.1612
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :4.0000 Max. :7.0000 Max. :3.00000 Max. :6.0000
##
## joy sadness surprise trust
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.421 Mean :0.1122 Mean :0.1798 Mean :0.4806
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :8.000 Max. :5.0000 Max. :4.0000 Max. :7.0000
##
## negative positive sentiment_bing
## Min. :0.0000 Min. :0.0000 Min. :-5.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0000
## Median :0.0000 Median :0.0000 Median : 0.0000
## Mean :0.2395 Mean :0.6676 Mean : 0.5141
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.: 1.0000
## Max. :6.0000 Max. :9.0000 Max. :11.0000
##
## links
## @RamblersMBB : 1139
## #LoyolaChicago : 1027
## #SisterJean : 778
## https://twitter.com#SisterJean: 231
## #LoyolaChicago; #MarchMadness : 208
## (Other) :16117
## NA's : 687
The dataset looks at the number of replies, retweets, favorites, and general tweets over the period of 2/9/11 until 4/6/2018. Most of the accounts included are unverified accounts.There are significantly more Favorites than there are Retweets and Replies. Most of the data in this dataset is character data and it consists of some numeric, categorical and quantitative data to go along with it. The sentiment is largely positive and overwhelmingly expresses trust and joy. Of course, Sister Jean was talked about often.
knitr::include_graphics("imgs/Final4Tweets.png")
Top 10 Accounts by # of Tweets that contain #FinalFour. The color indicates if it is an account that belongs to Michigan, Loyola, or Other. U of M had more accounts tweeting about the final four than Loyola did.
knitr::include_graphics("imgs/LoyolaMentions.png")
This graph shows that Loyola was mentioned in more tweets after losing the game on March 31 than on the day of their Final Four game. Additionally, Loyola was tweeted about the most on March 25 by far.
knitr::include_graphics("imgs/LoyolaTweeters.png")
The plot shows the accounts who Tweeted about Loyola the most and how many tweets they made including “Loyola” in the Tweet.
knitr::include_graphics("imgs/Top10Tweeters.png")
This image shows the Top 10 Tweeters in the time period that tweets were pulled for.
knitr::include_graphics("imgs/TweetRetweet.png")
This plot shows a comparison between the number of Tweets and Retweets over the time period. There was a spike in tweets on March 26 while there was the largest spike in Retweets on the day of the final game.
Loyola has a large, positive presence on Twitter. They were talked about both before and after the game. Most negative tweets were in defense of Sr. Jean or the players, not about other teams which is a great testament to Loyola’s Sportsmanship. They had largely positive emotions associated with most of their tweets.
ggplot(data = emotions, aes(x = emotion, y = percent)) +
geom_bar(stat = "identity", aes(fill = emotion)) +
scale_fill_brewer(palette="RdYlGn") +
coord_flip() +
xlab("Emotion") +
ylab("Percentage")
About 23% of the of the emotions expressed were trusting and a little over 20% were anticipation and joy. There was less than 5% that were showing disgust and less than 7.5% showed anger. All sentiment analysis for Loyola is largely positive.
Have more accounts tweeting about Loyola and Sr. Jean. Retweet more often. While Loyola had a great opportunity, they were outtweeted by U of M because U of M had more accounts dedicated to their team. The sentiment was very positive, so the tweets from March Madness could be incorporated into promotions for upcoming athletic events and promotions to new students. Capitalized on the good PR and use analytics to find all the good comments from people unassociated with Loyola. They have less bias.
knitr::include_graphics("imgs/Watson.Activity.PNG")
This image compares the total favoritews, retweets, and replies seen in March 2018. There were significantly more favorites than retweets and replies. This trend shows that people are more likely to engage with tweets in less committal ways (favoriting does not show up on your feed, while replying involves typing a comment back).
knitr::include_graphics("imgs/Watson.Favorite.PNG")
This image compares the March Madness favorites by official, verified accounts and unofficial, unverified accounts over the year thus far. The verified accounts saw much more favoriting activity with the bulk of the favorites occurring in April. Unverified accounts saw much more favoriting in March.
knitr::include_graphics("imgs/Watson3.31v4.1.PNG")
This chart compares the numbers of Favorites, Retweets, and Replies for the day of the Final Four games and the day of the Final Game. The final game saw much more twitter activity than the Final Four game.