On this notebook we are going to analysis tweets from march madness 2018
Use regular expression to clean the tweets text
Familiarize with some natural language processing tools
# Here we are checking if the package is installed
if(!require("tidyverse")){
install.packages("tidyverse", dependencies = TRUE)
library("tidyverse")
}
if(!require("syuzhet")){
install.packages("syuzhet", dependencies = TRUE)
library("syuzhet")
}
if(!require("cleanNLP")){
install.packages("cleanNLP", dependencies = TRUE)
library("cleanNLP")
}
if(!require("magrittr")){
install.packages("magrittr", dependencies = TRUE)
library("magrittr")
}
if(!require("wordcloud")){
install.packages("wordcloud", dependencies = TRUE)
library("wordcloud")
}
tweets <- read_csv("data/sentiment_march_madness.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets[12:21])
## # A tibble: 6 x 10
## anticipation disgust fear joy sadness surprise trust negative
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2 0 0 2 0 2 2 0
## 2 1 3 3 1 2 2 3 4
## 3 0 0 0 0 0 0 0 0
## 4 1 0 0 1 0 0 0 0
## 5 2 0 0 1 0 1 1 0
## 6 1 1 3 0 0 0 1 1
## # ... with 2 more variables: positive <int>, sentiment_bing <int>
cnlp_init_udpipe()
## Loading required namespace: udpipe
doc <- cnlp_annotate(input = tweets$text, as_strings = TRUE, doc_ids = tweets$tweet_id, meta = tweets[-c(1,2)])
## Warning in cnlp_annotate(input = tweets$text, as_strings = TRUE, doc_ids =
## tweets$tweet_id, : duplicated document ids given
qplot(x = 1:length(tweets$sentiment_bing),
y = tweets$sentiment_bing,
geom = "line",
xlab = "Narrative Time",
ylab = "Emotional Valence",
main = "Tweets Sentiment Trajectory")
angry_tweets <- which(tweets$anger > 0)
data_frame(tweet = tweets$text[angry_tweets][1:2])
## # A tibble: 2 x 1
## tweet
## <chr>
## 1 Look I get that that you re all excited that you beat an 11 seed but …
## 2 Ben Richardson was extremely emotional leaving the court screaming int…
joy_tweets <- which(tweets$joy > 0)
data_frame(tweet = tweets$text[joy_tweets][5:7])
## # A tibble: 3 x 1
## tweet
## <chr>
## 1 Thank you Loyola of Chicago and Sr Jean What great a basketball run pl…
## 2 After being honored at tomorrow s chicagobulls game the FinalFour Ra…
## 3 With everything being said I respect Loyola so much for what they acco…
value <- as.double(colSums(prop.table(tweets[, 11:18])))
emotion <- names(tweets)[11:18]
emotion <- factor(emotion, levels = names(tweets)[11:18][order(value, decreasing = FALSE)])
emotions <- data_frame(emotion, percent = value * 100)
head(emotions)
## # A tibble: 6 x 2
## emotion percent
## <fct> <dbl>
## 1 anger 6.72
## 2 anticipation 21.8
## 3 disgust 3.58
## 4 fear 8.07
## 5 joy 21.1
## 6 sadness 5.62
ggplot(data = emotions, aes(x = emotion, y = percent)) +
geom_bar(stat = "identity", aes(fill = emotion)) +
scale_fill_brewer(palette="RdYlGn") +
coord_flip() +
xlab("Emotion") +
ylab("Percentage")
mydata <- read.csv('data/sentiment_march_madness.csv')
summary(mydata)
## tweet_id text username
## Min. :3.542e+16 : 1273 @LALATE : 81
## 1st Qu.:9.774e+17 : 1245 @RamblersMBB : 30
## Median :9.777e+17 : 197 @SkywayChicago : 27
## Mean :9.753e+17 : 51 @chicagomargaret: 21
## 3rd Qu.:9.777e+17 : 35 @sschrimp : 18
## Max. :9.824e+17 SisterJean: 15 @loyolaforus : 16
## (Other) :17371 (Other) :19994
## fullname date datetime
## LALATE : 81 3/25/18:10708 2018-03-25T00:21:10Z: 16
## Loyola Basketball: 31 3/23/18: 2976 2018-03-25T00:21:31Z: 16
## Steve Timble : 27 3/24/18: 2274 2018-03-25T00:21:09Z: 15
## Margaret Holt : 21 3/26/18: 1504 2018-03-25T00:21:35Z: 15
## Mark : 21 3/18/18: 1099 2018-03-25T00:21:08Z: 14
## Steve : 19 3/27/18: 241 2018-03-25T00:21:11Z: 14
## (Other) :19987 (Other): 1385 (Other) :20097
## verified reply retweets favorite
## Min. :0.00000 Min. : 0.0000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0
## Median :0.00000 Median : 0.0000 Median : 0.000 Median : 1.0
## Mean :0.06192 Mean : 0.3467 Mean : 3.146 Mean : 15.8
## 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 3.0
## Max. :1.00000 Max. :591.0000 Max. :5143.000 Max. :32180.0
##
## anger anticipation disgust fear
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.1342 Mean :0.4359 Mean :0.07143 Mean :0.1612
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :4.0000 Max. :7.0000 Max. :3.00000 Max. :6.0000
##
## joy sadness surprise trust
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.421 Mean :0.1122 Mean :0.1798 Mean :0.4806
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :8.000 Max. :5.0000 Max. :4.0000 Max. :7.0000
##
## negative positive sentiment_bing
## Min. :0.0000 Min. :0.0000 Min. :-5.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0000
## Median :0.0000 Median :0.0000 Median : 0.0000
## Mean :0.2395 Mean :0.6676 Mean : 0.5141
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.: 1.0000
## Max. :6.0000 Max. :9.0000 Max. :11.0000
##
## links
## @RamblersMBB : 1139
## #LoyolaChicago : 1027
## #SisterJean : 778
## https://twitter.com#SisterJean: 231
## #LoyolaChicago; #MarchMadness : 208
## (Other) :16117
## NA's : 687
##1: This shows official Loyola twitter accounts and the amount of replies, retweets, and favorites they had during March Madness. The RamblersMBB account had the most interactions with people and LoyolaQuinlan had barely any. Quinlan can use this to improve their social media interactions.
##2: This shows the number of retweets per username. University of Michigan’s Basketball twitter account had the highest number of retweets and Loyola had significantly less than them. Many of the other usernames with high amounts of retweets were related to the city of Chicago, such as the Cubs and the Bulls accounts.
##3: This shows the number of tweets per day during March Madness about Loyola. March 25th had the highest number of tweets per day, which could have been related to the fact that Loyola won the Elite Eight game the previous night.
##4: This shows the amount of favorites per verified users compared to unverified users. Verified users typically had more favorites, with the most being 32,149. This makes sense because verified users typically have more of a following and their tweets are seen by more people.
##5: This shows most common links in a tweet. The larger the cirlce, the more times that link was mentioned. The hashtag #SisterJean had one the highest mentions along with the username @umichbball.
##2: This shows the users that had the most angry sentiment tweets towards Loyola. @loyolaforus had angry tweets towards Loyola, which is an account for a workers coallition. This is most likely unrelated to March Madness.