On this notebook we are going to analysis tweets from march madness 2018
Use regular expression to clean the tweets text
Familiarize with some natural language processing tools
# Here we are checking if the package is installed
if(!require("tidyverse")){
install.packages("tidyverse", dependencies = TRUE)
library("tidyverse")
}
if(!require("syuzhet")){
install.packages("syuzhet", dependencies = TRUE)
library("syuzhet")
}
## Warning: package 'syuzhet' was built under R version 3.4.4
if(!require("cleanNLP")){
install.packages("cleanNLP", dependencies = TRUE)
library("cleanNLP")
}
## Warning: package 'cleanNLP' was built under R version 3.4.4
if(!require("magrittr")){
install.packages("magrittr", dependencies = TRUE)
library("magrittr")
}
if(!require("wordcloud")){
install.packages("wordcloud", dependencies = TRUE)
library("wordcloud")
}
## Warning: package 'wordcloud' was built under R version 3.4.4
tweets <- read_csv("data/march_madness.csv")
# Change the tweets IDs from longe integer to characters
tweets$tweet_id <- as.character(tweets$tweet_id)
# Extract and delete the links variable to add it at the end
links <- tweets$links
tweets$links <- NULL
# Inspects the first 10 rows
head(tweets)
## # A tibble: 6 x 10
## tweet_id text username fullname date datetime verified
## <chr> <chr> <chr> <chr> <date> <dttm> <int>
## 1 9802205~ Good~ @mill_c~ Mill Ca~ 2018-03-31 2018-03-31 23:09:34 0
## 2 9802732~ Look~ @Hoodie~ A.J. Sc~ 2018-04-01 2018-04-01 02:38:48 0
## 3 9802186~ #Loy~ @DanLea~ Dan Lea~ 2018-03-31 2018-03-31 23:01:56 1
## 4 9784337~ Chec~ @chisel~ Chisele~ 2018-03-27 2018-03-27 00:49:15 0
## 5 9802406~ Bye ~ @ProSpo~ Pro Spo~ 2018-04-01 2018-04-01 00:29:18 0
## 6 9802403~ Ben ~ @RyanSc~ Ryan Sc~ 2018-04-01 2018-04-01 00:27:55 0
## # ... with 3 more variables: reply <int>, retweets <int>, favorite <int>
tweets <- read_csv("data/march_madness.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets)
## # A tibble: 6 x 11
## tweet_id text username fullname date datetime verified
## <chr> <chr> <chr> <chr> <date> <dttm> <int>
## 1 9802205~ Good~ @mill_c~ Mill Ca~ 2018-03-31 2018-03-31 23:09:34 0
## 2 9802732~ Look~ @Hoodie~ A.J. Sc~ 2018-04-01 2018-04-01 02:38:48 0
## 3 9802186~ #Loy~ @DanLea~ Dan Lea~ 2018-03-31 2018-03-31 23:01:56 1
## 4 9784337~ Chec~ @chisel~ Chisele~ 2018-03-27 2018-03-27 00:49:15 0
## 5 9802406~ Bye ~ @ProSpo~ Pro Spo~ 2018-04-01 2018-04-01 00:29:18 0
## 6 9802403~ Ben ~ @RyanSc~ Ryan Sc~ 2018-04-01 2018-04-01 00:27:55 0
## # ... with 4 more variables: reply <int>, retweets <int>, favorite <int>,
## # links <chr>
summary(tweets)
## tweet_id text username
## Length:20187 Length:20187 Length:20187
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## fullname date datetime
## Length:20187 Min. :2011-02-09 Min. :2011-02-09 19:42:51
## Class :character 1st Qu.:2018-03-24 1st Qu.:2018-03-24 11:03:02
## Mode :character Median :2018-03-25 Median :2018-03-25 00:22:38
## Mean :2018-03-18 Mean :2018-03-18 10:40:33
## 3rd Qu.:2018-03-25 3rd Qu.:2018-03-25 03:08:07
## Max. :2018-04-06 Max. :2018-04-06 21:24:17
## verified reply retweets favorite
## Min. :0.00000 Min. : 0.0000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0
## Median :0.00000 Median : 0.0000 Median : 0.000 Median : 1.0
## Mean :0.06192 Mean : 0.3467 Mean : 3.146 Mean : 15.8
## 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 3.0
## Max. :1.00000 Max. :591.0000 Max. :5143.000 Max. :32180.0
## links
## Length:20187
## Class :character
## Mode :character
##
##
##
knitr::include_graphics("image11.png")
This shows that most of the tweets occurred on Sunday. The closer you got to the weekend the more people tweeted and then once Sunday came around, many people took to Twitter and Favorited a lot.
knitr::include_graphics("image12.png")
These are the Tweet IDs that Favorited the most. One Tweet ID had over 30,000 Favorites which is a crazy amount for one account.
knitr::include_graphics("image13.png")
This graph shows the time of day that most tweets occur. At midnight there was the largest amount of retweets reaching nearly 30,000. This is because people are out after watching the games and are on their phones retweeting things that other people wrote about the games.
knitr::include_graphics("image14.png")
As expected most of the March Madness retweets occurred in March sinc that is the month that the tournament occurs.
knitr::include_graphics("image15.png")
This was cool to see all the spikes, because you see when the games occurred, and the further the Loyola got the higher the spike.
Explanations are above.
Loyola had nearly 0 people talking about them, but once Loyola started getting further and further into the tournament, the more people were giving them recognition on social media. I do not have a Twitter so I do not know how favorites and retweets work but the closer to the game time, the more tweets went out.
summary(tweets)
## tweet_id text username
## Length:20187 Length:20187 Length:20187
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## fullname date datetime
## Length:20187 Min. :2011-02-09 Min. :2011-02-09 19:42:51
## Class :character 1st Qu.:2018-03-24 1st Qu.:2018-03-24 11:03:02
## Mode :character Median :2018-03-25 Median :2018-03-25 00:22:38
## Mean :2018-03-18 Mean :2018-03-18 10:40:33
## 3rd Qu.:2018-03-25 3rd Qu.:2018-03-25 03:08:07
## Max. :2018-04-06 Max. :2018-04-06 21:24:17
## verified reply retweets favorite
## Min. :0.00000 Min. : 0.0000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0
## Median :0.00000 Median : 0.0000 Median : 0.000 Median : 1.0
## Mean :0.06192 Mean : 0.3467 Mean : 3.146 Mean : 15.8
## 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 3.0
## Max. :1.00000 Max. :591.0000 Max. :5143.000 Max. :32180.0
## links
## Length:20187
## Class :character
## Mode :character
##
##
##
The max favorites was 32180 which was seen in my tableau information since one user had 32180 favorites alone. I think that most of the traffic on Twitter happened at certain times and marketing teams should advertise over Twitter at these times to maximize the people that see it.
I recommend that Loyola advertises more during the weekends later at night because that is when most people were retweeting and favoriting things and Loyola could get their name out to alot of people by doing this.
knitr::include_graphics("image16.png")
This confirms what I found in Tableau that March is the highest traffic on Twitter regarding March madness.
knitr::include_graphics("image17.png")
This data shows the favorites by username and obviously the Ramblers Men’s Bball team account has the highest amount of favorites.
knitr::include_graphics("image18.png")
The highest driver of verified is favorite and reply, with 94% strength which is quite strong.