On this notebook we are going to analysis tweets from march madness 2018
Use regular expression to clean the tweets text
Familiarize with some natural language processing tools
# Here we are checking if the package is installed
if(!require("tidyverse")){
install.packages("tidyverse", dependencies = TRUE)
library("tidyverse")
}
if(!require("syuzhet")){
install.packages("syuzhet", dependencies = TRUE)
library("syuzhet")
}
if(!require("cleanNLP")){
install.packages("cleanNLP", dependencies = TRUE)
library("cleanNLP")
}
if(!require("magrittr")){
install.packages("magrittr", dependencies = TRUE)
library("magrittr")
}
if(!require("wordcloud")){
install.packages("wordcloud", dependencies = TRUE)
library("wordcloud")
}
tweets <- read_csv("data/march_madness.csv")
# Change the tweets IDs from longe integer to characters
tweets$tweet_id <- as.character(tweets$tweet_id)
# Extract and delete the links variable to add it at the end
links <- tweets$links
tweets$links <- NULL
# Inspects the first 10 rows
head(tweets)
replace_reg <- 'https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|(pic.twitter.com/[A-Za-z\\d]+)|&|<|>|RT|(.*.)\\.com(.*.)\\S+\\s|[^[:alnum:]]|(http|https)\\S+\\s*|(#|@)\\S+\\s*|\\n|\\"'
tweets <- tweets %>%
mutate(text = str_replace_all(text, replace_reg, " ")) %>%
mutate(text = iconv(text, from = "ASCII", to = "UTF-8", sub = " "))
Error in mutate_impl(.data, dots) :
Evaluation error: embedded nul in string: 'How to draw kawaii step by step leassons on Google Play animejapan KAWAIIcollection FinalFour SisterJean ORLvUTA c\003\tc\003)c\0024c\0033c\003\034c\003<c\003+h6\005 nitiasa precure d;.i\035"c\003)c\002$c\003\0c\003<c\003\023c\003+c\003\t c\0025c\0033c\003\ac\003<c\003"c\003<c\003\vc\0033c\0020 CBX SomeoneLikeYou l\v m\031\024l\025< l\n$k,4l\0024l\035\004 l6\025m\025\030m\0254 l5\034j0\025l0=k/< '.
The data we are using has columns such as date, verified, reply, text, links etc. which are all features of the Twitter application.
mydata <- read.csv("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\09-notebook-lab\\data\\march_madness.csv")
summary(mydata)
tweet_id
Min. :3.542e+16
1st Qu.:9.774e+17
Median :9.777e+17
Mean :9.753e+17
3rd Qu.:9.777e+17
Max. :9.824e+17
text
Yes! My favorite sports city has a team in the #FinalFour of #MarchMadness! The Loyola-Chicago Ramblers are in the final 4 for the 1st time since 1963! They’ve been underdogs this whole tournament & still keep winning! Go @RamblersMBB! Win for #SisterJean! #OnwardLU #NoFinishLine: 8
Congrats @RamblersMBB : 7
Let’s go @RamblersMBB : 4
Congrats @RamblersMBB. : 3
Congratulations @RamblersMBB : 3
I love #SisterJean : 3
(Other) :20159
username fullname date datetime verified reply retweets
@LALATE : 81 LALATE : 81 2018-03-25:10708 2018-03-25T00:21:10Z: 16 Min. :0.00000 Min. : 0.0000 Min. : 0.000
@RamblersMBB : 30 Loyola Basketball: 31 2018-03-23: 2976 2018-03-25T00:21:31Z: 16 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.000
@SkywayChicago : 27 Steve Timble : 27 2018-03-24: 2274 2018-03-25T00:21:09Z: 15 Median :0.00000 Median : 0.0000 Median : 0.000
@chicagomargaret: 21 Margaret Holt : 21 2018-03-26: 1504 2018-03-25T00:21:35Z: 15 Mean :0.06192 Mean : 0.3467 Mean : 3.146
@sschrimp : 18 Mark : 21 2018-03-18: 1099 2018-03-25T00:21:08Z: 14 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.000
@loyolaforus : 16 Steve : 19 2018-03-27: 241 2018-03-25T00:21:11Z: 14 Max. :1.00000 Max. :591.0000 Max. :5143.000
(Other) :19994 (Other) :19987 (Other) : 1385 (Other) :20097
favorite links
Min. : 0.0 @RamblersMBB : 1139
1st Qu.: 0.0 #LoyolaChicago : 1027
Median : 1.0 #SisterJean : 778
Mean : 15.8 https://twitter.com#SisterJean: 231
3rd Qu.: 3.0 #LoyolaChicago; #MarchMadness : 208
Max. :32180.0 (Other) :16117
NA's : 687
knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img1.png")
knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img2.png")
knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img3.png")
The first figure illustrates the number of records graphically and in a table. It appears March had the most amount of tweets from this set of data. The second figure illustrates how many of those tweets were either verified, retweeted, or favorited, and the time of day tweets were most active. The third figure further distinguishes the tweet features in terms of number of tweets, and the spherical figure shows the frequency of links whereby the more a hashtag was tweeted, the bigger the circle. ————-
Upon looking at the most frequent links/hashtags, it appears #loyolachicago,#sisterjean and @RamblerMBB are the most tweeted hashtags, giving Loyola a positive image. ### 2B) Use descriptive statistics to backup your arguments
summary(mydata)
tweet_id
Min. :3.542e+16
1st Qu.:9.774e+17
Median :9.777e+17
Mean :9.753e+17
3rd Qu.:9.777e+17
Max. :9.824e+17
text
Yes! My favorite sports city has a team in the #FinalFour of #MarchMadness! The Loyola-Chicago Ramblers are in the final 4 for the 1st time since 1963! They’ve been underdogs this whole tournament & still keep winning! Go @RamblersMBB! Win for #SisterJean! #OnwardLU #NoFinishLine: 8
Congrats @RamblersMBB : 7
Let’s go @RamblersMBB : 4
Congrats @RamblersMBB. : 3
Congratulations @RamblersMBB : 3
I love #SisterJean : 3
(Other) :20159
username fullname date datetime verified reply retweets
@LALATE : 81 LALATE : 81 2018-03-25:10708 2018-03-25T00:21:10Z: 16 Min. :0.00000 Min. : 0.0000 Min. : 0.000
@RamblersMBB : 30 Loyola Basketball: 31 2018-03-23: 2976 2018-03-25T00:21:31Z: 16 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.000
@SkywayChicago : 27 Steve Timble : 27 2018-03-24: 2274 2018-03-25T00:21:09Z: 15 Median :0.00000 Median : 0.0000 Median : 0.000
@chicagomargaret: 21 Margaret Holt : 21 2018-03-26: 1504 2018-03-25T00:21:35Z: 15 Mean :0.06192 Mean : 0.3467 Mean : 3.146
@sschrimp : 18 Mark : 21 2018-03-18: 1099 2018-03-25T00:21:08Z: 14 3rd Qu.:0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.000
@loyolaforus : 16 Steve : 19 2018-03-27: 241 2018-03-25T00:21:11Z: 14 Max. :1.00000 Max. :591.0000 Max. :5143.000
(Other) :19994 (Other) :19987 (Other) : 1385 (Other) :20097
favorite links
Min. : 0.0 @RamblersMBB : 1139
1st Qu.: 0.0 #LoyolaChicago : 1027
Median : 1.0 #SisterJean : 778
Mean : 15.8 https://twitter.com#SisterJean: 231
3rd Qu.: 3.0 #LoyolaChicago; #MarchMadness : 208
Max. :32180.0 (Other) :16117
NA's : 687
According the summary, the hashtags that were previously mentioned appear to be the most frequently tweeted. @RamblersMBB = 1139, #LoyolaChicago = 1027, and #SisterJean = 778
knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\explore.png")
knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img4.png")
knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img5.png")
knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img6.png")
The first figure identifies the number of retweets over the months in 2017. The graphs shoes that march had the most amount of retweets with a value of nearly 40,000. The second figure idenifies the most number of twitter replies using bubbles, which illustrates the most number of replies by the size of bubbles. The third figure shows another graph, wherein the number of reweets by verified users can be analyzed.
Overall, each figure demonstrates at least one common conclusion that March rendered the most amount of tweets, as suspected due to March Madness.