March Madness Analysis

About this Notebook

On this notebook we are going to analysis tweets from march madness 2018
Use regular expression to clean the tweets text
Familiarize with some natural language processing tools

Analytics Toolkit: Require Packages

tidyverse: https://www.tidyverse.org/
syuzhet: https://github.com/mjockers/syuzhet
cleanNLP: https://github.com/statsmaths/cleanNLP
wordcloud: https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf

# Here we are checking if the package is installed
if(!require("tidyverse")){
  install.packages("tidyverse", dependencies = TRUE)
  library("tidyverse")
}

if(!require("syuzhet")){
  install.packages("syuzhet", dependencies = TRUE)
  library("syuzhet")
}

## Warning: package 'syuzhet' was built under R version 3.4.4

if(!require("cleanNLP")){
  install.packages("cleanNLP", dependencies = TRUE)
  library("cleanNLP")
}

## Warning: package 'cleanNLP' was built under R version 3.4.4

if(!require("magrittr")){
  install.packages("magrittr", dependencies = TRUE)
  library("magrittr")
}

if(!require("wordcloud")){
  install.packages("wordcloud", dependencies = TRUE)
  library("wordcloud")
}

## Warning: package 'wordcloud' was built under R version 3.4.4

Data Preparation: Cleaning tweets using regular expressions

Reading and inspecting the dataset

tweets <- read_csv("data/march_madness.csv")

# Change the tweets IDs from longe integer to characters
tweets$tweet_id <- as.character(tweets$tweet_id)

# Extract and delete the links variable to add it at the end
links <- tweets$links
tweets$links <- NULL

# Inspects the first 10 rows
head(tweets)

## # A tibble: 6 x 10
##   tweet_id text  username fullname date       datetime            verified
##   <chr>    <chr> <chr>    <chr>    <date>     <dttm>                 <int>
## 1 9802205~ Good~ @mill_c~ Mill Ca~ 2018-03-31 2018-03-31 23:09:34        0
## 2 9802732~ Look~ @Hoodie~ A.J. Sc~ 2018-04-01 2018-04-01 02:38:48        0
## 3 9802186~ #Loy~ @DanLea~ Dan Lea~ 2018-03-31 2018-03-31 23:01:56        1
## 4 9784337~ Chec~ @chisel~ Chisele~ 2018-03-27 2018-03-27 00:49:15        0
## 5 9802406~ Bye ~ @ProSpo~ Pro Spo~ 2018-04-01 2018-04-01 00:29:18        0
## 6 9802403~ Ben ~ @RyanSc~ Ryan Sc~ 2018-04-01 2018-04-01 00:27:55        0
## # ... with 3 more variables: reply <int>, retweets <int>, favorite <int>

First lets read the dataset and inspect the first 10 rows

tweets <- read_csv("data/march_madness.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets)

## # A tibble: 6 x 11
##   tweet_id text  username fullname date       datetime            verified
##   <chr>    <chr> <chr>    <chr>    <date>     <dttm>                 <int>
## 1 9802205~ Good~ @mill_c~ Mill Ca~ 2018-03-31 2018-03-31 23:09:34        0
## 2 9802732~ Look~ @Hoodie~ A.J. Sc~ 2018-04-01 2018-04-01 02:38:48        0
## 3 9802186~ #Loy~ @DanLea~ Dan Lea~ 2018-03-31 2018-03-31 23:01:56        1
## 4 9784337~ Chec~ @chisel~ Chisele~ 2018-03-27 2018-03-27 00:49:15        0
## 5 9802406~ Bye ~ @ProSpo~ Pro Spo~ 2018-04-01 2018-04-01 00:29:18        0
## 6 9802403~ Ben ~ @RyanSc~ Ryan Sc~ 2018-04-01 2018-04-01 00:27:55        0
## # ... with 4 more variables: reply <int>, retweets <int>, favorite <int>,
## #   links <chr>

Task 1: Data Exploration - Tableau

1A) Generally describe the data (summary)

summary(tweets)

##    tweet_id             text             username        
##  Length:20187       Length:20187       Length:20187      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    fullname              date               datetime                  
##  Length:20187       Min.   :2011-02-09   Min.   :2011-02-09 19:42:51  
##  Class :character   1st Qu.:2018-03-24   1st Qu.:2018-03-24 11:03:02  
##  Mode  :character   Median :2018-03-25   Median :2018-03-25 00:22:38  
##                     Mean   :2018-03-18   Mean   :2018-03-18 10:40:33  
##                     3rd Qu.:2018-03-25   3rd Qu.:2018-03-25 03:08:07  
##                     Max.   :2018-04-06   Max.   :2018-04-06 21:24:17  
##     verified           reply             retweets           favorite      
##  Min.   :0.00000   Min.   :  0.0000   Min.   :   0.000   Min.   :    0.0  
##  1st Qu.:0.00000   1st Qu.:  0.0000   1st Qu.:   0.000   1st Qu.:    0.0  
##  Median :0.00000   Median :  0.0000   Median :   0.000   Median :    1.0  
##  Mean   :0.06192   Mean   :  0.3467   Mean   :   3.146   Mean   :   15.8  
##  3rd Qu.:0.00000   3rd Qu.:  0.0000   3rd Qu.:   0.000   3rd Qu.:    3.0  
##  Max.   :1.00000   Max.   :591.0000   Max.   :5143.000   Max.   :32180.0  
##     links          
##  Length:20187      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

1B) Use tableau to create at least 5 plots

knitr::include_graphics("image11.png")

This shows that most of the tweets occurred on Sunday. The closer you got to the weekend the more people tweeted and then once Sunday came around, many people took to Twitter and Favorited a lot.

knitr::include_graphics("image12.png")

These are the Tweet IDs that Favorited the most. One Tweet ID had over 30,000 Favorites which is a crazy amount for one account.

knitr::include_graphics("image13.png")

This graph shows the time of day that most tweets occur. At midnight there was the largest amount of retweets reaching nearly 30,000. This is because people are out after watching the games and are on their phones retweeting things that other people wrote about the games.

knitr::include_graphics("image14.png")

As expected most of the March Madness retweets occurred in March sinc that is the month that the tournament occurs.

knitr::include_graphics("image15.png")

This was cool to see all the spikes, because you see when the games occurred, and the further the Loyola got the higher the spike.

1C) Explain each plot make a relation to date of the tweets/time

Explanations are above.

Task 3: Data Analysis

2A)Based on your plots and data description make give a general narrative for the image of loyola in twitter

Loyola had nearly 0 people talking about them, but once Loyola started getting further and further into the tournament, the more people were giving them recognition on social media. I do not have a Twitter so I do not know how favorites and retweets work but the closer to the game time, the more tweets went out.

2B) Use descriptive statistics to backup your arguments

summary(tweets)

##    tweet_id             text             username        
##  Length:20187       Length:20187       Length:20187      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    fullname              date               datetime                  
##  Length:20187       Min.   :2011-02-09   Min.   :2011-02-09 19:42:51  
##  Class :character   1st Qu.:2018-03-24   1st Qu.:2018-03-24 11:03:02  
##  Mode  :character   Median :2018-03-25   Median :2018-03-25 00:22:38  
##                     Mean   :2018-03-18   Mean   :2018-03-18 10:40:33  
##                     3rd Qu.:2018-03-25   3rd Qu.:2018-03-25 03:08:07  
##                     Max.   :2018-04-06   Max.   :2018-04-06 21:24:17  
##     verified           reply             retweets           favorite      
##  Min.   :0.00000   Min.   :  0.0000   Min.   :   0.000   Min.   :    0.0  
##  1st Qu.:0.00000   1st Qu.:  0.0000   1st Qu.:   0.000   1st Qu.:    0.0  
##  Median :0.00000   Median :  0.0000   Median :   0.000   Median :    1.0  
##  Mean   :0.06192   Mean   :  0.3467   Mean   :   3.146   Mean   :   15.8  
##  3rd Qu.:0.00000   3rd Qu.:  0.0000   3rd Qu.:   0.000   3rd Qu.:    3.0  
##  Max.   :1.00000   Max.   :591.0000   Max.   :5143.000   Max.   :32180.0  
##     links          
##  Length:20187      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

The max favorites was 32180 which was seen in my tableau information since one user had 32180 favorites alone. I think that most of the traffic on Twitter happened at certain times and marketing teams should advertise over Twitter at these times to maximize the people that see it.

2C)Any recommendations to Loyola’s marketing team

I recommend that Loyola advertises more during the weekends later at night because that is when most people were retweeting and favoriting things and Loyola could get their name out to alot of people by doing this.

Task 3: Watson Analysis

3A)Use watson analytics to explore the data

3B)Give at least 3 plots or discoveries using watson. Explain your findings.

knitr::include_graphics("image16.png")

This confirms what I found in Tableau that March is the highest traffic on Twitter regarding March madness.

knitr::include_graphics("image17.png")

This data shows the favorites by username and obviously the Ramblers Men’s Bball team account has the highest amount of favorites.

knitr::include_graphics("image18.png")

The highest driver of verified is favorite and reply, with 94% strength which is quite strong.