March Madness Analysis

About this Notebook

On this notebook we are going to analysis tweets from march madness 2018
Use regular expression to clean the tweets text
Familiarize with some natural language processing tools

Analytics Toolkit: Require Packages

tidyverse: https://www.tidyverse.org/
syuzhet: https://github.com/mjockers/syuzhet
cleanNLP: https://github.com/statsmaths/cleanNLP
wordcloud: https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf

# Here we are checking if the package is installed
if(!require("tidyverse")){
  install.packages("tidyverse", dependencies = TRUE)
  library("tidyverse")
}

if(!require("syuzhet")){
  install.packages("syuzhet", dependencies = TRUE)
  library("syuzhet")
}

if(!require("cleanNLP")){
  install.packages("cleanNLP", dependencies = TRUE)
  library("cleanNLP")
}

if(!require("magrittr")){
  install.packages("magrittr", dependencies = TRUE)
  library("magrittr")
}

if(!require("wordcloud")){
  install.packages("wordcloud", dependencies = TRUE)
  library("wordcloud")
}

Sentiment Analysis: Natural Language Processing

First lets read the dataset and inspect the first 10 rows

tweets <- read_csv("data/sentiment_march_madness.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets[12:21])

## # A tibble: 6 x 10
##   anticipation disgust  fear   joy sadness surprise trust negative
##          <int>   <int> <int> <int>   <int>    <int> <int>    <int>
## 1            2       0     0     2       0        2     2        0
## 2            1       3     3     1       2        2     3        4
## 3            0       0     0     0       0        0     0        0
## 4            1       0     0     1       0        0     0        0
## 5            2       0     0     1       0        1     1        0
## 6            1       1     3     0       0        0     1        1
## # ... with 2 more variables: positive <int>, sentiment_bing <int>

For sentiment analysis we are going to use the cleanNLP package that uses Stanford CoreNLP – Natural language software int he backend. First we need to initialize the CoreNLP engine and create an annotation object using the text column, tweet_id and the other columns are given as metadata

cnlp_init_udpipe()

## Loading required namespace: udpipe

doc <- cnlp_annotate(input = tweets$text, as_strings = TRUE, doc_ids = tweets$tweet_id, meta = tweets[-c(1,2)])

## Warning in cnlp_annotate(input = tweets$text, as_strings = TRUE, doc_ids =
## tweets$tweet_id, : duplicated document ids given

Here we can see the change of sentiment in the tweets

qplot(x = 1:length(tweets$sentiment_bing), 
      y = tweets$sentiment_bing, 
      geom = "line", 
      xlab = "Narrative Time", 
      ylab = "Emotional Valence", 
      main = "Tweets Sentiment Trajectory")

Look at the tweets with negative sentiment

angry_tweets <- which(tweets$anger > 0)
data_frame(tweet = tweets$text[angry_tweets][1:2])

## # A tibble: 2 x 1
##   tweet                                                                   
##   <chr>                                                                   
## 1 Look  I get that that you re all excited that you beat an 11 seed  but …
## 2 Ben Richardson was extremely emotional leaving the court  screaming int…

Look a tweets with positive sentiment

joy_tweets <- which(tweets$joy > 0)
data_frame(tweet = tweets$text[joy_tweets][5:7])

## # A tibble: 3 x 1
##   tweet                                                                   
##   <chr>                                                                   
## 1 Thank you Loyola of Chicago and Sr Jean  What great a basketball run pl…
## 2 After being honored at tomorrow s  chicagobulls game the  FinalFour  Ra…
## 3 With everything being said  I respect Loyola so much for what they acco…

Lets explore the emotions in the tweets more in-depth. Here we are going to extract the variables regarding emotions and create a subset.

value <- as.double(colSums(prop.table(tweets[, 11:18])))
emotion <- names(tweets)[11:18]
emotion <- factor(emotion, levels = names(tweets)[11:18][order(value, decreasing = FALSE)])
emotions <- data_frame(emotion, percent = value * 100)

head(emotions)

## # A tibble: 6 x 2
##   emotion      percent
##   <fct>          <dbl>
## 1 anger           6.72
## 2 anticipation   21.8 
## 3 disgust         3.58
## 4 fear            8.07
## 5 joy            21.1 
## 6 sadness         5.62

Now we can create a plot of the emotions in the march madness tweets

ggplot(data = emotions, aes(x = emotion, y = percent)) + 
  geom_bar(stat = "identity", aes(fill = emotion)) + 
  scale_fill_brewer(palette="RdYlGn") + 
  coord_flip() +
  xlab("Emotion") +
  ylab("Percentage")

Task 1: Data Exploration - Tableau

1A) Generally describe the data (summary)

mydata <- read.csv('data/sentiment_march_madness.csv')
summary(mydata)

##     tweet_id                   text                   username    
##  Min.   :3.542e+16               : 1273   @LALATE         :   81  
##  1st Qu.:9.774e+17               : 1245   @RamblersMBB    :   30  
##  Median :9.777e+17               :  197   @SkywayChicago  :   27  
##  Mean   :9.753e+17               :   51   @chicagomargaret:   21  
##  3rd Qu.:9.777e+17               :   35   @sschrimp       :   18  
##  Max.   :9.824e+17     SisterJean:   15   @loyolaforus    :   16  
##                      (Other)     :17371   (Other)         :19994  
##               fullname          date                       datetime    
##  LALATE           :   81   3/25/18:10708   2018-03-25T00:21:10Z:   16  
##  Loyola Basketball:   31   3/23/18: 2976   2018-03-25T00:21:31Z:   16  
##  Steve Timble     :   27   3/24/18: 2274   2018-03-25T00:21:09Z:   15  
##  Margaret Holt    :   21   3/26/18: 1504   2018-03-25T00:21:35Z:   15  
##  Mark             :   21   3/18/18: 1099   2018-03-25T00:21:08Z:   14  
##  Steve            :   19   3/27/18:  241   2018-03-25T00:21:11Z:   14  
##  (Other)          :19987   (Other): 1385   (Other)             :20097  
##     verified           reply             retweets           favorite      
##  Min.   :0.00000   Min.   :  0.0000   Min.   :   0.000   Min.   :    0.0  
##  1st Qu.:0.00000   1st Qu.:  0.0000   1st Qu.:   0.000   1st Qu.:    0.0  
##  Median :0.00000   Median :  0.0000   Median :   0.000   Median :    1.0  
##  Mean   :0.06192   Mean   :  0.3467   Mean   :   3.146   Mean   :   15.8  
##  3rd Qu.:0.00000   3rd Qu.:  0.0000   3rd Qu.:   0.000   3rd Qu.:    3.0  
##  Max.   :1.00000   Max.   :591.0000   Max.   :5143.000   Max.   :32180.0  
##                                                                           
##      anger         anticipation       disgust             fear       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.1342   Mean   :0.4359   Mean   :0.07143   Mean   :0.1612  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :4.0000   Max.   :7.0000   Max.   :3.00000   Max.   :6.0000  
##                                                                      
##       joy           sadness          surprise          trust       
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.421   Mean   :0.1122   Mean   :0.1798   Mean   :0.4806  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :5.0000   Max.   :4.0000   Max.   :7.0000  
##                                                                    
##     negative         positive      sentiment_bing   
##  Min.   :0.0000   Min.   :0.0000   Min.   :-5.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.0000  
##  Median :0.0000   Median :0.0000   Median : 0.0000  
##  Mean   :0.2395   Mean   :0.6676   Mean   : 0.5141  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.: 1.0000  
##  Max.   :6.0000   Max.   :9.0000   Max.   :11.0000  
##                                                     
##                             links      
##  @RamblersMBB                  : 1139  
##  #LoyolaChicago                : 1027  
##  #SisterJean                   :  778  
##  https://twitter.com#SisterJean:  231  
##  #LoyolaChicago; #MarchMadness :  208  
##  (Other)                       :16117  
##  NA's                          :  687

The data contains tweets about Loyola Men’s Basketball during March Madness.The most commonly tweeted username was @RamblersMBB, which is the official Loyola Men’s Basketball twitter account. Additionally, the most common hashtags were LoyolaChicago and SisterJean. The date with the most tweets about Loyola was March 25th.

1B) Use tableau to create at least 5 plots

1C) Explain each plot make a relation to date of the tweets/time

##1: This shows official Loyola twitter accounts and the amount of replies, retweets, and favorites they had during March Madness. The RamblersMBB account had the most interactions with people and LoyolaQuinlan had barely any. Quinlan can use this to improve their social media interactions. ##2: This shows the number of retweets per username. University of Michigan’s Basketball twitter account had the highest number of retweets and Loyola had significantly less than them. Many of the other usernames with high amounts of retweets were related to the city of Chicago, such as the Cubs and the Bulls accounts. ##3: This shows the number of tweets per day during March Madness about Loyola. March 25th had the highest number of tweets per day, which could have been related to the fact that Loyola won the Elite Eight game the previous night. ##4: This shows the amount of favorites per verified users compared to unverified users. Verified users typically had more favorites, with the most being 32,149. This makes sense because verified users typically have more of a following and their tweets are seen by more people. ##5: This shows most common links in a tweet. The larger the cirlce, the more times that link was mentioned. The hashtag #SisterJean had one the highest mentions along with the username @umichbball.

Task 3: Data Analysis

2A)Based on your plots and data description make give a general narrative for the image of loyola in twitter

Loyola had a mostly positive image on twitter and a lot of the mentions were related to Sister Jean. However, Loyola did not have as many twitter ineractions, such as favorites or retweets as U of M or other March Madness related accounts.

2B) Use descriptive statistics to backup your arguments

qplot(x = 1:length(tweets$sentiment_bing), 
      y = tweets$sentiment_bing, 
      geom = "line", 
      xlab = "Narrative Time", 
      ylab = "Emotional Valence", 
      main = "Tweets Sentiment Trajectory")

ggplot(data = emotions, aes(x = emotion, y = percent)) + 
  geom_bar(stat = "identity", aes(fill = emotion)) + 
  scale_fill_brewer(palette="RdYlGn") + 
  coord_flip() +
  xlab("Emotion") +
  ylab("Percentage")

These shows that the tweets about Loyola were more positive in nature. Very few tweets showes sadness and disgust and the majority showed trust.

2C)Any recommendations to Loyola’s marketing team

##Loyola should do a better job of interacting with twitter users and creating content that will encourage users to interact with Loyola by retweeting or favoriting them.

Task 3: Watson Analysis

3A)Use watson analytics to explore the data

3B)Give at least 3 plots or discoveries using watson. Explain your findings.

1: This shows how many positive tweets there were per day. March 25th has the most positive tweets, which also is the day that there were the most tweets about Loyola. The amount of positive tweets dropped significantly the following day and did not increase, most likely due to Loyola’s loss.

##2: This shows the users that had the most angry sentiment tweets towards Loyola. @loyolaforus had angry tweets towards Loyola, which is an account for a workers coallition. This is most likely unrelated to March Madness.