About this Notebook



Analytics Toolkit: Require Packages


# Here we are checking if the package is installed
if(!require("tidyverse")){
  install.packages("tidyverse", dependencies = TRUE)
  library("tidyverse")
}
if(!require("syuzhet")){
  install.packages("syuzhet", dependencies = TRUE)
  library("syuzhet")
}
if(!require("cleanNLP")){
  install.packages("cleanNLP", dependencies = TRUE)
  library("cleanNLP")
}
if(!require("magrittr")){
  install.packages("magrittr", dependencies = TRUE)
  library("magrittr")
}
if(!require("wordcloud")){
  install.packages("wordcloud", dependencies = TRUE)
  library("wordcloud")
}

Data Preparation: Cleaning tweets using regular expressions


Reading and inspecting the dataset

tweets <- read_csv("data/march_madness.csv")
# Change the tweets IDs from longe integer to characters
tweets$tweet_id <- as.character(tweets$tweet_id)
# Extract and delete the links variable to add it at the end
links <- tweets$links
tweets$links <- NULL
# Inspects the first 10 rows
head(tweets)

Now first we need to extract the text from the raw tweets and clean it using regular expressions

replace_reg <- 'https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|(pic.twitter.com/[A-Za-z\\d]+)|&amp;|&lt;|&gt;|RT|(.*.)\\.com(.*.)\\S+\\s|[^[:alnum:]]|(http|https)\\S+\\s*|(#|@)\\S+\\s*|\\n|\\"'
tweets <- tweets %>% 
  mutate(text = str_replace_all(text, replace_reg, " ")) %>% 
  mutate(text = iconv(text, from = "ASCII", to = "UTF-8", sub = " "))
Error in mutate_impl(.data, dots) : 
  Evaluation error: embedded nul in string: 'How to draw  kawaii  step by step leassons on Google Play   animejapan  KAWAIIcollection  FinalFour  SisterJean  ORLvUTA  c\003\tc\003)c\0024c\0033c\003\034c\003<c\003+h6\005  nitiasa  precure  d;.i\035"c\003)c\002$c\003\0c\003<c\003\023c\003+c\003\t  c\0025c\0033c\003\ac\003<c\003"c\003<c\003\vc\0033c\0020  CBX SomeoneLikeYou  l\v m\031\024l\025< l\n$k,4l\0024l\035\004 l6\025m\025\030m\0254 l5\034j0\025l0=k/< '.

Here we are going to use the Saif Mohammad’s NRC Emotion lexicon toextract the sentiment of the tweet. The NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

Here we are going to merge the NRC Emotion results with the original data to create a complete dataset.

To have another metrics for semtiment we are going to use another lexicon developed by Professor Minqing Hu and Professor Bing Liu, from University of Illinois at Chicago.

Sentiment Analysis: Natural Language Processing


First lets read the dataset and inspect the first 10 rows

For sentiment analysis we are going to use the cleanNLP package that uses Stanford CoreNLP – Natural language software int he backend. First we need to initialize the CoreNLP engine and create an annotation object using the text column, tweet_id and the other columns are given as metadata

Distribution of tweet/sentence length, max number of works in a tweet 280

Here we can see the change of sentiment in the tweets

Here we can find the most used entities from the tweets entity table. The document corpus yields an alternative way to see the underlying topics.

Here we creating a high-level summary of the tweets text by extracting all direct object object-dependencies.

Look at the tweets with negative sentiment

Look a tweets with positive sentiment

Lets explore the emotions in the tweets more in-depth. Here we are going to extract the variables regarding emotions and create a subset.

Now we can create a plot of the emotions in the march madness tweets


Reporting: A Wordcloud from March Madness Tweets


Wordclouds are always a fun and engaging way to display data. Here we are going to set some stop words that we dont want in the plot.

Now lets create a dataframe of words and filter using predefined stop words

Set a threshold for the min/max frequency of words and scale of the wordcloud

The last step is to create the wordcloud by counting the frequency of the words


Task 1: Data Exploration - Tableau


1A) Generally describe the data (summary)

The data we are using has columns such as date, verified, reply, text, links etc. which are all features of the Twitter application.

mydata <- read.csv("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\09-notebook-lab\\data\\march_madness.csv")
summary(mydata)
    tweet_id        
 Min.   :3.542e+16  
 1st Qu.:9.774e+17  
 Median :9.777e+17  
 Mean   :9.753e+17  
 3rd Qu.:9.777e+17  
 Max.   :9.824e+17  
                    
                                                                                                                                                                                                                                                                                             text      
 Yes! My favorite sports city has a team in the #FinalFour of #MarchMadness! The Loyola-Chicago Ramblers are in the final 4 for the 1st time since 1963! They’ve been underdogs this whole tournament & still keep winning! Go @RamblersMBB! Win for #SisterJean! #OnwardLU #NoFinishLine:    8  
 Congrats @RamblersMBB                                                                                                                                                                                                                                                                         :    7  
 Let’s go @RamblersMBB                                                                                                                                                                                                                                                                    :    4  
 Congrats @RamblersMBB.                                                                                                                                                                                                                                                                        :    3  
 Congratulations @RamblersMBB                                                                                                                                                                                                                                                                  :    3  
 I love #SisterJean                                                                                                                                                                                                                                                                            :    3  
 (Other)                                                                                                                                                                                                                                                                                       :20159  
             username                  fullname             date                       datetime        verified           reply             retweets       
 @LALATE         :   81   LALATE           :   81   2018-03-25:10708   2018-03-25T00:21:10Z:   16   Min.   :0.00000   Min.   :  0.0000   Min.   :   0.000  
 @RamblersMBB    :   30   Loyola Basketball:   31   2018-03-23: 2976   2018-03-25T00:21:31Z:   16   1st Qu.:0.00000   1st Qu.:  0.0000   1st Qu.:   0.000  
 @SkywayChicago  :   27   Steve Timble     :   27   2018-03-24: 2274   2018-03-25T00:21:09Z:   15   Median :0.00000   Median :  0.0000   Median :   0.000  
 @chicagomargaret:   21   Margaret Holt    :   21   2018-03-26: 1504   2018-03-25T00:21:35Z:   15   Mean   :0.06192   Mean   :  0.3467   Mean   :   3.146  
 @sschrimp       :   18   Mark             :   21   2018-03-18: 1099   2018-03-25T00:21:08Z:   14   3rd Qu.:0.00000   3rd Qu.:  0.0000   3rd Qu.:   0.000  
 @loyolaforus    :   16   Steve            :   19   2018-03-27:  241   2018-03-25T00:21:11Z:   14   Max.   :1.00000   Max.   :591.0000   Max.   :5143.000  
 (Other)         :19994   (Other)          :19987   (Other)   : 1385   (Other)             :20097                                                          
    favorite                                  links      
 Min.   :    0.0   @RamblersMBB                  : 1139  
 1st Qu.:    0.0   #LoyolaChicago                : 1027  
 Median :    1.0   #SisterJean                   :  778  
 Mean   :   15.8   https://twitter.com#SisterJean:  231  
 3rd Qu.:    3.0   #LoyolaChicago; #MarchMadness :  208  
 Max.   :32180.0   (Other)                       :16117  
                   NA's                          :  687  

1B) Use tableau to create at least 5 plots

knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img1.png")

knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img2.png")

knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img3.png")

1C) Explain each plot make a relation to date of the tweets/time

The first figure illustrates the number of records graphically and in a table. It appears March had the most amount of tweets from this set of data. The second figure illustrates how many of those tweets were either verified, retweeted, or favorited, and the time of day tweets were most active. The third figure further distinguishes the tweet features in terms of number of tweets, and the spherical figure shows the frequency of links whereby the more a hashtag was tweeted, the bigger the circle. ————-

Task 3: Data Analysis


2A)Based on your plots and data description make give a general narrative for the image of loyola in twitter

Upon looking at the most frequent links/hashtags, it appears #loyolachicago,#sisterjean and @RamblerMBB are the most tweeted hashtags, giving Loyola a positive image. ### 2B) Use descriptive statistics to backup your arguments

summary(mydata)
    tweet_id        
 Min.   :3.542e+16  
 1st Qu.:9.774e+17  
 Median :9.777e+17  
 Mean   :9.753e+17  
 3rd Qu.:9.777e+17  
 Max.   :9.824e+17  
                    
                                                                                                                                                                                                                                                                                             text      
 Yes! My favorite sports city has a team in the #FinalFour of #MarchMadness! The Loyola-Chicago Ramblers are in the final 4 for the 1st time since 1963! They’ve been underdogs this whole tournament & still keep winning! Go @RamblersMBB! Win for #SisterJean! #OnwardLU #NoFinishLine:    8  
 Congrats @RamblersMBB                                                                                                                                                                                                                                                                         :    7  
 Let’s go @RamblersMBB                                                                                                                                                                                                                                                                    :    4  
 Congrats @RamblersMBB.                                                                                                                                                                                                                                                                        :    3  
 Congratulations @RamblersMBB                                                                                                                                                                                                                                                                  :    3  
 I love #SisterJean                                                                                                                                                                                                                                                                            :    3  
 (Other)                                                                                                                                                                                                                                                                                       :20159  
             username                  fullname             date                       datetime        verified           reply             retweets       
 @LALATE         :   81   LALATE           :   81   2018-03-25:10708   2018-03-25T00:21:10Z:   16   Min.   :0.00000   Min.   :  0.0000   Min.   :   0.000  
 @RamblersMBB    :   30   Loyola Basketball:   31   2018-03-23: 2976   2018-03-25T00:21:31Z:   16   1st Qu.:0.00000   1st Qu.:  0.0000   1st Qu.:   0.000  
 @SkywayChicago  :   27   Steve Timble     :   27   2018-03-24: 2274   2018-03-25T00:21:09Z:   15   Median :0.00000   Median :  0.0000   Median :   0.000  
 @chicagomargaret:   21   Margaret Holt    :   21   2018-03-26: 1504   2018-03-25T00:21:35Z:   15   Mean   :0.06192   Mean   :  0.3467   Mean   :   3.146  
 @sschrimp       :   18   Mark             :   21   2018-03-18: 1099   2018-03-25T00:21:08Z:   14   3rd Qu.:0.00000   3rd Qu.:  0.0000   3rd Qu.:   0.000  
 @loyolaforus    :   16   Steve            :   19   2018-03-27:  241   2018-03-25T00:21:11Z:   14   Max.   :1.00000   Max.   :591.0000   Max.   :5143.000  
 (Other)         :19994   (Other)          :19987   (Other)   : 1385   (Other)             :20097                                                          
    favorite                                  links      
 Min.   :    0.0   @RamblersMBB                  : 1139  
 1st Qu.:    0.0   #LoyolaChicago                : 1027  
 Median :    1.0   #SisterJean                   :  778  
 Mean   :   15.8   https://twitter.com#SisterJean:  231  
 3rd Qu.:    3.0   #LoyolaChicago; #MarchMadness :  208  
 Max.   :32180.0   (Other)                       :16117  
                   NA's                          :  687  

According the summary, the hashtags that were previously mentioned appear to be the most frequently tweeted. @RamblersMBB = 1139, #LoyolaChicago = 1027, and #SisterJean = 778

2C)Any recommendations to Loyola’s marketing team

Task 3: Watson Analysis


3A)Use watson analytics to explore the data

knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\explore.png")

3B)Give at least 3 plots or discoveries using watson. Explain your findings.

knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img4.png")

knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img5.png")

knitr::include_graphics("C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 9\\img6.png")

The first figure identifies the number of retweets over the months in 2017. The graphs shoes that march had the most amount of retweets with a value of nearly 40,000. The second figure idenifies the most number of twitter replies using bubbles, which illustrates the most number of replies by the size of bubbles. The third figure shows another graph, wherein the number of reweets by verified users can be analyzed.

Overall, each figure demonstrates at least one common conclusion that March rendered the most amount of tweets, as suspected due to March Madness.

