About this Notebook




Analytics Toolkit: Require Packages



# Here we are checking if the package is installed
if(!require("tidyverse")){
  install.packages("tidyverse", dependencies = TRUE)
  library("tidyverse")
}
if(!require("syuzhet")){
  install.packages("syuzhet", dependencies = TRUE)
  library("syuzhet")
}
if(!require("cleanNLP")){
  install.packages("cleanNLP", dependencies = TRUE)
  library("cleanNLP")
}
if(!require("magrittr")){
  install.packages("magrittr", dependencies = TRUE)
  library("magrittr")
}
if(!require("wordcloud")){
  install.packages("wordcloud", dependencies = TRUE)
  library("wordcloud")
}



Data Preparation: Cleaning tweets using regular expressions



Reading and inspecting the dataset

tweets <- read_csv("data/march_madness.csv")
# Change the tweets IDs from longe integer to characters
tweets$tweet_id <- as.character(tweets$tweet_id)
# Extract and delete the links variable to add it at the end
links <- tweets$links
tweets$links <- NULL
# Inspects the first 10 rows
head(tweets)


Now first we need to extract the text from the raw tweets and clean it using regular expressions


Here we are going to use the Saif Mohammad’s NRC Emotion lexicon toextract the sentiment of the tweet. The NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).


Here we are going to merge the NRC Emotion results with the original data to create a complete dataset.

tweets <- bind_cols(tweets, nrc_data)


To have another metrics for semtiment we are going to use another lexicon developed by Professor Minqing Hu and Professor Bing Liu, from University of Illinois at Chicago.

tweets$sentiment_bing <- get_sentiment(char_v = tweets$text, method="bing", language = "english")


Sentiment Analysis: Natural Language Processing



First lets read the dataset and inspect the first 10 rows

tweets <- read_csv("data/march_madness_sent.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets[12:21])


For sentiment analysis we are going to use the cleanNLP package that uses Stanford CoreNLP – Natural language software int he backend. First we need to initialize the CoreNLP engine and create an annotation object using the text column, tweet_id and the other columns are given as metadata

cnlp_init_udpipe()

doc <- cnlp_annotate(input = tweets$text, as_strings = TRUE, doc_ids = tweets$tweet_id, meta = tweets[-c(1,2)])


Distribution of tweet/sentence length, max number of works in a tweet 280

tokens <- cnlp_get_token(doc) %>%
  group_by(id, sid) %>%
  summarize(sent_len = n())
quantile(tokens$sent_len, seq(0,1,0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
   1    1    3    5    7    9   11   14   18   24   62 


Here we can see the change of sentiment in the tweets

qplot(x = 1:length(tweets$sentiment_bing), 
      y = tweets$sentiment_bing, 
      geom = "line", 
      xlab = "Narrative Time", 
      ylab = "Emotional Valence", 
      main = "Tweets Sentiment Trajectory")


Here we can find the most used entities from the tweets entity table. The document corpus yields an alternative way to see the underlying topics.


Here we creating a high-level summary of the tweets text by extracting all direct object object-dependencies.


Look at the tweets with negative sentiment

angry_tweets <- which(tweets$anger > 0)
data_frame(tweet = tweets$text[angry_tweets][1:2])


Look a tweets with positive sentiment

joy_tweets <- which(tweets$joy > 0)
data_frame(tweet = tweets$text[joy_tweets][5:7])


Lets explore the emotions in the tweets more in-depth. Here we are going to extract the variables regarding emotions and create a subset.

value <- as.double(colSums(prop.table(tweets[, 11:18])))
emotion <- names(tweets)[11:18]
emotion <- factor(emotion, levels = names(tweets)[11:18][order(value, decreasing = FALSE)])
emotions <- data_frame(emotion, percent = value * 100)
head(emotions)


Now we can create a plot of the emotions in the march madness tweets

ggplot(data = emotions, aes(x = emotion, y = percent)) + 
  geom_bar(stat = "identity", aes(fill = emotion)) + 
  scale_fill_brewer(palette="RdYlGn") + 
  coord_flip() +
  xlab("Emotion") +
  ylab("Percentage")



Reporting: A Wordcloud from March Madness Tweets



Wordclouds are always a fun and engaging way to display data. Here we are going to set some stop words that we dont want in the plot.

remove_words <- c( "twitter", "chicago", "loyola", "ramblers", "loyolaramblers","school",
                   "university", "luc", "loyolachicago" , "ramblersmbb", "ncaa","ve","basketball" ,
                   "umichbball", "marchmadness2018", "marchmadness", "final", "marchmaddness",
                   "goblue", "finalfour", "sisterjean", "ncaatournament", "ncaatournament2018",
                   "didn","city", "hey", "day", "college", "games", "tourney", "march", "game")
my_stop_words <- bind_rows(data_frame(word = remove_words, lexicon = c("SMART")), stop_words)


Now lets create a dataframe of words and filter using predefined stop words

twt_text <- tibble(text = tweets$text) %>% 
  unnest_tokens(word, text) %>%
  filter(!word %in% my_stop_words$word, str_detect(word, "[a-z]"))


Set a threshold for the min/max frequency of words and scale of the wordcloud

min_freq = 20
fig_scale = c(2.5 , 0.5)
max_words = 100


The last step is to create the wordcloud by counting the frequency of the words

---
title: "March Madness Analysis"
author: 
- "Quinlan School of Business - Loyola University Chicago"
- "Jose Luis Rodriguez"
output:
  html_notebook: default
  html_document: default
date: "April 9, 2018"
subtitle: "CME Group Foundation Business Analytics Lab"
---

<br>

--------------

## About this Notebook

--------------

* On this notebook we are going to analysis tweets from march madness 2018

* Use regular expression to clean the tweets text

* Familiarize with some natural language processing tools

<br>

--------------

## Analytics Toolkit: Require Packages

--------------

<br>

* tidyverse: https://www.tidyverse.org/
* syuzhet: https://github.com/mjockers/syuzhet
* cleanNLP: https://github.com/statsmaths/cleanNLP
* wordcloud: https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf

```{r , message=FALSE}

# Here we are checking if the package is installed
if(!require("tidyverse")){
  install.packages("tidyverse", dependencies = TRUE)
  library("tidyverse")
}

if(!require("syuzhet")){
  install.packages("syuzhet", dependencies = TRUE)
  library("syuzhet")
}

if(!require("cleanNLP")){
  install.packages("cleanNLP", dependencies = TRUE)
  library("cleanNLP")
}

if(!require("magrittr")){
  install.packages("magrittr", dependencies = TRUE)
  library("magrittr")
}

if(!require("wordcloud")){
  install.packages("wordcloud", dependencies = TRUE)
  library("wordcloud")
}

```

<br>

--------------

## Data Preparation: Cleaning tweets using regular expressions

--------------

<br>

#### Reading and inspecting the dataset
```{r, message=FALSE}
tweets <- read_csv("data/march_madness.csv")

# Change the tweets IDs from longe integer to characters
tweets$tweet_id <- as.character(tweets$tweet_id)

# Extract and delete the links variable to add it at the end
links <- tweets$links
tweets$links <- NULL

# Inspects the first 10 rows
head(tweets)
```

<br>

#### Now first we need to extract the text from the raw tweets and clean it using regular expressions
```{r}
replace_reg <- 'https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|(pic.twitter.com/[A-Za-z\\d]+)|&amp;|&lt;|&gt;|RT|(.*.)\\.com(.*.)\\S+\\s|[^[:alnum:]]|(http|https)\\S+\\s*|(#|@)\\S+\\s*|\\n|\\"'

tweets <- tweets %>% 
  mutate(text = str_replace_all(text, replace_reg, " ")) %>% 
  mutate(text = iconv(text, from = "ASCII", to = "UTF-8", sub = " "))

head(tweets['text'])
```

<br>

#### Here we are going to use the Saif Mohammad’s NRC Emotion lexicon toextract the sentiment of the tweet. The NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

* Source: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

```{r}
nrc_data <- get_nrc_sentiment(tweets$text)
nrc_data <- as_tibble(nrc_data)

head(nrc_data)
```

<br>

#### Here we are going to merge the NRC Emotion results with the original data to create a complete dataset.
```{r}
tweets <- bind_cols(tweets, nrc_data)

```

<br>

#### To have another metrics for semtiment we are going to use another lexicon developed by Professor Minqing Hu and Professor Bing Liu, from University of Illinois at Chicago.
* Source: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

```{r}
tweets$sentiment_bing <- get_sentiment(char_v = tweets$text, method="bing", language = "english")

```

<br>

#### Finally add the links variable and save the new complete dataset 
```{r}
tweets$links <- links

write_csv(tweets, "data/march_madness_sent.csv")

head(tweets[3:10])
```

<br>

--------------

## Sentiment Analysis: Natural Language Processing

--------------

<br>

#### First lets read the dataset and inspect the first 10 rows
```{r, message=FALSE}
tweets <- read_csv("data/march_madness_sent.csv")
tweets$tweet_id <- as.character(tweets$tweet_id)
head(tweets[12:21])
```

<br>

#### For sentiment analysis we are going to use the cleanNLP package that uses Stanford CoreNLP – Natural language software int he backend. First we need to initialize the CoreNLP engine and create an annotation object using the text column, tweet_id and the other columns are given as metadata
```{r}
cnlp_init_udpipe()

doc <- cnlp_annotate(input = tweets$text, as_strings = TRUE, doc_ids = tweets$tweet_id, meta = tweets[-c(1,2)])

```

<br>

#### Distribution of tweet/sentence length, max number of works in a tweet 280
```{r}
tokens <- cnlp_get_token(doc) %>%
  group_by(id, sid) %>%
  summarize(sent_len = n())

quantile(tokens$sent_len, seq(0,1,0.1))

```

<br>

#### Here we can see the change of sentiment in the tweets 
```{r}
qplot(x = 1:length(tweets$sentiment_bing), 
      y = tweets$sentiment_bing, 
      geom = "line", 
      xlab = "Narrative Time", 
      ylab = "Emotional Valence", 
      main = "Tweets Sentiment Trajectory")

```

<br>

#### Here we can find the most used entities from the tweets entity table. The document corpus yields an alternative way to see the underlying topics.
```{r}
tweets_entities <- cnlp_get_token(doc) %>%
  filter(upos == "NOUN") %>%
  group_by(lemma) %>%
  summarize(count = n()) %>%
  top_n(n = 80, count) %>%
  arrange(desc(count)) %>%
  use_series(lemma)

data_frame(tweets_entities)

```

<br>

#### Here we creating a high-level summary of the tweets text by extracting all direct object object-dependencies.
```{r}
tweets_summary <- cnlp_get_dependency(doc, get_token = TRUE) %>%
  left_join(cnlp_get_document(doc)) %>%
  select(id = id, start = word, word = lemma_target) %>%
  left_join(word_frequency) %>%
  filter(frequency < 0.0001) %>%
  select(id, start, word) %$%
  sprintf("%s => %s", start, word)

data_frame(tweets_summary)

```

<br>

#### Look at the tweets with negative sentiment 
```{r}

angry_tweets <- which(tweets$anger > 0)
data_frame(tweet = tweets$text[angry_tweets][1:2])

```

<br>

#### Look a tweets with positive sentiment
```{r}
joy_tweets <- which(tweets$joy > 0)
data_frame(tweet = tweets$text[joy_tweets][5:7])

```

<br>

#### Lets explore the emotions in the tweets more in-depth. Here we are going to extract the variables regarding emotions and create a subset.
```{r}
value <- as.double(colSums(prop.table(tweets[, 11:18])))
emotion <- names(tweets)[11:18]
emotion <- factor(emotion, levels = names(tweets)[11:18][order(value, decreasing = FALSE)])
emotions <- data_frame(emotion, percent = value * 100)

head(emotions)
```

<br>

#### Now we can create a plot of the emotions in the march madness tweets
```{r}
ggplot(data = emotions, aes(x = emotion, y = percent)) + 
  geom_bar(stat = "identity", aes(fill = emotion)) + 
  scale_fill_brewer(palette="RdYlGn") + 
  coord_flip() +
  xlab("Emotion") +
  ylab("Percentage")

```

<br>

--------------

## Reporting: A Wordcloud from March Madness Tweets

--------------

<br>

#### Wordclouds are always a fun and engaging way to display data. Here we are going to set some stop words that we dont want in the plot.
```{r}
remove_words <- c( "twitter", "chicago", "loyola", "ramblers", "loyolaramblers","school", "gonna",
                   "university", "luc", "loyolachicago" , "ramblersmbb", "ncaa","ve","basketball" ,
                   "umichbball", "marchmadness2018", "marchmadness", "final", "marchmaddness",
                   "goblue", "finalfour", "sisterjean", "ncaatournament", "ncaatournament2018",
                   "didn","city", "hey", "day", "college", "games", "tourney", "march", "game")

my_stop_words <- bind_rows(data_frame(word = remove_words, lexicon = c("SMART")), stop_words)

```

<br>

#### Now lets create a dataframe of words and filter using predefined stop words 
```{r}
twt_text <- tibble(text = tweets$text) %>% 
  unnest_tokens(word, text) %>%
  filter(!word %in% my_stop_words$word, str_detect(word, "[a-z]"))

```

<br>

#### Set a threshold for the min/max frequency of words and scale of the wordcloud
```{r}
min_freq = 80
max_words = 100
fig_scale = c(3 , 0.5)
```

<br>

#### The last step is to create the wordcloud by counting the frequency of the words
```{r, message=FALSE}
twt_text %>%
  anti_join(my_stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, 
                 scale = fig_scale,
                 min.freq = min_freq,
                 max.words = max_words))

```
