Social Media Sentiment Analysis

The purpose of this exercise is to set up a framework for ingesting data from various social media platforms and using it to analyze the text contained therein.

Introduction

The majority of data available out in the world today is unstructured. This data can carry great insight for businesses to learn and improve their offerings. There is also huge potential to use similar text analysis as market and opposition research to try to identify what one’s competition is doing different and try to learn from it.

Data

We use a combination of API calls and web-scrapping to gather data on hotels in Toronto that serve primarily travels. We base this purely on the hotels’ location near the airport.

This also allows us to deal with data as we would encounter it in the real-world, rather than working with cleaned data already available in the numerous datasets available on the internet.

## Loading the libraries, setting work directory and loading the data file.
## install.packages("remotes")
## remotes::install_github("richierocks/yelp")
## install.packages("tidyverse")
## install.packages("data.table")

rm(list = ls())
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(yelp) ## Tips from https://github.com/richierocks/yelp
library(RCurl) ## API Calls
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
library(data.table) ## For %like% function
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(leaflet) ## For mapping

## For manual scrapping
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
## For corpus creation
library(stringr)
library(bitops)
library(NLP)
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm)

## for lemantizing wordcloud and graph words
library(pacman)
pacman::p_load_gh("trinker/textstem")
pacman::p_load(textstem, dplyr)

## For Word Cloud
library(RColorBrewer)
library(wordcloud)

## For clustering of words
library(graph)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:NLP':
## 
##     annotation
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
## 
## Attaching package: 'graph'
## The following object is masked from 'package:stringr':
## 
##     boundary
## For sentiments  analysis
library(syuzhet)

Data Gathering and Prep

Working with the YELP API

We use the Yelp API to gather data on hotels in Toronto near the hotel.

use the command store_access_token() to authenticate against the yelp API with “-” separating your for token.

for more information on this visit the easy tutorial at: https://github.com/richierocks/yelp

## Getting hotels around the Airport
Airport_Hotels <- business_search("Airport Hotel", "Toronto, Ontario", limit = 50, offset = 0)

We gather this data and place it in a dataframe to make manipulation etc. easier to deal with.

Ideally, we would keep and analyze all the hotels we retrieved. But for this exercise we limit ourselves to the top four hotels ranked in terms of most reviews.

## Shortlisting hotels - We pick the Top 4
Shortlist_Hotels <- Airport_Hotels %>% 
  dplyr::filter(name %like% "Air") %>% 
  dplyr::filter(category_titles %like% "Hotels") %>% 
  dplyr::filter(review_count >= 40)

saveRDS(Shortlist_Hotels, "shortlist_hotels.RDS")
saveRDS(Shortlist_Hotels$name, "hotel_list.RDS")

Mapping our Hotels - with Leaflet

The first exercise is to place the hotels on a map to judge their proximity to each other.

## Viewing our shortlist on a map
leaflet() %>% 
  addTiles() %>% 
  addMarkers(lng = Shortlist_Hotels$longitude, lat = Shortlist_Hotels$latitude,
             popup  = Shortlist_Hotels$name
              ) %>% 
            setView(lng=-79.607579, lat=43.685894 , zoom=15)

We see the hotels are, as we selected, close to the airport and to each other. With the Sheraton being placed on the airport premises.

Data Scrapping with rvest

We use the rvest library to augment data we get from our Yelp API. A limitation of the API is that it gives us only the top 3 reviews. We want to be able to see all the reviews. To keep the project simple, we focus on 40 reviews from each hotel.

The method we use is endlessly extendable and we can use the same to augment our reviews data by gathering reviews from google maps, trip advisor etc.

## Pulling top 3 reviews
yelp_reviews_top_3 <- reviews(Shortlist_Hotels$business_id)


## Manualling pulling yelp newest 40 reviews - using rvest package
hotel_1 <- "https://www.yelp.com/biz/sheraton-gateway-hotel-in-toronto-international-airport-toronto"
html_1 <- read_html(hotel_1)
reviews_1 <- html_nodes(html_1, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_1 <- as.data.frame(html_text(reviews_1, trim = TRUE))
hotel_1 <- "https://www.yelp.com/biz/sheraton-gateway-hotel-in-toronto-international-airport-toronto?start=20"
html_1 <- read_html(hotel_1)
reviews_1 <- html_nodes(html_1, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_1 <- rbind(reviews_text_1, as.data.frame(html_text(reviews_1, trim = TRUE)))
rm(hotel_1, html_1, reviews_1)


hotel_2 <- "https://www.yelp.com/biz/the-westin-toronto-airport-toronto?osq=The+Westin+Toronto+Airport"
html_2 <- read_html(hotel_2)
reviews_2 <- html_nodes(html_2, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_2 <- as.data.frame(html_text(reviews_2, trim = TRUE))
hotel_2 <- "https://www.yelp.com/biz/the-westin-toronto-airport-toronto?osq=The+Westin+Toronto+Airport?start=20"
html_2 <- read_html(hotel_2)
reviews_2 <- html_nodes(html_2, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_2 <- rbind(reviews_text_2, as.data.frame(html_text(reviews_2, trim = TRUE)))
rm(hotel_2, html_2, reviews_2)

hotel_3 <- "https://www.yelp.com/biz/toronto-airport-marriott-hotel-toronto?osq=Toronto+Airport+Marriott+Hotel"
html_3 <- read_html(hotel_3)
reviews_3 <- html_nodes(html_3, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_3 <- as.data.frame(html_text(reviews_3, trim = TRUE))
hotel_3 <- "https://www.yelp.com/biz/toronto-airport-marriott-hotel-toronto?osq=Toronto+Airport+Marriott+Hotel?start=20"
html_3 <- read_html(hotel_3)
reviews_3 <- html_nodes(html_3, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_3 <- rbind(reviews_text_3, as.data.frame(html_text(reviews_3, trim = TRUE)))
rm(hotel_3, html_3, reviews_3)

hotel_4 <- "https://www.yelp.com/biz/holiday-inn-toronto-international-airport-toronto"
html_4 <- read_html(hotel_4)
reviews_4 <- html_nodes(html_4, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_4 <- as.data.frame(html_text(reviews_4, trim = TRUE))
hotel_4 <- "https://www.yelp.com/biz/holiday-inn-toronto-international-airport-toronto?start=20"
html_4 <- read_html(hotel_4)
reviews_4 <- html_nodes(html_4, ".comment__373c0__3EKjH .lemon--span__373c0__3997G")
reviews_text_4 <- rbind(reviews_text_4, as.data.frame(html_text(reviews_4, trim = TRUE)))
rm(hotel_4, html_4, reviews_4)

Data Transformation

We have all our data neatly arranged in dataframes. We would next like to start building a corpus for the data to be usable for sentiment and text analysis

Creating a Corpus

A corpus is basically a huge dictionary of words. In this case, these are words pertaining to reviews and hotels. The main difference between the final corpus we are aiming for and the words we start with is the idea of streamlining. we want to get rid of words that would not add value to our analysis. words used in daily speak like a, you, the, etc. are good for clarity but do not add value to the analysis we are looking for. Similarly, plurals, action items, tenses might muddle our analysis.

## Creating Corpus
hotel_1_corpus <- Corpus(VectorSource(reviews_text_1$`html_text(reviews_1, trim = TRUE)`))

## Cleanup

hotel_1_clean <-tm_map(hotel_1_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_1_corpus, removePunctuation):
## transformation drops documents
hotel_1_clean <-tm_map(hotel_1_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_1_clean, content_transformer(tolower)):
## transformation drops documents
hotel_1_clean <-tm_map(hotel_1_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_1_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_1_clean <-tm_map(hotel_1_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_1_clean, removeNumbers): transformation
## drops documents
hotel_1_clean <-tm_map(hotel_1_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_1_clean, stripWhitespace): transformation
## drops documents
hotel_1_clean <- tm_map(hotel_1_clean, removeWords, c(stopwords("english"),
                        "hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
                        ) 
## Warning in tm_map.SimpleCorpus(hotel_1_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
## Lementize corpus
hotel_1_word_cloud <- lemmatize_words(hotel_1_clean)

saveRDS(hotel_1_word_cloud, "wordcloud_1.RDS")

############# Similiary, creating and saving from other hotels
hotel_2_corpus <- Corpus(VectorSource(reviews_text_2$`html_text(reviews_2, trim = TRUE)`))
hotel_2_clean <-tm_map(hotel_2_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_2_corpus, removePunctuation):
## transformation drops documents
hotel_2_clean <-tm_map(hotel_2_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_2_clean, content_transformer(tolower)):
## transformation drops documents
hotel_2_clean <-tm_map(hotel_2_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_2_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_2_clean <-tm_map(hotel_2_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_2_clean, removeNumbers): transformation
## drops documents
hotel_2_clean <-tm_map(hotel_2_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_2_clean, stripWhitespace): transformation
## drops documents
hotel_2_clean <- tm_map(hotel_2_clean, removeWords, c(stopwords("english"),
                                                      "didnt","youre","westin","hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
) 
## Warning in tm_map.SimpleCorpus(hotel_2_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
hotel_2_word_cloud <- lemmatize_words(hotel_2_clean)
saveRDS(hotel_2_word_cloud, "wordcloud_2.RDS")

hotel_3_corpus <- Corpus(VectorSource(reviews_text_3$`html_text(reviews_3, trim = TRUE)`))
hotel_3_clean <-tm_map(hotel_3_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_3_corpus, removePunctuation):
## transformation drops documents
hotel_3_clean <-tm_map(hotel_3_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_3_clean, content_transformer(tolower)):
## transformation drops documents
hotel_3_clean <-tm_map(hotel_3_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_3_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_3_clean <-tm_map(hotel_3_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_3_clean, removeNumbers): transformation
## drops documents
hotel_3_clean <-tm_map(hotel_3_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_3_clean, stripWhitespace): transformation
## drops documents
hotel_3_clean <- tm_map(hotel_3_clean, removeWords, c(stopwords("english"),
                                                      "marriott","hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
) 
## Warning in tm_map.SimpleCorpus(hotel_3_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
hotel_3_word_cloud <- lemmatize_words(hotel_3_clean)
saveRDS(hotel_3_word_cloud, "wordcloud_3.RDS")

hotel_4_corpus <- Corpus(VectorSource(reviews_text_4$`html_text(reviews_4, trim = TRUE)`))
hotel_4_clean <-tm_map(hotel_4_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_4_corpus, removePunctuation):
## transformation drops documents
hotel_4_clean <-tm_map(hotel_4_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_4_clean, content_transformer(tolower)):
## transformation drops documents
hotel_4_clean <-tm_map(hotel_4_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_4_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_4_clean <-tm_map(hotel_4_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_4_clean, removeNumbers): transformation
## drops documents
hotel_4_clean <-tm_map(hotel_4_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_4_clean, stripWhitespace): transformation
## drops documents
hotel_4_clean <- tm_map(hotel_4_clean, removeWords, c(stopwords("english"),
                                                      "told","said","hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
) 
## Warning in tm_map.SimpleCorpus(hotel_4_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
hotel_4_word_cloud <- lemmatize_words(hotel_4_clean)
saveRDS(hotel_4_word_cloud, "wordcloud_4.RDS")

Word Cloud

Finally, with data prep done, we can dive into some of the fun analytical stuff.

Using the easy to use wordcloud library we set up.

We display 25 of the most common words. with size of the words an indication of their frequency.

## Color Pallettes
hotel_pal <- brewer.pal(8,"Dark2")

## Wordcloud - play with parameters
## Saving png
wordcloud(hotel_1_word_cloud, random.order = F, max.words = 25, scale = c(5,2), colors = hotel_pal)

No surprise that room is the most important topic for customers and guests at a hotel.

Sentiment Analysis.

Sentiment analysis looks at the words and their usage and based on several criteria, assigns them different emotional values. the syuzhet library we use, compares words against the NCR lexicon and assigns words values accordingly on the emotional spectrum. (those interested can find more information at http://sentiment.nrc.ca/lexicons-for-research/).

Corpus Preparation

First, we have to convert our cleaned corpus and convert into vector format to allow for each review to be assigned appropriate values.

Note: There are two schools of thought here. One idea would be applying the sentiment analysis on the original review data - that which hasn’t be cleaned or lementized - as different tenses, active, passive words etc. score differently against the NRC Lexicon. The other school would use the cleaned data against the lexicon.

Ideally, market research is not relying wholly on the sentiment analysis, and would reconfirm the results by reading the reviews and getting feedback from the original poster etc.

Here we compare against our cleaned review corpus.

## Sentiments along emotional spectrum (NCR Lexicon).

reviews_1 <- as.vector(reviews_text_1$`html_text(reviews_1, trim = TRUE)`)
reviews_text_1$`html_text(reviews_1, trim = TRUE)` <- as.character(reviews_text_1$`html_text(reviews_1, trim = TRUE)`)

hotel_1_sentiments <- get_nrc_sentiment(reviews_1)
hotel_1_sentiment_scores <-data.frame(colSums(hotel_1_sentiments[,]))
names(hotel_1_sentiment_scores) <- "Score"
hotel_1_sentiment_scores <- cbind("sentiment" = rownames(hotel_1_sentiment_scores), hotel_1_sentiment_scores)
rownames(hotel_1_sentiment_scores) <- NULL

saveRDS(hotel_1_sentiment_scores, "sentiment_1.RDS")

## Similarly, for other 3
reviews_2 <- as.vector(reviews_text_2$`html_text(reviews_2, trim = TRUE)`)
reviews_text_2$`html_text(reviews_2, trim = TRUE)` <- as.character(reviews_text_2$`html_text(reviews_2, trim = TRUE)`)
hotel_2_sentiments <- get_nrc_sentiment(reviews_2)
hotel_2_sentiment_scores <-data.frame(colSums(hotel_2_sentiments[,]))
names(hotel_2_sentiment_scores) <- "Score"
hotel_2_sentiment_scores <- cbind("sentiment" = rownames(hotel_2_sentiment_scores), hotel_2_sentiment_scores)
rownames(hotel_2_sentiment_scores) <- NULL
saveRDS(hotel_2_sentiment_scores, "sentiment_2.RDS")

reviews_3 <- as.vector(reviews_text_3$`html_text(reviews_3, trim = TRUE)`)
reviews_text_3$`html_text(reviews_3, trim = TRUE)` <- as.character(reviews_text_3$`html_text(reviews_3, trim = TRUE)`)
hotel_3_sentiments <- get_nrc_sentiment(reviews_3)
hotel_3_sentiment_scores <-data.frame(colSums(hotel_3_sentiments[,]))
names(hotel_3_sentiment_scores) <- "Score"
hotel_3_sentiment_scores <- cbind("sentiment" = rownames(hotel_3_sentiment_scores), hotel_3_sentiment_scores)
rownames(hotel_3_sentiment_scores) <- NULL
saveRDS(hotel_3_sentiment_scores, "sentiment_3.RDS")

reviews_4 <- as.vector(reviews_text_4$`html_text(reviews_4, trim = TRUE)`)
reviews_text_4$`html_text(reviews_4, trim = TRUE)` <- as.character(reviews_text_4$`html_text(reviews_4, trim = TRUE)`)
hotel_4_sentiments <- get_nrc_sentiment(reviews_4)
hotel_4_sentiment_scores <-data.frame(colSums(hotel_4_sentiments[,]))
names(hotel_4_sentiment_scores) <- "Score"
hotel_4_sentiment_scores <- cbind("sentiment" = rownames(hotel_4_sentiment_scores), hotel_4_sentiment_scores)
rownames(hotel_4_sentiment_scores) <- NULL
saveRDS(hotel_4_sentiment_scores, "sentiment_4.RDS")

Sentiment Graph.

Finally, we plot our results.

## Plotting
ggplot(data = hotel_1_sentiment_scores, aes(x = sentiment , y = Score)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Score") + ggtitle("Sentiment Score for Sheraton Gateway Hotel Reviews")

Association Analysis

Next we look at seeing how words associate with other words. This can be a useful exercise to see how customers think or pair certain words with each other.

The goal with this exercise is to look at the reviews people leave to try to see the relationships that exist between words. From an app perspective, we would want to suggest words like “good”, or “bad” and see how they relate to room, or lobby etc.

Term Document Matrix

For this association we create a term document matrix. a document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In simple words, it is a list of frequencies associated with all the words that occur in our corpus.

## # Building a Document-Term Matrix

Hotel_1_Dtm <- TermDocumentMatrix(hotel_1_word_cloud, control = list(minWordLength = 1,
                                                     weighting =
                                                       function(x)
                                                         weightTfIdf(x, normalize =
                                                                       FALSE),
                                                     stopwords = TRUE))
inspect(Hotel_1_Dtm)
## <<TermDocumentMatrix (terms: 1380, documents: 41)>>
## Non-/sparse entries: 2571/54009
## Sparsity           : 95%
## Maximal term length: 20
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##          Docs
## Terms           11       19       20       21        27       29       38
##   charges 0.000000 0.000000 0.000000 4.357552 26.145312 0.000000 0.000000
##   didnt   0.000000 0.000000 0.000000 0.000000 13.072656 0.000000 0.000000
##   get     0.000000 1.550197 0.000000 3.100394  4.650591 0.000000 6.200788
##   nice    2.715104 0.000000 0.000000 1.357552  1.357552 0.000000 0.000000
##   night   1.450661 2.901323 0.000000 1.450661  2.901323 1.450661 0.000000
##   parking 0.000000 9.106872 0.000000 0.000000  0.000000 0.000000 0.000000
##   room    4.826173 1.930469 0.000000 1.930469  1.930469 2.895704 0.000000
##   service 2.901323 0.000000 1.450661 0.000000  0.000000 2.901323 4.351984
##   since   3.772590 0.000000 0.000000 3.772590 30.180716 0.000000 0.000000
##   stay    3.810267 1.270089 1.270089 0.000000  8.890624 0.000000 0.000000
##          Docs
## Terms             4        6         7
##   charges  0.000000 0.000000  0.000000
##   didnt   17.430208 0.000000  0.000000
##   get      0.000000 1.550197  1.550197
##   nice     0.000000 1.357552  1.357552
##   night    0.000000 0.000000  0.000000
##   parking  0.000000 0.000000 18.213743
##   room     1.930469 2.895704  1.930469
##   service  2.901323 1.450661  1.450661
##   since    0.000000 0.000000  0.000000
##   stay     0.000000 3.810267  2.540178
#Based on the above matrix, many data mining tasks can be done,
#for example, clustering, classification and association analysis.

#Frequent Terms and Association
freq.terms <- findFreqTerms(Hotel_1_Dtm, lowfreq=25)

term.freq <- rowSums(as.matrix(Hotel_1_Dtm))
term.freq <- subset(term.freq, term.freq >= 30)
hotel_1_df <- data.frame(term = names(term.freq), freq = term.freq)

saveRDS(hotel_1_df, "df_1.RDS")
saveRDS(Hotel_1_Dtm, "dtm_1.RDS")

#### Similarly, for other 3 hotels
Hotel_2_Dtm <- TermDocumentMatrix(hotel_2_word_cloud, control = list(minWordLength = 1,
                                                                     weighting =
                                                                       function(x)
                                                                         weightTfIdf(x, normalize =
                                                                                       FALSE),
                                                                     stopwords = TRUE))
inspect(Hotel_2_Dtm)
## <<TermDocumentMatrix (terms: 873, documents: 40)>>
## Non-/sparse entries: 2598/32322
## Sparsity           : 93%
## Maximal term length: 26
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##            Docs
## Terms              1        14       19       21       25       26        34
##   break      0.00000 17.287712 0.000000  0.00000 0.000000 0.000000 17.287712
##   breakfast  0.00000  0.000000 0.000000  0.00000 4.543720 1.514573  0.000000
##   great      0.00000  1.321928 2.643856  0.00000 2.643856 0.000000  1.321928
##   like       0.00000  3.029146 3.029146  0.00000 0.000000 3.029146  3.029146
##   manager    0.00000 23.253497 0.000000  0.00000 0.000000 0.000000 23.253497
##   night      0.00000 10.000000 2.000000  0.00000 4.000000 0.000000 10.000000
##   person     0.00000 13.287712 0.000000  0.00000 0.000000 0.000000 13.287712
##   room       2.00000  1.000000 3.000000  2.00000 2.000000 3.000000  1.000000
##   staff     10.94786 19.158759 0.000000 10.94786 0.000000 0.000000 19.158759
##   weather    0.00000 17.287712 0.000000  0.00000 0.000000 0.000000 17.287712
##            Docs
## Terms             39        5        6
##   break     0.000000 0.000000 0.000000
##   breakfast 0.000000 4.543720 1.514573
##   great     2.643856 2.643856 0.000000
##   like      3.029146 0.000000 3.029146
##   manager   0.000000 0.000000 0.000000
##   night     2.000000 4.000000 0.000000
##   person    0.000000 0.000000 0.000000
##   room      3.000000 2.000000 3.000000
##   staff     0.000000 0.000000 0.000000
##   weather   0.000000 0.000000 0.000000
freq.terms <- findFreqTerms(Hotel_2_Dtm, lowfreq=25)
term.freq <- rowSums(as.matrix(Hotel_2_Dtm))
term.freq <- subset(term.freq, term.freq >= 30)
hotel_2_df <- data.frame(term = names(term.freq), freq = term.freq)
saveRDS(hotel_2_df, "df_2.RDS")
saveRDS(Hotel_2_Dtm, "dtm_2.RDS")

Hotel_3_Dtm <- TermDocumentMatrix(hotel_3_word_cloud, control = list(minWordLength = 1,
                                                                     weighting =
                                                                       function(x)
                                                                         weightTfIdf(x, normalize =
                                                                                       FALSE),
                                                                     stopwords = TRUE))
inspect(Hotel_3_Dtm)
## <<TermDocumentMatrix (terms: 744, documents: 40)>>
## Non-/sparse entries: 2164/27596
## Sparsity           : 93%
## Maximal term length: 23
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##             Docs
## Terms               15       20        25       26       28        35       40
##   breakfast   0.000000 11.60964  0.000000 0.000000 2.321928  0.000000 11.60964
##   club       12.000000  0.00000  0.000000 0.000000 4.000000 12.000000  0.00000
##   conference  0.000000  0.00000  0.000000 0.000000 0.000000  0.000000  0.00000
##   free        0.000000  0.00000  8.210897 0.000000 0.000000  0.000000  0.00000
##   got         0.000000  0.00000 13.287712 0.000000 0.000000  0.000000  0.00000
##   lounge      6.643856  0.00000  0.000000 9.965784 0.000000  6.643856  0.00000
##   nice        0.000000  0.00000  1.514573 0.000000 0.000000  0.000000  0.00000
##   order       0.000000 17.28771  0.000000 0.000000 0.000000  0.000000 17.28771
##   quite       0.000000  0.00000  9.287712 2.321928 0.000000  0.000000  0.00000
##   rooms       2.000000  0.00000  1.000000 0.000000 0.000000  2.000000  0.00000
##             Docs
## Terms                5        6        8
##   breakfast   0.000000 0.000000 2.321928
##   club        0.000000 0.000000 4.000000
##   conference  0.000000 0.000000 0.000000
##   free        8.210897 0.000000 0.000000
##   got        13.287712 0.000000 0.000000
##   lounge      0.000000 9.965784 0.000000
##   nice        1.514573 0.000000 0.000000
##   order       0.000000 0.000000 0.000000
##   quite       9.287712 2.321928 0.000000
##   rooms       1.000000 0.000000 0.000000
freq.terms <- findFreqTerms(Hotel_3_Dtm, lowfreq=25)
term.freq <- rowSums(as.matrix(Hotel_3_Dtm))
term.freq <- subset(term.freq, term.freq >= 30)
hotel_3_df <- data.frame(term = names(term.freq), freq = term.freq)
saveRDS(hotel_3_df, "df_3.RDS")
saveRDS(Hotel_3_Dtm, "dtm_3.RDS")

Hotel_4_Dtm <- TermDocumentMatrix(hotel_4_word_cloud, control = list(minWordLength = 1,
                                                                     weighting =
                                                                       function(x)
                                                                         weightTfIdf(x, normalize =
                                                                                       FALSE),
                                                                     stopwords = TRUE))
inspect(Hotel_4_Dtm)
## <<TermDocumentMatrix (terms: 1168, documents: 40)>>
## Non-/sparse entries: 2100/44620
## Sparsity           : 96%
## Maximal term length: 16
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##          Docs
## Terms            1        12       13       14       18       33        36
##   asked   0.000000 0.0000000 0.000000 0.000000 4.321928 0.000000 25.931569
##   extra   0.000000 0.0000000 0.000000 0.000000 0.000000 0.000000 16.609640
##   floor   0.000000 2.0000000 0.000000 2.000000 2.000000 0.000000  2.000000
##   like    0.000000 0.0000000 0.000000 0.000000 2.514573 0.000000  0.000000
##   nice    0.000000 0.0000000 1.621488 0.000000 6.485954 0.000000  4.864465
##   parking 0.000000 0.0000000 0.000000 0.000000 0.000000 8.608012  2.152003
##   room    1.356144 0.6780719 1.356144 8.814935 2.712288 0.000000  4.068431
##   rooms   5.587489 0.0000000 1.862496 1.862496 1.862496 0.000000  0.000000
##   staff   3.703396 0.0000000 6.172326 1.234465 1.234465 1.234465  2.468931
##   well    1.621488 0.0000000 3.242977 1.621488 3.242977 1.621488  4.864465
##          Docs
## Terms           40         5        8
##   asked   0.000000 0.0000000 0.000000
##   extra   0.000000 0.0000000 0.000000
##   floor   8.000000 0.0000000 4.000000
##   like    0.000000 0.0000000 0.000000
##   nice    0.000000 0.0000000 0.000000
##   parking 0.000000 0.0000000 6.456009
##   room    2.034216 0.6780719 1.356144
##   rooms   3.724993 0.0000000 0.000000
##   staff   0.000000 0.0000000 0.000000
##   well    1.621488 0.0000000 1.621488
freq.terms <- findFreqTerms(Hotel_4_Dtm, lowfreq=25)
term.freq <- rowSums(as.matrix(Hotel_4_Dtm))
term.freq <- subset(term.freq, term.freq >= 30)
hotel_4_df <- data.frame(term = names(term.freq), freq = term.freq)
saveRDS(hotel_4_df, "df_4.RDS")
saveRDS(Hotel_4_Dtm, "dtm_4.RDS")

Plotting Word Frequencies.

One insight we can glance, similar to what we learned from our wordcloud, is to see the frequency with which words occur in the reviews’ corpus.

ggplot(hotel_1_df, aes(x = term, y = freq)) + geom_bar(stat = "identity") +
  xlab("Terms") + ylab("Count") + coord_flip()

Plot and Table of Associations.

We use the below code to see how words are related with each other. We use a minimum correlation threshold of 0.50 for our analysis.

The plot below shows us the replationship between words, with the thickness of lines showing us the strength of these relationships.

library(graph)
##plot(Hotel_1_Dtm, term = freq.terms, corThreshold = 0.50, weighting = T)

Note: We comment out the above code, as it is having issues with knitr and shiny.

Similarly, we can pick specific words like below to see how other words relate to this.

In our shiny app, we will make this part interactive, where users can input a word of their choice and see all relationships.

findAssocs(Hotel_1_Dtm, 'great', 0.50)
## $great
##      area    highly       low     light    screen     super traveling    always 
##      0.72      0.69      0.69      0.69      0.69      0.58      0.58      0.56 
##      huge connected  friendly 
##      0.53      0.53      0.52
findAssocs(Hotel_1_Dtm, 'good', 0.50)
## $good
##       rather       reason    selection        along          bee     ceilings 
##         0.62         0.62         0.62         0.60         0.60         0.60 
##        clock      descent      dropoff      english         flax       french 
##         0.60         0.60         0.60         0.60         0.60         0.60 
##          hit        issue     language       liquor       matter         miss 
##         0.60         0.60         0.60         0.60         0.60         0.60 
##    oversized       poorly       radios       rental       simple        speak 
##         0.60         0.60         0.60         0.60         0.60         0.60 
##     surprise    traintram      updated         huge surprisingly 
##         0.60         0.60         0.60         0.51         0.51
findAssocs(Hotel_1_Dtm, 'bad', 0.65)
## $bad
##            requested             achieved                  air 
##                  1.0                  0.7                  0.7 
##      allergyfriendly              aspects                avoid 
##                  0.7                  0.7                  0.7 
##              blazing               cardio             catering 
##                  0.7                  0.7                  0.7 
##                clerk                 deem             downtown 
##                  0.7                  0.7                  0.7 
##           electrical              enjoyed            essential 
##                  0.7                  0.7                  0.7 
##          featherfree             features           filtration 
##                  0.7                  0.7                  0.7 
##              fitness             goodlife             informed 
##                  0.7                  0.7                  0.7 
##             majority                mbsec              meeting 
##                  0.7                  0.7                  0.7 
##       meetingoverall             meetings                  new 
##                  0.7                  0.7                  0.7 
##                  now              outlets               placei 
##                  0.7                  0.7                  0.7 
##               pollen                 pure              quickly 
##                  0.7                  0.7                  0.7 
##              receive             received scientificallyproven 
##                  0.7                  0.7                  0.7 
##                space               speeds            surprised 
##                  0.7                  0.7                  0.7 
##          thatfinally           thoroughly             touchthe 
##                  0.7                  0.7                  0.7 
##             training           treadmills           treatments 
##                  0.7                  0.7                  0.7 
##              unusual                 warm              watches 
##                  0.7                  0.7                  0.7 
##               weight                wifii               worthy 
##                  0.7                  0.7                  0.7 
##                 youd              awesome                 bath 
##                  0.7                  0.7                  0.7 
##             checking                crazy                 crib 
##                  0.7                  0.7                  0.7 
##               deluxe                 dock                gross 
##                  0.7                  0.7                  0.7 
##             insanely                kinda               looked 
##                  0.7                  0.7                  0.7 
##               planes            porcelain                 runs 
##                  0.7                  0.7                  0.7 
##                  set              trouble                  tub 
##                  0.7                  0.7                  0.7 
##               unload 
##                  0.7

Clustering

Lastly, we want to create word clusters and see how words “bunch” together in reviews and how this can give us insight into user reviews.

Data Prep

First, we create a matrix with sparse terms removed. Basically, we are getting rid of words that are below a certain threshold (we choose 0.80) and might only be adding noise to our data.

This will allow us to focus only on words that are more frequent and hence appear more often in reviews.

#clustering
# remove sparse terms
hotel_1_tdm2 <- removeSparseTerms(Hotel_1_Dtm, sparse = 0.80)
hotel_1_m2 <- as.matrix(hotel_1_tdm2)
# cluster terms
Hotel_1_distMatrix <- dist(scale(hotel_1_m2))
fit_1 <- hclust(Hotel_1_distMatrix, method = "ward.D2")

saveRDS(fit_1, "fit_1.RDS")

### Similarly, for other 3
hotel_2_tdm2 <- removeSparseTerms(Hotel_2_Dtm, sparse = 0.80)
hotel_2_m2 <- as.matrix(hotel_2_tdm2)
Hotel_2_distMatrix <- dist(scale(hotel_2_m2))
fit_2 <- hclust(Hotel_2_distMatrix, method = "ward.D2")
saveRDS(fit_2, "fit_2.RDS")

hotel_3_tdm2 <- removeSparseTerms(Hotel_3_Dtm, sparse = 0.80)
hotel_3_m2 <- as.matrix(hotel_3_tdm2)
Hotel_3_distMatrix <- dist(scale(hotel_3_m2))
fit_3 <- hclust(Hotel_3_distMatrix, method = "ward.D2")
saveRDS(fit_3, "fit_3.RDS")

hotel_4_tdm2 <- removeSparseTerms(Hotel_4_Dtm, sparse = 0.80)
hotel_4_m2 <- as.matrix(hotel_4_tdm2)
Hotel_4_distMatrix <- dist(scale(hotel_4_m2))
fit_4 <- hclust(Hotel_4_distMatrix, method = "ward.D2")
saveRDS(fit_4, "fit_4.RDS")

Clustering Plots

Finally, we plot the clusters created above for a visual analysis.

plot(fit_1)
rect.hclust(fit_1, k = 4) # cut tree into 4 clusters

Conclusion

The above analysis is a good start analyzing unstructured data and comparing different businesses. We can expand this analysis easily to other businesses and, with the help of APIs and web-scrapping, bring in other social media sources to add to our review corpus.

The other benefit of using APIs and scrapping, is that our analysis can be almost realtime, with our dashboard showing the most recent social media trends and sentiments.

Shiny App

The above analysis in replicated in a simple dashboard we deployed using the shiny framework. Shiny app is deployed at: https://qasimahmed.shinyapps.io/TorontoHotelSentimentAnalysis