We have all our data neatly arranged in dataframes. We would next like to start building a corpus for the data to be usable for sentiment and text analysis
Creating a Corpus
A corpus is basically a huge dictionary of words. In this case, these are words pertaining to reviews and hotels. The main difference between the final corpus we are aiming for and the words we start with is the idea of streamlining. we want to get rid of words that would not add value to our analysis. words used in daily speak like a, you, the, etc. are good for clarity but do not add value to the analysis we are looking for. Similarly, plurals, action items, tenses might muddle our analysis.
## Creating Corpus
hotel_1_corpus <- Corpus(VectorSource(reviews_text_1$`html_text(reviews_1, trim = TRUE)`))
## Cleanup
hotel_1_clean <-tm_map(hotel_1_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_1_corpus, removePunctuation):
## transformation drops documents
hotel_1_clean <-tm_map(hotel_1_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_1_clean, content_transformer(tolower)):
## transformation drops documents
hotel_1_clean <-tm_map(hotel_1_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_1_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_1_clean <-tm_map(hotel_1_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_1_clean, removeNumbers): transformation
## drops documents
hotel_1_clean <-tm_map(hotel_1_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_1_clean, stripWhitespace): transformation
## drops documents
hotel_1_clean <- tm_map(hotel_1_clean, removeWords, c(stopwords("english"),
"hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
)
## Warning in tm_map.SimpleCorpus(hotel_1_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
## Lementize corpus
hotel_1_word_cloud <- lemmatize_words(hotel_1_clean)
saveRDS(hotel_1_word_cloud, "wordcloud_1.RDS")
############# Similiary, creating and saving from other hotels
hotel_2_corpus <- Corpus(VectorSource(reviews_text_2$`html_text(reviews_2, trim = TRUE)`))
hotel_2_clean <-tm_map(hotel_2_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_2_corpus, removePunctuation):
## transformation drops documents
hotel_2_clean <-tm_map(hotel_2_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_2_clean, content_transformer(tolower)):
## transformation drops documents
hotel_2_clean <-tm_map(hotel_2_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_2_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_2_clean <-tm_map(hotel_2_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_2_clean, removeNumbers): transformation
## drops documents
hotel_2_clean <-tm_map(hotel_2_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_2_clean, stripWhitespace): transformation
## drops documents
hotel_2_clean <- tm_map(hotel_2_clean, removeWords, c(stopwords("english"),
"didnt","youre","westin","hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
)
## Warning in tm_map.SimpleCorpus(hotel_2_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
hotel_2_word_cloud <- lemmatize_words(hotel_2_clean)
saveRDS(hotel_2_word_cloud, "wordcloud_2.RDS")
hotel_3_corpus <- Corpus(VectorSource(reviews_text_3$`html_text(reviews_3, trim = TRUE)`))
hotel_3_clean <-tm_map(hotel_3_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_3_corpus, removePunctuation):
## transformation drops documents
hotel_3_clean <-tm_map(hotel_3_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_3_clean, content_transformer(tolower)):
## transformation drops documents
hotel_3_clean <-tm_map(hotel_3_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_3_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_3_clean <-tm_map(hotel_3_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_3_clean, removeNumbers): transformation
## drops documents
hotel_3_clean <-tm_map(hotel_3_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_3_clean, stripWhitespace): transformation
## drops documents
hotel_3_clean <- tm_map(hotel_3_clean, removeWords, c(stopwords("english"),
"marriott","hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
)
## Warning in tm_map.SimpleCorpus(hotel_3_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
hotel_3_word_cloud <- lemmatize_words(hotel_3_clean)
saveRDS(hotel_3_word_cloud, "wordcloud_3.RDS")
hotel_4_corpus <- Corpus(VectorSource(reviews_text_4$`html_text(reviews_4, trim = TRUE)`))
hotel_4_clean <-tm_map(hotel_4_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(hotel_4_corpus, removePunctuation):
## transformation drops documents
hotel_4_clean <-tm_map(hotel_4_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(hotel_4_clean, content_transformer(tolower)):
## transformation drops documents
hotel_4_clean <-tm_map(hotel_4_clean, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(hotel_4_clean, removeWords,
## stopwords("english")): transformation drops documents
hotel_4_clean <-tm_map(hotel_4_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(hotel_4_clean, removeNumbers): transformation
## drops documents
hotel_4_clean <-tm_map(hotel_4_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(hotel_4_clean, stripWhitespace): transformation
## drops documents
hotel_4_clean <- tm_map(hotel_4_clean, removeWords, c(stopwords("english"),
"told","said","hotel", "one", "two", "airport", "terminal", "toronto", "really", "just", "flight")
)
## Warning in tm_map.SimpleCorpus(hotel_4_clean, removeWords,
## c(stopwords("english"), : transformation drops documents
hotel_4_word_cloud <- lemmatize_words(hotel_4_clean)
saveRDS(hotel_4_word_cloud, "wordcloud_4.RDS")
Social Media Sentiment Analysis
The purpose of this exercise is to set up a framework for ingesting data from various social media platforms and using it to analyze the text contained therein.
Introduction
The majority of data available out in the world today is unstructured. This data can carry great insight for businesses to learn and improve their offerings. There is also huge potential to use similar text analysis as market and opposition research to try to identify what one’s competition is doing different and try to learn from it.
Data
We use a combination of API calls and web-scrapping to gather data on hotels in Toronto that serve primarily travels. We base this purely on the hotels’ location near the airport.
This also allows us to deal with data as we would encounter it in the real-world, rather than working with cleaned data already available in the numerous datasets available on the internet.
Data Gathering and Prep
Working with the YELP API
We use the Yelp API to gather data on hotels in Toronto near the hotel.
use the command
store_access_token()to authenticate against the yelp API with “-” separating your for token.for more information on this visit the easy tutorial at: https://github.com/richierocks/yelp
We gather this data and place it in a dataframe to make manipulation etc. easier to deal with.
Ideally, we would keep and analyze all the hotels we retrieved. But for this exercise we limit ourselves to the top four hotels ranked in terms of most reviews.
Mapping our Hotels - with Leaflet
The first exercise is to place the hotels on a map to judge their proximity to each other.
We see the hotels are, as we selected, close to the airport and to each other. With the Sheraton being placed on the airport premises.
Data Scrapping with rvest
We use the rvest library to augment data we get from our Yelp API. A limitation of the API is that it gives us only the top 3 reviews. We want to be able to see all the reviews. To keep the project simple, we focus on 40 reviews from each hotel.
The method we use is endlessly extendable and we can use the same to augment our reviews data by gathering reviews from google maps, trip advisor etc.
Data Transformation
We have all our data neatly arranged in dataframes. We would next like to start building a corpus for the data to be usable for sentiment and text analysis
Creating a Corpus
A corpus is basically a huge dictionary of words. In this case, these are words pertaining to reviews and hotels. The main difference between the final corpus we are aiming for and the words we start with is the idea of streamlining. we want to get rid of words that would not add value to our analysis. words used in daily speak like a, you, the, etc. are good for clarity but do not add value to the analysis we are looking for. Similarly, plurals, action items, tenses might muddle our analysis.
Word Cloud
Finally, with data prep done, we can dive into some of the fun analytical stuff.
Using the easy to use wordcloud library we set up.
We display 25 of the most common words. with size of the words an indication of their frequency.
No surprise that room is the most important topic for customers and guests at a hotel.
Sentiment Analysis.
Sentiment analysis looks at the words and their usage and based on several criteria, assigns them different emotional values. the syuzhet library we use, compares words against the NCR lexicon and assigns words values accordingly on the emotional spectrum. (those interested can find more information at http://sentiment.nrc.ca/lexicons-for-research/).
Corpus Preparation
First, we have to convert our cleaned corpus and convert into vector format to allow for each review to be assigned appropriate values.
Note: There are two schools of thought here. One idea would be applying the sentiment analysis on the original review data - that which hasn’t be cleaned or lementized - as different tenses, active, passive words etc. score differently against the NRC Lexicon. The other school would use the cleaned data against the lexicon.
Ideally, market research is not relying wholly on the sentiment analysis, and would reconfirm the results by reading the reviews and getting feedback from the original poster etc.
Here we compare against our cleaned review corpus.
Sentiment Graph.
Finally, we plot our results.
Association Analysis
Next we look at seeing how words associate with other words. This can be a useful exercise to see how customers think or pair certain words with each other.
The goal with this exercise is to look at the reviews people leave to try to see the relationships that exist between words. From an app perspective, we would want to suggest words like “good”, or “bad” and see how they relate to room, or lobby etc.
Term Document Matrix
For this association we create a term document matrix. a document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In simple words, it is a list of frequencies associated with all the words that occur in our corpus.
Plotting Word Frequencies.
One insight we can glance, similar to what we learned from our wordcloud, is to see the frequency with which words occur in the reviews’ corpus.
Plot and Table of Associations.
We use the below code to see how words are related with each other. We use a minimum correlation threshold of 0.50 for our analysis.
The plot below shows us the replationship between words, with the thickness of lines showing us the strength of these relationships.
Note: We comment out the above code, as it is having issues with knitr and shiny.
Similarly, we can pick specific words like below to see how other words relate to this.
In our shiny app, we will make this part interactive, where users can input a word of their choice and see all relationships.
Clustering
Lastly, we want to create word clusters and see how words “bunch” together in reviews and how this can give us insight into user reviews.
Data Prep
First, we create a matrix with sparse terms removed. Basically, we are getting rid of words that are below a certain threshold (we choose 0.80) and might only be adding noise to our data.
This will allow us to focus only on words that are more frequent and hence appear more often in reviews.
Clustering Plots
Finally, we plot the clusters created above for a visual analysis.
Conclusion
The above analysis is a good start analyzing unstructured data and comparing different businesses. We can expand this analysis easily to other businesses and, with the help of APIs and web-scrapping, bring in other social media sources to add to our review corpus.
The other benefit of using APIs and scrapping, is that our analysis can be almost realtime, with our dashboard showing the most recent social media trends and sentiments.
Shiny App
The above analysis in replicated in a simple dashboard we deployed using the shiny framework. Shiny app is deployed at: https://qasimahmed.shinyapps.io/TorontoHotelSentimentAnalysis