BUAN_6357_Project - Text

As the competition among business increase ratings and reviews play a pivotal role in the success of any business. This is an analyis of the Austin coffee shop data. Online reviews and customer feedback can give businesses great insights in order to improve their business model. Even though review-like data is readily availabe and easily accessible on social media platforms, there is not an easy solution to analyze the unstructred data.

In this project, reviews of coffee shopes in Austin are studied by using many different techniques to understand what are the factors that cause customers to be satisfied or disatisfied. To uncover the insights I’ve tried to explore three of the most popular topics in Natural Language Processing: Sentiment Analysis, Word Embedding (also called Word2Vec) and Topic Modeling.

Sentiment Analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. In our case we will use it to determine if a review has a positive or negative. Sentiment analysis is contextual mining of text which identifies and extracts subjective information in source material, and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations.

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

A Topic Model identifies topics, in which words sharing similar contextual meanings appear together. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for. It’s as if similar words are clustered together, except that a word can appear in multiple topics. Additionally, each review can be characterized by a single or multiple topics. Topics are identified based on the detection of the likelihood of term co-occurrence, determined by a model.Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model.

pacman::p_load(dplyr, caret, ggplot2, tm, SnowballC, data.table, tidyr, tidytext, wordcloud, 
               reshape2, textstem, topicmodels,Rmpfr, LDAvis, stringi, sentimentr, e1071, text2vec,
               ggrepel, Rtsne, tibble, scales, ldatuning)

data <- fread("ratings_and_sentiments.csv")
df <- data.frame(data)
df <- df %>% select(-vibe_sent)
df <- df[complete.cases(df),]

Dataset: Its a dataset containing 7k reviews on coffee shops from Austin from Yelp.com with a row for each review, and columns with the following data: coffee shop name, review text, and review score.

mean(df$num_rating)

## [1] 4.169207

The average rating from all reviews of Austin coffee shops is 4.169 out of 5

49.6% of the reviews has rating 5, 31% as rating 4, 9.7% as rating 3, 6% as rating 2 and 3.7% as rating 1

The distribution above also shows that online reviews are heavily positively skewed, with over 81% of reviews being 4 or 5 stars. This suggests that in the review world, giving less than 4 stars as a rating qualifies as a “bad review”, giving above 4 stars means it was a top quality experience

Review text cleaning - Removing numbers, punctuation, accent, unwanted words, converting to lower, removing whitespace and lemmatizing.

sent_data <- df %>%  mutate(id = row_number()) %>% select(id,coffee_shop_name, review_text, num_rating) 

sent_data$review_text <- removeNumbers(sent_data$review_text)
sent_data$review_text <- tolower(sent_data$review_text)
sent_data$review_text <- removeWords(sent_data$review_text, c('coffee', 'food', 'drink','austin', stopwords("SMART")))
sent_data$review_text <- removePunctuation(sent_data$review_text)
sent_data$review_text <- iconv(sent_data$review_text, "latin1", "ASCII", sub="")
sent_data$review_text <- stripWhitespace(sent_data$review_text)
sent_data$review_text <- lemmatize_strings(sent_data$review_text)

tidy_sent <- sent_data %>% unnest_tokens(word, review_text)
sentiment <- tidy_sent %>% inner_join(get_sentiments("bing"))

## Joining, by = "word"

Top 10 words for both positive and negative reviews

## Selecting by n

We will be performing a Lexicon-based Unsupervised Sentiment Analysis, using a package called sentimentr, written by Tyler Rinker. This lexicon based approach is in fact more sophisticated than it actually sounds, and takes into consideration concepts such as “amplifiers” and “valence shifters” when calculating the sentiment.

Now we will plot the sentiment by rating, using a box plot. We would expect that the higher the rating that a review has received, the higher the sentiment of the review.

Reviews with a star rating of 1 tend to be mostly consistently negative, while reviews with a star rating of 5 tends to be a mixed back of sentiment, while averaging high

The comparison cloud gives a clear contrast of words used by people who are happy with the service compared to those who are not. People who have not recommended the place have used negative words like disappoint, bad, complaint, smell, rude etc. People who have recommended the place have used positive words like great, delicious, friendly, fast, fantastic etc.

Working with text corpora involves using natural language processing techniques. First we used both, test and train data. We need to consider all possible words in our corpus. Then we create a VectorSource, that is the input type for the Corpus function defined in the package tm. That gives us a VCorpus object that basically is a collection of content+metadata objects, where the content contains our sentences. In our way to find document input features for our classifier, we want to put this corpus in the shape of a document matrix. A document matrix is a numeric matrix containing a column for each different word in our whole corpus. If we consider each column as a term for our model, we will end up with a very complex model with 11522 different features. This will make the model slow and probably not very efficient. Some terms or words are more important than others, and we want to remove those that are not much important. When we remove the sparse term we end up with just 555 terms.

sent_data <- sent_data %>% filter(num_rating != 3) %>% mutate(review =  ifelse(num_rating >= 4, 1,0))  

set.seed(123)
split <- createDataPartition(sent_data$review, p = 0.8, list = FALSE)
train <- sent_data[split,]
test <- sent_data[-split,]

corpus_review <- Corpus(VectorSource(train$review_text))
dtm_review <- DocumentTermMatrix(corpus_review)
dtm_review

## <<DocumentTermMatrix (documents: 5500, terms: 11522)>>
## Non-/sparse entries: 172711/63198289
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

dtm_review <- removeSparseTerms(dtm_review, 0.99)
dtm_review

## <<DocumentTermMatrix (documents: 5500, terms: 555)>>
## Non-/sparse entries: 117883/2934617
## Sparsity           : 96%
## Maximal term length: 13
## Weighting          : term frequency (tf)

Now we want to convert this matrix into a data frame that we can use to train a classifier

train_review <- as.matrix(dtm_review)
train_review <- cbind(train_review, train$review)
colnames(train_review)[ncol(train_review)] <- "y"

train_review <- as.data.frame(train_review)
train_review$y <- as.factor(train_review$y)

test_corpus <- Corpus(VectorSource(test$review_text))
test_dtm <- DocumentTermMatrix(test_corpus, control=list(dictionary = Terms(dtm_review)))
test_review <- as.matrix(test_dtm)
test_review <- as.data.frame(test_review)

From the summary we see that, model select 5,094 observation as support vector points. We also see that the kernel is radial, cost is 0.1, gamma is 0.1 and we are predicting two classes 0 and 1

set.seed(123)
review_model <- svm(y ~.,train_review, kernel='radial', gamma=0.1, cost=0.1)
summary(review_model)

## 
## Call:
## svm(formula = y ~ ., data = train_review, kernel = "radial", gamma = 0.1, 
##     cost = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.1 
## 
## Number of Support Vectors:  5094
## 
##  ( 4501 593 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

The train error is 10.78% and test error is 10.48%. SVM metho is quite efficient here.

pred_train <- predict(review_model, newdata = train_review)
table(Actual=train$review, Predicted = pred_train)

##       Predicted
## Actual    0    1
##      0    0  593
##      1    0 4907

100-mean(train$review == pred_train)*100

## [1] 10.78182

pred_test <- predict(review_model, newdata = test_review)
table(Actual=test$review, Predicted = pred_test)

##       Predicted
## Actual    0    1
##      0    0  144
##      1    0 1230

100-mean(test$review == pred_test)*100

## [1] 10.48035

Word Embedding

Word Embedding is a technique used to take a corpora (structured set of text, such as reviews), and transform it in such a way that it captures the context of a word in a document/review, its semantic and syntactic similarity, and its relation with other words. Here we are considering only negative review. This to see what is the user highly dissatisfied of.

t-SNE maps high dimensional data such as word embedding into a lower dimension in such that the distance between two words roughly describe the similarity. Additionally it begins to create naturally forming clusters.

## Performing PCA
## Read the 1099 x 50 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 50.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.32 seconds (sparsity = 0.195681)!
## Learning embedding...
## Iteration 50: error is 63.653259 (50 iterations in 0.24 seconds)
## Iteration 100: error is 63.656278 (50 iterations in 0.22 seconds)
## Iteration 150: error is 63.668862 (50 iterations in 0.26 seconds)
## Iteration 200: error is 63.655218 (50 iterations in 0.28 seconds)
## Iteration 250: error is 63.651729 (50 iterations in 0.26 seconds)
## Iteration 300: error is 2.508521 (50 iterations in 0.16 seconds)
## Iteration 350: error is 2.471283 (50 iterations in 0.12 seconds)
## Iteration 400: error is 2.460460 (50 iterations in 0.12 seconds)
## Iteration 450: error is 2.455687 (50 iterations in 0.12 seconds)
## Iteration 500: error is 2.450819 (50 iterations in 0.13 seconds)
## Fitting performed in 1.93 seconds.

To make this even more interesting, let’s overlay sentiment. To estimate sentiment at the word level, we will use the sentence-level sentiment. We will simply take all the sentences that contain the word of interest, and take the average of sentiment across all thsoe sentences.

TOPIC MODELLING

We will then be converting the reviews into a Document Term Matrix (DTM). A DTM (or TDM for Term Document Matrix) is a very popular method of storing or structuring text data that allows for easy manipulation to do things such as to perform a Latent Dirchlet Allocation model.

To find the right number of topics, we run Latent Dirchlet Allocation at varying levels of k (# of topics), and determine the most appropriate number of topics by looking at several evaluation metrics.

Even here we are considering negative review to identify what the customer is not satisfied about.

frequent_words <- vocab %>%
  filter(doc_count >= nrow(neg_rev) * .01) %>%
  rename(word = term) %>%
  select(word)

by_review_word <- neg_rev %>%
  mutate(id = 1:nrow(.)) %>%
  unnest_tokens(word, text)

word_counts <- by_review_word %>%
  anti_join(stop_words, by = "word") %>%
  count(id, word, sort = TRUE) %>%
  ungroup()

dtm <- word_counts %>%
  cast_dtm(id, word, n)

result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.

FindTopicsNumber_plot(result)

By inspecting the “maximize” and “minimize” evaluation metrics, k = 7 topics seem to be an appropriate number.

We will now refit the model using k = 7, and within each topic return the top 15 words based on its beta value

Here from the top words of the different topic we can say that the customer is not satisfied or dissapointed of taste according to topic 1. Seating, parking, loud enironment are the concers according to topic 2. Price and quality are the concerns accordig to topic 3. Topic 4 emphasises on terrible service, bad customer service and rude. In Topic 5 locaton and interneta are the main concern.

Now we overlay the sentiment and the word vectors to create a single cohesive visualization that encapsulates all three Natural Language Processing tasks

We have successfully created a single visualization that encapsulates the Sentiment (using Lexicon-based Domain-specific Sentiment Analysis), the Semantic Word Similarity (using GloVe Word Embedding), and the Topics (using Topic Modeling with Latent Dirichlet Allocation)

From this graph we can seating, service vibe, wait time, rude, parking are the ones encapsulating the sentiment.

BUAN_6357_Project - Text_Mining

Shilpa Bhat

Word Embedding

TOPIC MODELLING