1. Title: Text Mining on German yelp Reviews

2. Introduction

By applying Text Mining on German yelp Reviews, this report tries to shed light on the question, what aspects influence people in giving their star ratings. The (contestable) rationale behind this approach is the following: We assume the information in the reviews (i.e. the words written) to have a causal effect on the rating given. The results might be interesting for two audiences: first, people who are interested in text mining in general; second, business owners, who are interested in, which are the salient aspects, that drive people’s rating behaviour.

3. Methods and Data

In the following, the relevant code chunks are presented and commented. After loading the review data set, we, first, had to get rid of the reviews that were not written in German. Then, a binary variable called ‘label’ (bad, good) was created.

review_Deutsch <- readRDS("review_Deutsch.rds")

review_Deutsch_right <- review_Deutsch[-c(91, 195, 197, 199, 201, 262, 267, 272, 275, 328, 339, 355, 356, 422, 436, 456, 457, 459, 460, 461, 524, 528, 541, 623, 709, 724, 741, 758, 782, 822, 925, 956, 969, 979, 983, 993, 996, 1022, 1054, 1097, 1103, 1119, 1149, 1157, 1161, 1164, 1173, 1211, 1212, 1248, 1280, 1347, 1359, 1362, 1368, 1394, 1400, 1405, 1409, 1431, 1447, 1454, 1467, 1554, 1687, 1690, 1704, 1709, 1736, 1754, 1787, 1788, 1813, 1815, 1816, 1831, 1832, 1835, 1840, 1842, 1847, 1875, 1878, 1899, 1906, 1961, 2040, 2044, 2047, 2081, 2089, 2113, 2137, 2145, 2272, 2285, 2319, 2323, 2354, 2383, 2386, 2387, 2453, 2454, 2490, 2498, 2515, 2632),] 
review_Deutsch <- review_Deutsch_right

review_Deutsch$label[review_Deutsch$stars==1] <- "bad"
review_Deutsch$label[review_Deutsch$stars==2] <- "bad"
review_Deutsch$label[review_Deutsch$stars==3] <- "bad"
review_Deutsch$label[review_Deutsch$stars==4] <- "good"
review_Deutsch$label[review_Deutsch$stars==5] <- "good"

Next, we created a data frame with only the information needed for the further analysis: the reviews and the corresponding labels.

review_Deutsch_text <- cbind(review_Deutsch$text, review_Deutsch$label)
colnames(review_Deutsch_text) <- c("Text", "Label")
review_Deutsch_text <- as.data.frame(review_Deutsch_text)

Then, we used the tm-package for the text processing. A list with German stop words was introduced that removes the most common words in German from the reviews. As these words are so frequently used, they supply no information to the analysis and thus, are discarded. Besides, there were further modifications made; these are typical in text processing. Punctuation and numbers were removed, all words were set to lower characters; then the words were stemmed and the white space was removed. Then, a Document Term Matrix was made. Before it was changed into a data frame, we removed words that appear less than five times in the whole DTM. We did so, as these words, very much like the most common words, rather impede the analysis.

library(tm)
## Warning: package 'tm' was built under R version 3.1.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.1.3
options(stringsasFactors = FALSE)

germanStopwords <- readLines("stopwords_de.txt", enc= "UTF-8")

corpus = Corpus(VectorSource(review_Deutsch_text$Text))

corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, function(x) removeWords(x, germanStopwords))
corpus <- tm_map(corpus, function(x) stemDocument(x, language="german"))
corpus <- tm_map(corpus, stripWhitespace)

DTM <- DocumentTermMatrix(corpus)
dim(DTM)
## [1]  2583 17270
minimumFrequency <- 5
DTM <- DocumentTermMatrix(corpus, control= list(bounds=list(global = c(minimumFrequency, Inf))))
dim(DTM)
## [1] 2583 2735
DTM_dataframe <- as.data.frame(as.matrix(DTM))

colnames(DTM_dataframe) <- make.names(colnames(DTM_dataframe))

DTM_dataframe$label <- review_Deutsch$label

In a next step, we applied a CART model to get a first understanding of the relationship between reviews and ratings. However, as the reader knows, single trees often lack a certain power in accuracy, which was also here the case. That is why, we will only comment on the results from the random forest model, we conducted.

4. Results

After splitting the data into a training set and a test set, we first had to define the label variable as a factor, so the algorithm would perform a classification and not a regression model. Random Forest is one of the most widely applied methods, when it comes to Data Mining. However, it often comes with a high amount of computational time. That is why, we decided to use a lower number of grown trees which would result in a forest of 128 trees.

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.1.3
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(caTools)
## Warning: package 'caTools' was built under R version 3.1.3
set.seed(123)

split <- sample.split(DTM_dataframe$label, SplitRatio = 0.7)
train <- subset(DTM_dataframe, split==TRUE)
test <- subset(DTM_dataframe, split==FALSE)

train$label <- as.factor(train$label)
test$label <- as.factor(test$label)

RF = randomForest(label ~ ., data=train, ntree=128)

predictRF = predict(RF, newdata=test)
table(test$label, predictRF)
##       predictRF
##        bad good
##   bad  124  140
##   good  24  487
tab <- table(test$label, predictRF)

When deciding on the value of the model, we opted for a range of indicators: Accuracy, Precision (positive prediction value), Recall (also called Sensitivity), and F1 (the harmonic mean of Precision and Recall). As it turns out, both Accuracy and Precision turn out to have a similar value of almost 80%. Compared with other models from other data mining analyses, this seems not so high. However, keep in mind, that language is a quite unstrucutured and, if you will, noisy phenomenon. Hence, we consider the results to be not bad; however, in further analyses there is definitely a potential for improvement. Recall turns out to be excellent. However, this might have something to do with an imbalance in the two categories bad(ratings 1,2,3) and good(ratings 4,5), with the latter exceeding the former by far. Finally, the F1 value shows a decent value of about .85; enough, we propose, to have a look at the words that had the biggest impact on the model.

#Accuracy:
((tab[1,1]+tab[2,2])/(tab[1,1]+tab[1,2]+tab[2,1]+tab[2,2]))
## [1] 0.7883871
#Precision:
((tab[2,2])/(tab[2,2]+tab[1,2]))
## [1] 0.7767145
#Recall(Sensitivity):
((tab[2,2])/(tab[2,2]+tab[2,1]))
## [1] 0.9530333
#F1(harmonic mean of precision and recall):
p <- ((tab[2,2])/(tab[2,2]+tab[1,2]))
r <- ((tab[2,2])/(tab[2,2]+tab[2,1]))
f <- 2*((p*r)/(p+r))
f
## [1] 0.8558875

At this point, the interest of business owners comes in, as they can get an idea, of which aspects customers consider as most salient in their consuming activities. On place one and two, we see two words that are rather non-informative: schad (pity) and schlecht (bad). Clearly, two negative sentiments, but they dont’t give any hint for action. However, the next two words unfreund (unfriendly) and leck (tasty) are more interesting. Perceived unfriendly service seems to be the most prevalent aspect that makes a customer give a negative rating. As most reviews are on restaurants, bars etc. it does not come as a surprise that the matter of tastiness is another strong indicator in the model. Further words to mention are geschmack (taste), lieblos (this word is tricky; it means sort of ‘carelessly’) and freundlich (friendly).

varImpPlot(RF)

5. Discussion

To be honest, the practical results presented do not come as a surprise. So this analysis seems to appeal more to the data analyst, than the business owner. However, we can state that the model was able to classify a large amount of ratings correctly based on the information retrieved from the text data. This makes room for further analyses. 1: how do different kinds of businesses (restaurants, shops, galleries) differ as to which aspects are considered most salient by customers? 2: how do different nationalities differ in that regard? 3: How do analyses differ depending on the size of words within the Document Term Matrix? 4: Remember that due to the data given, the amount of good ratings was much bigger than the amount of bad ratings. If we had enough ratings and were able to select only a subset of good ratings so as to have equal sizes of bad and good ratings: how would this impact on the results? Keep in mind, for example, the high Recall rate, we presented earlier. One can clearly see, that following this analysis, there are further questions to this topic abound.