The database that we are going to use in this projet consists of Tripadvisor reviews related to Hotels.
The data is made up of 20491 reviews for 2 variables:
Review = full text of reviewRating = rating from 1 to 5 for each reviewIn order to make the dataset usable and more easy to work on, we decide to extract a random sample of 1000 observations.
Let's visualize the table that allow us to navigate inside our new data!
First of all, we verify if the data presents some outliers. As we can see, there aren't outliers!
sum(is.na(T_rw))## [1] 0
In order to study the total distribution of Rating within the various votes we created the following barplot!
The majority of reviews have been accompanied by the evaluation = 5.
On other hand, in order to have an idea of text, this is a table where is present the first 2 reviews .
##
##
## +--------------------------------+--------------------------------+
## | tiny comfortable location | excellent vacation july 14-21/ |
## | terrific beware noise busy | 2007we traveled extensively |
## | intersection, ambient smells | carraibe loved hotel1 food no |
## | neighbouring restaurant | gastroenteritis 2 mosquitoes |
## | wafting room evening night | not issue3 beach activities |
## | hours difficult sleep.the | kids club nice 3 y/o daughter |
## | hotel staff cooly efficient | enjoyed 5 staff friendly |
## | helpful point knowledge end | helpful6 rooms big separate |
## | door hotel, evening arrival | sofa additional person7 nb |
## | looking advice place eat none | children allergy n't hesitate |
## | offered discovered later | needs accommodated, |
## | handful good restaurants right | |
## | corner, rooms small 2 people | |
## | beware appointed comfortable, | |
## | comparable boutique hotel new | |
## | york city size price-wise, | |
## +--------------------------------+--------------------------------+
Naturally, before procede with our analysis, we have to clean the text.
The function gsub allows replace the character selected in another one: in our case we decided to convert in a empty space the following preposition:
hotel and roomAnother important functions allow us to complete the text cleanup:
English stopwordsT_rw$Review <- gsub("â", "", T_rw$Review)
T_rw$Review <- gsub("ç", "", T_rw$Review)
T_rw$Review <- gsub("^aa", "", T_rw$Review)
T_rw$Review <- gsub("<\\w+ *", "", T_rw$Review)
T_rw$Review <- gsub("@\\w+", "", T_rw$Review)
T_rw$Review <- gsub("https?://.+", "", T_rw$Review)
T_rw$Review <- gsub("\\d+\\w*\\d*", "", T_rw$Review)
T_rw$Review <- gsub("#\\w+", "", T_rw$Review)
T_rw$Review <- gsub("\n", " ", T_rw$Review)
T_rw$Review <- gsub("^\\s+", "", T_rw$Review)
T_rw$Review <- gsub("\\s+$", "", T_rw$Review)
T_rw$Review <- gsub("[ |\t]+", " ", T_rw$Review)
T_rw$Review <- gsub("[^\x01-\x7F]", "", T_rw$Review)
T_rw$Review <- removePunctuation(T_rw$Review)
T_rw$Review <- removeNumbers(T_rw$Review)
T_rw$Review <- tolower(T_rw$Review)
T_rw$Review <- gsub("hotel", " ", T_rw$Review)
T_rw$Review <- gsub("room", " ", T_rw$Review)
T_rw$Review <- removeWords(T_rw$Review, stopwords("english"))
T_rw$Review <- stripWhitespace(T_rw$Review)
T_rw$Review <- stemDocument(T_rw$Review)We set up first the source then the corpus from source.
T_rw_source <- VectorSource(T_rw$Review)
corpus <- Corpus(T_rw_source)It is time inspect our corpus randomly selecting 2 reviews.
## [1] "great holiday stop marco polo gateway hong kong excel good servic friend noth troubl recommend marco polo gateway just fot locat ideal kowloon just short walk corner ferri hong kong love saw honk kong realli good templ street market favourit brows lot bargain caught ferri lantau island huge buddha weather aw just leav sky clear memor sight husband eat chines food resteraunt cheap peopl help explain menus easi went victoria peak tram fabul experi went late afternoon saw light turn harbour went week way uk recommend fabul place"
## [1] "great locat friend welcom warm welcom rosanna alloc loev clean quiet mini bar competit price locat excel close best sight definitley stay"
Once the corpus is used to finish cleaning, we are able to convert it into DocumentTermMatrix.
Let's present the 10 most common words by table and plot.
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
head(frequency, 10)## stay great staff good locat night time day nice just
## 1402 1048 804 786 713 684 656 645 634 619
frequency<-as.data.frame(frequency)
frequency <- cbind(Word = rownames(frequency),frequency)
frequency<-as.data.frame(frequency)
frequency.top <- frequency[1:10,]
frequency.top$Word <- as.factor(frequency.top$Word)Text clustering is the application of cluster analysis to text-based documents.
In order to understand and categorize unstructured textual data, it utilizes machine learning and natural language processing (NLP).
It involves grouping a set of texts in such a way that the texts in one group (cluster) contain same properties than the texts in other groups or clusters.
Term Frequency-Inverse Document Frequency is a numerical statistic that shows how important a word is to a corpus.
Fist of all, we create the matrix of term counts to get the Inverse Document Frequency vector.
tdf_rw <- TermDocFreq(dtm2)Then we define a measure of similarity between two non-zero vectors of an inner product space.
tfidf <- t(dtm2[ , tdf_rw$term]) * tdf_rw$idf
tfidf <- t(tfidf)Once create the cosine similarity, we are able to convert it into a cosine distance by subtracting it from 1.
csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)
cdist <- as.dist(1 - csim)In the agglomerative approach, each word starts as its own cluster and then we merge similar ones into bigger clusters.
There are several methods for combining clusters in agglomerative approach.
We decide to use Ward's method where the distance between two clusters is how much the sum of squares will increase when we merge them.
hc <- hclust(cdist, "ward.D")We used the function cutree in order to split the result and dendogram into 2 clusters.
Finally we assign the number of clusters to belong into our dataset.
clustering <- cutree(hc, 2)
T_rw$cluster <- clusteringThis is result graphically of our grouping.
In this section we try to inspect the main word inside each cluster created above.
The sample of 1000 observations was divided into clusters with this frequency:
| Cluster 1 | Cluster 2 | |
|---|---|---|
| Size | 891 | 109 |
| Size % | 89.1 | 10.9 |
The top words below could suggests us an initial description of the groups:
## cluster size top_words
## 1 1 892 stay, locat, great, breakfast, staff, bed, bath
## 2 2 108 resort, beach, food, water, pool, time, peopl
The wordcloud below with 100 words confirms our idea.
The first cluster is full of word matched with a tipically hotel used for work-trip or something related with turism in city.
wordcloud::wordcloud(words = names(cluster_words[[ 1 ]]),
freq = cluster_words[[ 1 ]],
max.words = 100,
random.order = FALSE,
colors = brewer.pal(10, "RdGy"),
main = "Top words in cluster 1") Now we have choosen one of the most important word (breakfast) in the first cluster and we made an association word plot with a percentage of association greater than 20%.
On other hand, the second cluster is clearly opposite than before.
It easy to see that there are a lot of word that remember something link to holiday/relax and beach/water.
wordcloud::wordcloud(words = names(cluster_words[[ 2 ]]),
freq = cluster_words[[ 2 ]],
max.words = 100,
random.order = FALSE,
colors = brewer.pal(5,"Set1"),
main = "Top words in cluster 2") Same as before, considering one of relevant word (resort) in the second cluster we illustrated another association word plot but this with a percentage of association greater than 40%.
Now we could assign a name for each cluster created:
TravelHolidayfor (i in 1:nrow(T_rw)) {
if(T_rw$cluster[i] == 1){
T_rw$cluster[i] <- "Travel"
} else {
T_rw$cluster[i] <- "Holiday"
}
}
T_rw$cluster <- as.factor(T_rw$cluster)The k-means algorithm majorly involves forming k-seeds first then grouping the work into k clusters based on calculation of the most optimum number.
In our case k = 2.
Concluding our clusterization, we perform and plot the ClusPlot on cosine distance.
We can see that there are several reviews that belong to both clusters: they represent reviews that are not meaningful to our groups.
kfit <- kmeans(cdist, 2, nstart=100)
clusplot(as.matrix(cdist), kfit$cluster, color=T, shade=T, labels=4, lines=4)The aim of this section is to create a model that predict:
RatingIn order to classify the Rating into two sections as Positive=1 and Negative=0, we decide to remove from our analysis the vote = 3 due to they are neutral.
Let's visualize the distribution of rating_new:
Hotel_rw <- T_rw %>% filter(Rating != 3) %>%
mutate(rating_new = if_else(Rating >= 4, 1, 0))
table(Hotel_rw$rating_new)##
## 0 1
## 169 729
We select 70% of casual observation. Then we develop the train on this proportion of observation.
set.seed(123)
training_obs <- createDataPartition(Hotel_rw$Rating, p = 0.7, list = F)
Hotel_rw.train <- Hotel_rw[training_obs,]We are going to select only the columns Review and rating_new.
Moreover, we rename the variable rating_new with y, making it as a factor.
It is time to create our model using support-vector machines.
corpus_Hotel_rw <- Corpus(VectorSource(Hotel_rw.train$Review))
tdm_Hotel_rw <- DocumentTermMatrix(corpus_Hotel_rw)
training_set_Hotel_rw <- as.matrix(tdm_Hotel_rw)
training_set_Hotel_rw <- cbind(training_set_Hotel_rw, Hotel_rw.train$rating_new)
colnames(training_set_Hotel_rw)[ncol(training_set_Hotel_rw)] <- "y"
training_set_Hotel_rw <- as.data.frame(training_set_Hotel_rw)
training_set_Hotel_rw$y <- as.factor(training_set_Hotel_rw$y)
set.seed(123)
Hotel_rw_model <- train(y ~., data = training_set_Hotel_rw, method = 'svmLinear3')We perform the same procedure for test.
Hotel_rw.test <- Hotel_rw[-training_obs,]
test_corpus_Hotel_rw <- Corpus(VectorSource(Hotel_rw.test$Review))
test_tdm_Hotel_rw <- DocumentTermMatrix(test_corpus_Hotel_rw, control = list(dictionary = Terms(tdm_Hotel_rw)))
test_tdm_Hotel_rw <- as.matrix(test_tdm_Hotel_rw)Finally, we are able to make the final prediction and visualize the predict result.
| Positive | Negative | |
|---|---|---|
| PREDICTION | 235 | 33 |
set.seed(123)
Hotel_rw_model_result <- predict(Hotel_rw_model, newdata = test_tdm_Hotel_rw)
table(Hotel_rw_model_result)## Hotel_rw_model_result
## 0 1
## 33 235
Now we create the confusion matrix:
Accuracy = 0.8993 below, indicate the frequency of observation correctly classified.Kappa = 0.6016, measure of the relationship between predictions their values.Sensitivity = 0.9729, became from the ratio between the Positive reviews correctly classified and the total of Positive reviews.Specificity = 0.5532, is obtained from the ratio between the Negative reviews correctly classified and the total of Negative reviews.The confusion matrix shows a grid of true and false predictions compared to the actual values.
We have an high accuracy of prediction.
confusionMatrix(data = Hotel_rw_model_result,
reference = as.factor(Hotel_rw.test$rating_new), positive = "1") ## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25 8
## 1 22 213
##
## Accuracy : 0.8881
## 95% CI : (0.8441, 0.9232)
## No Information Rate : 0.8246
## P-Value [Acc > NIR] : 0.002736
##
## Kappa : 0.5616
##
## Mcnemar's Test P-Value : 0.017622
##
## Sensitivity : 0.9638
## Specificity : 0.5319
## Pos Pred Value : 0.9064
## Neg Pred Value : 0.7576
## Prevalence : 0.8246
## Detection Rate : 0.7948
## Detection Prevalence : 0.8769
## Balanced Accuracy : 0.7479
##
## 'Positive' Class : 1
##