The database that we are going to use in this projet consists of Tripadvisor reviews related to Hotels.

The data is made up of 20491 reviews for 2 variables:

  • Review = full text of review
  • Rating = rating from 1 to 5 for each review

In order to make the dataset usable and more easy to work on, we decide to extract a random sample of 1000 observations.

Let's visualize the table that allow us to navigate inside our new data!

First of all, we verify if the data presents some outliers. As we can see, there aren't outliers!

sum(is.na(T_rw))
## [1] 0

Purpose of the analysis

  1. Visualise the distribution of rating among the reviews
  2. Clean the reviews text
  3. Find out the clusters which represent the type of reviews and plot them
  4. Predict whether a review is classified as positive or negative

First view of variables

In order to study the total distribution of Rating within the various votes we created the following barplot!

The majority of reviews have been accompanied by the evaluation = 5.

On other hand, in order to have an idea of text, this is a table where is present the first 2 reviews .

## 
## 
## +--------------------------------+--------------------------------+
## |   tiny comfortable location    | excellent vacation july 14-21/ |
## |   terrific beware noise busy   |  2007we traveled extensively   |
## |  intersection, ambient smells  | carraibe loved hotel1 food no  |
## |    neighbouring restaurant     |  gastroenteritis 2 mosquitoes  |
## |   wafting room evening night   |  not issue3 beach activities   |
## |   hours difficult sleep.the    | kids club nice 3 y/o daughter  |
## |  hotel staff cooly efficient   |    enjoyed 5 staff friendly    |
## |  helpful point knowledge end   |  helpful6 rooms big separate   |
## |  door hotel, evening arrival   |   sofa additional person7 nb   |
## | looking advice place eat none  | children allergy n't hesitate  |
## |    offered discovered later    |      needs accommodated,       |
## | handful good restaurants right |                                |
## |  corner, rooms small 2 people  |                                |
## | beware appointed comfortable,  |                                |
## | comparable boutique hotel new  |                                |
## |   york city size price-wise,   |                                |
## +--------------------------------+--------------------------------+

Text Cleaning

Naturally, before procede with our analysis, we have to clean the text.

The function gsub allows replace the character selected in another one: in our case we decided to convert in a empty space the following preposition:

  • Mentions
  • Urls
  • Emojis
  • Special character
  • Newlines
  • Date
  • The words hotel and room

Another important functions allow us to complete the text cleanup:

  • Remove extra-space
  • Remove punctuation
  • Remove number
  • Tranform uppercase in lowercase letter
  • Stem words
  • Remove English stopwords
T_rw$Review <- gsub("â", "", T_rw$Review)
T_rw$Review <- gsub("ç", "", T_rw$Review)
T_rw$Review <- gsub("^aa", "", T_rw$Review)
T_rw$Review <- gsub("<\\w+ *", "", T_rw$Review)
T_rw$Review <- gsub("@\\w+", "", T_rw$Review)
T_rw$Review <- gsub("https?://.+", "", T_rw$Review)
T_rw$Review <- gsub("\\d+\\w*\\d*", "", T_rw$Review)
T_rw$Review <- gsub("#\\w+", "", T_rw$Review)
T_rw$Review <- gsub("\n", " ", T_rw$Review)
T_rw$Review <- gsub("^\\s+", "", T_rw$Review)
T_rw$Review <- gsub("\\s+$", "", T_rw$Review)
T_rw$Review <- gsub("[ |\t]+", " ", T_rw$Review)
T_rw$Review <- gsub("[^\x01-\x7F]", "", T_rw$Review)

T_rw$Review <- removePunctuation(T_rw$Review)
T_rw$Review <- removeNumbers(T_rw$Review)
T_rw$Review <- tolower(T_rw$Review)

T_rw$Review <- gsub("hotel", " ", T_rw$Review)
T_rw$Review <- gsub("room", " ", T_rw$Review)

T_rw$Review <- removeWords(T_rw$Review, stopwords("english"))
T_rw$Review <- stripWhitespace(T_rw$Review)
T_rw$Review <- stemDocument(T_rw$Review)

Source and Corpus

We set up first the source then the corpus from source.

T_rw_source <- VectorSource(T_rw$Review)
corpus <- Corpus(T_rw_source)

It is time inspect our corpus randomly selecting 2 reviews.

## [1] "great holiday stop marco polo gateway hong kong excel good servic friend noth troubl recommend marco polo gateway just fot locat ideal kowloon just short walk corner ferri hong kong love saw honk kong realli good templ street market favourit brows lot bargain caught ferri lantau island huge buddha weather aw just leav sky clear memor sight husband eat chines food resteraunt cheap peopl help explain menus easi went victoria peak tram fabul experi went late afternoon saw light turn harbour went week way uk recommend fabul place"
## [1] "great locat friend welcom warm welcom rosanna alloc loev clean quiet mini bar competit price locat excel close best sight definitley stay"

DocumentTermMatrix

Once the corpus is used to finish cleaning, we are able to convert it into DocumentTermMatrix.

Let's present the 10 most common words by table and plot.

dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
head(frequency, 10)
##  stay great staff  good locat night  time   day  nice  just 
##  1402  1048   804   786   713   684   656   645   634   619
frequency<-as.data.frame(frequency)

frequency <- cbind(Word = rownames(frequency),frequency)
frequency<-as.data.frame(frequency)
frequency.top <- frequency[1:10,]
frequency.top$Word <- as.factor(frequency.top$Word)

Clustering

Text clustering is the application of cluster analysis to text-based documents.

In order to understand and categorize unstructured textual data, it utilizes machine learning and natural language processing (NLP).

It involves grouping a set of texts in such a way that the texts in one group (cluster) contain same properties than the texts in other groups or clusters.

TF-IDF

Term Frequency-Inverse Document Frequency is a numerical statistic that shows how important a word is to a corpus.

Fist of all, we create the matrix of term counts to get the Inverse Document Frequency vector.

tdf_rw <- TermDocFreq(dtm2)

Then we define a measure of similarity between two non-zero vectors of an inner product space.

tfidf <- t(dtm2[ , tdf_rw$term]) * tdf_rw$idf
tfidf <- t(tfidf)

Once create the cosine similarity, we are able to convert it into a cosine distance by subtracting it from 1.

csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)
cdist <- as.dist(1 - csim)

Hierarchical clustering

In the agglomerative approach, each word starts as its own cluster and then we merge similar ones into bigger clusters.

There are several methods for combining clusters in agglomerative approach.

We decide to use Ward's method where the distance between two clusters is how much the sum of squares will increase when we merge them.

hc <- hclust(cdist, "ward.D")

We used the function cutree in order to split the result and dendogram into 2 clusters.

Finally we assign the number of clusters to belong into our dataset.

clustering <- cutree(hc, 2)

T_rw$cluster <- clustering

This is result graphically of our grouping.

In this section we try to inspect the main word inside each cluster created above.

The sample of 1000 observations was divided into clusters with this frequency:

Cluster 1 Cluster 2
Size 891 109
Size % 89.1 10.9

The top words below could suggests us an initial description of the groups:

  1. A groups of words releted to review of business-city Hotel.
  2. A groups of words releted to review of holiday-sea Hotel.
##   cluster size                                       top_words
## 1       1  892 stay, locat, great, breakfast, staff, bed, bath
## 2       2  108   resort, beach, food, water, pool, time, peopl

The wordcloud below with 100 words confirms our idea.

The first cluster is full of word matched with a tipically hotel used for work-trip or something related with turism in city.

wordcloud::wordcloud(words = names(cluster_words[[ 1 ]]), 
                     freq = cluster_words[[ 1 ]], 
                     max.words = 100, 
                     random.order = FALSE, 
                     colors = brewer.pal(10, "RdGy"),
                     main = "Top words in cluster 1") 

Now we have choosen one of the most important word (breakfast) in the first cluster and we made an association word plot with a percentage of association greater than 20%.

On other hand, the second cluster is clearly opposite than before.

It easy to see that there are a lot of word that remember something link to holiday/relax and beach/water.

wordcloud::wordcloud(words = names(cluster_words[[ 2 ]]), 
                     freq = cluster_words[[ 2 ]], 
                     max.words = 100, 
                     random.order = FALSE, 
                     colors = brewer.pal(5,"Set1"),
                     main = "Top words in cluster 2") 

Same as before, considering one of relevant word (resort) in the second cluster we illustrated another association word plot but this with a percentage of association greater than 40%.

Now we could assign a name for each cluster created:

  1. Travel Hotel Review <- Travel
  2. Holiday Hotel Review <- Holiday
for (i in 1:nrow(T_rw)) {
  if(T_rw$cluster[i] == 1){
    T_rw$cluster[i] <- "Travel"
  } else {
    T_rw$cluster[i] <- "Holiday"
  }
}

T_rw$cluster <- as.factor(T_rw$cluster)

K means

The k-means algorithm majorly involves forming k-seeds first then grouping the work into k clusters based on calculation of the most optimum number.

In our case k = 2.

Concluding our clusterization, we perform and plot the ClusPlot on cosine distance.

We can see that there are several reviews that belong to both clusters: they represent reviews that are not meaningful to our groups.

kfit <- kmeans(cdist, 2, nstart=100)

clusplot(as.matrix(cdist), kfit$cluster, color=T, shade=T, labels=4, lines=4)

Prediction

The aim of this section is to create a model that predict:

  • If a review is positive or negative <- Rating

Rating

In order to classify the Rating into two sections as Positive=1 and Negative=0, we decide to remove from our analysis the vote = 3 due to they are neutral.

Let's visualize the distribution of rating_new:

Hotel_rw <- T_rw %>% filter(Rating != 3) %>% 
  mutate(rating_new = if_else(Rating >= 4, 1, 0))
table(Hotel_rw$rating_new)
## 
##   0   1 
## 169 729

We select 70% of casual observation. Then we develop the train on this proportion of observation.

set.seed(123)
training_obs <- createDataPartition(Hotel_rw$Rating, p = 0.7, list = F)
Hotel_rw.train <- Hotel_rw[training_obs,]

We are going to select only the columns Review and rating_new.

Moreover, we rename the variable rating_new with y, making it as a factor.

It is time to create our model using support-vector machines.

corpus_Hotel_rw <- Corpus(VectorSource(Hotel_rw.train$Review))
tdm_Hotel_rw <- DocumentTermMatrix(corpus_Hotel_rw)
training_set_Hotel_rw <- as.matrix(tdm_Hotel_rw)

training_set_Hotel_rw <- cbind(training_set_Hotel_rw, Hotel_rw.train$rating_new)

colnames(training_set_Hotel_rw)[ncol(training_set_Hotel_rw)] <- "y"

training_set_Hotel_rw <- as.data.frame(training_set_Hotel_rw)
training_set_Hotel_rw$y <- as.factor(training_set_Hotel_rw$y)
set.seed(123)
Hotel_rw_model <- train(y ~., data = training_set_Hotel_rw, method = 'svmLinear3')

We perform the same procedure for test.

Hotel_rw.test  <- Hotel_rw[-training_obs,]

test_corpus_Hotel_rw <- Corpus(VectorSource(Hotel_rw.test$Review))
test_tdm_Hotel_rw <- DocumentTermMatrix(test_corpus_Hotel_rw, control = list(dictionary = Terms(tdm_Hotel_rw)))
test_tdm_Hotel_rw <- as.matrix(test_tdm_Hotel_rw)

Finally, we are able to make the final prediction and visualize the predict result.

Positive Negative
PREDICTION 235 33
set.seed(123)
Hotel_rw_model_result <- predict(Hotel_rw_model, newdata = test_tdm_Hotel_rw)
table(Hotel_rw_model_result)
## Hotel_rw_model_result
##   0   1 
##  33 235

Now we create the confusion matrix:

  • The Accuracy = 0.8993 below, indicate the frequency of observation correctly classified.
  • The Kappa = 0.6016, measure of the relationship between predictions their values.
  • The Sensitivity = 0.9729, became from the ratio between the Positive reviews correctly classified and the total of Positive reviews.
  • The Specificity = 0.5532, is obtained from the ratio between the Negative reviews correctly classified and the total of Negative reviews.

The confusion matrix shows a grid of true and false predictions compared to the actual values.

We have an high accuracy of prediction.

confusionMatrix(data = Hotel_rw_model_result,
                reference = as.factor(Hotel_rw.test$rating_new), positive = "1") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  25   8
##          1  22 213
##                                           
##                Accuracy : 0.8881          
##                  95% CI : (0.8441, 0.9232)
##     No Information Rate : 0.8246          
##     P-Value [Acc > NIR] : 0.002736        
##                                           
##                   Kappa : 0.5616          
##                                           
##  Mcnemar's Test P-Value : 0.017622        
##                                           
##             Sensitivity : 0.9638          
##             Specificity : 0.5319          
##          Pos Pred Value : 0.9064          
##          Neg Pred Value : 0.7576          
##              Prevalence : 0.8246          
##          Detection Rate : 0.7948          
##    Detection Prevalence : 0.8769          
##       Balanced Accuracy : 0.7479          
##                                           
##        'Positive' Class : 1               
##