1. Introduction


People love to travel whether it be for personal or business purposes. A bad user experience can cause major loss to that hotel, specially when the customer has arrived for a business purpose. Hence, identifying the best hotel is very important for mobility team of a company as well for someone who is looking for a great experience while travelling for the first time in a foriegn country.

This document contains mainly two steps-
1. Data Analysis - The first part, aims in identifying the top Hotels in the Europe which one should consider, based on the user’s ratings as well as the number of reviews that have been provided.

2. Sentiment Analysis - The Second part, tries to identify the most important factors(whether positive or negative) that customers expect, based on their review comments.

The data has been scraped through booking.com. This dataset contains 10 lakhs customer reviews and scoring of 1493 luxury hotels across Europe. Meanwhile, the geographical location of hotels are also provided for further analysis. This data contains the following columns.

Data Code Description
Hotel_Address Address of hotel
Review_Date Date when reviewer posted the corresponding review
Average_Score Average Score of the hotel, calculated based on the latest comment in the last year
Hotel_Name Name of the Hotel
Reviewer_Nationality Nationality of Reviewer
Review Review the reviewer gave to the hotel
Review_Total_Word_Counts Total number of words in the review provided
Reviewer_Score Score the reviewer has given to the hotel, based on his/her experience
Total_Number_of_Reviews_Reviewer_Has_Given Number of Reviews the reviewers has given in the past
Total_Number_of_Reviews Total number of valid reviews the hotel has
Tags Tags reviewer gave the hotel
days_since_review Duration between the review date and scrape date
Additional_Number_of_Scoring Guests who just made a scoring on the service rather than a review
lat Latitude of the hotel (Geographical Location for that hotel)
lng Longitude of the hotel (Geographical Location for that hotel)

2. Packages

The details of different packages used for this analysis is listed below.
. tm :Text Mining package in R.
. SnowballC :Used for word stemming.
. ggplot2 :Data visualisation in R.
. leaflet :Create Interactive Web Maps.
. tidyr :Used for data tidying/data reshaping.
. dplyr :Easy functions to perform data manipulation in R.
. stringr :String operations in R.
. wordcloud :To create wordclouds.
. syuzhet:Basic NLP Package, used also for Sentiment Analysis in R.

3. Data Preparation

The data is loaded from the source and cleaned for further analysis.

HotelData = read.delim2("Hotel_Reviews_R.txt",quote = '', stringsAsFactors = FALSE)

Lets see the data structure.

str(HotelData)
## 'data.frame':    1031476 obs. of  15 variables:
##  $ Hotel_Address                             : chr  " s Gravesandestraat 55 Oost 1092 AA Amsterdam Netherlands" " s Gravesandestraat 55 Oost 1092 AA Amsterdam Netherlands" " s Gravesandestraat 55 Oost 1092 AA Amsterdam Netherlands" " s Gravesandestraat 55 Oost 1092 AA Amsterdam Netherlands" ...
##  $ Additional_Number_of_Scoring              : int  194 194 194 194 194 194 194 194 194 194 ...
##  $ Review_Date                               : chr  "03-08-2017" "03-08-2017" "31-07-2017" "31-07-2017" ...
##  $ Average_Score                             : chr  "7.7" "7.7" "7.7" "7.7" ...
##  $ Hotel_Name                                : chr  "Hotel Arena" "Hotel Arena" "Hotel Arena" "Hotel Arena" ...
##  $ Reviewer_Nationality                      : chr  " Russia " " Ireland " " Australia " " United Kingdom " ...
##  $ Total_Number_of_Reviews                   : int  1403 1403 1403 1403 1403 1403 1403 1403 1403 1403 ...
##  $ Review                                    : chr  " I am so angry that i made this post available via all possible sites i use when planing my trips so no one wil"| __truncated__ "No Negative" " Rooms are nice but for elderly a bit difficult as most rooms are two story with narrow steps So ask for single"| __truncated__ " My room was dirty and I was afraid to walk barefoot on the floor which looked as if it was not cleaned in week"| __truncated__ ...
##  $ Review_Word_Counts                        : int  397 0 42 210 140 17 33 11 34 15 ...
##  $ Total_Number_of_Reviews_Reviewer_Has_Given: int  7 7 9 1 3 1 6 1 3 1 ...
##  $ Reviewer_Score                            : chr  "2.9" "7.5" "7.1" "3.8" ...
##  $ Tags                                      : chr  "\"[' Leisure trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 6 nights ']\"" "\"[' Leisure trip ', ' Couple ', ' Duplex Double Room ', ' Stayed 4 nights ']\"" "\"[' Leisure trip ', ' Family with young children ', ' Duplex Double Room ', ' Stayed 3 nights ', ' Submitted f"| __truncated__ "\"[' Leisure trip ', ' Solo traveler ', ' Duplex Double Room ', ' Stayed 3 nights ']\"" ...
##  $ days_since_review                         : chr  "0 days" "0 days" "3 days" "3 days" ...
##  $ lat                                       : chr  "52.3605759" "52.3605759" "52.3605759" "52.3605759" ...
##  $ lng                                       : chr  "4.9159683" "4.9159683" "4.9159683" "4.9159683" ...

Creating unique hotel name and geo-locations

hotel.names = HotelData %>%
  select(Hotel_Name, Hotel_Address, lat, lng, Average_Score, Total_Number_of_Reviews) %>%
  filter(lat != 0 & lng != 0 & !duplicated(Hotel_Address)) %>%
  group_by(Hotel_Name, Hotel_Address, lat, lng, Average_Score) %>%
  summarise(Total_Number_of_Reviews = sum(Total_Number_of_Reviews));
## Warning: package 'bindrcpp' was built under R version 3.5.1
hotel.names
## # A tibble: 1,476 x 6
## # Groups:   Hotel_Name, Hotel_Address, lat, lng [?]
##    Hotel_Name   Hotel_Address    lat   lng   Average_Score Total_Number_of~
##    <chr>        <chr>            <chr> <chr> <chr>                    <int>
##  1 11 Cadogan ~ 11 Cadogan Gard~ 51.4~ -0.1~ 8.7                        393
##  2 1K Hotel     13 Boulevard Du~ 48.8~ 2.36~ 7.7                        663
##  3 25hours Hot~ Lerchenfelder S~ 48.2~ 16.3~ 8.8                       4324
##  4 41           41 Buckingham P~ 51.4~ -0.1~ 9.6                        244
##  5 45 Park Lan~ 45 Park Lane We~ 51.5~ -0.1~ 9.4                         68
##  6 88 Studios   88 Holland Road~ 51.4~ -0.2~ 8.4                        955
##  7 9Hotel Repu~ 7 9 Rue Pierre ~ 48.8~ 2.36~ 8.8                        857
##  8 A La Villa ~ 44 Rue Madame 6~ 48.8~ 2.33~ 8.8                        185
##  9 ABaC Restau~ Avenida Tibidab~ 41.4~ 2.13~ 8.8                        111
## 10 Abba Garden  Santa Rosa Espl~ 41.3~ 2.10~ 7.9                        959
## # ... with 1,466 more rows

4. Exploratory Data Analysis

4.1 Geographical Location for the Hotels

First lets plot a map chart to see, where these hotels are based -

# Current geogrphical cordinates are in char, converting them to numeric
points <- cbind(as.numeric(hotel.names$lng),as.numeric(hotel.names$lat))

#Creating a Leaflet Map

leaflet() %>% 
  addProviderTiles('OpenStreetMap.Mapnik',
                   options = providerTileOptions(noWrap = TRUE)) %>%
  addMarkers(data = points,
             popup = paste0("<strong>Hotel: </strong>",
                            hotel.names$Hotel_Name,                 
                            "<br><strong>Address: </strong>", 
                            hotel.names$Hotel_Address, 
                            "<br><strong>Average Score: </strong>", 
                            hotel.names$Average_Score, 
                            "<br><strong>Number of Reviews: </strong>", 
                            hotel.names$Total_Number_of_Reviews),
             clusterOptions = markerClusterOptions())

Click on the Orange Bubbles to zoom into exact precise location.

Now we would add an extra column i.e. country, to the datasets. We can then start asking questions such as which country/city has the highest distribution of highly rated hotels.

hotel.names$Country = sapply(str_split(hotel.names$Hotel_Address," "),function(X) {X[length(X)]})  #Extracting country information
hotel.names$Country = str_replace(hotel.names$Country, "Kingdom","U.K")

# Adding Country information to our main dataset
HotelData = HotelData %>% left_join(hotel.names[,c(2,7)],by = 'Hotel_Address')

4.2 Distribution of Hotel Reviews


Histogram to see how user reviews overall are distributed.

HotelData %>% select(Hotel_Name, Average_Score, Country) %>% group_by(Hotel_Name, Average_Score, Country) %>% ggplot(aes(x = as.numeric(Average_Score))) + geom_histogram(color = 'blue',fill = 'blue', alpha =0.4, bins = 30) + labs(x = "Average Rating", y = "No of People who provided Rating", title = "Distribution of Ratings")

#colSums(Hotel_Data$Country,n())
#Hotel_Data = na.omit(HotelData)

4.3 Average Review Score for each Country


Let’s see if there are some countries where hotels get high ratings on an average? Seems like all hotels have similar ratings with Spain, Austria and France having higher ratings than the rest .

Hotel_Data = na.omit(HotelData)
# Generating Boxplot
Hotel_Data %>% ggplot(aes(x = as.factor(Country), y = as.numeric(Average_Score))) + geom_boxplot() + labs(x = 'Country', y = 'Average Score', title = 'Average Rating for each country')

One of the other point, that needs to be noted is the number of reviews that were made by the users because larger number of reviews would produce more accuracy.

#Bar chart
Hotel_Data %>% ggplot(aes(x = Country)) + geom_bar(position = position_stack(reverse = TRUE), color = "black", fill = "white") + coord_flip() + stat_count(aes(label = ..count..),geom = "text")

The results are different with UK having the highest number of reviews,meaning UK is the most travelled destination with Spain and France being the second and third country with most reviews.

4.4 Identifying which nationality of people visited these countries the most


The next analysis would be to find out which nationality of people prefered to visit these countries.

# Identifying which nationality of people  visited UK the most 

Hotel_Data %>% filter(Country == 'U.K', Reviewer_Nationality != ' United Kingdom ' ) %>% select(Reviewer_Nationality) %>% group_by(Reviewer_Nationality) %>% summarise(n=n()) %>% arrange(desc(n))%>% head(5)
## # A tibble: 5 x 2
##   Reviewer_Nationality             n
##   <chr>                        <int>
## 1 " United States of America " 22022
## 2 " Australia "                16946
## 3 " Ireland "                  14040
## 4 " United Arab Emirates "      7728
## 5 " Saudi Arabia "              6404
# Identifying which nationality of people visited France the most 

Hotel_Data %>% filter(Country == 'France', Reviewer_Nationality != ' France ') %>% select(Reviewer_Nationality) %>% group_by(Reviewer_Nationality) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(5)
## # A tibble: 5 x 2
##   Reviewer_Nationality             n
##   <chr>                        <int>
## 1 " United Kingdom "           33558
## 2 " United States of America " 14484
## 3 " Australia "                 7748
## 4 " Saudi Arabia "              3964
## 5 " United Arab Emirates "      3062
# Identifying which nationality of people visited Spain the most 

Hotel_Data %>% filter(Country == 'Spain', Reviewer_Nationality != ' Spain ') %>% select(Reviewer_Nationality) %>% group_by(Reviewer_Nationality) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(5)
## # A tibble: 5 x 2
##   Reviewer_Nationality             n
##   <chr>                        <int>
## 1 " United Kingdom "           41770
## 2 " United States of America " 12286
## 3 " Australia "                 5802
## 4 " Ireland "                   4262
## 5 " Canada "                    2898

It can be observed that in general, people from US, UK and Australia have travelled the most in European countries.

4.5 Identifying the top hotels in each country


The last analysis would be to find out top 10 hotels across the Europe, which is based on combination of Avergae Score provided by the reviewers as well as the total number of reviews that were provided. Thw top 10 hotels are then geographically located using leaflet package in R.

# Identifying the top hotels in each country
Hotel_Data %>% filter(Average_Score >9) %>% select(Hotel_Name, Country, Hotel_Address,Total_Number_of_Reviews,Average_Score, lng, lat) %>% group_by(Hotel_Name,Country) %>% distinct() %>% arrange(desc(Total_Number_of_Reviews),desc(Average_Score)) %>% head(10) -> Top_Hotels

leaflet() %>% 
  addProviderTiles('OpenStreetMap.Mapnik',
                   options = providerTileOptions(noWrap = TRUE)) %>%
  addMarkers(data = cbind(as.numeric(Top_Hotels$lng),as.numeric(Top_Hotels$lat)) ,
             popup = paste0("<strong>Hotel: </strong>",
                            Top_Hotels$Hotel_Name,                 
                            "<br><strong>Address: </strong>", 
                            Top_Hotels$Hotel_Address, 
                            "<br><strong>Average Score: </strong>", 
                            Top_Hotels$Average_Score, 
                            "<br><strong>Number of Reviews: </strong>", 
                            Top_Hotels$Total_Number_of_Reviews),
             clusterOptions = markerClusterOptions())

5. Sentiment Analysis

Before performing a sentiment analysis, we will try to find out what user review speaks and the important factors (whether positive/negative) that need to be taken care by the hotel administrators in order for them to not lose their premium customers.
First step is to build a wordCloud for words with the highest frequencies in reviews.The Text of the review is converted to lower case and numbers and stop words are removed from it.

# Building corpus
corpus = HotelData$Review
corpus = Corpus(VectorSource(corpus))

# Cleaning the text
corpus = tm_map(corpus, tolower)
## Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops
## documents
corpus = tm_map(corpus,removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation
## drops documents
corpus = tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus = tm_map(corpus, removeWords, stopwords('english'))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus = tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation
## drops documents
inspect(corpus[1:5])    # View top 5 comments
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1]  angry made post available via possible sites use planing trips one will make mistake booking place made booking via booking com stayed nights hotel july upon arrival placed small room nd floor hotel turned room booked specially reserved level duplex room big windows high ceilings room ok don t mind broken window can closed hello rain mini fridge contained sort bio weapon least guessed smell intimately asked change room explaining times booked duplex btw costs simple double got way volume due high ceiling offered room next day check next day o clock order get room waned best way begin holiday wait till order check new room wonderful waist time room got just wanted peaceful internal garden view big window tired waiting room placed belongings rushed city evening turned constant noise room guess made vibrating vent tubes something constant annoying hell stop even making hard fall asleep wife audio recording can attach want can send via e mail next day technician came able determine cause disturbing sound offered change room hotel fully booked room left one smaller seems newer 
## [2]  negative
## [3]  rooms nice elderly bit difficult rooms two story narrow steps ask single level inside rooms basic just tea coffee boiler bar empty fridge
## [4]  room dirty afraid walk barefoot floor looked cleaned weeks white furniture looked nice pictures dirty door looked like attacked angry dog shower drain clogged staff respond request clean day heavy rainfall pretty common occurrence amsterdam roof room leaking luckily bed also see signs earlier water damage also saw insects running floor overall second floor property looked dirty badly kept top repairman came fix something room next door midnight noisy many guests understand challenges running hotel old building negligence inconsistent prices demanded hotel last night complained water damage night shift manager offered move different room offer came pretty late around midnight already bed ready sleep                                                                                                                                                                                                                                                                                                                                                                                              
## [5]  booked company line showed pictures room thought getting paying arrived s room booked staff told book villa suite theough directly completely false advertising realised grouped lots rooms photos together leaving consumer confused extreamly disgruntled especially wife s th birthday present please make website clear pricing photos didn t really know paying much wnded photos told getting something wasn t happy won t using


There are words in our sparse matrix such as Room, Hotels etc. which are not required, because it is obvious that we are talking about Hotel reviews. Hence, removing such wprds. Also, there are some comments as No negative which also can be interpretted as user had a positive experience and similarly its vice versa. Hence, replacing “No Negative” by “Positive” and “No Positive” by “Negative”.

corpus = tm_map(corpus, removeWords, c('hotel','room'))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, c("hotel", "room")):
## transformation drops documents
corpus = tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation
## drops documents
corpus = tm_map(corpus, gsub, pattern = "No Negative", replacement = "Positive")
## Warning in tm_map.SimpleCorpus(corpus, gsub, pattern = "No Negative",
## replacement = "Positive"): transformation drops documents
corpus = tm_map(corpus, gsub, pattern = "no negative", replacement = "Positive")
## Warning in tm_map.SimpleCorpus(corpus, gsub, pattern = "no negative",
## replacement = "Positive"): transformation drops documents
corpus = tm_map(corpus, gsub, pattern = "No negative", replacement = "Positive")
## Warning in tm_map.SimpleCorpus(corpus, gsub, pattern = "No negative",
## replacement = "Positive"): transformation drops documents
corpus = tm_map(corpus, gsub, pattern = "No negatives", replacement = "Positive")
## Warning in tm_map.SimpleCorpus(corpus, gsub, pattern = "No negatives",
## replacement = "Positive"): transformation drops documents
corpus = tm_map(corpus,gsub, pattern = "No Positive", replacement = "Negative")
## Warning in tm_map.SimpleCorpus(corpus, gsub, pattern = "No Positive",
## replacement = "Negative"): transformation drops documents


Creating a sparse matrix.Creating a Barplot to see the highest frequency words from the user’s comments.

# Term Document Matrix - To convert unstructured data into structured data as rows and columns
tdm = TermDocumentMatrix(corpus)
tdm = removeSparseTerms(tdm, 0.99)
tdm = as.matrix(tdm)

bargraph = rowSums(tdm)

# Filtering those words that were present for more than 30k times.
bargraph = subset(bargraph, bargraph>30000)

barplot(bargraph,las = 2,col = rainbow(75))

# From the barplot we can see that "Staff","Breakfast" and "location" are the three words that people are talking the most.


Creating Bag of Words

tdm = TermDocumentMatrix(corpus)
tdm = removeSparseTerms(tdm, 0.99)
tdm = as.matrix(tdm)

wordCl = sort(rowSums(tdm), decreasing = FALSE)
set.seed(100)
wordcloud(words = names(wordCl), 
          freq = wordCl,
          max.words = 250,
          scale = c(5,0.3),
          random.order = F,
          colors = brewer.pal(8, 'Dark2'),
          rot.per = 0.7)


Sentiment Analysis for hotel reviews:

tweets <- as.character(corpus)
class(tweets)
## [1] "character"
## [1] "character"
# Obtain Sentiment scores 


sentiment <- get_nrc_sentiment(as.character(corpus))
head(sentiment)
##   anger anticipation disgust fear joy sadness surprise trust negative
## 1   807          644     716  927 543     816      396   902     2246
## 2     0            0       0    0   0       0        0     0        0
## 3     0            0       0    0   0       0        0     0        0
##   positive
## 1     1758
## 2        0
## 3        0
# Bar Plot for Sentiment Analysis
barplot(colSums(sentiment), 
                las = 2, 
                col = rainbow(10),
                ylab = 'Count',
                main = 'Sentiment Scores for Hotel Reviews')

6. Summary

The main purpose of this analysis was to identify top hotels in Europe that one should look for while travelling abroad whether it be for business or personal purposes.
The second part of our analysis was aimed in understanding what are the most common keywords used by the customer based on their reviews.

Analysis
1. From our analysis it was identified that UK, Spain and France are the three most popular travel detsinations, with UK having the highest number of reviews (~ 5 times the reviews of other countries).

2. We also tried to find out the nationality of people that visited the most in these countries and it was found out that US, UK and Australian citizens travelled the most in European countries.

3. Some of the best hotels in the UK were found out to be situated in London such as ‘M by Montcalm Shoreditch London Tech City’,‘citizenM Tower of London’ etc to name a few, and hence I believe London can be the best spot to organise business meetings.

4. A basic sentiment analysis was performed to see the overall reaction of customers, and overall response was found out to be somewhat neutral, with customers speaking the most about staff, location, negative, breakfast,friendly, service etc.