TRIPADVISOR RECOMMENDER SYSTEM

I have chosen tripadvisor as recommender system and a project to work on because it has influenced and motivated lots of other people also I guess. The ratings that you see about any place, hotels, cars, flights and restaurants are very legit. The restaurant rankings in tripadvisor are getting more popular and millions of people are using it in search of good place to stay, eat, rent, while they are traveling to different places in the world. And yes, tripadvisor ratings are all over the world. It is not like other recommender systems which works on the sponsorships or advertisements. People who have been to the place or ate or stayed reviews their personal feelings about that particular place.

My hypothesis is the higher the ratings of any place, I hope it is a nice place to stay or eat or drive.

Data Source

It was little tough for me to search for the dataset for tripadviso because they are not open source. You have to put request in their website for the dataset for data analysis or for academic purpose.

But luckily, after searching in different websites I found a datset which displays a cars and hotels reviews of about 10 different cities in the world.

And I chose the New York City dataset, since I live here and I know major hotels and locations. And it would be easier for me to check if those reviews and ratings are legit or not.

After wiping out, unnecessary columns for this project I finally was able to save the csv file on the github.

#getting the csv data from github
hotels <- read.csv('https://raw.githubusercontent.com/maharjansudhan/DATA607/master/new-york-city.csv ',TRUE,sep = ',')
#to check how many hotels are rated and reviewed
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
count(hotels, vars = "hotel_name")
## # A tibble: 1 x 2
##   vars           n
##   <chr>      <int>
## 1 hotel_name   260

On this dataset, there are 260 hotels in New York City that are being reviewed and rated in 2012.

summary(hotels)
##                                 hotel_name 
##  3 west club                         :  1  
##  414 hotel                           :  1  
##  6 columbus                          :  1  
##  60 thompson                         :  1  
##  70 park avenue hotel a kimpton hotel:  1  
##  ace hotel nyc                       :  1  
##  (Other)                             :254  
##                          street               city      num_reviews   
##  1335 avenue of the americas:  2   new york city:260   Min.   :  1.0  
##  85 west street             :  2                       1st Qu.: 80.0  
##  1 central park west        :  1                       Median :203.0  
##  100 east 50th street       :  1                       Mean   :212.1  
##  100 orchard st             :  1                       3rd Qu.:339.2  
##  101 w 57th st at 6th ave   :  1                       Max.   :580.0  
##  (Other)                    :252                                      
##   CLEANLINESS         ROOM          SERVICE         LOCATION    
##  Min.   :2.966   Min.   :2.585   Min.   :2.621   Min.   :2.720  
##  1st Qu.:4.009   1st Qu.:3.616   1st Qu.:3.684   1st Qu.:4.294  
##  Median :4.316   Median :3.920   Median :3.996   Median :4.564  
##  Mean   :4.263   Mean   :3.922   Mean   :3.990   Mean   :4.455  
##  3rd Qu.:4.557   3rd Qu.:4.256   3rd Qu.:4.312   3rd Qu.:4.705  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##      VALUE       overall_ratingsource
##  Min.   :3.023   Min.   :3.205       
##  1st Qu.:3.667   1st Qu.:3.905       
##  Median :3.973   Median :4.103       
##  Mean   :3.934   Mean   :4.113       
##  3rd Qu.:4.176   3rd Qu.:4.356       
##  Max.   :5.000   Max.   :5.000       
## 

These hotels are rated from 0 to 5 on their cleanliness, room, service, location and value, respectively.

There there is an overall ratings for these individual hotels also.

str(hotels)
## 'data.frame':    260 obs. of  10 variables:
##  $ hotel_name          : Factor w/ 260 levels "3 west club",..: 127 31 134 150 193 180 114 112 228 167 ...
##  $ street              : Factor w/ 258 levels "1 central park west",..: 132 60 136 193 166 191 171 222 199 9 ...
##  $ city                : Factor w/ 1 level "new york city": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_reviews         : int  98 330 341 345 33 328 348 372 152 388 ...
##  $ CLEANLINESS         : num  4.98 4.73 4.9 4.8 4.9 ...
##  $ ROOM                : num  4.94 4.22 4.49 4.65 4.72 ...
##  $ SERVICE             : num  4.97 4.77 4.79 4.65 4.69 ...
##  $ LOCATION            : num  4.89 4.84 4.81 4.86 4.79 ...
##  $ VALUE               : num  4.76 4.38 4.38 4.35 4.66 ...
##  $ overall_ratingsource: num  4.91 4.59 4.67 4.66 4.75 ...

There are 10 different variables which represents hotel_name, address, number of reviews and different kinds of ratings about these hotels.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v readr   1.1.1
## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v stringr 1.3.1
## v ggplot2 3.1.0     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tibble' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
head(hotels)
##                             hotel_name               street          city
## 1                    inn new york city    266 west 71st st. new york city
## 2                     casablanca hotel 147 west 43rd street new york city
## 3                        library hotel   299 madison avenue new york city
## 4                new york palace hotel      455 madison ave new york city
## 5 the french quarters guest apartments   346 w. 46th street new york city
## 6                     sofitel new york  45 west 44th street new york city
##   num_reviews CLEANLINESS     ROOM  SERVICE LOCATION    VALUE
## 1          98    4.984848 4.939394 4.969697 4.893939 4.757576
## 2         330    4.734219 4.215947 4.767442 4.840532 4.382060
## 3         341    4.897010 4.491694 4.787375 4.807309 4.375415
## 4         345    4.803987 4.647841 4.651163 4.860465 4.345515
## 5          33    4.896552 4.724138 4.689655 4.793103 4.655172
## 6         328    4.764120 4.607973 4.544850 4.847176 4.265781
##   overall_ratingsource
## 1             4.909091
## 2             4.588040
## 3             4.671761
## 4             4.661794
## 5             4.751724
## 6             4.605980
# to remove if there is any rows that are blank
hotels <- hotels %>% filter(hotel_name != " ")
## Warning: package 'bindrcpp' was built under R version 3.5.1
# to confirm that there is no more blank rows
head(hotels,15)
##                                         hotel_name
## 1                                inn new york city
## 2                                 casablanca hotel
## 3                                    library hotel
## 4                            new york palace hotel
## 5             the french quarters guest apartments
## 6                                 sofitel new york
## 7                                    hotel giraffe
## 8                                     hotel elysee
## 9           the ritz carlton new york central park
## 10 residence inn by marriott times square new york
## 11                                  affinia dumont
## 12                               chelsea pines inn
## 13              the hampton inn times square north
## 14                  hilton garden inn times square
## 15                               the lucerne hotel
##                             street          city num_reviews CLEANLINESS
## 1                266 west 71st st. new york city          98    4.984848
## 2             147 west 43rd street new york city         330    4.734219
## 3               299 madison avenue new york city         341    4.897010
## 4                  455 madison ave new york city         345    4.803987
## 5               346 w. 46th street new york city          33    4.896552
## 6              45 west 44th street new york city         328    4.764120
## 7                   365 park ave s new york city         348    4.837209
## 8              60 east 54th street new york city         372    4.711111
## 9            50 central park south new york city         152    4.829268
## 10     1033 avenue of the americas new york city         388    4.642140
## 11            150 east 34th street new york city         345    4.485050
## 12            317 west 14th street new york city         205    4.447205
## 13 851 eighth avenue @ 51st street new york city         332    4.588040
## 14               790 eighth avenue new york city         334    4.604651
## 15            201 west 79th street new york city         347    4.600000
##        ROOM  SERVICE LOCATION    VALUE overall_ratingsource
## 1  4.939394 4.969697 4.893939 4.757576             4.909091
## 2  4.215947 4.767442 4.840532 4.382060             4.588040
## 3  4.491694 4.787375 4.807309 4.375415             4.671761
## 4  4.647841 4.651163 4.860465 4.345515             4.661794
## 5  4.724138 4.689655 4.793103 4.655172             4.751724
## 6  4.607973 4.544850 4.847176 4.265781             4.605980
## 7  4.617940 4.598007 4.511628 4.299003             4.572757
## 8  4.485185 4.625926 4.792593 4.388889             4.600741
## 9  4.682927 4.597561 4.914634 4.146341             4.634146
## 10 4.525084 4.381271 4.725753 4.397993             4.534448
## 11 4.485050 4.411960 4.647841 4.352159             4.476412
## 12 4.167702 4.546584 4.850932 4.366460             4.475776
## 13 4.435216 4.568106 4.787375 4.441860             4.564120
## 14 4.392027 4.478405 4.784053 4.365449             4.524917
## 15 4.288235 4.441176 4.641176 4.211765             4.436471
hist(hotels$overall_ratingsource)

According to the histogram the ratings are normally distributed. There are no outlier, meaning there is no such hotel which is rated very very low beacuse of its bad service and quality or rated high beacuse of good service and qualities.

#let's find out the top 2% and bottom 2% ratings
perc <- names(quantile(hotels$overall_ratingsource,seq(0.01,0.99,0.01)))
score <- unname(quantile(hotels$overall_ratingsource,seq(0.01,0.99,0.01)))
d <- data.frame(percentile=perc,score=score)

This means around 60% of the hotels are rated higher than 4.

d%>%arrange(desc(score))%>%head(5)
##   percentile    score
## 1        99% 4.816245
## 2        98% 4.669967
## 3        97% 4.639097
## 4        96% 4.604094
## 5        95% 4.596481
category <- hotels%>%select(CLEANLINESS,ROOM, SERVICE, LOCATION, VALUE)


boxplot(category, ylab="Categories", xlab="Rating")

According to the boxplot result, it seems like people care more about the clealiness of their room and the location of the hotel.

#Compare between Service and Value
plot(hotels$SERVICE ~ hotels$VALUE)

I tried to see if people cared about Service vs the Value of the the hotel and it seems like they both are correlated to each other. The higher the service the higher the ratings of the value also. So, we can say that people don’t care about value when the service is good.

#Compare between location and Value
plot(hotels$LOCATION ~ hotels$VALUE)

As per the Location and the value of the place, people tend to prefer location rather than the cost to stay there.

#Compare between Cleanliness and Value
plot(hotels$CLEANLINESS ~ hotels$VALUE)

Surprisingly, it seems people prefer clean room to live rather than the cost of the place.

rated_hotels <- select(hotels, -street, -city,  -num_reviews,   -CLEANLINESS,   -ROOM, -SERVICE,    -LOCATION, -VALUE, hotel_name, overall_ratingsource)
highest_rated_hotels <- arrange(rated_hotels, desc(overall_ratingsource))
head(highest_rated_hotels)
##                             hotel_name overall_ratingsource
## 1                  crosby street hotel             5.000000
## 2                             the mark             5.000000
## 3                    inn new york city             4.909091
## 4 the french quarters guest apartments             4.751724
## 5                     the mercer hotel             4.673077
## 6                        library hotel             4.671761
tail(highest_rated_hotels)
##                               hotel_name overall_ratingsource
## 255                amsterdam court hotel             3.435583
## 256                         hudson hotel             3.433887
## 257                    astor on the park             3.420091
## 258                      the grant hotel             3.278261
## 259                   woogo central park             3.262069
## 260 best western convention center hotel             3.205333

The highest rated hotels are CROSS STREET HOTEL AND THE MARK HOTEL.

The lowest rated hotels are WOOGO CENTRAL PARK AND BEST WESTERN CONVENTION CENTER HOTEL

#clarifying doubts

plot(hotels$num_reviews ~ hotels$overall_ratingsource, col=ifelse(hotels$overall_ratingsource==5, "red", "black"))

rated_vs_reviews <- select(hotels, -street, -city,  num_reviews,    -CLEANLINESS,   -ROOM, -SERVICE,    -LOCATION, -VALUE, hotel_name, overall_ratingsource)
highest_rated_hotels <- arrange(rated_vs_reviews, desc(overall_ratingsource))
head(highest_rated_hotels)
##                             hotel_name num_reviews overall_ratingsource
## 1                  crosby street hotel           4             5.000000
## 2                             the mark           9             5.000000
## 3                    inn new york city          98             4.909091
## 4 the french quarters guest apartments          33             4.751724
## 5                     the mercer hotel          84             4.673077
## 6                        library hotel         341             4.671761
tail(highest_rated_hotels)
##                               hotel_name num_reviews overall_ratingsource
## 255                amsterdam court hotel         334             3.435583
## 256                         hudson hotel         339             3.433887
## 257                    astor on the park         343             3.420091
## 258                      the grant hotel          40             3.278261
## 259                   woogo central park         163             3.262069
## 260 best western convention center hotel         119             3.205333

One thing I got suspicious is that the highest rated hotel CROSS STREET HOTEL has 5.0 ratings and it has only 4 reviews in total. Similarly, THE MARK HOTEL also has 9 reviews.

It is very unclear that they have very less reviews thats why their ratings are 5 or are they seriously doing good job in taking care of their guests by giving them a clean good valued room with a good price in a very convenient location. People might have bias like I have right now.

Tripadvisor is used by everyone in the whole world. Its a very huge recommender system that can be used for flights, cars, hotels, restaurants, etc. The business is getting bigger and bigger since people are using more smartphones these days and it is very convenient to use the app on their phone.

#Presenting all the Top 10 rated hotels of New York City in 2012 according to Tripadvisor
library(leaflet)
## Warning: package 'leaflet' was built under R version 3.5.1
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.5.1
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.5.1
geo_hotels <- select(hotels, hotel_name, overall_ratingsource)
geo_hotels <- arrange(geo_hotels, desc(overall_ratingsource))
x <- head(geo_hotels,10)
x
##                                hotel_name overall_ratingsource
## 1                     crosby street hotel             5.000000
## 2                                the mark             5.000000
## 3                       inn new york city             4.909091
## 4    the french quarters guest apartments             4.751724
## 5                        the mercer hotel             4.673077
## 6                           library hotel             4.671761
## 7                   new york palace hotel             4.661794
## 8             the sherry netherland hotel             4.654545
## 9                        the bowery hotel             4.634483
## 10 the ritz carlton new york central park             4.634146
leaflet() %>% addTiles() %>% setView(-74.00, 40.73, 12) %>%
  addMarkers(
    lng = -73.9996057, lat = 40.7230126,
    label = "Crosby Street Hotel",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.9780981, lat = 40.7675672,  
    label = "The Mark",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.9867765, lat = 40.7787641,  
    label = "INN New York City",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.9919417, lat = 40.76035,
    label = "The French Quarters Guest Apartments",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -74.0007262, lat = 40.7248889,
    label = "The Mercer Hotel",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.9908914, lat = 40.7471436,  
    label = "Library Hotel",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.9771662, lat = 40.75802,
    label = "New York Palace Hotel",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.974817, lat = 40.7644768,
    label = "The Sherry Netherland Hotel",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.9938177, lat = 40.726017,  
    label = "The Bowery Hotel",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
  addMarkers(
    lng = -73.9781879, lat = 40.7652719,  
    label = "The Ritz Carlton New York Central Hotel",
    labelOptions = labelOptions(noHide = T, textSizw = "15px"))

What are the main goals?

The key goals of tripadvisor users are to find out the best place to flights, cruises, visiting different places or eat at a best place.

It started as a small company in 2000.

How can you help them accomplish those goals?

Depending upon the the previous users previous history records we can suggest them others places they might like to visit or eat or so on.

If the users are new then we can recommend them the best place within the 10 miles range of where they are searching to go to.

The good thing about Tripadvisor ratings and reviews is these are all written by visitors. They have experienced something there and wrote the words that they experienced.They have right to rate higher or lower or write good reviews or the bad ones. It all depends on the guests.

The only issue with this is some companies can create fake accounts and write good reviews about their own place. There is no legitimacy on this particular issue.

Conclusion

After all the calculations and analysis, We can make sure that ratings and reviews can not be trusted always. It depends upon a person’s critical moment thoughts about that place. If a person who stayed in a hotel is in good mood at the time of checkout or while he/she is writing the reiview, then there is a high chance that they will right a good review and rate higher points. So, you should not judge according to one person’s review to stay in a hotel. Read few reviews, do some research about the place, why you want to stay there, is it for business visit or for a vacation, how close is it from your visiting center? All these things matter a lot while choosing a hotel to stay. But the recommender system like tripadvisor helps us a lot to atleast choose something from a lot of availabilities.

“When expection doesn’t meet reality then people write bad reviews about places and restaurants.”

Citation

Ganesan, K. A., and C. X. Zhai, “Opinion-Based Entity Ranking”, Information Retrieval. http://kavita-ganesan.com/opinion-based-entity-ranking/#.XA4JRWhKjtQ @article{ganesan2012opinion, title={Opinion-based entity ranking}, author={Ganesan, Kavita and Zhai, ChengXiang}, journal={Information retrieval}, volume={15}, number={2}, pages={116–150}, year={2012}, publisher={Springer} }