I have chosen tripadvisor as recommender system and a project to work on because it has influenced and motivated lots of other people also I guess. The ratings that you see about any place, hotels, cars, flights and restaurants are very legit. The restaurant rankings in tripadvisor are getting more popular and millions of people are using it in search of good place to stay, eat, rent, while they are traveling to different places in the world. And yes, tripadvisor ratings are all over the world. It is not like other recommender systems which works on the sponsorships or advertisements. People who have been to the place or ate or stayed reviews their personal feelings about that particular place.
My hypothesis is the higher the ratings of any place, I hope it is a nice place to stay or eat or drive.
It was little tough for me to search for the dataset for tripadviso because they are not open source. You have to put request in their website for the dataset for data analysis or for academic purpose.
But luckily, after searching in different websites I found a datset which displays a cars and hotels reviews of about 10 different cities in the world.
And I chose the New York City dataset, since I live here and I know major hotels and locations. And it would be easier for me to check if those reviews and ratings are legit or not.
After wiping out, unnecessary columns for this project I finally was able to save the csv file on the github.
#getting the csv data from github
hotels <- read.csv('https://raw.githubusercontent.com/maharjansudhan/DATA607/master/new-york-city.csv ',TRUE,sep = ',')
#to check how many hotels are rated and reviewed
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
count(hotels, vars = "hotel_name")
## # A tibble: 1 x 2
## vars n
## <chr> <int>
## 1 hotel_name 260
On this dataset, there are 260 hotels in New York City that are being reviewed and rated in 2012.
summary(hotels)
## hotel_name
## 3 west club : 1
## 414 hotel : 1
## 6 columbus : 1
## 60 thompson : 1
## 70 park avenue hotel a kimpton hotel: 1
## ace hotel nyc : 1
## (Other) :254
## street city num_reviews
## 1335 avenue of the americas: 2 new york city:260 Min. : 1.0
## 85 west street : 2 1st Qu.: 80.0
## 1 central park west : 1 Median :203.0
## 100 east 50th street : 1 Mean :212.1
## 100 orchard st : 1 3rd Qu.:339.2
## 101 w 57th st at 6th ave : 1 Max. :580.0
## (Other) :252
## CLEANLINESS ROOM SERVICE LOCATION
## Min. :2.966 Min. :2.585 Min. :2.621 Min. :2.720
## 1st Qu.:4.009 1st Qu.:3.616 1st Qu.:3.684 1st Qu.:4.294
## Median :4.316 Median :3.920 Median :3.996 Median :4.564
## Mean :4.263 Mean :3.922 Mean :3.990 Mean :4.455
## 3rd Qu.:4.557 3rd Qu.:4.256 3rd Qu.:4.312 3rd Qu.:4.705
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## VALUE overall_ratingsource
## Min. :3.023 Min. :3.205
## 1st Qu.:3.667 1st Qu.:3.905
## Median :3.973 Median :4.103
## Mean :3.934 Mean :4.113
## 3rd Qu.:4.176 3rd Qu.:4.356
## Max. :5.000 Max. :5.000
##
These hotels are rated from 0 to 5 on their cleanliness, room, service, location and value, respectively.
There there is an overall ratings for these individual hotels also.
str(hotels)
## 'data.frame': 260 obs. of 10 variables:
## $ hotel_name : Factor w/ 260 levels "3 west club",..: 127 31 134 150 193 180 114 112 228 167 ...
## $ street : Factor w/ 258 levels "1 central park west",..: 132 60 136 193 166 191 171 222 199 9 ...
## $ city : Factor w/ 1 level "new york city": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_reviews : int 98 330 341 345 33 328 348 372 152 388 ...
## $ CLEANLINESS : num 4.98 4.73 4.9 4.8 4.9 ...
## $ ROOM : num 4.94 4.22 4.49 4.65 4.72 ...
## $ SERVICE : num 4.97 4.77 4.79 4.65 4.69 ...
## $ LOCATION : num 4.89 4.84 4.81 4.86 4.79 ...
## $ VALUE : num 4.76 4.38 4.38 4.35 4.66 ...
## $ overall_ratingsource: num 4.91 4.59 4.67 4.66 4.75 ...
There are 10 different variables which represents hotel_name, address, number of reviews and different kinds of ratings about these hotels.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v readr 1.1.1
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.1 v stringr 1.3.1
## v ggplot2 3.1.0 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tibble' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
head(hotels)
## hotel_name street city
## 1 inn new york city 266 west 71st st. new york city
## 2 casablanca hotel 147 west 43rd street new york city
## 3 library hotel 299 madison avenue new york city
## 4 new york palace hotel 455 madison ave new york city
## 5 the french quarters guest apartments 346 w. 46th street new york city
## 6 sofitel new york 45 west 44th street new york city
## num_reviews CLEANLINESS ROOM SERVICE LOCATION VALUE
## 1 98 4.984848 4.939394 4.969697 4.893939 4.757576
## 2 330 4.734219 4.215947 4.767442 4.840532 4.382060
## 3 341 4.897010 4.491694 4.787375 4.807309 4.375415
## 4 345 4.803987 4.647841 4.651163 4.860465 4.345515
## 5 33 4.896552 4.724138 4.689655 4.793103 4.655172
## 6 328 4.764120 4.607973 4.544850 4.847176 4.265781
## overall_ratingsource
## 1 4.909091
## 2 4.588040
## 3 4.671761
## 4 4.661794
## 5 4.751724
## 6 4.605980
# to remove if there is any rows that are blank
hotels <- hotels %>% filter(hotel_name != " ")
## Warning: package 'bindrcpp' was built under R version 3.5.1
# to confirm that there is no more blank rows
head(hotels,15)
## hotel_name
## 1 inn new york city
## 2 casablanca hotel
## 3 library hotel
## 4 new york palace hotel
## 5 the french quarters guest apartments
## 6 sofitel new york
## 7 hotel giraffe
## 8 hotel elysee
## 9 the ritz carlton new york central park
## 10 residence inn by marriott times square new york
## 11 affinia dumont
## 12 chelsea pines inn
## 13 the hampton inn times square north
## 14 hilton garden inn times square
## 15 the lucerne hotel
## street city num_reviews CLEANLINESS
## 1 266 west 71st st. new york city 98 4.984848
## 2 147 west 43rd street new york city 330 4.734219
## 3 299 madison avenue new york city 341 4.897010
## 4 455 madison ave new york city 345 4.803987
## 5 346 w. 46th street new york city 33 4.896552
## 6 45 west 44th street new york city 328 4.764120
## 7 365 park ave s new york city 348 4.837209
## 8 60 east 54th street new york city 372 4.711111
## 9 50 central park south new york city 152 4.829268
## 10 1033 avenue of the americas new york city 388 4.642140
## 11 150 east 34th street new york city 345 4.485050
## 12 317 west 14th street new york city 205 4.447205
## 13 851 eighth avenue @ 51st street new york city 332 4.588040
## 14 790 eighth avenue new york city 334 4.604651
## 15 201 west 79th street new york city 347 4.600000
## ROOM SERVICE LOCATION VALUE overall_ratingsource
## 1 4.939394 4.969697 4.893939 4.757576 4.909091
## 2 4.215947 4.767442 4.840532 4.382060 4.588040
## 3 4.491694 4.787375 4.807309 4.375415 4.671761
## 4 4.647841 4.651163 4.860465 4.345515 4.661794
## 5 4.724138 4.689655 4.793103 4.655172 4.751724
## 6 4.607973 4.544850 4.847176 4.265781 4.605980
## 7 4.617940 4.598007 4.511628 4.299003 4.572757
## 8 4.485185 4.625926 4.792593 4.388889 4.600741
## 9 4.682927 4.597561 4.914634 4.146341 4.634146
## 10 4.525084 4.381271 4.725753 4.397993 4.534448
## 11 4.485050 4.411960 4.647841 4.352159 4.476412
## 12 4.167702 4.546584 4.850932 4.366460 4.475776
## 13 4.435216 4.568106 4.787375 4.441860 4.564120
## 14 4.392027 4.478405 4.784053 4.365449 4.524917
## 15 4.288235 4.441176 4.641176 4.211765 4.436471
hist(hotels$overall_ratingsource)
According to the histogram the ratings are normally distributed. There are no outlier, meaning there is no such hotel which is rated very very low beacuse of its bad service and quality or rated high beacuse of good service and qualities.
#let's find out the top 2% and bottom 2% ratings
perc <- names(quantile(hotels$overall_ratingsource,seq(0.01,0.99,0.01)))
score <- unname(quantile(hotels$overall_ratingsource,seq(0.01,0.99,0.01)))
d <- data.frame(percentile=perc,score=score)
This means around 60% of the hotels are rated higher than 4.
d%>%arrange(desc(score))%>%head(5)
## percentile score
## 1 99% 4.816245
## 2 98% 4.669967
## 3 97% 4.639097
## 4 96% 4.604094
## 5 95% 4.596481
category <- hotels%>%select(CLEANLINESS,ROOM, SERVICE, LOCATION, VALUE)
boxplot(category, ylab="Categories", xlab="Rating")
According to the boxplot result, it seems like people care more about the clealiness of their room and the location of the hotel.
#Compare between Service and Value
plot(hotels$SERVICE ~ hotels$VALUE)
I tried to see if people cared about Service vs the Value of the the hotel and it seems like they both are correlated to each other. The higher the service the higher the ratings of the value also. So, we can say that people don’t care about value when the service is good.
#Compare between location and Value
plot(hotels$LOCATION ~ hotels$VALUE)
As per the Location and the value of the place, people tend to prefer location rather than the cost to stay there.
#Compare between Cleanliness and Value
plot(hotels$CLEANLINESS ~ hotels$VALUE)
Surprisingly, it seems people prefer clean room to live rather than the cost of the place.
rated_hotels <- select(hotels, -street, -city, -num_reviews, -CLEANLINESS, -ROOM, -SERVICE, -LOCATION, -VALUE, hotel_name, overall_ratingsource)
highest_rated_hotels <- arrange(rated_hotels, desc(overall_ratingsource))
head(highest_rated_hotels)
## hotel_name overall_ratingsource
## 1 crosby street hotel 5.000000
## 2 the mark 5.000000
## 3 inn new york city 4.909091
## 4 the french quarters guest apartments 4.751724
## 5 the mercer hotel 4.673077
## 6 library hotel 4.671761
tail(highest_rated_hotels)
## hotel_name overall_ratingsource
## 255 amsterdam court hotel 3.435583
## 256 hudson hotel 3.433887
## 257 astor on the park 3.420091
## 258 the grant hotel 3.278261
## 259 woogo central park 3.262069
## 260 best western convention center hotel 3.205333
The highest rated hotels are CROSS STREET HOTEL AND THE MARK HOTEL.
The lowest rated hotels are WOOGO CENTRAL PARK AND BEST WESTERN CONVENTION CENTER HOTEL
#clarifying doubts
plot(hotels$num_reviews ~ hotels$overall_ratingsource, col=ifelse(hotels$overall_ratingsource==5, "red", "black"))
rated_vs_reviews <- select(hotels, -street, -city, num_reviews, -CLEANLINESS, -ROOM, -SERVICE, -LOCATION, -VALUE, hotel_name, overall_ratingsource)
highest_rated_hotels <- arrange(rated_vs_reviews, desc(overall_ratingsource))
head(highest_rated_hotels)
## hotel_name num_reviews overall_ratingsource
## 1 crosby street hotel 4 5.000000
## 2 the mark 9 5.000000
## 3 inn new york city 98 4.909091
## 4 the french quarters guest apartments 33 4.751724
## 5 the mercer hotel 84 4.673077
## 6 library hotel 341 4.671761
tail(highest_rated_hotels)
## hotel_name num_reviews overall_ratingsource
## 255 amsterdam court hotel 334 3.435583
## 256 hudson hotel 339 3.433887
## 257 astor on the park 343 3.420091
## 258 the grant hotel 40 3.278261
## 259 woogo central park 163 3.262069
## 260 best western convention center hotel 119 3.205333
One thing I got suspicious is that the highest rated hotel CROSS STREET HOTEL has 5.0 ratings and it has only 4 reviews in total. Similarly, THE MARK HOTEL also has 9 reviews.
It is very unclear that they have very less reviews thats why their ratings are 5 or are they seriously doing good job in taking care of their guests by giving them a clean good valued room with a good price in a very convenient location. People might have bias like I have right now.
Tripadvisor is used by everyone in the whole world. Its a very huge recommender system that can be used for flights, cars, hotels, restaurants, etc. The business is getting bigger and bigger since people are using more smartphones these days and it is very convenient to use the app on their phone.
#Presenting all the Top 10 rated hotels of New York City in 2012 according to Tripadvisor
library(leaflet)
## Warning: package 'leaflet' was built under R version 3.5.1
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.5.1
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.5.1
geo_hotels <- select(hotels, hotel_name, overall_ratingsource)
geo_hotels <- arrange(geo_hotels, desc(overall_ratingsource))
x <- head(geo_hotels,10)
x
## hotel_name overall_ratingsource
## 1 crosby street hotel 5.000000
## 2 the mark 5.000000
## 3 inn new york city 4.909091
## 4 the french quarters guest apartments 4.751724
## 5 the mercer hotel 4.673077
## 6 library hotel 4.671761
## 7 new york palace hotel 4.661794
## 8 the sherry netherland hotel 4.654545
## 9 the bowery hotel 4.634483
## 10 the ritz carlton new york central park 4.634146
leaflet() %>% addTiles() %>% setView(-74.00, 40.73, 12) %>%
addMarkers(
lng = -73.9996057, lat = 40.7230126,
label = "Crosby Street Hotel",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.9780981, lat = 40.7675672,
label = "The Mark",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.9867765, lat = 40.7787641,
label = "INN New York City",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.9919417, lat = 40.76035,
label = "The French Quarters Guest Apartments",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -74.0007262, lat = 40.7248889,
label = "The Mercer Hotel",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.9908914, lat = 40.7471436,
label = "Library Hotel",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.9771662, lat = 40.75802,
label = "New York Palace Hotel",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.974817, lat = 40.7644768,
label = "The Sherry Netherland Hotel",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.9938177, lat = 40.726017,
label = "The Bowery Hotel",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))%>%
addMarkers(
lng = -73.9781879, lat = 40.7652719,
label = "The Ritz Carlton New York Central Hotel",
labelOptions = labelOptions(noHide = T, textSizw = "15px"))
The key goals of tripadvisor users are to find out the best place to flights, cruises, visiting different places or eat at a best place.
It started as a small company in 2000.
Depending upon the the previous users previous history records we can suggest them others places they might like to visit or eat or so on.
If the users are new then we can recommend them the best place within the 10 miles range of where they are searching to go to.
The good thing about Tripadvisor ratings and reviews is these are all written by visitors. They have experienced something there and wrote the words that they experienced.They have right to rate higher or lower or write good reviews or the bad ones. It all depends on the guests.
The only issue with this is some companies can create fake accounts and write good reviews about their own place. There is no legitimacy on this particular issue.
After all the calculations and analysis, We can make sure that ratings and reviews can not be trusted always. It depends upon a person’s critical moment thoughts about that place. If a person who stayed in a hotel is in good mood at the time of checkout or while he/she is writing the reiview, then there is a high chance that they will right a good review and rate higher points. So, you should not judge according to one person’s review to stay in a hotel. Read few reviews, do some research about the place, why you want to stay there, is it for business visit or for a vacation, how close is it from your visiting center? All these things matter a lot while choosing a hotel to stay. But the recommender system like tripadvisor helps us a lot to atleast choose something from a lot of availabilities.
“When expection doesn’t meet reality then people write bad reviews about places and restaurants.”
Ganesan, K. A., and C. X. Zhai, “Opinion-Based Entity Ranking”, Information Retrieval. http://kavita-ganesan.com/opinion-based-entity-ranking/#.XA4JRWhKjtQ @article{ganesan2012opinion, title={Opinion-based entity ranking}, author={Ganesan, Kavita and Zhai, ChengXiang}, journal={Information retrieval}, volume={15}, number={2}, pages={116–150}, year={2012}, publisher={Springer} }