My parents love to travel to Las Vegas. They go at least twice a year. I thought it would be interesting to discover which hotel has the highest rating, and which features—star, pool, gym, tennis court, spa, or internet—might contribute the most.
The dataset in question is found on the UC Irvine Machine Learning Repository here.
First, let’s read in the csv. I have already uploaded it into my own GitHub repository. Then we can explore around a bit.
directory <- getwd()
url <- "https://raw.githubusercontent.com/EyeDen/las-vegas/master/LasVegasTripAdvisorReviews-Dataset.csv"
download.file(url = url, destfile = paste(directory, "LasVegasTripAdvisorReviews-Dataset.csv", sep = "/"))
vegas <- read.csv("LasVegasTripAdvisorReviews-Dataset.csv", sep = ';')
head(vegas, 10)
## User.country Nr..reviews Nr..hotel.reviews Helpful.votes Score
## 1 USA 11 4 13 5
## 2 USA 119 21 75 3
## 3 USA 36 9 25 5
## 4 UK 14 7 14 4
## 5 Canada 5 5 2 4
## 6 Canada 31 8 27 3
## 7 UK 45 12 46 4
## 8 USA 2 1 4 4
## 9 India 24 3 8 4
## 10 Canada 12 7 11 3
## Period.of.stay Traveler.type Pool Gym Tennis.court Spa Casino
## 1 Dec-Feb Friends NO YES NO NO YES
## 2 Dec-Feb Business NO YES NO NO YES
## 3 Mar-May Families NO YES NO NO YES
## 4 Mar-May Friends NO YES NO NO YES
## 5 Mar-May Solo NO YES NO NO YES
## 6 Mar-May Couples NO YES NO NO YES
## 7 Mar-May Couples NO YES NO NO YES
## 8 Mar-May Families NO YES NO NO YES
## 9 Mar-May Friends NO YES NO NO YES
## 10 Mar-May Families NO YES NO NO YES
## Free.internet Hotel.name Hotel.stars
## 1 YES Circus Circus Hotel & Casino Las Vegas 3
## 2 YES Circus Circus Hotel & Casino Las Vegas 3
## 3 YES Circus Circus Hotel & Casino Las Vegas 3
## 4 YES Circus Circus Hotel & Casino Las Vegas 3
## 5 YES Circus Circus Hotel & Casino Las Vegas 3
## 6 YES Circus Circus Hotel & Casino Las Vegas 3
## 7 YES Circus Circus Hotel & Casino Las Vegas 3
## 8 YES Circus Circus Hotel & Casino Las Vegas 3
## 9 YES Circus Circus Hotel & Casino Las Vegas 3
## 10 YES Circus Circus Hotel & Casino Las Vegas 3
## Nr..rooms User.continent Member.years Review.month Review.weekday
## 1 3773 North America 9 January Thursday
## 2 3773 North America 3 January Friday
## 3 3773 North America 2 February Saturday
## 4 3773 Europe 6 February Friday
## 5 3773 North America 7 March Tuesday
## 6 3773 North America 2 March Tuesday
## 7 3773 Europe 4 April Friday
## 8 3773 North America 0 April Tuesday
## 9 3773 Asia 3 May Saturday
## 10 3773 North America 5 May Tuesday
summary(vegas)
## User.country Nr..reviews Nr..hotel.reviews Helpful.votes
## USA :217 Min. : 1.00 Min. : 0.00 Min. : 0.00
## UK : 72 1st Qu.: 12.00 1st Qu.: 5.00 1st Qu.: 8.00
## Canada : 65 Median : 23.50 Median : 9.00 Median : 16.00
## Australia: 36 Mean : 48.13 Mean : 16.02 Mean : 31.75
## Ireland : 13 3rd Qu.: 54.25 3rd Qu.: 18.00 3rd Qu.: 35.00
## India : 11 Max. :775.00 Max. :263.00 Max. :365.00
## (Other) : 90
## Score Period.of.stay Traveler.type Pool Gym
## Min. :1.000 Dec-Feb:124 Business: 74 NO : 24 NO : 24
## 1st Qu.:4.000 Jun-Aug:126 Couples :214 YES:480 YES:480
## Median :4.000 Mar-May:128 Families:110
## Mean :4.123 Sep-Nov:126 Friends : 82
## 3rd Qu.:5.000 Solo : 24
## Max. :5.000
##
## Tennis.court Spa Casino Free.internet
## NO :384 NO :120 NO : 48 NO : 24
## YES:120 YES:384 YES:456 YES:480
##
##
##
##
##
## Hotel.name Hotel.stars Nr..rooms
## Bellagio Las Vegas : 24 3 : 96 Min. : 188
## Caesars Palace : 24 3,5: 72 1st Qu.: 826
## Circus Circus Hotel & Casino Las Vegas: 24 4 :120 Median :2700
## Encore at wynn Las Vegas : 24 4,5: 24 Mean :2196
## Excalibur Hotel & Casino : 24 5 :192 3rd Qu.:3025
## Hilton Grand Vacations at the Flamingo: 24 Max. :4027
## (Other) :360
## User.continent Member.years Review.month Review.weekday
## Africa : 7 Min. :-1806.0000 April : 42 Friday :65
## Asia : 36 1st Qu.: 2.0000 August : 42 Monday :74
## Europe :118 Median : 4.0000 December: 42 Saturday :61
## North America:295 Mean : 0.7679 February: 42 Sunday :77
## Oceania : 41 3rd Qu.: 6.0000 January : 42 Thursday :62
## South America: 7 Max. : 13.0000 July : 42 Tuesday :80
## (Other) :252 Wednesday:85
# Calculate mean and median ratings for each hotel
aggregate(Score~Hotel.name, vegas, mean)
## Hotel.name Score
## 1 Bellagio Las Vegas 4.208333
## 2 Caesars Palace 4.125000
## 3 Circus Circus Hotel & Casino Las Vegas 3.208333
## 4 Encore at wynn Las Vegas 4.541667
## 5 Excalibur Hotel & Casino 3.708333
## 6 Hilton Grand Vacations at the Flamingo 3.958333
## 7 Hilton Grand Vacations on the Boulevard 4.166667
## 8 Marriott's Grand Chateau 4.541667
## 9 Monte Carlo Resort&Casino 3.291667
## 10 Paris Las Vegas 4.041667
## 11 The Cosmopolitan Las Vegas 4.250000
## 12 The Cromwell 4.083333
## 13 The Palazzo Resort Hotel Casino 4.375000
## 14 The Venetian Las Vegas Hotel 4.583333
## 15 The Westin las Vegas Hotel Casino & Spa 3.916667
## 16 Treasure Island- TI Hotel & Casino 3.958333
## 17 Tropicana Las Vegas - A Double Tree by Hilton Hotel 4.041667
## 18 Trump International Hotel Las Vegas 4.375000
## 19 Tuscany Las Vegas Suites & Casino 4.208333
## 20 Wyndham Grand Desert 4.375000
## 21 Wynn Las Vegas 4.625000
aggregate(Score~Hotel.name, vegas, median)
## Hotel.name Score
## 1 Bellagio Las Vegas 4.5
## 2 Caesars Palace 4.5
## 3 Circus Circus Hotel & Casino Las Vegas 3.0
## 4 Encore at wynn Las Vegas 5.0
## 5 Excalibur Hotel & Casino 4.0
## 6 Hilton Grand Vacations at the Flamingo 4.0
## 7 Hilton Grand Vacations on the Boulevard 4.5
## 8 Marriott's Grand Chateau 5.0
## 9 Monte Carlo Resort&Casino 3.5
## 10 Paris Las Vegas 4.0
## 11 The Cosmopolitan Las Vegas 5.0
## 12 The Cromwell 4.5
## 13 The Palazzo Resort Hotel Casino 5.0
## 14 The Venetian Las Vegas Hotel 5.0
## 15 The Westin las Vegas Hotel Casino & Spa 4.0
## 16 Treasure Island- TI Hotel & Casino 4.0
## 17 Tropicana Las Vegas - A Double Tree by Hilton Hotel 4.0
## 18 Trump International Hotel Las Vegas 5.0
## 19 Tuscany Las Vegas Suites & Casino 5.0
## 20 Wyndham Grand Desert 4.5
## 21 Wynn Las Vegas 5.0
It’s interesting that so many hotels are scored pretty highly! Several hotels have a median score of 5, such as the Encore at Wynn, the Marriott’s Grand Chateau, and the Venetian. That means that more than half the ratings must have been a 5. Indeed, their mean scores are the highest at around the 4.5 mark. Sadly, this data set does not have an indicator of how expensive a hotel is. It would have been nice to see if hotels considered as expensive are rated higher. For now, we will use each hotel’s star rating as an indication of price.
Since we would like to focus on the effect amenities/star rating may have on hotel reviews, let’s make a subset of only the values we care about.
# Get all unique hotels and turn it into a vector
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hotels <- vegas %>% distinct(vegas$Hotel.name)
hotels_vector <- hotels$`vegas$Hotel.name`
hotels_vector <- as.vector(hotels_vector)
hotels_vector
## [1] "Circus Circus Hotel & Casino Las Vegas"
## [2] "Excalibur Hotel & Casino"
## [3] "Monte Carlo Resort&Casino"
## [4] "Treasure Island- TI Hotel & Casino"
## [5] "Tropicana Las Vegas - A Double Tree by Hilton Hotel"
## [6] "Caesars Palace"
## [7] "The Cosmopolitan Las Vegas"
## [8] "The Palazzo Resort Hotel Casino"
## [9] "Wynn Las Vegas"
## [10] "Trump International Hotel Las Vegas"
## [11] "The Cromwell"
## [12] "Encore at wynn Las Vegas"
## [13] "Hilton Grand Vacations on the Boulevard"
## [14] "Marriott's Grand Chateau"
## [15] "Tuscany Las Vegas Suites & Casino"
## [16] "Hilton Grand Vacations at the Flamingo"
## [17] "Wyndham Grand Desert"
## [18] "The Venetian Las Vegas Hotel"
## [19] "Bellagio Las Vegas"
## [20] "Paris Las Vegas"
## [21] "The Westin las Vegas Hotel Casino & Spa"
# We see that there are 22 hotels in this data set.
# The issue, however, is that all values are identical for each hotel entry.
# i.e. For all entries of Circus Circus, the number of rooms doesn't change.
# That means we should condense this down to entry, ideally.
sub_vegas <- vegas[, c(5, 8:16)]
temp <- subset(sub_vegas, Hotel.name == hotels_vector[1])
final_vegas <- temp[1,]
# Now we just have to repeat this 21 more times.
# Fortunately, we should be able to do this with a function.
make_df <- function(df, v)
{
sub_vegas <- df[, c(5, 8:16)]
temp <- subset(sub_vegas, Hotel.name == v[1])
final_vegas <- temp[1,]
for(i in 2:21)
{
temp <- subset(sub_vegas, Hotel.name == v[i])
final_vegas <- rbind(final_vegas, temp[1,])
}
return(final_vegas)
}
final_vegas <- make_df(vegas, hotels_vector)
# Now let's change all the "YES" and "NO" values to 1 and 0
# Again, this will probably be easier as a function.
YNconverter <- function(df)
{
for(i in 2:7)
{
df[,i] <- factor(as.numeric(df[,i]))
levels(df[,i])[1] <- "0"
levels(df[,i])[2] <- "1"
}
return(df)
}
final_vegas <- YNconverter(final_vegas)
# Now we need the number of scores for each hotel.
# And we can delete the Score column.
get_ratings <- function(df1, df2, v)
{
for(i in 2:21)
{
df2$x <- subset(df1, Hotel.name == v[i], select = c("Freq"))
# Renaming doesn't work, but is necessary for other columns to be added
colnames(df2)[i+1] <- hotels_vector[i]
}
return(df2)
}
final_vegas[1] <- NULL
hotel_ratings <- subset(vegas, select = c("Score", "Hotel.name"))
hotel_ratings <- as.data.frame(table(hotel_ratings))
scores <- subset(hotel_ratings, Hotel.name == hotels_vector[1], select = c("Score", "Freq"))
scores <- get_ratings(hotel_ratings, scores, hotels_vector)
# Now we need to transpose this subset of scores and join with the rest
t_ratings <- function(df1, df2, v)
{
# df1 is scores
# df2 is final
for(i in 3:22)
{
temp <- as.data.frame(t(df1[,i]))
z <- i - 1
temp$x <- v[z]
colnames(temp) <- c("r1", "r2", "r3", "r4", "r5", "Hotel.name")
df2 <- rbind(df2, temp)
}
rownames(df2) <- NULL
return(df2)
}
final <- as.data.frame(t(scores[,2]))
final$x <- hotels_vector[1]
colnames(final) <- c("r1", "r2", "r3", "r4", "r5", "Hotel.name")
final <- t_ratings(scores, final, hotels_vector)
# And now we can finally merge the two dataframes into our final data
unique_hotels <- merge(final_vegas, final, by = "Hotel.name")
unique_hotels
## Hotel.name Pool Gym
## 1 Bellagio Las Vegas 1 1
## 2 Caesars Palace 1 1
## 3 Circus Circus Hotel & Casino Las Vegas 0 1
## 4 Encore at wynn Las Vegas 1 1
## 5 Excalibur Hotel & Casino 1 1
## 6 Hilton Grand Vacations at the Flamingo 1 1
## 7 Hilton Grand Vacations on the Boulevard 1 1
## 8 Marriott's Grand Chateau 1 1
## 9 Monte Carlo Resort&Casino 1 1
## 10 Paris Las Vegas 1 1
## 11 The Cosmopolitan Las Vegas 1 1
## 12 The Cromwell 1 0
## 13 The Palazzo Resort Hotel Casino 1 1
## 14 The Venetian Las Vegas Hotel 1 1
## 15 The Westin las Vegas Hotel Casino & Spa 1 1
## 16 Treasure Island- TI Hotel & Casino 1 1
## 17 Tropicana Las Vegas - A Double Tree by Hilton Hotel 1 1
## 18 Trump International Hotel Las Vegas 1 1
## 19 Tuscany Las Vegas Suites & Casino 1 1
## 20 Wyndham Grand Desert 1 1
## 21 Wynn Las Vegas 1 1
## Tennis.court Spa Casino Free.internet Hotel.stars Nr..rooms r1 r2 r3 r4
## 1 0 1 1 1 5 3933 0 3 1 8
## 2 0 1 1 1 5 3348 2 0 3 7
## 3 0 0 1 1 3 3773 2 4 7 9
## 4 0 1 1 1 5 2034 1 0 1 5
## 5 0 1 1 1 3 3981 0 1 9 10
## 6 0 0 0 1 3 315 0 2 6 7
## 7 0 1 1 1 3,5 1228 1 2 1 8
## 8 0 0 1 1 3,5 732 0 0 1 9
## 9 0 1 1 0 4 3003 1 5 6 10
## 10 0 1 1 1 4 2916 0 3 3 8
## 11 0 1 1 1 5 2959 1 3 0 5
## 12 0 0 1 1 4,5 188 1 2 3 6
## 13 0 1 1 1 5 3025 0 0 4 7
## 14 0 1 1 1 5 4027 0 0 1 8
## 15 0 1 1 1 4 826 0 1 6 11
## 16 1 1 1 1 4 2884 0 0 6 13
## 17 1 1 1 1 4 1467 1 1 3 10
## 18 0 1 1 1 5 1282 1 1 1 6
## 19 1 1 1 1 3 716 0 1 6 4
## 20 1 0 0 1 3,5 787 0 0 3 9
## 21 1 1 1 1 5 2700 0 1 1 4
## r5
## 1 12
## 2 12
## 3 2
## 4 17
## 5 4
## 6 9
## 7 12
## 8 14
## 9 2
## 10 10
## 11 15
## 12 12
## 13 13
## 14 15
## 15 6
## 16 5
## 17 9
## 18 15
## 19 13
## 20 12
## 21 18
# But we also need a subset with all the scores in tact.
hotel_scores <- vegas[, c(5, 8:16)]
hotel_scores <- YNconverter(hotel_scores)
Now that we have our final data, we can attempt to see if there’s any correlation between amenities offered by hotels and their TripAdvisor ratings.
library(ggplot2)
library(reshape2)
# Boxplot
ggplot(hotel_scores, aes(y = Score, x = Hotel.name)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + labs(title = "Boxplot of Hotel Scores", x = "Hotels", y = "Scores")
# As we see in the boxplot, several hotels have medians at the maximum score of 5
# Very impressive! Now let's see how the scores breakdown for each individual hotel.
temp <- unique_hotels[,c("Hotel.name", "r1", "r2", "r3", "r4", "r5")]
melted_data <- melt(temp, id.vars = "Hotel.name")
ggplot(melted_data, aes(Hotel.name, value, col = variable)) + geom_point() + stat_smooth() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + labs(title="Number of scores for each hotel", x = "Hotels", y = "Number of scores")
## `geom_smooth()` using method = 'loess'
# This is a lot of data, but we see some interesting things.
# Encore at Wynn, Cosmopolitan, Venetian, Trump International, and the Wynn all
# have earned 15 or more 5 ratings from TripAdvisor users.
# Let's see if amenities might have helped them.
ggplot(unique_hotels, aes(x = Pool)) + geom_bar() + ggtitle("Number of hotels with a pool")
ggplot(unique_hotels, aes(x = Gym)) + geom_bar() + ggtitle("Number of hotels with a gym")
ggplot(unique_hotels, aes(x = Tennis.court)) + geom_bar() + ggtitle("Number of hotels with a tennis court")
ggplot(unique_hotels, aes(x = Spa)) + geom_bar() + ggtitle("Number of hotels with a spa")
ggplot(unique_hotels, aes(x = Casino)) + geom_bar() + ggtitle("Number of hotels with a casino")
ggplot(unique_hotels, aes(x = Free.internet)) + geom_bar() + ggtitle("Number of hotels with free internet")
# Well, with the exception of a spa, most hotels seem to have similar amenities.
# Let's focus on star ratings
high_list <- c("Encore at wynn Las Vegas", "The Cosmopolitan Las Vegas", "The Venetian Las Vegas Hotel", "Trump International Hotel Las Vegas", "Wynn Las Vegas")
best_hotels <- hotel_scores[hotel_scores$Hotel.name %in% high_list, ]
best_unique <- unique_hotels[unique_hotels$Hotel.name %in% high_list, ]
ggplot(best_unique, aes(Hotel.stars)) + geom_bar() + ggtitle("Quality of hotel stars among the highest rated hotels")
# So among our highest rated hotels, all of them are considered to be 5-star!
# Let's also examine the number of rooms, to see if that has any impact
ggplot(best_unique, aes(x = Hotel.name, y = Nr..rooms)) + geom_point() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + ggtitle("Number of rooms for each hotel")
# There is a wide disparity in the number of rooms. The Venetian has over 4000, # while Trump International has a little over a 1000! Yet, they earned the same
# number of 5 ratings by TripAdvisor visitors!
In short, our data isn’t of high enough quality to evaluate whether amenities affect TripAdvisor ratings.
We can roughly conclude that the star rating of a hotel might influence a visitor, as the top rated hotels were all 5-star hotels. Sadly, most of the amenity data is binary. It would have been better to know if the quality of the amenities affected ratings. Perhaps the hotels that earned 5-star ratings have better internet, gyms, pools, or casinos. Perhaps the quality of Trump International’s rooms allowed it to match the massive Venetian. We can’t really tell with this data.
Still, it might be that the old adage is true: you get what you pay for.