Final

Examination of Las Vegas hotel reviews by TripAdvisor Users

My parents love to travel to Las Vegas. They go at least twice a year. I thought it would be interesting to discover which hotel has the highest rating, and which features—star, pool, gym, tennis court, spa, or internet—might contribute the most.

The dataset in question is found on the UC Irvine Machine Learning Repository here.

Load and explore data

First, let’s read in the csv. I have already uploaded it into my own GitHub repository. Then we can explore around a bit.

directory <- getwd()
url <- "https://raw.githubusercontent.com/EyeDen/las-vegas/master/LasVegasTripAdvisorReviews-Dataset.csv"
download.file(url = url, destfile = paste(directory,      "LasVegasTripAdvisorReviews-Dataset.csv", sep = "/"))

vegas <- read.csv("LasVegasTripAdvisorReviews-Dataset.csv", sep = ';')

head(vegas, 10)

##    User.country Nr..reviews Nr..hotel.reviews Helpful.votes Score
## 1           USA          11                 4            13     5
## 2           USA         119                21            75     3
## 3           USA          36                 9            25     5
## 4            UK          14                 7            14     4
## 5        Canada           5                 5             2     4
## 6        Canada          31                 8            27     3
## 7            UK          45                12            46     4
## 8           USA           2                 1             4     4
## 9         India          24                 3             8     4
## 10       Canada          12                 7            11     3
##    Period.of.stay Traveler.type Pool Gym Tennis.court Spa Casino
## 1         Dec-Feb       Friends   NO YES           NO  NO    YES
## 2         Dec-Feb      Business   NO YES           NO  NO    YES
## 3         Mar-May      Families   NO YES           NO  NO    YES
## 4         Mar-May       Friends   NO YES           NO  NO    YES
## 5         Mar-May          Solo   NO YES           NO  NO    YES
## 6         Mar-May       Couples   NO YES           NO  NO    YES
## 7         Mar-May       Couples   NO YES           NO  NO    YES
## 8         Mar-May      Families   NO YES           NO  NO    YES
## 9         Mar-May       Friends   NO YES           NO  NO    YES
## 10        Mar-May      Families   NO YES           NO  NO    YES
##    Free.internet                             Hotel.name Hotel.stars
## 1            YES Circus Circus Hotel & Casino Las Vegas           3
## 2            YES Circus Circus Hotel & Casino Las Vegas           3
## 3            YES Circus Circus Hotel & Casino Las Vegas           3
## 4            YES Circus Circus Hotel & Casino Las Vegas           3
## 5            YES Circus Circus Hotel & Casino Las Vegas           3
## 6            YES Circus Circus Hotel & Casino Las Vegas           3
## 7            YES Circus Circus Hotel & Casino Las Vegas           3
## 8            YES Circus Circus Hotel & Casino Las Vegas           3
## 9            YES Circus Circus Hotel & Casino Las Vegas           3
## 10           YES Circus Circus Hotel & Casino Las Vegas           3
##    Nr..rooms User.continent Member.years Review.month Review.weekday
## 1       3773  North America            9      January       Thursday
## 2       3773  North America            3      January         Friday
## 3       3773  North America            2     February       Saturday
## 4       3773         Europe            6     February         Friday
## 5       3773  North America            7        March        Tuesday
## 6       3773  North America            2        March        Tuesday
## 7       3773         Europe            4        April         Friday
## 8       3773  North America            0        April        Tuesday
## 9       3773           Asia            3          May       Saturday
## 10      3773  North America            5          May        Tuesday

summary(vegas)

##     User.country  Nr..reviews     Nr..hotel.reviews Helpful.votes   
##  USA      :217   Min.   :  1.00   Min.   :  0.00    Min.   :  0.00  
##  UK       : 72   1st Qu.: 12.00   1st Qu.:  5.00    1st Qu.:  8.00  
##  Canada   : 65   Median : 23.50   Median :  9.00    Median : 16.00  
##  Australia: 36   Mean   : 48.13   Mean   : 16.02    Mean   : 31.75  
##  Ireland  : 13   3rd Qu.: 54.25   3rd Qu.: 18.00    3rd Qu.: 35.00  
##  India    : 11   Max.   :775.00   Max.   :263.00    Max.   :365.00  
##  (Other)  : 90                                                      
##      Score       Period.of.stay  Traveler.type  Pool      Gym     
##  Min.   :1.000   Dec-Feb:124    Business: 74   NO : 24   NO : 24  
##  1st Qu.:4.000   Jun-Aug:126    Couples :214   YES:480   YES:480  
##  Median :4.000   Mar-May:128    Families:110                      
##  Mean   :4.123   Sep-Nov:126    Friends : 82                      
##  3rd Qu.:5.000                  Solo    : 24                      
##  Max.   :5.000                                                    
##                                                                   
##  Tennis.court  Spa      Casino    Free.internet
##  NO :384      NO :120   NO : 48   NO : 24      
##  YES:120      YES:384   YES:456   YES:480      
##                                                
##                                                
##                                                
##                                                
##                                                
##                                   Hotel.name  Hotel.stars   Nr..rooms   
##  Bellagio Las Vegas                    : 24   3  : 96     Min.   : 188  
##  Caesars Palace                        : 24   3,5: 72     1st Qu.: 826  
##  Circus Circus Hotel & Casino Las Vegas: 24   4  :120     Median :2700  
##  Encore at wynn Las Vegas              : 24   4,5: 24     Mean   :2196  
##  Excalibur Hotel & Casino              : 24   5  :192     3rd Qu.:3025  
##  Hilton Grand Vacations at the Flamingo: 24               Max.   :4027  
##  (Other)                               :360                             
##        User.continent  Member.years          Review.month   Review.weekday
##  Africa       :  7    Min.   :-1806.0000   April   : 42   Friday   :65    
##  Asia         : 36    1st Qu.:    2.0000   August  : 42   Monday   :74    
##  Europe       :118    Median :    4.0000   December: 42   Saturday :61    
##  North America:295    Mean   :    0.7679   February: 42   Sunday   :77    
##  Oceania      : 41    3rd Qu.:    6.0000   January : 42   Thursday :62    
##  South America:  7    Max.   :   13.0000   July    : 42   Tuesday  :80    
##                                            (Other) :252   Wednesday:85

# Calculate mean and median ratings for each hotel
aggregate(Score~Hotel.name, vegas, mean)

##                                             Hotel.name    Score
## 1                                   Bellagio Las Vegas 4.208333
## 2                                       Caesars Palace 4.125000
## 3               Circus Circus Hotel & Casino Las Vegas 3.208333
## 4                             Encore at wynn Las Vegas 4.541667
## 5                             Excalibur Hotel & Casino 3.708333
## 6               Hilton Grand Vacations at the Flamingo 3.958333
## 7              Hilton Grand Vacations on the Boulevard 4.166667
## 8                             Marriott's Grand Chateau 4.541667
## 9                            Monte Carlo Resort&Casino 3.291667
## 10                                     Paris Las Vegas 4.041667
## 11                          The Cosmopolitan Las Vegas 4.250000
## 12                                        The Cromwell 4.083333
## 13                     The Palazzo Resort Hotel Casino 4.375000
## 14                        The Venetian Las Vegas Hotel 4.583333
## 15             The Westin las Vegas Hotel Casino & Spa 3.916667
## 16                  Treasure Island- TI Hotel & Casino 3.958333
## 17 Tropicana Las Vegas - A Double Tree by Hilton Hotel 4.041667
## 18                 Trump International Hotel Las Vegas 4.375000
## 19                   Tuscany Las Vegas Suites & Casino 4.208333
## 20                                Wyndham Grand Desert 4.375000
## 21                                      Wynn Las Vegas 4.625000

aggregate(Score~Hotel.name, vegas, median)

##                                             Hotel.name Score
## 1                                   Bellagio Las Vegas   4.5
## 2                                       Caesars Palace   4.5
## 3               Circus Circus Hotel & Casino Las Vegas   3.0
## 4                             Encore at wynn Las Vegas   5.0
## 5                             Excalibur Hotel & Casino   4.0
## 6               Hilton Grand Vacations at the Flamingo   4.0
## 7              Hilton Grand Vacations on the Boulevard   4.5
## 8                             Marriott's Grand Chateau   5.0
## 9                            Monte Carlo Resort&Casino   3.5
## 10                                     Paris Las Vegas   4.0
## 11                          The Cosmopolitan Las Vegas   5.0
## 12                                        The Cromwell   4.5
## 13                     The Palazzo Resort Hotel Casino   5.0
## 14                        The Venetian Las Vegas Hotel   5.0
## 15             The Westin las Vegas Hotel Casino & Spa   4.0
## 16                  Treasure Island- TI Hotel & Casino   4.0
## 17 Tropicana Las Vegas - A Double Tree by Hilton Hotel   4.0
## 18                 Trump International Hotel Las Vegas   5.0
## 19                   Tuscany Las Vegas Suites & Casino   5.0
## 20                                Wyndham Grand Desert   4.5
## 21                                      Wynn Las Vegas   5.0

It’s interesting that so many hotels are scored pretty highly! Several hotels have a median score of 5, such as the Encore at Wynn, the Marriott’s Grand Chateau, and the Venetian. That means that more than half the ratings must have been a 5. Indeed, their mean scores are the highest at around the 4.5 mark. Sadly, this data set does not have an indicator of how expensive a hotel is. It would have been nice to see if hotels considered as expensive are rated higher. For now, we will use each hotel’s star rating as an indication of price.

Data Wrangling

Since we would like to focus on the effect amenities/star rating may have on hotel reviews, let’s make a subset of only the values we care about.

# Get all unique hotels and turn it into a vector
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

hotels <- vegas %>% distinct(vegas$Hotel.name)
hotels_vector <- hotels$`vegas$Hotel.name`
hotels_vector <- as.vector(hotels_vector)
hotels_vector

##  [1] "Circus Circus Hotel & Casino Las Vegas"             
##  [2] "Excalibur Hotel & Casino"                           
##  [3] "Monte Carlo Resort&Casino"                          
##  [4] "Treasure Island- TI Hotel & Casino"                 
##  [5] "Tropicana Las Vegas - A Double Tree by Hilton Hotel"
##  [6] "Caesars Palace"                                     
##  [7] "The Cosmopolitan Las Vegas"                         
##  [8] "The Palazzo Resort Hotel Casino"                    
##  [9] "Wynn Las Vegas"                                     
## [10] "Trump International Hotel Las Vegas"                
## [11] "The Cromwell"                                       
## [12] "Encore at wynn Las Vegas"                           
## [13] "Hilton Grand Vacations on the Boulevard"            
## [14] "Marriott's Grand Chateau"                           
## [15] "Tuscany Las Vegas Suites & Casino"                  
## [16] "Hilton Grand Vacations at the Flamingo"             
## [17] "Wyndham Grand Desert"                               
## [18] "The Venetian Las Vegas Hotel"                       
## [19] "Bellagio Las Vegas"                                 
## [20] "Paris Las Vegas"                                    
## [21] "The Westin las Vegas Hotel Casino & Spa"

# We see that there are 22 hotels in this data set.
# The issue, however, is that all values are identical for each hotel entry.
# i.e. For all entries of Circus Circus, the number of rooms doesn't change.
# That means we should condense this down to entry, ideally.

sub_vegas <- vegas[, c(5, 8:16)]
temp <- subset(sub_vegas, Hotel.name == hotels_vector[1])
final_vegas <- temp[1,]

# Now we just have to repeat this 21 more times.
# Fortunately, we should be able to do this with a function.

make_df <- function(df, v)
{
  sub_vegas <- df[, c(5, 8:16)]
  temp <- subset(sub_vegas, Hotel.name == v[1])
  final_vegas <- temp[1,]
  
  for(i in 2:21)
  {
    temp <- subset(sub_vegas, Hotel.name == v[i])
    final_vegas <- rbind(final_vegas, temp[1,])
  }
  
  return(final_vegas)
}

final_vegas <- make_df(vegas, hotels_vector)

# Now let's change all the "YES" and "NO" values to 1 and 0
# Again, this will probably be easier as a function.

YNconverter <- function(df)
{
  for(i in 2:7)
  {
    df[,i] <- factor(as.numeric(df[,i]))
    levels(df[,i])[1] <- "0"
    levels(df[,i])[2] <- "1"
  }
  return(df)
}

final_vegas <- YNconverter(final_vegas)

# Now we need the number of scores for each hotel.
# And we can delete the Score column.

get_ratings <- function(df1, df2, v)
{
  for(i in 2:21)
  {
    df2$x <- subset(df1, Hotel.name == v[i], select = c("Freq"))
    # Renaming doesn't work, but is necessary for other columns to be added
    colnames(df2)[i+1] <- hotels_vector[i]
  }
  
  return(df2)
}

final_vegas[1] <- NULL

hotel_ratings <- subset(vegas, select = c("Score", "Hotel.name"))
hotel_ratings <- as.data.frame(table(hotel_ratings))
scores <- subset(hotel_ratings, Hotel.name == hotels_vector[1], select = c("Score",                   "Freq"))
scores <- get_ratings(hotel_ratings, scores, hotels_vector)

# Now we need to transpose this subset of scores and join with the rest

t_ratings <- function(df1, df2, v)
{
  # df1 is scores
  # df2 is final
  for(i in 3:22)
  {
    temp <- as.data.frame(t(df1[,i]))
    z <- i - 1
    temp$x <- v[z]
    colnames(temp) <- c("r1", "r2", "r3", "r4", "r5", "Hotel.name")
    df2 <- rbind(df2, temp)
  }
  
  rownames(df2) <- NULL
  return(df2)
}

final <- as.data.frame(t(scores[,2]))
final$x <- hotels_vector[1]
colnames(final) <- c("r1", "r2", "r3", "r4", "r5", "Hotel.name")
final <- t_ratings(scores, final, hotels_vector)

# And now we can finally merge the two dataframes into our final data
unique_hotels <- merge(final_vegas, final, by = "Hotel.name")
unique_hotels

##                                             Hotel.name Pool Gym
## 1                                   Bellagio Las Vegas    1   1
## 2                                       Caesars Palace    1   1
## 3               Circus Circus Hotel & Casino Las Vegas    0   1
## 4                             Encore at wynn Las Vegas    1   1
## 5                             Excalibur Hotel & Casino    1   1
## 6               Hilton Grand Vacations at the Flamingo    1   1
## 7              Hilton Grand Vacations on the Boulevard    1   1
## 8                             Marriott's Grand Chateau    1   1
## 9                            Monte Carlo Resort&Casino    1   1
## 10                                     Paris Las Vegas    1   1
## 11                          The Cosmopolitan Las Vegas    1   1
## 12                                        The Cromwell    1   0
## 13                     The Palazzo Resort Hotel Casino    1   1
## 14                        The Venetian Las Vegas Hotel    1   1
## 15             The Westin las Vegas Hotel Casino & Spa    1   1
## 16                  Treasure Island- TI Hotel & Casino    1   1
## 17 Tropicana Las Vegas - A Double Tree by Hilton Hotel    1   1
## 18                 Trump International Hotel Las Vegas    1   1
## 19                   Tuscany Las Vegas Suites & Casino    1   1
## 20                                Wyndham Grand Desert    1   1
## 21                                      Wynn Las Vegas    1   1
##    Tennis.court Spa Casino Free.internet Hotel.stars Nr..rooms r1 r2 r3 r4
## 1             0   1      1             1           5      3933  0  3  1  8
## 2             0   1      1             1           5      3348  2  0  3  7
## 3             0   0      1             1           3      3773  2  4  7  9
## 4             0   1      1             1           5      2034  1  0  1  5
## 5             0   1      1             1           3      3981  0  1  9 10
## 6             0   0      0             1           3       315  0  2  6  7
## 7             0   1      1             1         3,5      1228  1  2  1  8
## 8             0   0      1             1         3,5       732  0  0  1  9
## 9             0   1      1             0           4      3003  1  5  6 10
## 10            0   1      1             1           4      2916  0  3  3  8
## 11            0   1      1             1           5      2959  1  3  0  5
## 12            0   0      1             1         4,5       188  1  2  3  6
## 13            0   1      1             1           5      3025  0  0  4  7
## 14            0   1      1             1           5      4027  0  0  1  8
## 15            0   1      1             1           4       826  0  1  6 11
## 16            1   1      1             1           4      2884  0  0  6 13
## 17            1   1      1             1           4      1467  1  1  3 10
## 18            0   1      1             1           5      1282  1  1  1  6
## 19            1   1      1             1           3       716  0  1  6  4
## 20            1   0      0             1         3,5       787  0  0  3  9
## 21            1   1      1             1           5      2700  0  1  1  4
##    r5
## 1  12
## 2  12
## 3   2
## 4  17
## 5   4
## 6   9
## 7  12
## 8  14
## 9   2
## 10 10
## 11 15
## 12 12
## 13 13
## 14 15
## 15  6
## 16  5
## 17  9
## 18 15
## 19 13
## 20 12
## 21 18

# But we also need a subset with all the scores in tact.
hotel_scores <- vegas[, c(5, 8:16)]
hotel_scores <- YNconverter(hotel_scores)

Graphics

Now that we have our final data, we can attempt to see if there’s any correlation between amenities offered by hotels and their TripAdvisor ratings.

library(ggplot2)
library(reshape2)

# Boxplot
ggplot(hotel_scores, aes(y = Score, x = Hotel.name)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + labs(title = "Boxplot of Hotel Scores", x = "Hotels", y = "Scores")

# As we see in the boxplot, several hotels have medians at the maximum score of 5
# Very impressive!  Now let's see how the scores breakdown for each individual hotel.
temp <- unique_hotels[,c("Hotel.name", "r1", "r2", "r3", "r4", "r5")]
melted_data <- melt(temp, id.vars = "Hotel.name")

ggplot(melted_data, aes(Hotel.name, value, col = variable)) + geom_point() + stat_smooth() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + labs(title="Number of scores for each hotel", x = "Hotels", y = "Number of scores")

## `geom_smooth()` using method = 'loess'

# This is a lot of data, but we see some interesting things.
# Encore at Wynn, Cosmopolitan, Venetian, Trump International, and the Wynn all
# have earned 15 or more 5 ratings from TripAdvisor users.
# Let's see if amenities might have helped them.

ggplot(unique_hotels, aes(x = Pool)) + geom_bar() + ggtitle("Number of hotels with a pool")

ggplot(unique_hotels, aes(x = Gym)) + geom_bar() + ggtitle("Number of hotels with a gym")

ggplot(unique_hotels, aes(x = Tennis.court)) + geom_bar() + ggtitle("Number of hotels with a tennis court")

ggplot(unique_hotels, aes(x = Spa)) + geom_bar() + ggtitle("Number of hotels with a spa")

ggplot(unique_hotels, aes(x = Casino)) + geom_bar() + ggtitle("Number of hotels with a casino")

ggplot(unique_hotels, aes(x = Free.internet)) + geom_bar() + ggtitle("Number of hotels with free internet")

# Well, with the exception of a spa, most hotels seem to have similar amenities.
# Let's focus on star ratings
high_list <- c("Encore at wynn Las Vegas", "The Cosmopolitan Las Vegas", "The Venetian Las Vegas Hotel", "Trump International Hotel Las Vegas", "Wynn Las Vegas")

best_hotels <- hotel_scores[hotel_scores$Hotel.name %in% high_list, ]
best_unique <- unique_hotels[unique_hotels$Hotel.name %in% high_list, ]

ggplot(best_unique, aes(Hotel.stars)) + geom_bar() + ggtitle("Quality of hotel stars among the highest rated hotels")

# So among our highest rated hotels, all of them are considered to be 5-star!
# Let's also examine the number of rooms, to see if that has any impact

ggplot(best_unique, aes(x = Hotel.name, y = Nr..rooms)) + geom_point() + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + ggtitle("Number of rooms for each hotel")

# There is a wide disparity in the number of rooms.  The Venetian has over 4000, # while Trump International has a little over a 1000!  Yet, they earned the same
# number of 5 ratings by TripAdvisor visitors!

Conclusion

In short, our data isn’t of high enough quality to evaluate whether amenities affect TripAdvisor ratings.

We can roughly conclude that the star rating of a hotel might influence a visitor, as the top rated hotels were all 5-star hotels. Sadly, most of the amenity data is binary. It would have been better to know if the quality of the amenities affected ratings. Perhaps the hotels that earned 5-star ratings have better internet, gyms, pools, or casinos. Perhaps the quality of Trump International’s rooms allowed it to match the massive Venetian. We can’t really tell with this data.

Still, it might be that the old adage is true: you get what you pay for.