“The data reveal the story.” This principle guides this project by using data to better understand the guest satisfaction and support more informed lodging decisions. Guest who choose four-star hotels typically have higher expectations regarding service, comfort, and overall experience . Rather than comparing hotels in general , this study focuses on U.S. four-star hotels to identify which characteristics are associated with higher guest satisfaction within a market where quality expectations are already high.
This study analyzes the Booking.com Accommodation Review Dataset, which contains approximately 1.6 million reviews collected from accommodations worldwide. The final sample includes 30,175 complete observations from four-star hotels located in the United States. Multiple Linear Regression, is used to examine whether factors such as accommodation types, room nights, accommodation score, helpful votes, and location characteristics are associated with guest review scores. The objective is provide evidence-based insights that help travelers make better lodging decisions
Choosing a four-star hotel does not always guarantee a superior guest experience. Travelers often ask:
Human Question What characteristics are associated with a better guest experience in U.S. four-star hotels?
To answer this question objectively, the following analytical question was developed
Analytical Question What factors are associated with higher guest satisfaction in U.S. four-start hotels?
Note*: This study identifies statistical associations rather than causal relationships.
Before fitting the regression model, the dataset was prepared to ensure that only observations relevant to the research questions were included. The data preparation process consisted of loading the require library, filtering the dataset to include only four-star hotels located in the United States, selecting the variables used in the analysis, removing observations with missing values, and verifying the final sample size. After these dat-cleaning steps, the final analitycal dataset contained 30175 complete observations ready for multiple linear regression.
library (dplyr)
To focus the analysis on accommodations with similar quality standards, the dataset was filtered to include only four-star hotels, ensuring that all accommodations represent a similar quality
Booking_clean <- BookingReviews_train %>%
filter(accommodation_star_rating == 4)
Only hotels located in the United States were retained for analysis
Booking_clean <- Booking_clean %>%
filter(accommodation_country == "United States of America")
Only the variables required for the regression analysis were selected.
Booking_clean <- Booking_clean %>%
select(review_score, review_helpful_votes, room_nights, month, accommodation_type, accommodation_country, accommodation_score, accommodation_star_rating,location_is_ski, location_is_beach, location_is_city_center)
Observations with missing values were removed to ensure a complete dataset.
Booking_clean <- Booking_clean %>%
na.omit()
The cleaned dataset contains 30175 complete observations used in the regression analysis.
dim(Booking_clean)
## [1] 30175 11
Multiple Linear Regression was selected because guest satisfaction is influenced by multiple hotel characteristics. This method evaluates the association between guess review scores and multiple predictors simultaneously while controlling for the effects of the other variables. The dependent variable is review_score, while the explanatory variables include accommodation type, accommodation score, room nights, helpful votes, month of stay, and location characteristics. Together, these variables allow the model to identify which characteristics are most strongly associated with guest satisfaction among travelers staying in four-star hotels in the United States.
Regression_Model <- lm(
review_score ~
review_helpful_votes + room_nights + month + accommodation_type + + accommodation_score + location_is_ski + location_is_beach + location_is_city_center,
data = Booking_clean
)
summary(Regression_Model)
##
## Call:
## lm(formula = review_score ~ review_helpful_votes + room_nights +
## month + accommodation_type + +accommodation_score + location_is_ski +
## location_is_beach + location_is_city_center, data = Booking_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3118 -0.8199 0.1640 0.9576 3.3046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.796837 0.106023 35.811 < 2e-16 ***
## review_helpful_votes -0.282251 0.014657 -19.258 < 2e-16 ***
## room_nights -0.028459 0.002902 -9.808 < 2e-16 ***
## month -0.002357 0.002070 -1.139 0.25481
## accommodation_typeApartment -0.016799 0.195945 -0.086 0.93168
## accommodation_typeBed and Breakfast -0.106892 0.066629 -1.604 0.10866
## accommodation_typeGuest house -0.089795 0.132411 -0.678 0.49768
## accommodation_typeHoliday home -0.264659 0.353606 -0.748 0.45419
## accommodation_typeHotel -0.091849 0.035122 -2.615 0.00892 **
## accommodation_typeInn -0.080960 0.055600 -1.456 0.14537
## accommodation_typeLodge 0.016197 0.104493 0.155 0.87682
## accommodation_typeMotel -0.062719 0.086026 -0.729 0.46596
## accommodation_typeResort -0.122297 0.038820 -3.150 0.00163 **
## accommodation_score 0.636193 0.011732 54.226 < 2e-16 ***
## location_is_ski -0.105054 0.041183 -2.551 0.01075 *
## location_is_beach 0.052728 0.016588 3.179 0.00148 **
## location_is_city_center 0.014066 0.015116 0.930 0.35212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.106 on 30158 degrees of freedom
## Multiple R-squared: 0.1093, Adjusted R-squared: 0.1088
## F-statistic: 231.3 on 16 and 30158 DF, p-value: < 2.2e-16
The regression model was statistically significant (F = 231.3, p < 0.001), indicating that the predictors jointly explain a meaningful portion of the variation in guest review scores. Although the model explains approximately 10.9% of the observed variability (Adjusted R² = 0.1088), which is reasonable because guest satisfaction is also influenced by personal preferences and other factors not included in the model.
-Accommodation score (strongest predictor) -Hotel Resort -Ski Location -Beach location
-Month -Apartment -Guest House -Motel -City Center
Overall, accommodation score was the strongest predictor of guest satisfaction, while beach and ski locations were also positively associated with higher review scores
Standard diagnostic plots were used to evaluate whether the assumptions of the multiple linear regression model were reasonably satisfied
plot(Regression_Model)
###Regression Diagnostics Residuals vs Fitted:
The residuals are generally scattered around zero without a strong pattern, suggesting that the linear relationship is reasonably appropriate for the data.
Normal Q-Q Plot:
Most residuals follow the reference line, although small deviations appear at the ends. With a large sample size, these departures are expected and do not seriously affect the analysis.
Scale-Location Plot:
The spread of the residuals is relatively consistent across fitted values. Some variation is present, but there is no strong evidence of severe heteroscedasticity.
Residuals vs Leverage:
Most observations have low leverage, with only a few potentially influential cases. No observations appear to have an undue impact on the overall regression model.
Overall, the diagnostic plots suggest that the assumptions of multiple linear regression are reasonably satisfied. Minor departures from normality and constant variance ae expected because review scores are measured on a bounded 0-10 scale and the data set contains more than 30,000 observations. Therefore, the regression results are considered reliable for this analysis.
This study examined which hotels characteristics are associated with higher guest satisfaction among travelers staying in four-start hotels in the United States. The results indicate that accommodation score is the strongest predictor of guest review scores, while beach and ski location are also positively associated with guest satisfaction. Although the model explains only part of the variation in the guest review scores, it identifies several meaningful factors that help explain differences in guest experiences
Booking.com Accommodation Review Dataset