Hotel Image
Hotel pricing seems to be constantly changing and there are a many theories as to when the best time to book a hotel is (closer to the stay date or far in advance). I am hoping to use this data to see if there is a correlation between booking date and price. I am also hoping to explore the two categories within the hotel variable to determine if there is a difference in the sweet spot for booking a Resort Hotel versus a City Hotel. To determine if these two variables do influence each other, I am running a correlation test and a linear regression model. The goal is to provide consumers with an understanding of how the average daily rate may be determined by their booking date.
In order to work with the data set in R, I installed and used several packages:
library(janitor)
library(ggpubr)
library(feasts)
library(ggpubr)
library(dplyr)
library(plyr)
The data set (hotels) was downloaded from Github and is an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019. The data set has 119390 rows with 32 columns and includes a variety of types (character, date, numeric). The data is a combination of two data sets - one with data related to a resort hotel and the other a city hotel. The researcher previously combined and cleaned the two data sets (H1 for Resort Hotel and H2 for City Hotel) into one (Hotels). However, I did some cleaning by deleting empty rows and columns, renaming complicated columns to much simpler names, combining three date columns to form one column with a full date, and removing outlier values.
Here is a quick view of the data set:
head(hotels, n=10)
## # A tibble: 10 x 32
## hotel is_canceled lead_time arrival_date_ye~ arrival_date_mo~
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Reso~ 0 342 2015 July
## 2 Reso~ 0 737 2015 July
## 3 Reso~ 0 7 2015 July
## 4 Reso~ 0 13 2015 July
## 5 Reso~ 0 14 2015 July
## 6 Reso~ 0 14 2015 July
## 7 Reso~ 0 0 2015 July
## 8 Reso~ 0 9 2015 July
## 9 Reso~ 1 85 2015 July
## 10 Reso~ 1 75 2015 July
## # ... with 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## # total_of_special_requests <dbl>, reservation_status <chr>,
## # reservation_status_date <date>
colnames(hotels)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
In order to complete the analysis, I will be working primarily with the ‘hotel’, ‘adr’, and ‘lead_time’ variables; the ‘date’ variable I merged is another possibility that I wanted in a better format in case it is needed. The first step I took was to get an understanding of each of the variables.
hotel variable - a character variable and consists only of only two responses - Resort Hotel and City Hotel. I am going to use the different categories to subset the data for more in depth analyses.adr variable - this represents the average daily rate, in dollars, for each of the hotel bookings within this data set. This is a numeric value so I was able run some statistics related to this variable: mean = 101.84. By looking at the summary of data it is obvious that there are outlier values that need to be removed (with a negative min value and a max value of 5,400). I will follow up by determining the specific values and removing them from the data set.lead_time variable - this is numeric and represents the amount of days a hotel is booked prior to the arrival date. The mean for this variable is 69 days with a min value of 0. I am not removing outlier values within this variable because 0 is a very real possibility and 737 could also be possible - especially when booking for large groups.adr |
lead_time |
|---|---|
| Min. -6.38 | Min. 0 |
| 1st Qu. 69.29 | 1st Qu. 18 |
| Median 94.58 | Median 69 |
| Mean 101.83 | Mean 104 |
| 3rd Qu. 126.00 | 3rd Qu. 160 |
| Max. 5400.00 | Max. 737 |
To clean the data, I started by renaming some complicated column names.mmThe three columns that I renamed were all arrival date fields that were separated into three columns containing the month, date, and year. I am not sure if I will be needing the date field for future analysis but just in case, I merged the three columns into one that holds the entire date (mm/dd/yyyy).
Next, I removed any columns and rows that were entirely empty so they would not muddy up the data.
remove_empty(hotels)
## value for "which" not specified, defaulting to c("rows", "cols")
I proceeded to idenfity all of the NA values:
is.na(hotels)
I then plotted the variables (lead_time and adr) to identify and remove any outliers.
boxplot(hotels$adr)$out
Q <- quantile(hotels$adr, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(hotels$adr)
eliminated<- subset(hotels, hotels$adr > (Q[1] - 1.5*iqr) & hotels$adr < (Q[2]+1.5*iqr))
I first want to plot out the two variables to see the relationship between the two.
I filtered the records to get a more compact visualization and filtered to only display observations with a daily average rate greater than 0 and less then 500 and a lead time not equal to zero (there were some erroneous observations with a negative lead_time).
The visualization shows that the average daily hotel rate definitely increases as the days until hotel arrival date decrease. Next I want to see if this relationship is the same or different between the two different hotel categories (resort hotels versus city hotels).
My next thought is to create a scatterplot with a line to determine if a linear relationship exists between the two variables and then separately with the groups.
The scatterplot of the variables as a whole show a very weak negative, linear relationship between the variables (as the line is very close to flat and the dots are scatter far apart from the line). When looking at the variables separated into City and Resort hotel:
This plot definitely shows a VERy different relationship for each hotel and the average daily rate paid. Resort hotels seem to follow a much less linear pattern than City hotels as the Resort Hotel line is almost flat. On the other hand, the City Hotel linear relationship appears to be stronger. Therefore, when I run the correlation test and linear regression model, I will need to make sure to look at the hotel categories separately.
The first test is to determine the correlation coefficient of the two variables:
## [1] -0.06307685
As displayed above, the correlation coefficient of the two variables results in a value = -.06307685. Since the coefficient is not equal to zero, our next step is to run a correlation test.
##
## Pearson's product-moment correlation
##
## data: hotels$adr and hotels$lead_time
## t = -21.838, df = 119388, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06872464 -0.05742502
## sample estimates:
## cor
## -0.06307685
Both the correlation coefficient value and the correlation test demonstrate that there is a very, very weak negative relationship between the variables as a whole. As such, it is unlikely that we will a linear regression between the two variables.
However, next I would like to see if a better relationship exists when the hotel types are looked at independently.
## hotel COR
## 1 City Hotel -0.13724040
## 2 Resort Hotel 0.03048248
This result shows a much stronger negative relationship between the two variables within the City Hotels category so I may find a relationship if I run a linear regression model for this group only.
##
## Call:
## lm(formula = ADR_500CH2$lead_time ~ ADR_500CH2$adr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146.51 -83.24 -33.80 56.91 1774.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.509175 1.019933 143.65 <2e-16 ***
## ADR_500CH2$adr -0.349211 0.008949 -39.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.9 on 79328 degrees of freedom
## Multiple R-squared: 0.01883, Adjusted R-squared: 0.01882
## F-statistic: 1523 on 1 and 79328 DF, p-value: < 2.2e-16
Based on the results of the analysis, we can reject the null hypothesis which states that there is no relationship between the two variables. With such a small p value of < 2e-15, and large t value, we can assume that there is a linear relationship between the average daily rate of a City hotel and the time of booking the hotel. This relationship is negative, which means that as the days prior to booking are reduced, the average daily rate of the hotel goes up.
In this analysis, I was hoping to determine that a correlation and linear relationship exists between the variables adr - average daily rate and lead_time - number of days hotel is booked prior to hotel stay within a hotel booking data set. To determine if this relationship existed, I started by plotting out the data to see if a visual relationship was present. I then went a step further and researched a visual relationship between those same variables but according to the two categories in the hotel variable: City Hotel and Resort Hotel. This proved to be much more enlightening than running the entire group as one. By adding a linear regression line, I could see visually that a relationship appeared to exist more within the City Hotel group.
This prompted me to run a correlation coefficient and correlation test on: * the two variables of the entire data set * the two variables as separate hotel groups: City Hotel and Resort Hotel
The results of these analyses lead me to determine that a relationship did not exist within the Resort Hotel group but it may exist within the City Hotel group. So I proceeded to run a linear regression model to determine if a linear relationship did exist within that City Hotel category according to adr and lead_time. The results of the analyses resulted in a negative relationship between the two variables lead_time and adr. The relationship is negative in that the less time ahead a customer books a hotel room, the higher the average daily rate of the stay is.
While this is an interesting result, it wasn’t exactly how my research started. I was hoping to identify an ideal date to book a hotel, based on the number of days prior to a stay, to get the best rate possible. While the visualizations show a negative relationship between the two variables, at a certain point the data changes and is no longer linear. It would be great to take this a step further to create an equation that a consumer could use to determine when may be the best time for them to book a trip. I hope this is at least a step in that direction!