Hotel pricing seems to be constantly changing and there are a many theories as to when the best time to book a hotel is (closer to the stay date or far in advance). I am hoping to use this data to see if there is a correlation between booking date and price. I have always leaned toward the “it is better to wait” category but then you always risk losing a booking or paying an exorbitant amount. So I am interested to see if I can find the sweet spot for booking. In order to determine this, I am hoping to run a correlation test between the lead time and the average daily rate of the booking. I am also hoping to explore the two categories within the hotel variable to determine if there is a difference in the sweet spot for booking a Resort Hotel versus a City Hotel. The goal is to provide consumers with a way to get the best price possible.
In order to work with the data set in R, I installed and used several packages:
library(ggpubr)
library(feasts)
library(ggpubr)
I am using the first two packages to clean the data set and the ggpubr package to run analyses (correlation test).
The dataset (hotels) was downloaded from Github and is an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019. The data set has 119390 rows with 32 columns and includes a variety of types (character, date, numeric). The data is a combination of two data sets - one with data related to a resort hotel and the other a city hotel. The researcher previously combined and cleaned the two data sets (H1 for Resort Hotel and H2 for City Hotel) into one (Hotels). However, I did some cleaning by deleting empy rows and columns, renaming complicated columns to much simpler names, combining three date columns to form one column with a full date, removing outliers.
In order to complete the analysis, I will be working primarily with the ‘hotel’, ‘adr’, and ‘lead_time’ variables; the ‘date’ variable I merged is another possibility. The first step I took was to get an understanding of each of the variables.
hotel variable - a character variable and consists only of only two responses - Resort Hotel and City Hotel. I am hoping to also use the different categories to subset the data for more in depth analyses.adr variable - this represents the average daily rate, in dollars, for each of the hotel bookings within this data set. This is a numeric value so I was able run some statistics related to this variable: mean = 101.84. By looking at this summmary of data it is obvious that there are outliers to remove (with a negative min value and a max value of 5400). I will follow up by determining the outliers and removing them from the data set.lead_time variable - this is numeric and represents the amount of days a hotel is booked prior to the arrival date. The mean for this variable is 69 days with a min value of 0. I am not removing outliers within this variable because 0 is a very real possibility and 737 could also be possible - especially when booking for large groups.adr |
lead_time |
|---|---|
| Min. -6.38 | Min. 0 |
| 1st Qu. 69.29 | 1st Qu. 18 |
| Median 94.58 | Median 69 |
| Mean 101.83 | Mean 104 |
| 3rd Qu. 126.00 | 3rd Qu. 160 |
| Max. 5400.00 | Max. 737 |
To clean the data, I started by renaming columns that were not in R style and very long. The three columns that I renamed were all arrival date fields that were separated into three columns containging the month, date, and year. I am not sure if I will be needing that date for future analysis but just in case, I merged the three columns into one that holds the entire date (mm/dd/yyyy) The I removed any columns and rows that were entirely empty so they would not muddy up the data. I then plotted the variables (lead_time and adr) to identify any outliers. Adr (average daily rate) definitely had some outliers that needed to be removed so I entered code to do that next.
I want to first test to see if there is a correlation between the variables lead_time and adr. I want to take that a step further and subset the data to look to see if a correlation exists between one group (Resort Hotel and City Hotel) and not the other. Finally, I want to run a regression analysis to determine if lead_time can predict the average daily rate of a booking.
Several things I need to rethink: * Check to verify that all data within the adr variable is in US dollars. * Review how to incorporate better code into R Markdown for better inclusion of code I am using. * Review how to incorporate plots in R Markdown for better visuals.