The Question

Hotel pricing seems to be constantly changing and there are a many theories as to when the best time to book a hotel is (closer to the stay date or far in advance). I am hoping to use this data to see if there is a correlation between booking date and price. I have always leaned toward the “it is better to wait” category but then you always risk losing a booking or paying an exorbitant amount. So I am interested to see if I can find the sweet spot for booking. In order to determine this, I am hoping to run a correlation test between the lead time and the average daily rate of the booking. I am also hoping to explore the two categories within the hotel variable to determine if there is a difference in the sweet spot for booking a Resort Hotel versus a City Hotel. The goal is to provide consumers with a way to get the best price possible.

Packages

In order to work with the data set in R, I installed and used several packages:

library(ggpubr)
library(feasts)
library(ggpubr)

I am using the first two packages to clean the data set and the ggpubr package to run analyses (correlation test).

The Data

The dataset (hotels) was downloaded from Github and is an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019. The data set has 119390 rows with 32 columns and includes a variety of types (character, date, numeric). The data is a combination of two data sets - one with data related to a resort hotel and the other a city hotel. The researcher previously combined and cleaned the two data sets (H1 for Resort Hotel and H2 for City Hotel) into one (Hotels). However, I did some cleaning by deleting empy rows and columns, renaming complicated columns to much simpler names, combining three date columns to form one column with a full date, removing outliers.

Data Preparation

In order to complete the analysis, I will be working primarily with the ‘hotel’, ‘adr’, and ‘lead_time’ variables; the ‘date’ variable I merged is another possibility. The first step I took was to get an understanding of each of the variables.

adr lead_time
Min. -6.38 Min. 0
1st Qu. 69.29 1st Qu. 18
Median 94.58 Median 69
Mean 101.83 Mean 104
3rd Qu. 126.00 3rd Qu. 160
Max. 5400.00 Max. 737

To clean the data, I started by renaming columns that were not in R style and very long. The three columns that I renamed were all arrival date fields that were separated into three columns containging the month, date, and year. I am not sure if I will be needing that date for future analysis but just in case, I merged the three columns into one that holds the entire date (mm/dd/yyyy) The I removed any columns and rows that were entirely empty so they would not muddy up the data. I then plotted the variables (lead_time and adr) to identify any outliers. Adr (average daily rate) definitely had some outliers that needed to be removed so I entered code to do that next.

Proposed Data Analysis

I want to first test to see if there is a correlation between the variables lead_time and adr. I want to take that a step further and subset the data to look to see if a correlation exists between one group (Resort Hotel and City Hotel) and not the other. Finally, I want to run a regression analysis to determine if lead_time can predict the average daily rate of a booking.

Several things I need to rethink: * Check to verify that all data within the adr variable is in US dollars. * Review how to incorporate better code into R Markdown for better inclusion of code I am using. * Review how to incorporate plots in R Markdown for better visuals.