Have you ever wondered the trends for hotel bookings? How long people stay? How often people cancel? What the busiest months are? In this analysis I explore a large dataset to examine these questions.
In this project, I used R to examine the “hotel_bookings” dataset. I imported the data by first downloding the data and saving it as a .csv file, then I used the read.csv function in R to begin my analysis.
This dataset contains information on records for client stays at hotels. More specifically, it contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. For the purpose of this post, I only focused on some of these variables to examine.
I looked at the following:
-Hotel Type
-Length of Stay
-Guest Count per Stay
-Cancellations
-Arrival Months and Days
Let’s take a look at the makeup of the dataset. The following pie chart visually represents the types of hotel types for stays in the dataset.
slices<-c(79330,40060)
lbls<-c("City Hotel","Resort Hotel")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
lbls <- paste(lbls,"%",sep="")
pie(slices,labels = lbls, col=c("Blue","White"),
main="Hotel Type")
summary(hotel$hotel)
## City Hotel Resort Hotel
## 79330 40060
From the pie chart we can see that 66% of the records in the dataset are from city hotels and the remaining 34% are from resort hotels.
Next, let’s look at how long the guests stay. The dataset had weekends and weekdays seperate so I combined the two columns into a vector and examined that.
weekend<-hotel$stays_in_weekend_nights
week<-hotel$stays_in_week_nights
total<-weekend+week
total[total==0]<-NA
summary(total)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 3.000 3.449 4.000 69.000 715
boxplot(total,col="chartreuse3",ylab="Nights Stayed")
From the 5 number summary, we can see that the mean number of nights stayed at a hotel in the dataset is 3.449 nights with a medan of 3 nights. Looking at the boxplot, it is evident that there are many outliers and the distribution is skewed right. It is important to note that there were 715 records where guests booked a reservation for the day only. These were excluded from the plot.
Let’s now take a look at the number of guests per reservation. Similar to the data on length of stay, guest count was split up into 3 categories: adults, children, and babies. The specific age ranges for the categories are unknown to me. I once again combined these three categories into a vector and examined the result.
adults<-hotel$adults
children<-hotel$children
babies<-hotel$babies
people<-adults+children+babies
people[people==0]<-NA
summary(people)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 2.000 1.971 2.000 55.000 184
boxplot(people, col="chartreuse3",ylab="Number of Guests")
From the boxplot above it is hard to determine much. I converted the vector into a factor and examined it as a categorical variable in a bar chart.
peoplef<-factor(people)
plot(peoplef, col="chartreuse3",ylab="Count",xlab="Number of Guests")
From this plot we can better see the distribution of the number of guests per reservation. Most records have 2 guests per stay, with 1,3, and 4 guests rounding out the top 4 as expected. Perhaps the most interesting part of this plot are the upper extremes. 55 guests in one reservation!
Next, I took a look at cancelled stays.
cancel_v_f<-factor(c(hotel$is_canceled))
plot(cancel_v_f,col=c("chartreuse3","red"),ylab="Count",xlab="Cancelled? 0 = No | 1 = Yes")
The green box above are stays that were not cancelled. The red box represents the records that showed the guest cancelled their hotel booking. I was suprised there were so many cancellations. As someone who had never cancelled a hotel booking, this porportion seemed high.
I wanted to examine the cancellation data further. I wanted to see the different porportions of cancellations by the type of hotel.
counts<-table(hotel$is_canceled, hotel$hotel)
barplot(counts, main="Hotel type and cancelations", xlab= "Hotel Type",col=c("chartreuse3","firebrick2"),legend=rownames(counts),ylab="Count")
Once again, the green represents the bookings that were not cancelled and the red represents the bookings that were. As you can see, city hotel bookings are cancelled significantly more often than resort hotels. My guess for the reason behind this is the purpose for booking a city hotel versus a resort hotel. People plan ahead for vacations and make room and set aside money for booking a resort hotel, whereas city hotel bookings could be made merely a few days prior for work or general travel which could be cancelled for poor weather or last minute changes.
Next I wanted to look at the distribution of hotel bookings by months of the year. I plotted the bookings below in a bar plot:
month_vector<-factor(hotel$arrival_date_month)
month<-factor(month_vector,levels=c("January","February","March","April","May","June","July","August","September","October","November","December"))
levels(month)<-c("J","F","M","A","My","Jn","Jl","Au","S","O","N","D")
plot(month,col="chartreuse3",ylab="Count",xlab="Month")
As expected, in the summer and generally warmer months there are more bookings than colder months. August has the most bookings and Janurary has the least.
Just for kicks, I wanted to see if there was any trends for the arrival day of hotel bookings.
day_vector<-factor(hotel$arrival_date_day_of_month)
plot(day_vector,col="chartreuse3",xlab="Day of Month",ylab="Count")
From the plot, there is no discernable trend. The 31st day of the month has significantly less arrivals than other days, but this is because not every month has 31 days.
httpswww.kaggle.comjessemostipakhotel-booking-demanddata