In the hotel industry, it’s super important to know how customers make their bookings. we need to dig into the data of hotel bookings to figure out what customers like, how they book, and what influences their decisions. For hotels, this knowledge enables them to improve their advertisements and offers, enhance guest satisfaction, and optimize room availability.
We are going to use R to perform data analysis and visualization to help us understand customer demographics and booking trends through the following steps:
library(tidyverse) #easy installation of packages
library(dplyr) #Manipulating data
library(ggplot2) #used for data visualization
library(magrittr) #Pipe operators
setwd("/Users/lianganni/Desktop/untitled folder")
# Load the original data
original_data <- read.csv("hotels.csv")
#number of variables in the original data set
num_variables <- ncol(original_data) #32 variables in the data set.
# Check for peculiarities, such as how missing values are recorded
colSums(is.na(original_data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
sum(is.na(original_data$children)) #four missing value
## [1] 4
original_data$children[is.na(original_data$children)] <- "UNKNOWN" # assign the "UNKNOWN" value
table(original_data$children) # check if the values were correctly assigned
##
## 0 1 10 2 3 UNKNOWN
## 110796 4861 1 3652 76 4
sum(is.na(original_data$children)) # no missing values left in this column
## [1] 0
head(original_data, 10)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## 7 27 1 0
## 8 27 1 0
## 9 27 1 0
## 10 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## 7 2 2 0 0 BB PRT Direct
## 8 2 2 0 0 FB PRT Direct
## 9 3 2 0 0 BB PRT Online TA
## 10 3 2 0 0 HB PRT Offline TA/TO
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## 7 Direct 0 0
## 8 Direct 0 0
## 9 TA/TO 0 0
## 10 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## 7 0 C C
## 8 0 C C
## 9 0 A A
## 10 0 D D
## booking_changes deposit_type agent company days_in_waiting_list
## 1 3 No Deposit NULL NULL 0
## 2 4 No Deposit NULL NULL 0
## 3 0 No Deposit NULL NULL 0
## 4 0 No Deposit 304 NULL 0
## 5 0 No Deposit 240 NULL 0
## 6 0 No Deposit 240 NULL 0
## 7 0 No Deposit NULL NULL 0
## 8 0 No Deposit 303 NULL 0
## 9 0 No Deposit 240 NULL 0
## 10 0 No Deposit 15 NULL 0
## customer_type adr required_car_parking_spaces total_of_special_requests
## 1 Transient 0.0 0 0
## 2 Transient 0.0 0 0
## 3 Transient 75.0 0 0
## 4 Transient 75.0 0 0
## 5 Transient 98.0 0 1
## 6 Transient 98.0 0 1
## 7 Transient 107.0 0 0
## 8 Transient 103.0 0 1
## 9 Transient 82.0 0 1
## 10 Transient 105.5 0 0
## reservation_status reservation_status_date
## 1 Check-Out 2015-07-01
## 2 Check-Out 2015-07-01
## 3 Check-Out 2015-07-02
## 4 Check-Out 2015-07-02
## 5 Check-Out 2015-07-03
## 6 Check-Out 2015-07-03
## 7 Check-Out 2015-07-03
## 8 Check-Out 2015-07-03
## 9 Canceled 2015-05-06
## 10 Canceled 2015-04-22
To summarize the data for answering key questions, I can create summary statistics, visualize trends using charts and graphs.Additionally, creating variables, aggregating data at different levels (daily, monthly), and joining external datasets can provide a more comprehensive understanding of customer behavior and booking patterns.
The chart I will include is list below:
I don’t know how to do ggplot so I need to learn to answer the question and at this point I don’t have plan on incorporating any linear regression (or more advanced models).