Travel is an enlightening and eye-opening experience for both adults and children of all ages. However, traveling with children can also be overwhelming due to unpredictable schedules and needs of eating, housing et, cl. Taking eating for example, would people rather eat at hotel or at resturant when traveling with kids? In this project, I am going to use a real data to answer this question.
The original data for this project is from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019. It was then cleaned promarily by some data scientist. This dataset includes 32 variables, recording the hotel and stay information such as hotel type, arrivel time, number of people, what meal plan they choose and so on.
I plan to select four variables from the dataset, which are adults, children, babies, and meal, to address the question. The variable meal will be used as the response, and then the selected data will be devided into training set and testing set. KNN and Randon Forest algorithm will be applied to build classification model. The prediction performance of both model will be evaluated by AUC and misclassification rate.
My model I build can be used to make prediction what kind of meal plan the customer would choose given the number of children traveling with the customer.
The following packages are needed to import and manipulate the data.
library(tidyverse) #tidy data, visualisation, transformation
library(tibble) #create tibbles
library(tidyr) #tidy data
library(dplyr) #data analysis
library(DT) #display data set
library(ggplot2) #data visualization
library(magrittr) #pipe oprator
library(knitr) #display tables
These packages are required to build and evaluate the models.
library(class) #KNN modeling
library(randomForest) #build Random forest model
library(ROCR) #ROC curve
I download the data set from website tidytuesday. This original data set includes 32 variables.
hotels <- read.csv("E:/Courses/20Spring/Data wrangling/Final/hotels/hotels.csv")
In order to answer my question, I only need four variables: adulds, children, babies, and meal, so I select these four variables to included in a new data set that are going to be manipulated.
hotel <- hotels[, c("adults", "children", "babies", "meal")]
There are four missing values in the variable children. They are imputed by the median value of this variable. The extreme values in variable adult, which are greater than 20, are imputed with 20. The extreme values in variable children, which are greater than 5, are imputed with 5. The extreme values in variable babies, which are greater than 3, are imputed with 3.
hotel$adults[hotel$adults > 20] <- 20
hotel$children[hotel$children > 5] <- 5
hotel$babies[hotel$babies > 3] <- 3
hotel$children[is.na(hotel$children)] <- median(hotel$children,na.rm = TRUE)
The “Undefined” meal are imputed by “SC” because both “Undefined” and “SC” mean the customer choose not to eat at hotel.
levels(hotel$meal) <- c("BB", "FB", "HB", "SC", "SC")
The cleaned data includes 119390 observations of 4 variables. The variables adults, children, and babies are numerical, while the variable meal is factor.
str(hotel)
## 'data.frame': 119390 obs. of 4 variables:
## $ adults : num 2 2 1 1 2 2 2 2 2 2 ...
## $ children: num 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 4 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
Below is a preview of the clean data set hotel.
head(hotel, n = 20)
## adults children babies meal
## 1 2 0 0 BB
## 2 2 0 0 BB
## 3 1 0 0 BB
## 4 1 0 0 BB
## 5 2 0 0 BB
## 6 2 0 0 BB
## 7 2 0 0 BB
## 8 2 0 0 FB
## 9 2 0 0 BB
## 10 2 0 0 HB
## 11 2 0 0 BB
## 12 2 0 0 HB
## 13 2 0 0 BB
## 14 2 1 0 HB
## 15 2 0 0 BB
## 16 2 0 0 BB
## 17 2 0 0 BB
## 18 2 0 0 BB
## 19 2 0 0 BB
## 20 2 0 0 BB
Data description
hotel.type <- lapply(hotel, class)
hotel.var_desc <- c("Number of adults",
"Number of children",
"Number of babies",
"Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal usually dinner), FB for Full board (breakfast, lunch and dinner)")
hotel.var_names <- colnames(hotel)
data.description <- as_data_frame(cbind(hotel.var_names, hotel.type, hotel.var_desc))
colnames(data.description) <- c("Variable name", "Data Type", "Variable Description")
kable(data.description)
| Variable name | Data Type | Variable Description |
|---|---|---|
| adults | numeric | Number of adults |
| children | numeric | Number of children |
| babies | numeric | Number of babies |
| meal | factor | Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal usually dinner), FB for Full board (breakfast, lunch and dinner) |
I plan to summarize each variable and make a table to present the result.
summary(hotel)
## adults children babies meal
## Min. : 0.000 Min. :0.0000 Min. :0.00000 BB:92310
## 1st Qu.: 2.000 1st Qu.:0.0000 1st Qu.:0.00000 FB: 798
## Median : 2.000 Median :0.0000 Median :0.00000 HB:14463
## Mean : 1.855 Mean :0.1038 Mean :0.00784 SC:11819
## 3rd Qu.: 2.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :20.000 Max. :5.0000 Max. :3.00000
I plan to use boxplot to present the features of continous variables
boxplot(hotel[,1:3],notch=T, col=c("red", "blue", "yellow"))
I will use barplot to display the categorical variable meal.
attach(hotel)
barplot(table(hotel$meal), main = "meal", cex.names=.5,col = "Green")
I plan to use jitter and table to display the relationship between meal and each of the continous variable.
plot(meal~adults)
plot(meal~children)
plot(meal~babies)
table(hotel$meal,hotel$adults)
##
## 0 1 2 3 4 5 6 10 20
## BB 290 19450 67185 5314 58 2 1 1 9
## FB 0 75 682 40 1 0 0 0 0
## HB 9 1779 11951 718 3 0 0 0 3
## SC 104 1723 9862 130 0 0 0 0 0
table(hotel$meal,hotel$children)
##
## 0 1 2 3 5
## BB 85231 3896 3118 64 1
## FB 732 54 12 0 0
## HB 13217 757 477 12 0
## SC 11620 154 45 0 0
table(hotel$meal,hotel$babies)
##
## 0 1 2 3
## BB 91648 649 11 2
## FB 774 24 0 0
## HB 14285 174 4 0
## SC 11766 53 0 0