1 INTRODUCTION

1.1 Problem

Travel is an enlightening and eye-opening experience for both adults and children of all ages. However, traveling with children can also be overwhelming due to unpredictable schedules and needs of eating, housing et, cl. Taking eating for example, would people rather eat at hotel or at resturant when traveling with kids? In this project, I am going to use a real data to answer this question.

1.2 Data

The original data for this project is from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019. It was then cleaned promarily by some data scientist. This dataset includes 32 variables, recording the hotel and stay information such as hotel type, arrivel time, number of people, what meal plan they choose and so on.

1.3 Strategy

I plan to select four variables from the dataset, which are adults, children, babies, and meal, to address the question. The variable meal will be used as the response, and then the selected data will be devided into training set and testing set. KNN and Randon Forest algorithm will be applied to build classification model. The prediction performance of both model will be evaluated by AUC and misclassification rate.

1.4 Insights

My model I build can be used to make prediction what kind of meal plan the customer would choose given the number of children traveling with the customer.

2 PACKAGE REQUIRED

The following packages are needed to import and manipulate the data.

library(tidyverse)    #tidy data, visualisation, transformation
library(tibble)       #create tibbles
library(tidyr)        #tidy data
library(dplyr)        #data analysis
library(DT)           #display data set
library(ggplot2)      #data visualization
library(magrittr)     #pipe oprator
library(knitr)        #display tables

These packages are required to build and evaluate the models.

library(class)        #KNN modeling
library(randomForest) #build Random forest model
library(ROCR)         #ROC curve

3 DATA PREPARATION

3.1 Read data into R

I download the data set from website tidytuesday. This original data set includes 32 variables.

hotels <- read.csv("E:/Courses/20Spring/Data wrangling/Final/hotels/hotels.csv")

3.2 Create a new data set that only includes the veriables I need

In order to answer my question, I only need four variables: adulds, children, babies, and meal, so I select these four variables to included in a new data set that are going to be manipulated.

hotel <- hotels[, c("adults", "children", "babies", "meal")]

3.3 Take care of the missing value and extreme values

There are four missing values in the variable children. They are imputed by the median value of this variable. The extreme values in variable adult, which are greater than 20, are imputed with 20. The extreme values in variable children, which are greater than 5, are imputed with 5. The extreme values in variable babies, which are greater than 3, are imputed with 3.

hotel$adults[hotel$adults > 20] <- 20
hotel$children[hotel$children > 5] <- 5
hotel$babies[hotel$babies > 3] <- 3
hotel$children[is.na(hotel$children)] <- median(hotel$children,na.rm = TRUE)

The “Undefined” meal are imputed by “SC” because both “Undefined” and “SC” mean the customer choose not to eat at hotel.

levels(hotel$meal) <- c("BB", "FB", "HB", "SC", "SC")

The cleaned data includes 119390 observations of 4 variables. The variables adults, children, and babies are numerical, while the variable meal is factor.

str(hotel)
## 'data.frame':    119390 obs. of  4 variables:
##  $ adults  : num  2 2 1 1 2 2 2 2 2 2 ...
##  $ children: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal    : Factor w/ 4 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...

Below is a preview of the clean data set hotel.

head(hotel, n = 20)
##    adults children babies meal
## 1       2        0      0   BB
## 2       2        0      0   BB
## 3       1        0      0   BB
## 4       1        0      0   BB
## 5       2        0      0   BB
## 6       2        0      0   BB
## 7       2        0      0   BB
## 8       2        0      0   FB
## 9       2        0      0   BB
## 10      2        0      0   HB
## 11      2        0      0   BB
## 12      2        0      0   HB
## 13      2        0      0   BB
## 14      2        1      0   HB
## 15      2        0      0   BB
## 16      2        0      0   BB
## 17      2        0      0   BB
## 18      2        0      0   BB
## 19      2        0      0   BB
## 20      2        0      0   BB

Data description is as follows.

hotel.type <- lapply(hotel, class)
hotel.var_desc <- c("Number of adults",
                    "Number of children",
                    "Number of babies",
                    "Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)"
)
hotel.var_names <- colnames(hotel)
data.description <- as_data_frame(cbind(hotel.var_names, hotel.type, hotel.var_desc))
colnames(data.description) <- c("Variable name", "Data Type", "Variable Description")
kable(data.description)
Variable name Data Type Variable Description
adults numeric Number of adults
children numeric Number of children
babies numeric Number of babies
meal factor Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)

4 EXPLORATION DATA ANALYSIS

4.1 Data Summary

I plan to summarize each variable and make a table to present the result.

summary(hotel)
##      adults          children          babies        meal      
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.00000   BB:92310  
##  1st Qu.: 2.000   1st Qu.:0.0000   1st Qu.:0.00000   FB:  798  
##  Median : 2.000   Median :0.0000   Median :0.00000   HB:14463  
##  Mean   : 1.855   Mean   :0.1038   Mean   :0.00784   SC:11819  
##  3rd Qu.: 2.000   3rd Qu.:0.0000   3rd Qu.:0.00000             
##  Max.   :20.000   Max.   :5.0000   Max.   :3.00000

4.2 EDA of continous variables

I plan to use boxplot to present the features of continous variables

boxplot(hotel[, 1:3],notch = T, col = c("red", "blue", "yellow"))

4.3 EDA of categorical variable

I will use barplot to display the categorical variable meal.

attach(hotel)
barplot(table(hotel$meal), main = "meal", cex.names = .5,col = "Green")

4.4 distribution of meal based on adults, children, and babies respectively.

I plan to use jitter and table to display the relationship between meal and each of the continous variable.

plot(meal~adults)

plot(meal~children)

plot(meal~babies)

table(hotel$meal,hotel$adults)
##     
##          0     1     2     3     4     5     6    10    20
##   BB   290 19450 67185  5314    58     2     1     1     9
##   FB     0    75   682    40     1     0     0     0     0
##   HB     9  1779 11951   718     3     0     0     0     3
##   SC   104  1723  9862   130     0     0     0     0     0
table(hotel$meal,hotel$children)
##     
##          0     1     2     3     5
##   BB 85231  3896  3118    64     1
##   FB   732    54    12     0     0
##   HB 13217   757   477    12     0
##   SC 11620   154    45     0     0
table(hotel$meal,hotel$babies)
##     
##          0     1     2     3
##   BB 91648   649    11     2
##   FB   774    24     0     0
##   HB 14285   174     4     0
##   SC 11766    53     0     0

5 MACHINE LEARNING

5.1 Data split

5.2 KNN model building

5.3 Evaluation of KNN model

5.4 Random forest model building

5.5 Evaluation of Random forest model building

6 SUMMARY