Content

Section 1: Introduction

  • 1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?
    • This project investigates the problem of hotel reservation cancellation. Specifically, a cancellation prediction model will be built. Reservations cancelled by customers has been a critical topic in service industry (e.g., hotel, airline, healthcare, etc.). It creates problems in managing supply anc demand for a company, and thus having negative impact on a company’s profit. This project is trying to provide some insight of reservation cancellation for companies so that they can be proactive and reduce the negative effect of it.
  • 1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed)
    • A logistic regression model with an open hotel booking demand dataset will be implement to analyze the problem. More information can be found in the later sections.
  • 1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.
    • The logstic regression model includes one dependent variable, is_cancelled, and five independent variables (some of them are categorical variables, and they will be seperated to a few binary variables later in our analysis). The following equation provides a general idea of our proposed model.

  • 1.4 Explain how your analysis will help the consumer of your analysis.
    • The analysis from our logistic regression model will provide hotel companies a better idea of the probability of reservation cancellation under a few different scenarios. Hence, companies can adjust their reservation policy (e.g., overbooking) to better ultilize their capacity and reduce profit lost.

\(~\)

Section 2: Packages Required

  • 2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.

    • readxl
    • tidyverse
    • ROCR
    • MASS
  • 2.2 Messages and warnings resulting from loading the package are suppressed.

  • 2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).

    • readxl: load data out of Excel to R environment.
    • tidyverse: dplyr under it helps with data manipulation, and ggplot2 helps with data and model visualization.
    • ROCR: help visualize the ROC curve obtained from the logistic regression model.
    • MASS: help analyze the proposed logistic regression model.

\(~\)

Section 3: Data Preparation

  • 3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.
  • 3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).
    • The dataset comes from an open hotel booking demand dataset from Antonio & et al 2019. Furthermore, the authors stated, “Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields.”
    • The data was collected between the 1st of July of 2015 and the 31st of August 2017.
    • There are total of 32 variables in the original dataset.
    • There are total of 119,390 observations.
    • The authors assure there is no missing data exists in the dataset. However, there are some “NULL” represented as on of the categories. This should not be considered as missing data but “not applicable”.
  • 3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.
    • Data importing:
    • Deduplicate data:
      • hotels_lrm_data <- hotels[!duplicated(hotels), ]
    • Extract the required variables for the analysis:
      • hotels_lrm_data <- hotels_lrm_data[,c(1,2,3,18,19,23,28)]
    • Remove observations with average daily rate (adr) less than zero
      • hotels_lrm_data <- filter(hotels_lrm_data, hotels_lrm_data$adr >= 0)
    • Count the number of missing value in the dataset
      • sum(is.na(hotels_lrm_data))
      • No missing value as mentioned in the paper of the original dataset.
  • 3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.
    • Total of 87,395 observations
    • Total of 7 variables
    • The head of the dataset:
  • 3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.
    • Summary Information:

\(~\)

Section 4: Proposed Exploratory Data Analysis

  • 4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?
    • previous_cancellations_rate: this new created variable will improve our analysis by looking at the probability that a customer cancels reservation in the history. This variable will be calculated as previous_cancellations divided by (previous_cancellations + previous_bookings_not_canceled). Note: if the denominator equals to zero, the rate will be zero.
    • hotel: this variable will be recoded to a binary variable (1: resort hotel; 0: city hotel).
    • deposit_type: this variable will be separate to two binary variables, no_deposit and non_refund.
  • 4.2 What types of plots and tables will help you to illustrate the findings to your questions?
    • Receiver operating characteristic (ROC) curve will be used to illustrate the performance of our analysis (trade-off between sensitivity and specificity).
    • Box plot, histograms and density plot will be used to demonstrate the distribution of the numeric variables.
    • Bar chart will be used to display the distribution of characteristic variables.
    • Scatter plot will be used to show the relationship pattern between the dependent variables and each independent variables.
  • 4.3 What do you not know how to do right now that you need to learn to answer your questions?
    • The significance of each independent variables is not known at this moement. There might be other independent variables not included in the origninal dataset would explain reservation cancellation. More dataset might be required later in the analysis process.
  • 4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?
    • Except logistic regression model, cluster analysis would also be interesting in looking at the similarity/difference between countries in canceling hotel reservations.

\(~\)

Section 5: Exploratory Data Analysis and Basic Statistic

  • 5.1 Learn about the data by:
    • Assessing dimensions

      • Total of 32 variables and 119,390 observations
    • Viewing the head and tail of the data

    • Identifying the data types of each variable

    • Identifying missing data

      • There are four missing values in the variable “children”
    • Computing summary statistics for the variables

    • Check for duplicate rows or columns

      • There are 31,994 duplicated rows
      • There are no duplicated columns
  • 5.2 Learn about the data visually by plotting:
    • Histograms

      • For numeric variables
    • Bar charts

      • For characteristic variables
    • Box plots

    • For character variables with lead_time as dependent variable

    • Bar plots

      • For numeric variables
    • Scatter plots

      • For numeric variables with lead_time as dependent variable

Section 6: Logistic Regression Model Analysis

Section 7: Model Summary and Insights

Model Summary

  • Plot Comparison (Logit, Probit, and Log Log)
    • Logit is not symmetric, but the lower tail is close to cloglog and the upper tail is thicker than the other two
    • Probit is not symmetric, but the lower tail is thinner than the other two functions and the upper tail is thinner than logit
    • Cloglog is not symmetric, the lower tail is close to logit, but the upper tail is thinner than the other two functions
  • ROC Curves and AUCs
    • AUC
       Logit: 0.66973
       Probit: 0.66986
       Cloglog: 0.66982

  • Brief Summary
    • Based on the above analysis, Logit, Probit, and Log Log have very similar performance. Thus, using either one of them should be fine. I will keep using logit as a link function here.
  • Chi-squared Test for Logit Model
    • P-Value: 0.9996114
    • Since the p-value is way larger than 5% level, the real is_canceled and our predicted value have no difference statistically, meaning our logit model is good.
  • Limitation
    • The AUCs we obtained from these three models would not be excellent, but 0.67 might be acceptable. Further research can include more variables (e.g., region, customer age…) to boost the performance.

Insights

  • City hotel has greater chance of being canceled than resort hotel
  • The more the lead day, the higher the chance of a reservation to be canceled
  • Intuitively, we might think refundable reservations will create more cancellation since customers have nothing to lose by doing it. However, the result tells the other way. Thus, a refundable room might provide customers with more flexibility without messing up a hotel’s capacity management
  • A customer with history of cancelling a reservation has higher chance of doing it again

\(~\)

Formatting & Other Requirements

  • All code is visible, proper coding style is followed, and code is well commented (see section regarding syle).
  • Coding is systematic - complicated problem broken down into sub-problems that are individually much simpler. Code is efficient, correct, and minimal. Code uses appropriate data structure (list, data frame, vector/matrix/array). Code checks for common errors.
  • Achievement, mastery, cleverness, creativity: Tools and techniques from the course are applied very competently and, perhaps, somewhat creatively. Perhaps student has gone beyond what was expected and required, e.g., extraordinary effort, additional tools not addressed by this course, unusually sophisticated appli- cation of tools from course.
  • .Rmd fully executes without any errors and HTML produced matches the HTML report submitted by student.

\(~\)