Content

Section 1: Introduction

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this?
- This project investigates the problem of hotel reservation cancellation. Specifically, a cancellation prediction model will be built. Reservations cancelled by customers has been a critical topic in service industry (e.g., hotel, airline, healthcare, etc.). It creates problems in managing supply anc demand for a company, and thus having negative impact on a company’s profit. This project is trying to provide some insight of reservation cancellation for companies so that they can be proactive and reduce the negative effect of it.
1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed)
- A logistic regression model with an open hotel booking demand dataset will be implement to analyze the problem. More information can be found in the later sections.
1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem.
- The logstic regression model includes one dependent variable, is_cancelled, and five independent variables (some of them are categorical variables, and they will be seperated to a few binary variables later in our analysis). The following equation provides a general idea of our proposed model.

1.4 Explain how your analysis will help the consumer of your analysis.
- The analysis from our logistic regression model will provide hotel companies a better idea of the probability of reservation cancellation under a few different scenarios. Hence, companies can adjust their reservation policy (e.g., overbooking) to better ultilize their capacity and reduce profit lost.

$~$

Section 2: Packages Required

2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.
- readxl
- tidyverse
- ROCR
- MASS
2.2 Messages and warnings resulting from loading the package are suppressed.
2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).
- readxl: load data out of Excel to R environment.
- tidyverse: dplyr under it helps with data manipulation, and ggplot2 helps with data and model visualization.
- ROCR: help visualize the ROC curve obtained from the logistic regression model.
- MASS: help analyze the proposed logistic regression model.

$~$

Section 3: Data Preparation

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.
- https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md
3.2 Source data is thoroughly explained (i.e. what was the original purpose of the data, when was it collected, how many variables did the original have, explain any peculiarities of the source data such as how missing values are recorded, or how data was imputed, etc.).
- The dataset comes from an open hotel booking demand dataset from Antonio & et al 2019. Furthermore, the authors stated, “Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields.”
- The data was collected between the 1st of July of 2015 and the 31st of August 2017.
- There are total of 32 variables in the original dataset.
- There are total of 119,390 observations.
- The authors assure there is no missing data exists in the dataset. However, there are some “NULL” represented as on of the categories. This should not be considered as missing data but “not applicable”.
3.3 Data importing and cleaning steps are explained in the text (tell me why you are doing the data cleaning activities that you perform) and follow a logical process.
- Data importing:
  - hotels <- readr::read_csv(‘https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv’)
- Deduplicate data:
  - hotels_lrm_data <- hotels[!duplicated(hotels), ]
- Extract the required variables for the analysis:
  - hotels_lrm_data <- hotels_lrm_data[,c(1,2,3,18,19,23,28)]
- Remove observations with average daily rate (adr) less than zero
  - hotels_lrm_data <- filter(hotels_lrm_data, hotels_lrm_data$adr >= 0)
- Count the number of missing value in the dataset
  - sum(is.na(hotels_lrm_data))
  - No missing value as mentioned in the paper of the original dataset.
3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.
- Total of 87,395 observations
- Total of 7 variables
- The head of the dataset:
3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.
- Summary Information:

$~$

Section 4: Proposed Exploratory Data Analysis

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?
- previous_cancellations_rate: this new created variable will improve our analysis by looking at the probability that a customer cancels reservation in the history. This variable will be calculated as previous_cancellations divided by (previous_cancellations + previous_bookings_not_canceled). Note: if the denominator equals to zero, the rate will be zero.
- hotel: this variable will be recoded to a binary variable (1: resort hotel; 0: city hotel).
- deposit_type: this variable will be separate to two binary variables, no_deposit and non_refund.
4.2 What types of plots and tables will help you to illustrate the findings to your questions?
- Receiver operating characteristic (ROC) curve will be used to illustrate the performance of our analysis (trade-off between sensitivity and specificity).
- Box plot, histograms and density plot will be used to demonstrate the distribution of the numeric variables.
- Bar chart will be used to display the distribution of characteristic variables.
- Scatter plot will be used to show the relationship pattern between the dependent variables and each independent variables.
4.3 What do you not know how to do right now that you need to learn to answer your questions?
- The significance of each independent variables is not known at this moement. There might be other independent variables not included in the origninal dataset would explain reservation cancellation. More dataset might be required later in the analysis process.
4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?
- Except logistic regression model, cluster analysis would also be interesting in looking at the similarity/difference between countries in canceling hotel reservations.

$~$

Formatting & Other Requirements

All code is visible, proper coding style is followed, and code is well commented (see section regarding syle).
Coding is systematic - complicated problem broken down into sub-problems that are individually much simpler. Code is efficient, correct, and minimal. Code uses appropriate data structure (list, data frame, vector/matrix/array). Code checks for common errors.
Achievement, mastery, cleverness, creativity: Tools and techniques from the course are applied very competently and, perhaps, somewhat creatively. Perhaps student has gone beyond what was expected and required, e.g., extraordinary effort, additional tools not addressed by this course, unusually sophisticated appli- cation of tools from course.
.Rmd fully executes without any errors and HTML produced matches the HTML report submitted by student.

$~$

In-class Coding Exercises - Module 4

Start exploring your final project data! Within your group, discuss your final project data and the 10 specific questions you wanted to ask of your data (per the online class prep). Discuss what kind of transformations would best answer these questions.

Start transforming your data to gain new insights. Based on what you learned this week, some questions you may want to ask are:
- What features could you filter on?
  - Filter by room type might be interesting. Different room types would have different market segments, thus it would be good to see them separately.
- How could arranging your data in different ways help?
  - Aggregating the data to country level might be interesting in seeing different cancellation effect at different level (individual vs country).
- Can you reduce your data by selecting only certain variables?
  - Yes, the analysis only needed 7 out of 32 variables in the original dataset. Furthermore, there are some duplicated rows which can be removed.
- Could creating new variables add new insights?
  - Yes, a new variable, previous_cancellation_rate, can be created by previous_cancellations and previous_bookings_not_canceled. This new variable would provide better view of the probability a customer canceled reservations historically.
- Could summary statistics at different categorical levels tell you more?
  - Yes, it could. For example, there is a variable called “country” in the original dataset, and aggregating the data to a country level can tell us the cancellation pattern country by country. This information/analysis could potentially be helpful for hotel companies who run their businesses in multiple countries.
- How can you incorporate the pipe (%>%) operator to make your code more efficient?
  - pipe operator would be helpful to feed the original dataset to do group_by function, and then use the pipe operator again to feed the grouped data to do our analysis.
Does your final project leverage more than one data set?
- Not so far.
- If so, start joining your data sets with the skills you learned.
- If not, try to identify another data set that you can join to your final project data to make it even more interesting.
Your mid-term project eval is due by the end of today’s class. You need to render this mid-term report as an R Markdown HTML product and you need to show all your code. You will need to upload your HTML report to RPubs and then send me the url via Slack. If you want to incorporate the above tasks into your midterm you certainly can do so at this time.

BANA 7025 Midterm Project - Hotel Reservation Cancellation

Chia-Chun Yang

11/7/2021