Initial setup and Configure the data set. Load the data set file
in variable hotel_data files.
Data set - Hotels : This data comes
from an open hotel booking demand data-set from Antonio, Almeida and
Nunes.
Summery of data set
summary(hotel_data)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
In my earlier analysis, I selected lead_time and cancellation data.
Therefore, I will use these two aspects for “NULL Hypotheses”.
Neyman-Pearson hypothesis test, we need to ensure that we have
sufficient sample size and that the data meet certain assumptions of the
test. The Neyman-Pearson is often used in the context of binary
hypothesis testing with known distributions and parameters.
To
perform a Neyman-Pearson hypothesis test.We need:
1. Knowledge of
the underlying distributions of the data.
2. Specific parameters of
the distributions under the null and alternative hypotheses.
3. A
large sample size.
# Find the number of record in hotel_data set.
total_no_of_record <- nrow(hotel_data)
#Assumption value for alpha
alpha <- 0.05
margin_of_error <- 0.02
# Determine the Z-Score(confidence level) when alpha is 0.05
z_score <- qnorm(1 - alpha/2)
#To display z_score
z_score
## [1] 1.959964
# Assumption value for sigma
sigma<-0.5
# Calculate sample size for proportions
sample_size <- ((z_score*z_score) * (sigma*sigma) )/(margin_of_error*margin_of_error)
# Round up to nearest integer
sample_size <- ceiling(sample_size)
#To display sample_size
sample_size
## [1] 2401
# Display the result
paste('Required sample size for Neyman-Pearson hypothesis test in hotel_data set :', sample_size )
## [1] "Required sample size for Neyman-Pearson hypothesis test in hotel_data set : 2401"
In above calculation, Required sample size is small than
total_no_of_record of hotel_data set. so we have enough data to perform
the Neyman-Pearson hypothesis test. This sample sizing is based upon
certain assumption which is - alpha = 0.05 ( commonly used value) ,
margin_of_error = 0.02 significant value of margin of error.
To perform a Fisher’s style test for significance, we can use a
contingency table and conduct a chi-squared test of independence.This
test will help us determine if there’s a significant association between
lead time (categorized) and cancellation status.
Hypothesis 1
(with visualizations ): Lead time does not significantly affect the
likelihood of cancellation.
# First, categorize lead time (e.g., into quadrilles)
lead_time_categories <- cut(hotel_data$lead_time, breaks = quantile(hotel_data$lead_time, probs = c(0, 0.25, 0.5, 0.75, 1)), labels = c("Q1", "Q2", "Q3", "Q4"))
# Create a contingency table
contingency_table_hypothesis1 <- table(lead_time_categories, hotel_data$is_canceled)
# Perform chi-squared test
chi_squared_test_hypothesis1 <- chisq.test(contingency_table_hypothesis1)
# Print the results
chi_squared_test_hypothesis1
##
## Pearson's Chi-squared test
##
## data: contingency_table_hypothesis1
## X-squared = 8676.7, df = 3, p-value < 2.2e-16
# Create 2X2 matrix from contingency_table_hypothesis1
# Note - contingency_table_hypothesis1_dataset_with_limit dataset has been limited with two row and 2 column as rstudio is given error - workspace memory error in window.
contingency_table_hypothesis1_dataset_with_limit <- matrix(contingency_table_hypothesis1, nrow = 2,ncol = 2)
## Warning in matrix(contingency_table_hypothesis1, nrow = 2, ncol = 2): data
## length differs from size of matrix: [8 != 2 x 2]
# Determine fisher result over data set - contingency_table_hypothesis1_dataset_with_limit
fisher_result_test_1 <- fisher.test(contingency_table_hypothesis1_dataset_with_limit)
#Display fisher result
fisher_result_test_1
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table_hypothesis1_dataset_with_limit
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.7928541 0.8423255
## sample estimates:
## odds ratio
## 0.8172328
barplot(contingency_table_hypothesis1, beside = TRUE, legend = TRUE,
main = "Lead Time vs. Cancellation Status",
xlab = "Lead Time",
ylab = "Frequency",
col = c("lightblue", "salmon"),
args.legend = list(title = "Cancellation Status"))
Hypothesis 2 (with visualizations ): The type of meal booked does not significantly affect the total amount of special requests made by guests.
# Create a contingency table for meal type and total special requests
contingency_table_hypothesis2 <- table(hotel_data$meal, hotel_data$total_of_special_requests)
# Perform chi-squared test
chi_squared_test_hypothesis2 <- chisq.test(contingency_table_hypothesis2)
## Warning in chisq.test(contingency_table_hypothesis2): Chi-squared approximation
## may be incorrect
# Print the results
chi_squared_test_hypothesis2
##
## Pearson's Chi-squared test
##
## data: contingency_table_hypothesis2
## X-squared = 1841.1, df = 20, p-value < 2.2e-16
contingency_table_hypothesis2_dataset_with_limit <- matrix(contingency_table_hypothesis2, nrow = 2,ncol = 2)
## Warning in matrix(contingency_table_hypothesis2, nrow = 2, ncol = 2): data
## length differs from size of matrix: [30 != 2 x 2]
# Determine fisher result over data set - contingency_table_hypothesis1_dataset_with_limit
fisher_result_test_2 <- fisher.test(contingency_table_hypothesis2_dataset_with_limit)
#Display fisher result
fisher_result_test_2
##
## Fisher's Exact Test for Count Data
##
## data: contingency_table_hypothesis2_dataset_with_limit
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 38.89655 46.05566
## sample estimates:
## odds ratio
## 42.33422
barplot(contingency_table_hypothesis2, beside = TRUE, legend = TRUE,
main = "Meal Type vs. Total Special Requests",
xlab = "Meal Type",
ylab = "Frequency",
col = rainbow(ncol(contingency_table_hypothesis2)),
args.legend = list(title = "Total Special Requests"))
Explanation of above Hypothesis :
Thank you!!!