Initial setup and Configure the data set.
Load the data set file in variable hotel_data files.
Data set - Hotels : This data comes from an open hotel booking demand data-set from Antonio, Almeida and Nunes.

NULL Hypotheses

Ask :

Summery of data set

summary(hotel_data)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 
In my earlier analysis, I selected lead_time and cancellation data. Therefore, I will use these two aspects for “NULL Hypotheses”.
NULL Hypotheses 1: Upon “LEAD TIME/Cancellation” NULL Hypotheses 2:Upon “Meal booked”.

Neyman-Pearson hypothesis test



Neyman-Pearson hypothesis test, we need to ensure that we have sufficient sample size and that the data meet certain assumptions of the test. The Neyman-Pearson is often used in the context of binary hypothesis testing with known distributions and parameters.
To perform a Neyman-Pearson hypothesis test.We need:
1. Knowledge of the underlying distributions of the data.
2. Specific parameters of the distributions under the null and alternative hypotheses.
3. A large sample size.


# Find the number of record in hotel_data set. 
total_no_of_record <- nrow(hotel_data)

#Assumption value for alpha
alpha <- 0.05 

margin_of_error <- 0.02
# Determine the Z-Score(confidence level) when alpha is 0.05
z_score <- qnorm(1 - alpha/2)

#To display z_score
z_score
## [1] 1.959964
# Assumption value for sigma
sigma<-0.5

# Calculate sample size for proportions
sample_size <- ((z_score*z_score) * (sigma*sigma) )/(margin_of_error*margin_of_error)
# Round up to nearest integer
sample_size <- ceiling(sample_size)  

#To display sample_size
sample_size
## [1] 2401
# Display the result
paste('Required sample size for Neyman-Pearson hypothesis test in hotel_data set :', sample_size )
## [1] "Required sample size for Neyman-Pearson hypothesis test in hotel_data set : 2401"
In above calculation, Required sample size is small than total_no_of_record of hotel_data set. so we have enough data to perform the Neyman-Pearson hypothesis test. This sample sizing is based upon certain assumption which is - alpha = 0.05 ( commonly used value) , margin_of_error = 0.02 significant value of margin of error.

Perform a Fisher’s style test for significance on the same hypothesis, and interpret the p-value.


To perform a Fisher’s style test for significance, we can use a contingency table and conduct a chi-squared test of independence.This test will help us determine if there’s a significant association between lead time (categorized) and cancellation status.
Hypothesis 1 (with visualizations ): Lead time does not significantly affect the likelihood of cancellation.

# First, categorize lead time (e.g., into quadrilles)
lead_time_categories <- cut(hotel_data$lead_time, breaks = quantile(hotel_data$lead_time, probs = c(0, 0.25, 0.5, 0.75, 1)), labels = c("Q1", "Q2", "Q3", "Q4"))


# Create a contingency table
contingency_table_hypothesis1 <- table(lead_time_categories, hotel_data$is_canceled)

# Perform chi-squared test
chi_squared_test_hypothesis1  <- chisq.test(contingency_table_hypothesis1)

# Print the results
chi_squared_test_hypothesis1
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_hypothesis1
## X-squared = 8676.7, df = 3, p-value < 2.2e-16
# Create 2X2 matrix from contingency_table_hypothesis1 
# Note - contingency_table_hypothesis1_dataset_with_limit dataset has been limited with two row and 2 column as rstudio is given error - workspace memory error in window.
contingency_table_hypothesis1_dataset_with_limit <- matrix(contingency_table_hypothesis1, nrow = 2,ncol = 2)
## Warning in matrix(contingency_table_hypothesis1, nrow = 2, ncol = 2): data
## length differs from size of matrix: [8 != 2 x 2]
# Determine fisher result over data set - contingency_table_hypothesis1_dataset_with_limit
fisher_result_test_1 <- fisher.test(contingency_table_hypothesis1_dataset_with_limit)
#Display fisher result
fisher_result_test_1
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table_hypothesis1_dataset_with_limit
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7928541 0.8423255
## sample estimates:
## odds ratio 
##  0.8172328
barplot(contingency_table_hypothesis1, beside = TRUE, legend = TRUE,
        main = "Lead Time vs. Cancellation Status",
        xlab = "Lead Time",
        ylab = "Frequency",
        col = c("lightblue", "salmon"),
        args.legend = list(title = "Cancellation Status"))

  1. Contingency Table : Categorize lead time and create a contingency table with lead time categories as rows and cancellation status as columns.
  2. Perform the Chi-Squared Test:Calculate the chi-squared statistic and its associated p-value.
  3. Interpret the Results: Based on the p-value which is less than 2.2e-16. In fact which is small value. i.e small p value indicate strong enough evidence against the NULL Hypothesis.Since p value is closer to zero means there is association between lead time and cancellation status .
  4. Perform fisher on limited data set (due to workspace memory error in rstudio):Determine fisher test on contingency_table_hypothesis1 with 2X2 matrix which is similar to our Chi-Squared Test for p-value. Explanation : The p-value in a chi-squared test represents the probability of observing the data if the null hypothesis are true.
    if p-value is less than choose significance level (eg 0.05 like #1) we reject the null hypothesis, indicating that there is a significant association between lead time and cancellation status.
    If the p-value is greater than the significance level, we fail to reject the null hypothesis, suggesting that there is no significant association between lead time and cancellation status.

Hypothesis 2 (with visualizations ): The type of meal booked does not significantly affect the total amount of special requests made by guests.

  # Create a contingency table for meal type and total special requests
  contingency_table_hypothesis2 <- table(hotel_data$meal, hotel_data$total_of_special_requests)
  
  # Perform chi-squared test
  chi_squared_test_hypothesis2 <- chisq.test(contingency_table_hypothesis2)
## Warning in chisq.test(contingency_table_hypothesis2): Chi-squared approximation
## may be incorrect
  # Print the results
  chi_squared_test_hypothesis2
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_hypothesis2
## X-squared = 1841.1, df = 20, p-value < 2.2e-16
  contingency_table_hypothesis2_dataset_with_limit <- matrix(contingency_table_hypothesis2, nrow = 2,ncol = 2)
## Warning in matrix(contingency_table_hypothesis2, nrow = 2, ncol = 2): data
## length differs from size of matrix: [30 != 2 x 2]
# Determine fisher result over data set - contingency_table_hypothesis1_dataset_with_limit
fisher_result_test_2 <- fisher.test(contingency_table_hypothesis2_dataset_with_limit)
#Display fisher result
fisher_result_test_2
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table_hypothesis2_dataset_with_limit
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  38.89655 46.05566
## sample estimates:
## odds ratio 
##   42.33422
  barplot(contingency_table_hypothesis2, beside = TRUE, legend = TRUE,
        main = "Meal Type vs. Total Special Requests",
        xlab = "Meal Type",
        ylab = "Frequency",
        col = rainbow(ncol(contingency_table_hypothesis2)),
        args.legend = list(title = "Total Special Requests"))

Explanation of above Hypothesis :

  1. Contingency table - contingency_table_hypothesis2 in chi square test and contingency_table_hypothesis2_dataset_with_limit in fisher.
  2. P-value is less than 2.2e-16 which is indicating the strong enough evidence against the NULL Hypothesis.
  3. The alternative hypothesis states that the true odds ratio is not equal to 1.
  4. The 95% confidence interval for the odds ratio is between [38.89655 - 46.05566].
  5. The estimated odds ratio is 42.33422 This result indicate that there is a significant association between Meal Type and Total Special Requests.
    Note : For fisher test, I have limit the table -“contingency_table” data for two rows and two column as rstudio is throwing error for memory workspace. I have tried to increase the memory but option did not work.

Thank you!!!