Hotels data - NULL Hypotheses

Initial setup and Configure the data set.
Load the data set file in variable hotel_data files.
Data set - Hotels : This data comes from an open hotel booking demand data-set from Antonio, Almeida and Nunes.

NULL Hypotheses

Ask :

Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data. For each hypothesis:
- Come up with an alpha level, power level, and minimum effect size, and explain why you chose each value.
- Determine if you have enough data to perform a Neyman-Pearson hypothesis test. If you do, show your sample size calculation, perform the test, and interpret results. If not, explain why there isn’t enough data.
- Perform a Fisher’s style test for significance on the same hypothesis, and interpret the p-value.
- (In the end, you should have two hypothesis tests for each hypothesis, equating two four total tests.)

Summery of data set

summary(hotel_data)

##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
##

In my earlier analysis, I selected lead_time and cancellation data. Therefore, I will use these two aspects for “NULL Hypotheses”.
NULL Hypotheses 1: Upon “LEAD TIME/Cancellation”

Lead_time : Number of days between booking and arrival.Cancellation : Cancel the booking for his/her stay in hotel.
Aspect of data : The chance of cancellation id not greatly impacted by the lead time, or the number of days between the reservation and the arrival.
NULL Hypothesis : There is no difference in cancellation rates between bookings with different lead times.
Alpha Level (is the probability of rejecting the null hypothesis when it is true): 0.05. if the p-value is less than 0.05 then the null hypothesis is rejected.
Power Level ( is the probability of rejecting the NULL Hypothesis when NULL hypothesis is false): 0.80 or grater. if it is more than or equal 80% then NULL hypothesis can rejected.
Minimum Effect Size:0.1
1. The minimum effect size implies the smallest difference or association that is practically meaningful or relevant in the context of the testing.
2. Choosing a minimum effect size of 0.1 implies that the testing aims to detect relatively small but still meaningful effects.
3. Selecting a minimum effect size of 0.1 indicates that it is sufficiently powered to detect even minute effects in the testing result. A smaller effect size necessitates a larger date set to achieve sufficient statistical power.
Explanation : Alpha level is a critical threshold used in null hypothesis analysis. It represent maximum acceptable risk of making a Type I error - false positive. 0.05 value is used commonly (balance choose between the risk of Type I errors -false positives and Type II errors -false negatives) for avoiding false positive.
Power is the probability of correctly rejecting the null hypothesis when there is a true effect ( avoiding a Type II error). 0.80 or higher is to ensure that their testing can detect real effects.A power of 0.80 means an 80% chance of detecting a true effect.
The minimum effect size quantifies the magnitude of the difference or association between groups or variables and value of Minimum Effect Size - 0.1 signifies a small but meaningful effect. In our case, Let’s say we have a new hotel reservation system and we are comparing it to the old one.Even a small reduction in cancellations matters for hotel revenue and customer satisfaction.

NULL Hypotheses 2:Upon “Meal booked”.

Meal booked : The type of meal booked by guest like Veg or non-veg / breakfast,lunch,dinner etc. and does not significantly affect the total amount of special requests made by guests. Small change in option may have big impact on reservation.
Null Hypothesis: There is no difference in the total amount of special requests between guests booking different meal types.
Alpha Level:0.01
Power Level:0.9 , A higher value of power level will help to determine a true effect.
Minimum Effect Size:0.5 A moderate effect size value represents a noticeable difference in the total amount of special requests.
Explanation:
1. Alpha level is a critical threshold used in null hypothesis analysis.It represent maximum acceptable risk of making a Type I error - false positive.
2. 0.01 value is more stringent. it reduces the risk positive and more significant for rejecting the null hypothesis means meal booking does not play major role in booking.
3. Power is the probability of correctly rejecting the null hypothesis when there is a true effect ( avoiding a Type II error). High value of power 0.9 is sensitive enough to identify real effects. In another word, 0.9 or 90% indicate that there are 90% chance of correctly detecting an effective of meal booking in hotel reservation.
4. The minimum effect size quantifies the magnitude of the difference or association between groups or variables and value of Minimum Effect Size. 0.5 value of Minimum Effect Size is moderate value which will impact practically significant. if the impact of meal booking on special request exceeds 0.5 standard deviations,It is considered practically meaningful.small effect size may be significant but may not have substantial practical implications. larger effect size must have more tangible impact.

Neyman-Pearson hypothesis test

Neyman-Pearson hypothesis test, we need to ensure that we have sufficient sample size and that the data meet certain assumptions of the test. The Neyman-Pearson is often used in the context of binary hypothesis testing with known distributions and parameters.
To perform a Neyman-Pearson hypothesis test.We need:
1. Knowledge of the underlying distributions of the data.
2. Specific parameters of the distributions under the null and alternative hypotheses.
3. A large sample size.

# Find the number of record in hotel_data set. 
total_no_of_record <- nrow(hotel_data)

#Assumption value for alpha
alpha <- 0.05 

margin_of_error <- 0.02
# Determine the Z-Score(confidence level) when alpha is 0.05
z_score <- qnorm(1 - alpha/2)

#To display z_score
z_score

## [1] 1.959964

# Assumption value for sigma
sigma<-0.5

# Calculate sample size for proportions
sample_size <- ((z_score*z_score) * (sigma*sigma) )/(margin_of_error*margin_of_error)
# Round up to nearest integer
sample_size <- ceiling(sample_size)  

#To display sample_size
sample_size

## [1] 2401

# Display the result
paste('Required sample size for Neyman-Pearson hypothesis test in hotel_data set :', sample_size )

## [1] "Required sample size for Neyman-Pearson hypothesis test in hotel_data set : 2401"

In above calculation, Required sample size is small than total_no_of_record of hotel_data set. so we have enough data to perform the Neyman-Pearson hypothesis test. This sample sizing is based upon certain assumption which is - alpha = 0.05 ( commonly used value) , margin_of_error = 0.02 significant value of margin of error.

Perform a Fisher’s style test for significance on the same hypothesis, and interpret the p-value.

To perform a Fisher’s style test for significance, we can use a contingency table and conduct a chi-squared test of independence.This test will help us determine if there’s a significant association between lead time (categorized) and cancellation status.
Hypothesis 1 (with visualizations ): Lead time does not significantly affect the likelihood of cancellation.

# First, categorize lead time (e.g., into quadrilles)
lead_time_categories <- cut(hotel_data$lead_time, breaks = quantile(hotel_data$lead_time, probs = c(0, 0.25, 0.5, 0.75, 1)), labels = c("Q1", "Q2", "Q3", "Q4"))


# Create a contingency table
contingency_table_hypothesis1 <- table(lead_time_categories, hotel_data$is_canceled)

# Perform chi-squared test
chi_squared_test_hypothesis1  <- chisq.test(contingency_table_hypothesis1)

# Print the results
chi_squared_test_hypothesis1

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_hypothesis1
## X-squared = 8676.7, df = 3, p-value < 2.2e-16

# Create 2X2 matrix from contingency_table_hypothesis1 
# Note - contingency_table_hypothesis1_dataset_with_limit dataset has been limited with two row and 2 column as rstudio is given error - workspace memory error in window.
contingency_table_hypothesis1_dataset_with_limit <- matrix(contingency_table_hypothesis1, nrow = 2,ncol = 2)

## Warning in matrix(contingency_table_hypothesis1, nrow = 2, ncol = 2): data
## length differs from size of matrix: [8 != 2 x 2]

# Determine fisher result over data set - contingency_table_hypothesis1_dataset_with_limit
fisher_result_test_1 <- fisher.test(contingency_table_hypothesis1_dataset_with_limit)
#Display fisher result
fisher_result_test_1

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table_hypothesis1_dataset_with_limit
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7928541 0.8423255
## sample estimates:
## odds ratio 
##  0.8172328

barplot(contingency_table_hypothesis1, beside = TRUE, legend = TRUE,
        main = "Lead Time vs. Cancellation Status",
        xlab = "Lead Time",
        ylab = "Frequency",
        col = c("lightblue", "salmon"),
        args.legend = list(title = "Cancellation Status"))

Contingency Table : Categorize lead time and create a contingency table with lead time categories as rows and cancellation status as columns.
Perform the Chi-Squared Test:Calculate the chi-squared statistic and its associated p-value.
Interpret the Results: Based on the p-value which is less than 2.2e-16. In fact which is small value. i.e small p value indicate strong enough evidence against the NULL Hypothesis.Since p value is closer to zero means there is association between lead time and cancellation status .
Perform fisher on limited data set (due to workspace memory error in rstudio):Determine fisher test on contingency_table_hypothesis1 with 2X2 matrix which is similar to our Chi-Squared Test for p-value. Explanation : The p-value in a chi-squared test represents the probability of observing the data if the null hypothesis are true.
if p-value is less than choose significance level (eg 0.05 like #1) we reject the null hypothesis, indicating that there is a significant association between lead time and cancellation status.
If the p-value is greater than the significance level, we fail to reject the null hypothesis, suggesting that there is no significant association between lead time and cancellation status.

Hypothesis 2 (with visualizations ): The type of meal booked does not significantly affect the total amount of special requests made by guests.

  # Create a contingency table for meal type and total special requests
  contingency_table_hypothesis2 <- table(hotel_data$meal, hotel_data$total_of_special_requests)
  
  # Perform chi-squared test
  chi_squared_test_hypothesis2 <- chisq.test(contingency_table_hypothesis2)

## Warning in chisq.test(contingency_table_hypothesis2): Chi-squared approximation
## may be incorrect

  # Print the results
  chi_squared_test_hypothesis2

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table_hypothesis2
## X-squared = 1841.1, df = 20, p-value < 2.2e-16

  contingency_table_hypothesis2_dataset_with_limit <- matrix(contingency_table_hypothesis2, nrow = 2,ncol = 2)

## Warning in matrix(contingency_table_hypothesis2, nrow = 2, ncol = 2): data
## length differs from size of matrix: [30 != 2 x 2]

# Determine fisher result over data set - contingency_table_hypothesis1_dataset_with_limit
fisher_result_test_2 <- fisher.test(contingency_table_hypothesis2_dataset_with_limit)
#Display fisher result
fisher_result_test_2

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table_hypothesis2_dataset_with_limit
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  38.89655 46.05566
## sample estimates:
## odds ratio 
##   42.33422

  barplot(contingency_table_hypothesis2, beside = TRUE, legend = TRUE,
        main = "Meal Type vs. Total Special Requests",
        xlab = "Meal Type",
        ylab = "Frequency",
        col = rainbow(ncol(contingency_table_hypothesis2)),
        args.legend = list(title = "Total Special Requests"))

Explanation of above Hypothesis :

Contingency table - contingency_table_hypothesis2 in chi square test and contingency_table_hypothesis2_dataset_with_limit in fisher.
P-value is less than 2.2e-16 which is indicating the strong enough evidence against the NULL Hypothesis.
The alternative hypothesis states that the true odds ratio is not equal to 1.
The 95% confidence interval for the odds ratio is between [38.89655 - 46.05566].
The estimated odds ratio is 42.33422 This result indicate that there is a significant association between Meal Type and Total Special Requests.
Note : For fisher test, I have limit the table -“contingency_table” data for two rows and two column as rstudio is throwing error for memory workspace. I have tried to increase the memory but option did not work.

Thank you!!!

Hotels data - NULL Hypotheses

Reshu

2024-02-25

NULL Hypotheses

Neyman-Pearson hypothesis test

Perform a Fisher’s style test for significance on the same hypothesis, and interpret the p-value.