Initial setup and Configure the data set.
Load the data set file in variable hotel_data files.
Data set - Hotels : This data comes from an open hotel booking demand dataset from Antonio, Almeida and Nunes.

Data Dive — Confidence Intervals

Ask: Build at least three pairs of variables

For each pair of variables, include at least one column that you created (i.e., calculated based on others)
All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., [‘small’, ‘medium’, ‘large’] is okay, but [“apples”, “oranges”, “bananas”] is not)
At least one pair should be a response variable and an explanatory variable

In Below section, There are three pairs :

1.Pair 1 : Response and Explanatory Variables

Response Variable: is_canceled (binary: 0 for not canceled, 1 for canceled)
Explanatory Variable: lead_time (continuous variable representing the number of days that elapsed between the entering date of the booking and the arrival date)

pair_1 <- hotel_data[, c("is_canceled", "lead_time")]

# Display the first few rows of pair 1
head(pair_1)

##   is_canceled lead_time
## 1           0       342
## 2           0       737
## 3           0         7
## 4           0        13
## 5           0        14
## 6           0        14

2.Pair 2 : Continuous Variables

Variable 1: stays_in_weekend_nights (continuous variable representing the number of weekend nights the guest stayed)
Variable 2: stays_in_week_nights (continuous variable representing the number of weekday nights the guest stayed)

# Display the first few rows of pair 2 for variable - stays_in_weekend_nights and stays_in_week_nights. 
pair_2 <- hotel_data[, c("stays_in_weekend_nights", "stays_in_week_nights")]

head(pair_2)

##   stays_in_weekend_nights stays_in_week_nights
## 1                       0                    0
## 2                       0                    0
## 3                       0                    1
## 4                       0                    1
## 5                       0                    2
## 6                       0                    2

3. Pair 3 : Ordered Categorical Variables

Variable 1: reserved_room_type (ordered categorical variable representing the type of room reserved).
Variable 2: assigned_room_type (ordered categorical variable representing the type of room assigned to the guest upon arrival).

The room_type_difference column (in this section) is calculated based on the difference between the assigned and reserved room types. This column serves as an additional variable that is created based on others like - ‘reserved_room_type’,‘assigned_room_type’ , ‘room_type_difference’.’

# Convert room types to numeric values
room_type_numeric <- function(room_type) {
  if (room_type == 'A') {
    return(1)
  } else if (room_type == 'B') {
    return(2)
  } else if (room_type == 'C') {
    return(3)
  } else if (room_type == 'D') {
    return(4)
  } else if (room_type == 'E') {
    return(5)
  } else if (room_type == 'F') {
    return(6)
  } else if (room_type == 'G') {
    return(7)
  } else if (room_type == 'H') {
    return(8)
  } else {
    return(NA)  # If room type is not recognized
  }
}

# Apply the function to the reserved and assigned room type columns
hotel_data$reserved_room_numeric <- sapply(hotel_data$reserved_room_type, room_type_numeric)
hotel_data$assigned_room_numeric <- sapply(hotel_data$assigned_room_type, room_type_numeric)


# Calculate room type difference
hotel_data$room_type_difference <- hotel_data$assigned_room_numeric - hotel_data$reserved_room_numeric

# Pair 3: Ordered Categorical Variables with Room Type Difference
pair_3 <- hotel_data[, c("reserved_room_type", "assigned_room_type", "room_type_difference")]


# Display the first few rows of pair 3
head(pair_3)

##   reserved_room_type assigned_room_type room_type_difference
## 1                  C                  C                    0
## 2                  C                  C                    0
## 3                  A                  C                    2
## 4                  A                  A                    0
## 5                  A                  A                    0
## 6                  A                  A                    0

tail(pair_3)

##        reserved_room_type assigned_room_type room_type_difference
## 119385                  A                  A                    0
## 119386                  A                  A                    0
## 119387                  E                  E                    0
## 119388                  D                  D                    0
## 119389                  A                  A                    0
## 119390                  A                  A                    0

Ask:

Plot a visualization for each relationship, and draw some conclusions based on the plot.

Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outlines?)

Plots with scrutinization

1. Cancellation vs. Lead Time:

Look for any patterns or trends in cancellations as lead time increases. Are there any outliers in lead time?
Conclusion: It seems like there might be a higher tendency for cancellations when lead time is longer, but further analysis is needed to confirm.

This scatter plot is comparing the lead time (the number of days between booking and arrival) to whether a booking was canceled or not.
Each point on the plot represents a booking in the dataset. The x-coordinate of each point represents the lead time (in days) for that booking, while the y-coordinate indicates whether the booking was canceled (1) or not (0). Therefore, each point on the plot represents a combination of these two variables for a particular booking.
The x-axis is labeled “Lead Time,” indicating the number of days between the booking date and the arrival date.
The y-axis is labeled “Canceled (1) or Not (0),” indicating whether the booking was canceled (1) or not (0). This variable serves as the response variable we are interested in understanding in relation to lead time.
This plot provides a visual representation of how lead time relates to the likelihood of booking cancellations, helping us understand the impact of timing on cancellation behavior in the dataset.

2. Weekend Nights vs. Week Nights:

Are there any extreme values or outliers in the number of weekend or week nights stayed?
Conclusion: Check for any guests staying exceptionally long periods during weekends or weekdays, as these could be potential outliers.

This scatter plot is comparing the number of weekend nights stayed to the number of week nights stayed.
The x-axis is labeled “Weekend Nights,” indicating the number of nights the guest stayed over the weekend (typically Friday and Saturday nights).
The y-axis is labeled “Week Nights,” indicating the number of nights the guest stayed during the week (Sunday through Thursday nights).
This plot provides a visual representation of the relationship between the number of weekend and week nights stayed, helping us understand how guests’ booking patterns vary across different days of the week.

3. Reserved Room Type vs. Assigned Room Type:

Are there any extreme values or outliers in the number of weekend or week nights stayed?
Conclusion: Check for any guests staying exceptionally long periods during weekends or weekdays, as these could be potential outliers.

This scatter plot is comparing the reserved room type to the assigned room type for each booking.
Each point on the plot represents a booking in the dataset. The x-coordinate of each point represents the reserved room type, while the y-coordinate indicates the assigned room type.
By examining the distribution of points on the plot, we can discern any patterns or discrepancies between the reserved and assigned room types. E.g.

Since most points fall along the diagonal line, it suggests that the reserved and assigned room types match closely, indicating that guests generally receive the room type they booked.

In some cases, some points are scattered away from the diagonal line, it suggests discrepancies between the reserved and assigned room types. This could indicate instances where guests are upgraded to a different room type or where there are issues with room availability.
This plot provides insight into how consistently guests receive the room type they initially reserved and helps identify any discrepancies or trends in room assignments within the dataset.

Ask

Calculate the appropriate correlation coefficient for each of these combinations

Explain why the value makes sense (or doesn’t) based on the visualization(s)

Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval

To calculate correlation coefficient for each combination and build confidence intervals for the response variable.
1. Calculate the correlation coefficient for each pair of variables.
2. Construct confidence intervals for the response variable (is_canceled).

# Calculate correlation coefficients
cor_pair1 <- cor(hotel_data$is_canceled, hotel_data$lead_time)
cor_pair2 <- cor(hotel_data$stays_in_weekend_nights, hotel_data$stays_in_week_nights)



# Define function to calculate Cramer's V
cramers_v <- function(x, y) {
  confusion_matrix <- table(x, y)
  n <- sum(confusion_matrix)
  chi_sq <- chisq.test(confusion_matrix)$statistic
  v <- sqrt(chi_sq / (n * (min(nrow(confusion_matrix), ncol(confusion_matrix)) - 1)))
  return(v)
}


# Calculate Cramer's V for reserved_room_type and assigned_room_type
cramer_v_pair3 <- cramers_v(hotel_data$reserved_room_type, hotel_data$assigned_room_type)

## Warning in chisq.test(confusion_matrix): Chi-squared approximation may be
## incorrect

# Print correlation coefficients and Cramer's V
print(paste("Correlation coefficient for Pair 1:", cor_pair1))

## [1] "Correlation coefficient for Pair 1: 0.293123355760716"

print(paste("Correlation coefficient for Pair 2:", cor_pair2))

## [1] "Correlation coefficient for Pair 2: 0.498968818495529"

print(paste("Cramer's V for Pair 3:", cramer_v_pair3))

## [1] "Cramer's V for Pair 3: 0.776358336690268"

# Confidence interval for response variable is_canceled
is_canceled_mean <- mean(hotel_data$is_canceled)
is_canceled_sd <- sd(hotel_data$is_canceled)
n <- length(hotel_data$is_canceled)
standard_error <- is_canceled_sd / sqrt(n)

# Assuming a normal distribution, construct a 95% confidence interval
lower_bound <- is_canceled_mean - qnorm(0.975) * standard_error
upper_bound <- is_canceled_mean + qnorm(0.975) * standard_error

# Print confidence interval
print(paste("95% Confidence Interval for is_canceled:", lower_bound, "-", upper_bound))

## [1] "95% Confidence Interval for is_canceled: 0.367676994663234 - 0.373155570878268"

1. Correlation Coefficients:

Pair 1 (is_canceled vs. lead_time): The correlation coefficient indicates the strength and direction of the linear relationship between the variables. A positive correlation would suggest that as lead time increases, the likelihood of cancellation increases, and vice versa. A negative correlation would suggest the opposite.
Pair 2 (stays_in_weekend_nights vs. stays_in_week_nights): The correlation coefficient measures the linear relationship between the number of weekend nights stayed and the number of weekday nights stayed. A positive correlation would suggest that guests who stay more weekend nights also tend to stay more weekday nights, and vice versa.
Pair 3 (reserved_room_type vs. assigned_room_type): Since these are categorical variables, we used Cramer’s V as a measure of association. Cramer’s V ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. The closer the value is to 1, the stronger the association between the two variables.

2. Confidence Interval for is_canceled:

The confidence interval provides a range of plausible values for the population mean of cancellations (is_canceled). With a 95% confidence level, we are 95% confident that the true population mean of cancellations falls within the calculated interval.

Conclusion: Based on the correlation coefficients and confidence interval, we can draw conclusions about the relationships between variables and the population mean of cancellations which can be used to figure out the decision-making like
1. Correlation Coefficients:

For Pair 1 (is_canceled vs. lead_time), if the correlation coefficient is positive, it suggests that as lead time increases, the likelihood of cancellation also increases, and vice versa. A negative correlation would imply the opposite relationship.
For Pair 2 (stays_in_weekend_nights vs. stays_in_week_nights), the correlation coefficient measures how the number of weekend nights stayed relates to the number of weekday nights stayed. A positive correlation would mean that guests who stay more weekend nights tend to stay more weekday nights, and vice versa.

2. Confidence Interval for is_canceled:

The 95% confidence interval for the population mean of cancellations (is_canceled) provides a range of plausible values. With a 95% confidence level, we are confident that the true population mean of cancellations falls within the calculated interval. This information is valuable for understanding the variability and uncertainty around the average cancellation rate in the dataset.

Overall, these analyses allow us to better understand the relationships between variables and draw conclusions about cancellations and other relevant factors in the dataset. Further analysis and interpretation may be necessary based on the specific context and goals of the analysis.

Thank You.!!!

Hotels Week 6 Data Dive

Reshu Gupta

2024-02-19

Data Dive — Confidence Intervals

Plots with scrutinization