Data Description

Load Libraries

```{r} library(dplyr) library(ggplot2) library(summarytools)


## Load Dataset 
```{r}
hotel_data  <- read.csv("G:/semester_1/4_Statistics_R/syllabus/lab/week3/hotel_bookings.csv") 

# Displays only the first 10 rows of the dataset
head(hotel_data, 10)

Introduction

The key objectives of this data dive are as follows:

Task 1: Unclear Columns or Values

-Identify unclear columns/values and explain their significance

In my dataset, there are a few columns/values that were unclear until I read the documentation

- 1. Column Name - meal:

The values in this column, such as ‘BB,’ ‘FB,’ ‘HB,’ etc., are unclear without the documentation. After reading the documentation, I found that ‘BB’ stands for ‘Bed & Breakfast,’ ‘FB’ stands for ‘Full Board,’ and ‘HB’ stands for ‘Half Board.’ These abbreviations were likely used for efficiency and consistency in data entry.

2. Column Name - reservation_status

  • The values ‘Check-Out,’ ‘Canceled,’ and ‘No-Show’ are clear, but without the documentation,it’s unclear what exactly constitutes a ‘No-Show’ reservation. The documentation clarifies that a ‘No-Show’ reservation is when the guest didn’t check in on the expected arrival date.

3. Column Name - market_segment

  • The values like ‘Direct,’ ‘Corporate,’ and ‘Online TA’ might not be immediately understandable without the documentation. The documentation explains that these values represent different market segments or booking channels.

4. Column Name - ADR:

  • The column name “ADR” is not self-explanatory. We need to refer to the documentation to understand its meaning.

Why They Chose This Encoding

  • Well, it’s important to refer to the documentation to understand why certain columns or values are encoded the way they are. For instance, “BB” in the “meal” column might stand for “Bed & Breakfast.”

Consequences of Not Reading Documentation

  • I think chances of misinterpreting the data due to unclear columns or values increases and that can lead to incorrect analysis and conclusions.

Task 2: Unclear Element Even After Documentation

- Identify any elements that remain unclear even after reading the documentation.

  • After analyzing the dataset I found there are couple of elements that are unclear but one element that remains unclear is:

    • Column Name - agent.
  • The documentation mentions ‘agent’ as an identifier, but it doesn’t explain the meaning behind the specific numerical values in this column. It’s unclear what these values represent.

  • Also, Its unclear why certain bookings were canceled despite having a “No Deposit” policy. The documentation might not explain the reasons behind these discrepancies.

Task 3: Building a Visualization

B- Visualize the issue of unclear cancellations

  • In this chart, I am able to see the number of cancellations by deposit type. However, it’s not clear why there are cancellations for “No Deposit” bookings.

  • Significance: This visualization helps identify bookings that didn’t adhere to the deposit policy, potentially indicating a problem with the booking system or policy enforcement.

{r} # Created a summary table of cancellations by deposit type cancellation_summary <- hotel_data %>% group_by(deposit_type, is_canceled) %>% summarise(count = n()) %>% filter(is_canceled == 1)

```{r} # Created a bar chart for better undestanding

ggplot(cancellation_summary, aes(x = deposit_type, y = count, fill = deposit_type)) + geom_bar(stat = “identity”) + labs(title = “Cancellations by Deposit Type”, x = “Deposit Type”, y = “Count of Cancellations”) + theme_minimal() + scale_fill_brewer(palette = “Set1”) ```

Task 4: Identifying Significant Risks

- Identify any significant risks associated with unclear data and propose mitigation strategies

  • One significant risk is misinterpreting the ‘agent’ column.

    • Since the documentation doesn’t explain the meaning of the numerical values, there’s a risk of drawing incorrect conclusions or making flawed decisions based on this data.

    • To reduce negative consequences, I could consider reaching out to the data source or conducting further research to understand the ‘agent’ column’s significance.

  • Another risk could be financial losses due to incorrect cancellations.

    • To mitigate this risk, the hotel management could implement stricter policy enforcement mechanisms or conduct periodic audits to identify and rectify discrepancies.

Conclusion