```{r} library(dplyr) library(ggplot2) library(summarytools)
## Load Dataset
```{r}
hotel_data <- read.csv("G:/semester_1/4_Statistics_R/syllabus/lab/week3/hotel_bookings.csv")
# Displays only the first 10 rows of the dataset
head(hotel_data, 10)
The key objectives of this data dive are as follows:
Unclear Elements: Identify data elements initially unclear until we consulted documentation.
Data Encoding: Explore encoding choices and their impact on analysis.
Visual Insight: Create visualizations to highlight documentation’s significance.
Risk Mitigation: Address risks due to unclear data and propose solutions.
In my dataset, there are a few columns/values that were unclear until I read the documentation
The values in this column, such as ‘BB,’ ‘FB,’ ‘HB,’ etc., are unclear without the documentation. After reading the documentation, I found that ‘BB’ stands for ‘Bed & Breakfast,’ ‘FB’ stands for ‘Full Board,’ and ‘HB’ stands for ‘Half Board.’ These abbreviations were likely used for efficiency and consistency in data entry.
After analyzing the dataset I found there are couple of elements that are unclear but one element that remains unclear is:
The documentation mentions ‘agent’ as an identifier, but it doesn’t explain the meaning behind the specific numerical values in this column. It’s unclear what these values represent.
Also, Its unclear why certain bookings were canceled despite having a “No Deposit” policy. The documentation might not explain the reasons behind these discrepancies.
In this chart, I am able to see the number of cancellations by deposit type. However, it’s not clear why there are cancellations for “No Deposit” bookings.
Significance: This visualization helps identify bookings that didn’t adhere to the deposit policy, potentially indicating a problem with the booking system or policy enforcement.
{r} # Created a summary table of cancellations by deposit type cancellation_summary <- hotel_data %>% group_by(deposit_type, is_canceled) %>% summarise(count = n()) %>% filter(is_canceled == 1)
```{r} # Created a bar chart for better undestanding
ggplot(cancellation_summary, aes(x = deposit_type, y = count, fill = deposit_type)) + geom_bar(stat = “identity”) + labs(title = “Cancellations by Deposit Type”, x = “Deposit Type”, y = “Count of Cancellations”) + theme_minimal() + scale_fill_brewer(palette = “Set1”) ```
One significant risk is misinterpreting the ‘agent’ column.
Since the documentation doesn’t explain the meaning of the numerical values, there’s a risk of drawing incorrect conclusions or making flawed decisions based on this data.
To reduce negative consequences, I could consider reaching out to the data source or conducting further research to understand the ‘agent’ column’s significance.
Another risk could be financial losses due to incorrect cancellations.