For this assignment dataset used is a Dataset that is a collection of hotel bookings information that includes various details about each booking, such as the hotel type, booking dates, customer demographics, booking status, and more. This dataset is useful for analyzing trends in hotel bookings, cancellations, and customer behavior.

Dataset extracted from Kaggle portal: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand

R Markdown

# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataset
hotel_data <- read_csv("https://raw.githubusercontent.com/Jomifum/Assignment9TidyVerse/main/hotel_bookings.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(head(hotel_data))
## # A tibble: 6 × 32
##   hotel        is_canceled lead_time arrival_date_year arrival_date_month
##   <chr>              <dbl>     <dbl>             <dbl> <chr>             
## 1 Resort Hotel           0       342              2015 July              
## 2 Resort Hotel           0       737              2015 July              
## 3 Resort Hotel           0         7              2015 July              
## 4 Resort Hotel           0        13              2015 July              
## 5 Resort Hotel           0        14              2015 July              
## 6 Resort Hotel           0        14              2015 July              
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, …

Examples of packages and plot:

This dataset contains information on hotel bookings, including details such as hotel type, booking dates, customer demographics, and booking status. Key functions from the tidyverse package were used to manipulate and analyze this data, functions like read_csv() that loaded the data, group_by() and summarize() aggregate it, while mutate() was used to add new variables. Now arrange() sorted the data, while filter() subsetted it, and select() picked specific columns. Regarding Visualization functions like ggplot() helped to create insightful plots. Those tools streamline data analysis, enabling efficient exploration of booking trends, cancellations, and customer behaviors.

# Example 1: Calculate the Cancellation Rate by Country
cancellation_by_country <- hotel_data %>%
  group_by(country) %>%
  summarize(total_bookings = n(),
            cancellations = sum(is_canceled),
            cancellation_rate = (cancellations / total_bookings) * 100) %>%
  arrange(desc(cancellation_rate))

# Display the top 10 countries by cancellation rate
cancellation_by_country %>%
  slice_max(order_by = cancellation_rate, n = 10) %>%
  print()
## # A tibble: 12 × 4
##    country total_bookings cancellations cancellation_rate
##    <chr>            <int>         <dbl>             <dbl>
##  1 BEN                  3             3               100
##  2 FJI                  1             1               100
##  3 GGY                  3             3               100
##  4 GLP                  2             2               100
##  5 HND                  1             1               100
##  6 IMN                  2             2               100
##  7 JEY                  8             8               100
##  8 KHM                  2             2               100
##  9 MYT                  2             2               100
## 10 NIC                  1             1               100
## 11 UMI                  1             1               100
## 12 VGB                  1             1               100
# Example 2: Plot the average lead time by market segment
hotel_data %>%
  group_by(market_segment) %>%
  summarize(avg_lead_time = mean(lead_time, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(market_segment, avg_lead_time), y = avg_lead_time)) +
  geom_col(fill = "skyblue") +
  coord_flip() +
  labs(title = "Average Lead Time by Market Segment",
       x = "Market Segment",
       y = "Average Lead Time (days)") +
  theme_minimal()

# Example 3: Count the number of bookings by reservation status
reservation_status_count <- hotel_data %>%
  count(reservation_status) %>%
  ggplot(aes(x = reservation_status, y = n, fill = reservation_status)) +
  geom_bar(stat = "identity") +
  labs(title = "Bookings by Reservation Status",
       x = "Reservation Status",
       y = "Number of Bookings") +
  theme_minimal()
# Example 4: Using select to keep specific columns
selected_data <- hotel_data %>%
  select(hotel, country, market_segment, is_canceled)

# Display the first few rows of the selected data
print(head(selected_data))
## # A tibble: 6 × 4
##   hotel        country market_segment is_canceled
##   <chr>        <chr>   <chr>                <dbl>
## 1 Resort Hotel PRT     Direct                   0
## 2 Resort Hotel PRT     Direct                   0
## 3 Resort Hotel GBR     Direct                   0
## 4 Resort Hotel GBR     Corporate                0
## 5 Resort Hotel GBR     Online TA                0
## 6 Resort Hotel GBR     Online TA                0
# Example 5: Using mutate to create a new column for cancellation rate per country
hotel_data <- hotel_data %>%
  group_by(country) %>%
  mutate(total_bookings = n(),
         cancellations = sum(is_canceled),
         cancellation_rate = (cancellations / total_bookings) * 100)

# Display the first few rows of the data with the new column
print(head(hotel_data))
## # A tibble: 6 × 35
## # Groups:   country [2]
##   hotel        is_canceled lead_time arrival_date_year arrival_date_month
##   <chr>              <dbl>     <dbl>             <dbl> <chr>             
## 1 Resort Hotel           0       342              2015 July              
## 2 Resort Hotel           0       737              2015 July              
## 3 Resort Hotel           0         7              2015 July              
## 4 Resort Hotel           0        13              2015 July              
## 5 Resort Hotel           0        14              2015 July              
## 6 Resort Hotel           0        14              2015 July              
## # ℹ 30 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, …
# Example 6: Using filter to get rows where cancellation_rate is higher than 50%
high_cancellation_rate <- hotel_data %>%
  filter(cancellation_rate > 50)

# Display the first few rows of the filtered data
print(head(high_cancellation_rate))
## # A tibble: 6 × 35
## # Groups:   country [1]
##   hotel        is_canceled lead_time arrival_date_year arrival_date_month
##   <chr>              <dbl>     <dbl>             <dbl> <chr>             
## 1 Resort Hotel           0       342              2015 July              
## 2 Resort Hotel           0       737              2015 July              
## 3 Resort Hotel           0         0              2015 July              
## 4 Resort Hotel           0         9              2015 July              
## 5 Resort Hotel           1        85              2015 July              
## 6 Resort Hotel           1        75              2015 July              
## # ℹ 30 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, …

###Conclusion: The tidyverse library in analyzing the hotel_data dataset demostrated its powerful and cohesive set of tools designed for data manipulation and visualization. By utilizing some of their functions, it efficiently processed and explored the dataset, extracting meaningful insights into booking trends, cancellation rates, and customer behaviors. Meanwhile, ggplot2 created an informative visualization, enhancing the understanding of this data. Overall, tidyverse streamlined the data analysis workflow, making it more intuitive and effective.