Challenge 7

Introduction

This challenge will focus on publication ready plots that include properties such as axis labels and facets.

Loading the required libraries.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(readr)
library(here)

## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 7

library(dplyr)

Reading the Data

We load the dataset that is a CSV file.

hotel_csv <- read_csv("hotel_bookings.csv", show_col_types = FALSE)

We can have a look at the data.

head(hotel_csv)

## # A tibble: 6 × 32
##   hotel        is_canceled lead_time arrival_date_year arrival_date_month
##   <chr>              <dbl>     <dbl>             <dbl> <chr>             
## 1 Resort Hotel           0       342              2015 July              
## 2 Resort Hotel           0       737              2015 July              
## 3 Resort Hotel           0         7              2015 July              
## 4 Resort Hotel           0        13              2015 July              
## 5 Resort Hotel           0        14              2015 July              
## 6 Resort Hotel           0        14              2015 July              
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, …

The following provides the description of the columns-

hotel- The type of hotel. For example, Resort Hotel.
is_canceled - Binary value that indicates if booking was cancelled or not.
lead_time: Number of days between the booking date and the arrival date.
arrival_date_year: The year of arrival.
arrival_date_month: The month of arrival.
arrival_date_week_number: The week number of the year for the arrival date.
arrival_date_day_of_month: The day of the month for the arrival date.
stays_in_weekend_nights: Number of weekend nights booked to stay.
stays_in_week_nights: Number of weekday nights booked to stay.
adults: Number of adults included in the reservation.
children: Number of children included in the reservation.
babies: Number of babies included in the reservation.
meal: Type of meal booked.
country: Country of origin.
market_segment: Market segment designation like Online Travel Agencies.
distribution_channel: Booking distribution channel.
is_repeated_guest: Binary value indicating if the guest is a repeated guest or not.
previous_cancellations: Number of previous bookings that were canceled by the guest.
previous_bookings_not_canceled: Number of previous bookings not canceled by the guest.
reserved_room_type: Code of room type reserved.
assigned_room_type: Code of room type assigned.
booking_changes: Number of changes made to the booking.
deposit_type: Type of deposit made by the guest.
agent: ID of the travel agency making the booking.
company: ID of the company making the booking.
days_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed.
customer_type: Type of booking.
adr: Average Daily Rate (the sum of transactional value divided by the number of nights).
required_car_parking_spaces: Number of car parking spaces required by the guest.
total_of_special_requests: Number of special requests made by the guest.
reservation_status: Reservation last status like Checked Out or Cancelled.
reservation_status_date: Date when the last status was set.

We get the dimensions of the data

dim(hotel_csv)

## [1] 119390     32

We see that there are 119390 hotel bookings across 32 parameters.

We can have a look at the columns.

colnames(hotel_csv)

##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"

We can just look at a specific column in the data. For instance, we can pick the country.

select(hotel_csv, "country")

## # A tibble: 119,390 × 1
##    country
##    <chr>  
##  1 PRT    
##  2 PRT    
##  3 GBR    
##  4 GBR    
##  5 GBR    
##  6 GBR    
##  7 PRT    
##  8 PRT    
##  9 PRT    
## 10 PRT    
## # ℹ 119,380 more rows

We can find the distribution of values under a specific column. For instance, if we want to see it in distribution channel.

dc <- select(hotel_csv, distribution_channel)
head(dc)

## # A tibble: 6 × 1
##   distribution_channel
##   <chr>               
## 1 Direct              
## 2 Direct              
## 3 Direct              
## 4 Corporate           
## 5 TA/TO               
## 6 TA/TO

table(dc)

## distribution_channel
## Corporate    Direct       GDS     TA/TO Undefined 
##      6677     14645       193     97870         5

We see that the frequencies vary a lot. From above we see that TA/TO has the highest occurence among all channels.

We can use a proportion table to get a better insight.

prop.table(table(dc))

## distribution_channel
##    Corporate       Direct          GDS        TA/TO    Undefined 
## 5.592596e-02 1.226652e-01 1.616551e-03 8.197504e-01 4.187955e-05

Tidying the data

First we check for any missing values.

colSums(is.na(hotel_csv))

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

We can fix it by replacing the missing children values by 0.

hotel_csv$children[is.na(hotel_csv$children)] <- 0

We can check again for missing values to see if it worked.

colSums(is.na(hotel_csv))

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              0                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

The above operation eliminated all missing values.

We can conduct a sanity check to ensure that the data appears to be normal.

summary(hotel_csv)

##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14

We see that all the columns are in reasonable ranges. That means that tidying the data was successful.

Mutating the data

We see that there are separate columns for year, month, and day. We can merge them into one as a date column.

hotel_csv <- hotel_csv %>%
  mutate(arrival_date = as.Date(paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month, sep = "-"), format = "%Y-%B-%d"))

We can remove the columns that we merged.

hotel_csv <- hotel_csv %>%
  select(-c(arrival_date_year, arrival_date_month, arrival_date_day_of_month))

We see that we can merge adults, children and babies into a single column as people.

hotel_csv <- hotel_csv %>%
  mutate(guests = adults + children + babies)

We can now take off the original columns.

hotel_csv <- hotel_csv %>%
  select(-c(adults, children, babies))

We see that there are separate columns for weeknights and weekend nights. We can merge these to get total nights.

hotel_csv <- hotel_csv %>%
  mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights)

We then remove the original columns.

hotel_csv <- hotel_csv %>%
  select(-c(stays_in_weekend_nights, stays_in_week_nights))

Once all is done, we can have a look at the data again to see if the changes were successfully made.

head(hotel_csv)

## # A tibble: 6 × 27
##   hotel        is_canceled lead_time arrival_date_week_number meal  country
##   <chr>              <dbl>     <dbl>                    <dbl> <chr> <chr>  
## 1 Resort Hotel           0       342                       27 BB    PRT    
## 2 Resort Hotel           0       737                       27 BB    PRT    
## 3 Resort Hotel           0         7                       27 BB    GBR    
## 4 Resort Hotel           0        13                       27 BB    GBR    
## 5 Resort Hotel           0        14                       27 BB    GBR    
## 6 Resort Hotel           0        14                       27 BB    GBR    
## # ℹ 21 more variables: market_segment <chr>, distribution_channel <chr>,
## #   is_repeated_guest <dbl>, previous_cancellations <dbl>,
## #   previous_bookings_not_canceled <dbl>, reserved_room_type <chr>,
## #   assigned_room_type <chr>, booking_changes <dbl>, deposit_type <chr>,
## #   agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## #   total_of_special_requests <dbl>, reservation_status <chr>, …

We can conduct sanity checks on our mutations.

str(hotel_csv)

## tibble [119,390 × 27] (S3: tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  $ arrival_date                  : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  $ guests                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ total_nights                  : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...

summary(hotel_csv)

##     hotel            is_canceled       lead_time   arrival_date_week_number
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   : 1.00           
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:16.00           
##  Mode  :character   Median :0.0000   Median : 69   Median :28.00           
##                     Mean   :0.3704   Mean   :104   Mean   :27.17           
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:38.00           
##                     Max.   :1.0000   Max.   :737   Max.   :53.00           
##      meal             country          market_segment     distribution_channel
##  Length:119390      Length:119390      Length:119390      Length:119390       
##  Class :character   Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character    
##                                                                               
##                                                                               
##                                                                               
##  is_repeated_guest previous_cancellations previous_bookings_not_canceled
##  Min.   :0.00000   Min.   : 0.00000       Min.   : 0.0000               
##  1st Qu.:0.00000   1st Qu.: 0.00000       1st Qu.: 0.0000               
##  Median :0.00000   Median : 0.00000       Median : 0.0000               
##  Mean   :0.03191   Mean   : 0.08712       Mean   : 0.1371               
##  3rd Qu.:0.00000   3rd Qu.: 0.00000       3rd Qu.: 0.0000               
##  Max.   :1.00000   Max.   :26.00000       Max.   :72.0000               
##  reserved_room_type assigned_room_type booking_changes   deposit_type      
##  Length:119390      Length:119390      Min.   : 0.0000   Length:119390     
##  Class :character   Class :character   1st Qu.: 0.0000   Class :character  
##  Mode  :character   Mode  :character   Median : 0.0000   Mode  :character  
##                                        Mean   : 0.2211                     
##                                        3rd Qu.: 0.0000                     
##                                        Max.   :21.0000                     
##     agent             company          days_in_waiting_list customer_type     
##  Length:119390      Length:119390      Min.   :  0.000      Length:119390     
##  Class :character   Class :character   1st Qu.:  0.000      Class :character  
##  Mode  :character   Mode  :character   Median :  0.000      Mode  :character  
##                                        Mean   :  2.321                        
##                                        3rd Qu.:  0.000                        
##                                        Max.   :391.000                        
##       adr          required_car_parking_spaces total_of_special_requests
##  Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
##  1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
##  Median :  94.58   Median :0.00000             Median :0.0000           
##  Mean   : 101.83   Mean   :0.06252             Mean   :0.5714           
##  3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
##  Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
##  reservation_status reservation_status_date  arrival_date       
##  Length:119390      Min.   :2014-10-17      Min.   :2015-07-01  
##  Class :character   1st Qu.:2016-02-01      1st Qu.:2016-03-13  
##  Mode  :character   Median :2016-08-07      Median :2016-09-06  
##                     Mean   :2016-07-30      Mean   :2016-08-28  
##                     3rd Qu.:2017-02-08      3rd Qu.:2017-03-18  
##                     Max.   :2017-09-14      Max.   :2017-08-31  
##      guests        total_nights   
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 2.000   1st Qu.: 2.000  
##  Median : 2.000   Median : 3.000  
##  Mean   : 1.968   Mean   : 3.428  
##  3rd Qu.: 2.000   3rd Qu.: 4.000  
##  Max.   :55.000   Max.   :69.000

We see that both operations yield appropriate results. This shows that the mutations were successful.

Plots

We can compare the total nights spent across the confirmed bookings in both types of hotels - city and resort.

confirmed_bookings <- hotel_csv %>%
  filter(reservation_status == "Check-Out")

nights <- ggplot(confirmed_bookings, aes(x = hotel, y = total_nights)) +
  geom_boxplot(color = "skyblue", outlier.shape = 16, outlier.size = 3) +
  labs(title = "Distribution of total nights across confirmed bookings",
       subtitle = "Comparison with types of hotel",
       x = "Hotel",
       y = "Total Nights") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16),
        plot.subtitle = element_text(hjust = 0.5, size = 14),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

nights

We use these parameters as they present an interesting case of data analysis. A box plot does a great job in visualizing the comparison between the two.

We see that for most part, the number of nights spent in both hotels are almost similar. We also see that the minimum total nights spent in resorts is higher and this makes sense because resort hotels tend to be close to tourist attractions that attracts a more expensive stay.

We also see that certain bookings in resorts are for well over a month which presents an opportunity to investigate deeper into such bookings.

The second graph compares the bookings where people got their reserved room type and the people who were assigned something else.

room_type <- hotel_csv %>%
  group_by(final_room = ifelse(reserved_room_type == assigned_room_type, "Same Room Type", "Different Room Type")) %>%
  summarize(count = n())

plot2 <- ggplot(room_type, aes(x = "", y = count, fill = final_room)) +
  geom_bar(stat = "identity", color = "skyblue") +
  labs(title = "Comparison of Reserved and Assigned Room Types",
       x = NULL,
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10))

plot2

A stacked bar chart is used because it gives a clear visual representation of the gap between the room types. From the chart above, we see that most bookings were assigned the room type that they reserved which is about 100,000. Less than 25000 bookings out of 119390 bookings were assigned some other room type.

This comparison is essential as it gives information about the demand and supply of various room types. The hotel management can analyze this information and make changes to the quantity of room types.

Conclusion

We created a few more graphs compared to Challenge 6 and we made them publication ready by assigning appropriate legends, title, and caption.