What I explored :

- Where do most guests come from ?

- Which segment of the market has the least number of days on the waiting list?

- Types of booking by hotel type.

- Which are the busiest months for both hotel types?

Loading the data

There are 119390 rows and 32 columns in this dataset.

library(readr)
hotel_bookings <- read_csv("hotel_bookings.csv")
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v dplyr   1.0.8
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.2.0     v forcats 0.5.1
## v purrr   0.3.4     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
View(hotel_bookings)

Exploring and preparing the data

str(hotel_bookings)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>
glimpse(hotel_bookings)
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort~
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",~
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat~
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat~
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",~
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",~
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit~
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "~
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type                  <chr> "Transient", "Transient", "Transient", ~
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", ~
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~
summary(hotel_bookings)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

A lot of these categorical data are as character data and therefore they do not have a lot of information

Converting categorical and binary variables into factor for bettering understanding and exploring

hotel_bookings <- hotel_bookings %>%
  mutate(hotel = as.factor(hotel),      
         is_canceled = as.factor(is_canceled),
         arrival_date_day_of_month = as.factor(arrival_date_day_of_month),
         arrival_date_month = as.factor(arrival_date_month),
         arrival_date_year = as.factor(arrival_date_year),
         meal = as.factor(meal),
         country = as.factor(country),
         market_segment = as.factor(market_segment),
         distribution_channel = as.factor(distribution_channel),
         is_repeated_guest = as.factor(is_repeated_guest),
         reserved_room_type = as.factor(reserved_room_type),
         assigned_room_type = as.factor(assigned_room_type),
         deposit_type = as.factor(deposit_type),
         customer_type = as.factor(customer_type),
         reservation_status = as.factor(reservation_status),
         agent = as.factor(agent),
         company = as.factor(company)
         )

glimpse(hotel_bookings) #seeing if the conversion worked 
## Rows: 119,390
## Columns: 32
## $ hotel                          <fct> Resort Hotel, Resort Hotel, Resort Hote~
## $ is_canceled                    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year              <fct> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month             <fct> July, July, July, July, July, July, Jul~
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal                           <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,~
## $ country                        <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,~
## $ market_segment                 <fct> Direct, Direct, Direct, Corporate, Onli~
## $ distribution_channel           <fct> Direct, Direct, Direct, Corporate, TA/T~
## $ is_repeated_guest              <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type             <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, ~
## $ assigned_room_type             <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, ~
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type                   <fct> No Deposit, No Deposit, No Deposit, No ~
## $ agent                          <fct> NULL, NULL, NULL, 304, 240, 240, NULL, ~
## $ company                        <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL~
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type                  <fct> Transient, Transient, Transient, Transi~
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status             <fct> Check-Out, Check-Out, Check-Out, Check-~
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~

Cleaning the data and checking for missing values in each column:

colSums(is.na(hotel_bookings))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

There seem to be 4 missing values for children and no missing values in the rest of the variables.

Removing missing values or NA

hotel_bookings$children[is.na(hotel_bookings$children)] = round(mean(hotel_bookings$children, na.rm = TRUE),0)
sum(is.na(hotel_bookings$children))
## [1] 0

I noticed some of the bookings for adults having 0 which does not seem right since children and babies can’t be booked without an adult.

# Finding number of observations when no adult checked in
nrow(subset(hotel_bookings, hotel_bookings$adults == 0))
## [1] 403
# Finding number of observations when no adults, children and babies checked in
nrow(subset(hotel_bookings, hotel_bookings$babies == 0 & hotel_bookings$adults == 0 & hotel_bookings$children == 0))
## [1] 180
# Finding number of observations when only babies checked in
nrow(subset(hotel_bookings, hotel_bookings$babies != 0 & hotel_bookings$adults == 0 & hotel_bookings$children == 0))
## [1] 0

Removing observations having no adults, children or babies checked-in from the dataset

hotel_bookings <- hotel_bookings %>% filter(hotel_bookings$adults != 0 | hotel_bookings$adults != 0 | hotel_bookings$children !=0)

More Exploring…

Exploring the number of countries involved in this dataset

hotel_bookings %>%
  group_by(country)%>%
  summarise(num=n())%>%
  arrange(desc(num))
## # A tibble: 178 x 2
##    country   num
##    <fct>   <int>
##  1 PRT     48483
##  2 GBR     12120
##  3 FRA     10401
##  4 ESP      8560
##  5 DEU      7285
##  6 ITA      3761
##  7 IRL      3374
##  8 BEL      2342
##  9 BRA      2222
## 10 NLD      2103
## # ... with 168 more rows

Where do most guests come from? Only working with the top 20 countries

data_country <- hotel_bookings %>% group_by(country) %>%  summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,20,booking_count) %>% 
  ggplot(aes(x = reorder(country,booking_count), y = booking_count)) +
  geom_bar(stat = "identity", width = 0.25)+
 coord_flip()+
   ylab('Count')+
  xlab('Country')+
  ggtitle('Bookings by country') +
  labs(fill='Hotel type')

These hotels are located in Portugal which can explain why bookings are mostly European countries and also why the highest is from Portugal.

Days on a waiting list by market segment

ggplot(hotel_bookings, aes(x = market_segment, y = days_in_waiting_list)) +
   geom_point()+ 
   ylab('Days on a waiting list')+
  xlab('Market segment')+
  ggtitle('Days on Waiting List by Market Segment') 

The aviation industry has the least days on the waiting list. The reason for that, I’m thinking could be because airlines have to provide immediate accommodation to their staff and sometimes even their passengers, therefore, they do not want to book hotels that would put them on a waiting list.

Exploring types of booking by hotel types

checkout = hotel_bookings  %>% filter(reservation_status == 'Check-Out') 
dim(checkout)
## [1] 75011    32
ggplot(checkout, aes(customer_type, fill=hotel)) +
  geom_bar(stat='count', position=position_dodge())+
  scale_fill_manual(values = c ("#b32db5" , "#881a58"))+
  ylab('Count')+
  xlab('Booking type')+
  ggtitle('Type of Booking by Different Hotel Type') +
  labs(fill='Hotel type')+
  theme_classic()

The most common type of booking for both hotel types are Transient customers, when the booking is not part of a group or contract and is not associated with another transient booking.

Exploring Seasonality of different hotel types as my final visualization.

I wanted to explore the seasonality of hotel bookings for city and resort hotels to see the busiest months for these
hotel_bookings$arrival_date_month = factor(hotel_bookings$arrival_date_month, levels = month.name)

ggplot(hotel_bookings, aes(x=arrival_date_month, fill = hotel))+
  geom_bar(position = 'dodge')+
  scale_fill_manual(values = c ("#d4660a" , "#e8b31a" )) +
  ylab('Count of Hotel Bookings ')+
  xlab('Month')+
  ggtitle('Seasonality of Different Hotel Types')+
  labs(fill='Hotel type')+
  theme_classic()+
   theme(axis.text.x = element_text(size = 6.5),
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 9),
        axis.title.x = element_text(size = 11),
        axis.title.y = element_text(size = 11),
        plot.title = element_text(size = 12, face = 'bold'))

August is the busiest month for both hotel types and the least busy months are during the beginning and end of the year.

Data Source, Process and Findings

This data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019. I found this dataset on Kaggle. The data is a combination of hotel booking demand for two hotels between the 1st of July 2015 and the 31st of August 2017. It consists of around 119,390 booking transactions from two hotels located in Portugal, one of the hotels is a resort hotel from Algarve and the other is a city hotel from Lisbon. This dataset has 119,390 observations of 32 variables to explore, such as the meals bought with every booking, when the bookings were made/canceled, number of adults and children, length of stay, market segments, country of origin, and more. Most of the categorical data were as character data in the data set, which I then converted into factors to find better summary statistics. When cleaning this data, I first started by checking for any missing values and I found four missing values for the “children” column. Then, I removed NA’s or missing values from the dataset. When checking for correctness, I converted categorical and binary variables into factors to find the summary statistics to be able to find more meaningful information. This then allowed to find that some of the bookings for adults had 0 in their columns which didn’t make a lot of sense since children and babies need an adult to be booked. As a result, I checked for the number of observations when no adult checked in, the number of observations when only babies checked in, and the number of observations when no adults, children, and babies checked in and only removed this last one from the dataset.

There were many aspects to explore from this dataset. However, I focused on four aspects. I wanted to explore and see where most customers came from, which segment of the market had the least number of days on the waiting list, the types of booking by hotel type, and lastly, as my final visualization, I wanted to see which are the busiest months for both, city hotel and resort hotel. For my last two visualizations, I used a color-blind-friendly color palette. For my final visualization, although it was no surprise to see a spike for both hotel types during the summer, I found it interesting to see that for resort hotels, the booking trend was steadier throughout the year than for city hotels which I would have thought would be the opposite. I also found it interesting to see such a high number of local tourisms in one of my other visualizations. I wish I could have included at least one interactive visualization in the project.