Background

In this activity, I conducted an analysis in R for a hotel bookings company using ggplot2 to quickly create data visualizations that allows the exploration of data and gained new insights to be shared. Here is how I created some simple data visualizations with ggplot2 package with these explained steps.

Step 1: Import Data

I imported the .csv file in the project folder called “hotel_bookings.csv” using the read_csv() function and saved it as a data frame called hotel_bookings:

hotel_bookings <- read.csv("hotel_bookings.csv")

Step 2: Examine the data

I previewed the data using the head() function to look at a sample of the data:

head(hotel_bookings)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03

I also used colnames() function to get the name of all the columns in the data:

colnames(hotel_bookings)
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"

Step 3: Install and load the ‘ggplot2’ package

Before creating visualizations and deriving quick insights, Installing and loading the ggplot2() function is very pivotal:

install.packages("ggplot2")
library(ggplot2)

Step 4: Create plots to answer questions

Question One

“The Stakeholder wants to target people who book early, and has a hypothesis that people with children have to book in advance”

I created a simple scatterplot visualization to see how true that statement is or isn’t using ggplot2:

ggplot(data = hotel_bookings)+
  geom_point(mapping = aes(x = lead_time, y = children))+
  geom_smooth(mapping = aes(x = lead_time, y = children))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

Scatterplots are useful for showing the relationship between two numeric variables. The geom_point() function uses point to create a scatterplot and the geom_smooth() uses line to create a trend line for the scatterplot.

On the x-axis, the plot shows how far in advance the booking is made, with the bookings furthest to the right happening most in advance. On the y-axis, it shows how many children there are in a party.

Conclusion drawn

The plot reveals the stakeholder’s hypothesis is incorrect. Many of the advanced bookings are being made by people with 0 children

Question Two

“The Stakeholder wants to increase weekend bookings, an important source of revenue for the hotel. The stakeholder wants to know what group of guest booked the most weekend nights in order to target that group in a new marketing campaign. She suggests that guests without children book the most weekend nights. Is this true?”

To Answer this question, I created a plot using ggplot2 to discover if the stakeholder is correct:

ggplot(data = hotel_bookings)+
  geom_point(mapping=aes(x=stays_in_weekend_nights,y=children))+
  geom_smooth(mapping=aes(x=stays_in_weekend_nights,y=children))+
  annotate("text",x=13,y=1.0,label= "No Correlation",color= "red")+
  labs(title = "Do people with children often book stays in weekend nights?",x="Stays in weekend nights",y="Children")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

I mapped ‘stays_in_weekend_nights’ on the x-axis and ‘children’ on the y-axis. I then added a chart title and changed the axis names to ‘Stays in weekend nights’ and ‘Children’ respectively using the labs() function.

The annotate() function puts labels inside the chart to show the text “No Correlation” which explains the relationship between the two numeric variables.

Conclusion drawn

The plot reveals that most of the weekend nights’ bookings are done by people with zero children. Therefore, the stakeholder is correct and her suggestion is true.

Wrap Up

R is the foremost platform for graphical data analysis and visualizations. The ggplot2 package allows you to quickly create data visualizations that can answer questions and give you insights about your data.