In this activity, I conducted an analysis in R for a hotel bookings
company using ggplot2 to quickly create data visualizations that allows
the exploration of data and gained new insights to be shared. Here is
how I created some simple data visualizations with ggplot2
package with these explained steps.
I imported the .csv file in the project folder called
“hotel_bookings.csv” using the read_csv() function and
saved it as a data frame called hotel_bookings:
hotel_bookings <- read.csv("hotel_bookings.csv")
I previewed the data using the head() function to look
at a sample of the data:
head(hotel_bookings)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
I also used colnames() function to get the name of all
the columns in the data:
colnames(hotel_bookings)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
Before creating visualizations and deriving quick insights,
Installing and loading the ggplot2() function is very
pivotal:
install.packages("ggplot2")
library(ggplot2)
“The Stakeholder wants to target people who book early, and has a hypothesis that people with children have to book in advance”
I created a simple scatterplot visualization to see how true that
statement is or isn’t using ggplot2:
ggplot(data = hotel_bookings)+
geom_point(mapping = aes(x = lead_time, y = children))+
geom_smooth(mapping = aes(x = lead_time, y = children))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
Scatterplots are useful for showing the relationship between two numeric variables. The geom_point() function uses point to create a scatterplot and the geom_smooth() uses line to create a trend line for the scatterplot.
On the x-axis, the plot shows how far in advance the booking is made, with the bookings furthest to the right happening most in advance. On the y-axis, it shows how many children there are in a party.
The plot reveals the stakeholder’s hypothesis is incorrect. Many of the advanced bookings are being made by people with 0 children
“The Stakeholder wants to increase weekend bookings, an important source of revenue for the hotel. The stakeholder wants to know what group of guest booked the most weekend nights in order to target that group in a new marketing campaign. She suggests that guests without children book the most weekend nights. Is this true?”
To Answer this question, I created a plot using ggplot2
to discover if the stakeholder is correct:
ggplot(data = hotel_bookings)+
geom_point(mapping=aes(x=stays_in_weekend_nights,y=children))+
geom_smooth(mapping=aes(x=stays_in_weekend_nights,y=children))+
annotate("text",x=13,y=1.0,label= "No Correlation",color= "red")+
labs(title = "Do people with children often book stays in weekend nights?",x="Stays in weekend nights",y="Children")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
I mapped ‘stays_in_weekend_nights’ on the x-axis and ‘children’ on the y-axis. I then added a chart title and changed the axis names to ‘Stays in weekend nights’ and ‘Children’ respectively using the labs() function.
The annotate() function puts labels inside the chart to show the text “No Correlation” which explains the relationship between the two numeric variables.
The plot reveals that most of the weekend nights’ bookings are done by people with zero children. Therefore, the stakeholder is correct and her suggestion is true.
R is the foremost platform for graphical data analysis and
visualizations. The ggplot2 package allows you to quickly
create data visualizations that can answer questions and give you
insights about your data.