Portugal Hotel vs Resort

Author

Kenny Nguyen

Link

Introduction

The dataset I’m working with contains booking information from a city hotel in Lisbon and a resort in the Algarve region of Portugal, spanning the years 2015 to 2017. It comes from the article “Hotel Booking Demand Datasets” by Nuno Antonio, Ana Almeida, and Luis Nunes, published in Data in Brief (2019), and was retrieved from Kaggle. The dataset includes 32 different variables related to hotel bookings, such as whether a booking was canceled, lead time (the number of days between the booking and the arrival), and the arrival date (including day, week number, month, and year).

It also provides information on the number of weekend nights (Saturday and Sunday) and weekday nights (Monday to Friday) booked, as well as the number of adults, children, and babies in each reservation. Meal types are recorded—ranging from no meal to Bed & Breakfast (BB), Half Board (HB), and Full Board (FB). The dataset also includes the guest’s country of origin, market segment (“TA” for travel agents and “TO” for tour operators), and whether the guest was a returning visitor.

Additional variables cover the number of previous bookings (both canceled and completed), room type (anonymized by letter), number of changes made to a booking, deposit status, travel agency ID, and number of days the booking remained on a waiting list. It also classifies customer type—contract (formal agreement), group, transient (not part of a group or contract), or transient party (linked to another transient booking).

The dataset further includes the average daily rate (ADR), calculated by dividing the total booking cost by the number of nights, along with the number of parking spaces required and the number of special requests made. Finally, it records the reservation’s final status (canceled, checked out, or no-show) and the date of that outcome. The data was originally extracted from the hotels’ Property Management System (PMS) databases and later cleaned by Thomas Mock and Antoine Bichat.

To better understand the dataset’s context, I did some background research. One helpful source was Why Travel to Portugal by TourHero, which highlights Portugal’s appeal and the best times to visit. Spring is recommended for its colorful flowers and pleasant weather, while summer (June to August) brings the highest volume of tourists and hotter temperatures. Interestingly, the article points out that June to August is also the best time to visit the Azores, due to warm, dry weather and minimal rainfall. It includes estimated flight prices from cities like Washington, Paris, and Rome.

While the article discusses travel highlights across Portugal, I focused specifically on Lisbon and the Algarve, the locations represented in the dataset. Lisbon, Europe’s second-oldest capital city, has a rich history shaped by Roman emperors, European crusaders, and the 1755 earthquake that caused a tsunami and fire. This event left an architectural contrast between pre- and post-quake buildings. Lisbon is also known for its cafes, boutique , and the Route 2B tram, which goes through historic neighborhoods and landmarks such as the Tower of Belém.

The Algarve, in contrast, is a coastal paradise known for its beaches and over 20 resort towns, including Lagos, Albufeira, Praia da Rocha, and Vilamoura. Lagos has a strong Portuguese cultural presence, Vilamoura is known for luxury stays, and Albufeira is famous for its lively nightlife with bars and clubs in the streets.

I chose this dataset and topic because of my deep love for traveling, especially to beach destinations where I can unwind and escape the stress of everyday life. Since I was a child, I dreamed of traveling, but my hardworking foreign parents, who came to this country with nothing, never had the opportunity to go on vacation. That always saddened me. Their sacrifices have inspired me to travel not just for myself, but for them too.

Now that I’m older and working, I take every opportunity to go on trips with my friends and cousins, using those experiences to heal the inner child in me who always desired to travel. I never take these moments for granted. Over time, I’ve become more efficient at planning meaningful, fun trips. When I came across this Portugal hotel booking dataset, I instantly felt connected to it.

library(tidyverse) # Loading library for dplyr commands
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer) # Loading library for color 
library(GGally) # Loading library for linear regression
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(plotly) # Loading library for interactivity

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
setwd("~/Desktop/Data 110")
hotel <- read_csv("hotel_bookings.csv")
Rows: 119390 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
date  (1): reservation_status_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hotel) # Opening up my data set 
# A tibble: 6 × 32
  hotel        is_canceled lead_time arrival_date_year arrival_date_month
  <chr>              <dbl>     <dbl>             <dbl> <chr>             
1 Resort Hotel           0       342              2015 July              
2 Resort Hotel           0       737              2015 July              
3 Resort Hotel           0         7              2015 July              
4 Resort Hotel           0        13              2015 July              
5 Resort Hotel           0        14              2015 July              
6 Resort Hotel           0        14              2015 July              
# ℹ 27 more variables: arrival_date_week_number <dbl>,
#   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
#   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
#   meal <chr>, country <chr>, market_segment <chr>,
#   distribution_channel <chr>, is_repeated_guest <dbl>,
#   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
#   reserved_room_type <chr>, assigned_room_type <chr>, …

After reviewing the dataset, I came to the confident conclusion that it does not require any cleaning.

Statistical Analysis

cor(hotel$total_of_special_requests, hotel$stays_in_weekend_nights) # Finding the correlation between total number of special requests and weekend nights 
[1] 0.07267083
fit1 <- lm(total_of_special_requests ~ stays_in_weekend_nights, data = hotel) # Creating a model to predict total number of special requested based on weekend nights
  
summary(fit1) # Getting a summary of the regression model (p-value, R-squared, coefficients, etc.)

Call:
lm(formula = total_of_special_requests ~ stays_in_weekend_nights, 
    data = hotel)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4409 -0.5755 -0.5178  0.4245  4.4822 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.517847   0.003123  165.80   <2e-16 ***
stays_in_weekend_nights 0.057693   0.002292   25.18   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7907 on 119388 degrees of freedom
Multiple R-squared:  0.005281,  Adjusted R-squared:  0.005273 
F-statistic: 633.8 on 1 and 119388 DF,  p-value: < 2.2e-16
plot(fit1) #plotting my dignosis plot 

The correlation between the number of weekend nights a guest stays and the total number of special requests is 0.0727, indicating a very weak positive relationship. This suggests that guests who stay longer over the weekend are only slightly more likely to make special requests. The regression model is represented by the equation:

total_of_special_requests = 0.5178 + 0.0577 (stays_in_weekend_nights)

This means that for each additional night a guest stays during the weekend, the model predicts an increase of 0.058special requests. The p-value for the stays on weekend nights variable is less than 2e-16, which is highly statistically significant, suggesting that weekend night stays are a meaningful, though weak, indicator of special request totals.

The Adjusted R-Squared value is 0.0053, meaning that only about 0.53% of the variation in special requests can be explained by weekend night stays. While there is a statistically significant relationship, the effect is minimal, and most of the variation in special requests is likely influenced by other factors such as the length of total stay or the purpose of the trip.

Plot 1

summer <- hotel |>
  filter(arrival_date_month %in% c("June", "July", "August")) # Filtering so I focus on the summer months
ss <- summer |>  # Use the summer-filtered dataset
  ggplot(aes(x = total_of_special_requests, # Made the x-axis the number of special requests
  y = stays_in_weekend_nights, # Made the Y-axis number of weekend night stays
  color = arrival_date_month)) + # Color points by summer month (June, July, August)
  
  geom_jitter(width = 0.3, height = 0.3, alpha = 0.7, size = 3) +  # Use jitter to avoid overplotting and enhance visibility
  
  geom_smooth(method = "lm", linetype = "dashed", color = "black") +  # Add trend line to show relationship
  
  scale_color_brewer(palette = "Spectral") + # Used a colorful palette to represent months
  
  theme_minimal() + # Used a clean theme 
  
  theme(legend.position = "top", # Move legend to the top
        plot.title = element_text(face = "bold", size = 13)) +  # Made the title more bold and bigger
  
  labs(title = "How Do Special Requests Relate to Weekend Stays in Summer?",  #Added a title
       x = "# of Special Requests", # Added the X-axis label
       y = "# of Weekend Night Stays",# Added the Y-axis label
       color = "Arrival Month",  # Added the Legend title
       caption = "Source: Hotel Booking Demand Dataset")  # Added the Data source caption
ss 
`geom_smooth()` using formula = 'y ~ x'

My visualization is a scatterplot that shows the number of special requests on the x-axis and the number of weekend nights a guest stayed on the y-axis. I used this type of plot because I wanted to explore the relationship between how many special requests a guest made and how long they stayed during the weekend. After experimenting with geom options in ggplot, I decided on geom_jitter. I found it useful because it solves a common problem in scatterplots where multiple data points overlap, especially when values repeat. Geom_jitter visually separates observations that would otherwise appear as a single dot, making patterns in the data more visible and easier to interpret.

To further emphasize the relationship between the variables, I added a linear regression line using geom_smooth(method = “lm”). Although the relationship isn’t very steep, including this trend line adds a layer of interpretation beyond the individual points and helps the viewer visualize the overall direction of the data.

Initially, I planned to use an alluvial plot to show flows between months, weekend nights, and special request categories. However, I realized that this type of visualization was not suitable for this dataset because it required categorical variables and flow relationships. Since my data was more numerical and continuous, the alluvial plot didn’t provide meaningful insight. That’s why I chose a scatterplot instead.

The color palette I used, Spectral, was the one used in the alluvial tutorial and helped in distinguishing between the summer arrival months and giving the chart a vibrant, readable aesthetic. One interesting observation from the scatterplot is the cluster of points in August, where many guests had low numbers of special requests and stayed for only one or two weekend nights. This might suggest that August brings in a large volume of short-stay guests with fewer needs.

Overall, this scatterplot provides a clear visual representation of the correlation between special requests and weekend stays during the summer months.

Plot 2

usaholi <- hotel |>
  filter(country == "USA") |>  # Filter for guests from the USA only
  mutate(holiday = case_when(  # Create a new column 'holiday' based on conditions
    arrival_date_month == "January" & arrival_date_day_of_month == 1 ~ "New Year's Day",  # Label Jan 1 as New Year's Day
    arrival_date_month == "May" & arrival_date_day_of_month >= 25 & arrival_date_day_of_month <= 31 ~ "Memorial Day",  # Found the range for potiental last Monday of May for Memorial Day 
    arrival_date_month == "May" & arrival_date_day_of_month >= 8 & arrival_date_day_of_month <= 14 ~ "Mother's Day",  # Found the range for potiental 2nd Sunday of May for Mother's day 
    arrival_date_month == "June" & arrival_date_day_of_month >= 15 & arrival_date_day_of_month <= 21 ~ "Father's Day",  # Found the range for potiental 3rd Sunday of June for Father's Day
    arrival_date_month == "July" & arrival_date_day_of_month == 4 ~ "Fourth of July",  # Got arrival date as Forth of July 
    arrival_date_month == "November" & arrival_date_day_of_month == 11 ~ "Veterans Day",  # Found the arrival date for Veterans Day
    arrival_date_month == "November" & arrival_date_day_of_month >= 22 & arrival_date_day_of_month <= 28 ~ "Thanksgiving",  # Found the range for potiental 4th Thursday of Nov for Thanksgiving
    arrival_date_month == "December" & arrival_date_day_of_month == 25 ~ "Christmas"  # Got Arrival Dates for Dec 25 as Christmas
  )) |>  
  filter(!is.na(holiday)) # Keep only rows that were assigned a holiday
boxplot <- usaholi |>
  ggplot(aes(x = adr, # Made the x-axis the ADR
  y = holiday, # Made the y-axis the holiday 
  fill = holiday)) + # Made holiday the fill aesthetic for coloring +
  geom_boxplot() + # Added the boxplots 
  theme_minimal() + # Added a clean theme 
  theme(plot.title = element_text(face = "bold", size = 13)) + # Made the title more bold and bigger
    scale_fill_brewer()+ # Added the color palette 
  labs(title = "Average Daily Rate by U.S. Holiday", # Added the title 
       x = "Average Daily Rate (ADR)", # Added the x-axis label
       y = "Holiday", # Added the y-axis label 
       fill = "Holiday", # Added a label to the legend  
       caption = "Source: Hotel Booking Demand Dataset") # Added the caption 

boxplot

My visualization is a boxplot that shows the average daily rate (ADR) on the x-axis and the holiday on the y-axis. I chose a boxplot because I wanted to compare how hotel prices vary across different U.S. holidays. To prepare the data, I filtered for guests from the U.S. and created a new holiday column using mutate and case_when to assign a holiday label based on arrival dates. Since my dataset spans multiple years, I manually selected date ranges that reflect when each holiday typically occurs. For example, Memorial Day usually falls on the last Monday of May, so I included dates between May 25th and 31st to capture that range. This approach allowed me to isolate bookings that aligned with major holidays despite shifting calendar dates.

Boxplots were a useful choice here because they summarize the distribution of ADR values using the interquartile range (25th–75th percentiles), with a line indicating the median (50th percentile). I used geom_boxplot for the plot and applied scale_fill_brewer to style the fill colors. I intentionally reused the same palette from my very first visualization, the airquality assignment, which made this feel like a full-circle moment in my learning journey and added a personal touch.

One observation that stood out is that Father’s Day appears to have the highest median ADR, which could suggest higher demand for hotel stays during that weekend. Overall, this boxplot provides a concise way to visually compare holiday-related pricing patterns and offers insight into which holidays might have peak hotel rates.

Plot 3

cancel <- hotel |> # Making a new dataset
  filter(reservation_status == "Canceled", arrival_date_year == "2017") |> # filtering out for canceled reservations in 2017
  group_by(arrival_date_month)|> # Grouping the arrival date month together
  summarise(avg_lead_time = mean(lead_time)) # Getting the average lead time 
cancel$arrival_date_month <- factor(cancel$arrival_date_month,
                                    levels = month.name)
cancel <- cancel[order(cancel$arrival_date_month), ] # Used AI to find out how to rearrange months, will cite 
p3 <- ggplot(cancel, aes(x = arrival_date_month, # Made the x-axis the month of arrival
y = avg_lead_time, # Made the y-axis the average lead time
fill = arrival_date_month)) + # Made the fill the month of arrival
  geom_col(color = "black", width = 0.7) + # Added black borders to bars
  scale_fill_manual(
    values = c("#FF6F61", "#6B5B95", "#88B04B", "#F7CAC9", "#92A8D1", "#955251",
               "#B565A7", "#009B77"))+ # Used Google color picker to get this color
  labs(
    title = "Average Lead Time for Canceled Hotel Reservations (2017)", # Added the title
    x = "Month of Planned Arrival", # Added the x-axis label
    y = "Average Lead Time (Days)", # Added the y-axis label
    fill = "Month", # Added a label to the legend
    caption = "Source: Hotel Booking Demand Dataset" # Added the caption
  ) +
  theme_minimal(base_size = 12) + # Added a clean theme with larger test font
  theme(plot.title = element_text(face = "bold", size = 13)) # Made the title more bold and bigger
p3

ggplotly(p3)

My visualization is a column chart that shows the average lead time on the y-axis and the month of planned arrival on the x-axis for canceled hotel reservations in 2017. I chose a column chart because I wanted to show how far in advance guests tended to cancel their bookings depending on the time of year. To prepare the data, I filtered the dataset to include only rows where the reservation status was “Canceled” and the arrival year was 2017. Then, I grouped the data by month and calculated the average lead time to understand seasonal patterns.

I used geom_col to create the bars and used a custom color palette with scale_fill_manual to make the chart more visually engaging. I also used theme_minimal for a clean, modern look and labeled all elements, including the axes, legend, and data source. I also used ggplotly to make this chart interactive.

One insight that stood out is that guests who booked for the summer months, like June and July, had the longest lead times. This may suggest that people plan far in advance for peak travel season but may also cancel as plans change. Another thing I noticed is that there were no cancellations recorded for September, October, November, or December in 2017, which caused the column chart to end at August.

Overall, this chart provides a clear view of how booking behavior varies throughout the year, which could be valuable for hotel managers planning around cancellation trends.

Bibliography

Antonio, Nuno, et al. “Hotel Booking Demand Datasets.” Data in Brief, vol. 22, Feb. 2019, pp. 41–49, www.sciencedirect.com/science/article/pii/S2352340918315191, https://doi.org/10.1016/j.dib.2018.11.126.

OpenAI. (2025). ChatGPT (May 10 version) [Large language model]. https://chat.openai.com/

THE BEAUTIFUL LISBON, OUR 3-DAY GUIDE. Accessed 7 May 2025.

“Why Travel to Portugal?” TourHero, www.tourhero.com/en/magazine/why-travel-to-portugal/.