
The primary goal of this analysis is to develop an Airbnb Budget Trip Planner, which suggests a personalized itinerary optimized for both cost and accommodation quality. Specifically, the objectives are:
Distribute the stay between a main city (e.g., Palermo) and nearby towns.
Maximize the number of nights within the user’s budget.
Ensure a minimum quality level by selecting highly-rated accommodations.
Identify cheaper alternatives through clustering of cities.
Creation of a personalized itinerary with the optimal distribution of nights.
Prediction of the expected price per night using regression models.
Identification of clusters of cities with similar prices and quality.
Clear visualizations using maps and graphs to facilitate decision-making.
Optimization of accommodation selection for budget-conscious travelers.
Support for booking platforms in dynamic price management.
Enhancement of user experience through AI-based recommendations
The exploratory analysis focuses on several key variables:
Host and Booking-Related Variables
host_is_superhost: Indicates whether the host is a
superhost (quality and reliability).
host_identity_verified: Indicates whether the host’s
identity has been verified
room_type: Type of accommodation (e.g., entire home,
private room).
property_type: Specific type of property (e.g.,
apartment, villa, boutique hotel)
instant_bookable: Indicates whether the listing can
be booked instantly.
Price and Stay Duration
price: Price per night of the listing.
minimum_nights: Minimum number of nights required by
the host.
maximum_nights: Maximum number of nights available
for booking.
Guest Ratings
review_scores_rating: Overall rating of the
accommodation.
review_scores_accuracy: Accuracy of the listing’s
information.
review_scores_cleanliness: Cleanliness of the
accommodation.
review_scores_checkin: Ease of check-in.
review_scores_communication: Host’s communication
quality.
review_scores_location: Location rating by previous
guests.
review_scores_value: Value-for-money
rating.
Reviews and Popularity
reviews_per_month: Number of reviews received per
month.
municipality: City/locality of the listing.
review_scores_rating variable helps identify
high-quality accommodations.price and municipality allow analysis of
price variations across cities.room_type differentiates between entire homes and
private rooms.library(readxl)
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(viridisLite)
library(car)
library(kableExtra)
library(cluster)
library(dendextend)
library(leaflet)
library(combinat)
library(stringr)
## Set the preferred working directory
setwd("C:/Users/Alice/OneDrive - LUISS Libera Università Internazionale degli Studi Sociali Guido Carli/ALICE/LUISS/magistrale/appunti/semestre 2/market data analysis/progetto")
## Read the listings dataset
df_listings <- read_excel("listings .xlsx")
df_reviews <- read_excel("reviews.xlsx")
df_calendar <- read_excel("calendar.xlsx")
## Specify columns to remove for cleaning purposes
columns_to_remove<- c("listing_id", "listing_url", "last_scraped", "name", "description", "neighborhood_overview", "host_id", "host_name", "host_location",
"host_since", "host_response_time", "host_response_rate", "host_acceptance_rate",
"host_neighbourhood", "host_listings_count", "host_has_profile_pic",
"host_verifications", "neighbourhood_cleansed", "number_of_reviews",
"accommodates", "bedrooms", "beds", "amenities",
"availability_30", "availability_90", "availability_365", "number_of_reviews_ltm", "number_of_reviews_l30d",
"first_review", "last_review"
)
## Remove the selected columns
df_listings_cleansed <- df_listings[, !(colnames(df_listings) %in%
columns_to_remove)]
## remove NA values
df_listings_cleansed <- df_listings_cleansed %>%
drop_na()
sum(is.na(df_listings_cleansed))
## [1] 0
##numeric and categorical variables
col_types = sapply(df_listings_cleansed, class)
numeric_cols = names(which(sapply(col_types, function(x) any(x %in% c("numeric", "integer")))))
categorical_cols = names(which(sapply(col_types, function(x) any(x %in% c("factor", "character")))))
# Define the numeric variables
numeric_vars <- c("price")
# Rimuoviamo gli outliers
# abbiamo controllato su airbnb minimo e massimo
df_listings_cleansed <- df_listings_cleansed %>%
filter(price >= 9, price <= 1200)
## Summary statistics for numerical variables
numeric_vars <- df_listings_cleansed %>%
select(where(is.numeric))
numeric_table <- summary(numeric_vars)
kable(numeric_table, caption = "Summary statistics of all numerical variables") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed", "bordered"))
| latitude | longitude | price | minimum_nights | maximum_nights | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :37.60 | Min. :12.98 | Min. : 15.0 | Min. : 1.000 | Min. : 1.0 | Min. :1.000 | Min. :0.000 | Min. :0.00 | Min. :0.000 | Min. :1.00 | Min. :1.000 | Min. :1.000 | Min. : 0.0100 | |
| 1st Qu.:38.06 | 1st Qu.:13.33 | 1st Qu.: 60.0 | 1st Qu.: 1.000 | 1st Qu.: 29.0 | 1st Qu.:4.670 | 1st Qu.:4.700 | 1st Qu.:4.71 | 1st Qu.:4.810 | 1st Qu.:4.83 | 1st Qu.:4.580 | 1st Qu.:4.600 | 1st Qu.: 0.1600 | |
| Median :38.12 | Median :13.36 | Median : 80.0 | Median : 2.000 | Median : 90.0 | Median :4.850 | Median :4.870 | Median :4.89 | Median :4.930 | Median :4.95 | Median :4.800 | Median :4.800 | Median : 0.5000 | |
| Mean :38.10 | Mean :13.44 | Mean : 105.1 | Mean : 2.601 | Mean : 404.4 | Mean :4.741 | Mean :4.764 | Mean :4.78 | Mean :4.835 | Mean :4.84 | Mean :4.697 | Mean :4.689 | Mean : 0.9516 | |
| 3rd Qu.:38.13 | 3rd Qu.:13.39 | 3rd Qu.: 113.0 | 3rd Qu.: 3.000 | 3rd Qu.:1125.0 | 3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.:5.00 | 3rd Qu.:5.000 | 3rd Qu.:5.00 | 3rd Qu.:4.960 | 3rd Qu.:4.930 | 3rd Qu.: 1.2900 | |
| Max. :38.72 | Max. :14.25 | Max. :1150.0 | Max. :365.000 | Max. :1125.0 | Max. :5.000 | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.000 | Max. :14.3300 |
## Summary delle variabili categoriche
# Seleziona solo le variabili categoriche
categorical_vars <- df_listings_cleansed %>%
select(where(~ is.character(.) || is.factor(.)))
# Calcola le frequenze per ogni variabile
categorical_summary <- map(categorical_vars, ~ as.data.frame(table(.))) %>%
imap(~ mutate(.x, Variable = .y)) %>%
bind_rows() %>%
rename(Value = ".", Count = "Freq") %>%
relocate(Variable, .before = Value)
# Visualizza la tabella con kable
kable(categorical_summary, caption = "Frequenze delle variabili categoriche") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed", "bordered"))
| Variable | Value | Count |
|---|---|---|
| host_is_superhost | f | 4903 |
| host_is_superhost | t | 2646 |
| host_identity_verified | f | 644 |
| host_identity_verified | t | 6905 |
| property_type | Barn | 1 |
| property_type | Boat | 5 |
| property_type | Camper/RV | 1 |
| property_type | Campsite | 1 |
| property_type | Casa particular | 6 |
| property_type | Dome | 4 |
| property_type | Earthen home | 1 |
| property_type | Entire bed and breakfast | 1 |
| property_type | Entire bungalow | 11 |
| property_type | Entire cabin | 11 |
| property_type | Entire chalet | 2 |
| property_type | Entire condo | 1014 |
| property_type | Entire cottage | 61 |
| property_type | Entire guest suite | 30 |
| property_type | Entire guesthouse | 8 |
| property_type | Entire home | 990 |
| property_type | Entire loft | 125 |
| property_type | Entire place | 10 |
| property_type | Entire rental unit | 2726 |
| property_type | Entire serviced apartment | 28 |
| property_type | Entire townhouse | 64 |
| property_type | Entire vacation home | 429 |
| property_type | Entire villa | 623 |
| property_type | Farm stay | 20 |
| property_type | Island | 1 |
| property_type | Private room | 7 |
| property_type | Private room in bed and breakfast | 438 |
| property_type | Private room in boat | 1 |
| property_type | Private room in casa particular | 7 |
| property_type | Private room in castle | 3 |
| property_type | Private room in condo | 178 |
| property_type | Private room in cottage | 2 |
| property_type | Private room in farm stay | 27 |
| property_type | Private room in guest suite | 23 |
| property_type | Private room in guesthouse | 6 |
| property_type | Private room in home | 110 |
| property_type | Private room in hut | 1 |
| property_type | Private room in loft | 7 |
| property_type | Private room in nature lodge | 12 |
| property_type | Private room in rental unit | 340 |
| property_type | Private room in resort | 1 |
| property_type | Private room in serviced apartment | 12 |
| property_type | Private room in tiny home | 1 |
| property_type | Private room in townhouse | 7 |
| property_type | Private room in vacation home | 31 |
| property_type | Private room in villa | 28 |
| property_type | Room in aparthotel | 8 |
| property_type | Room in bed and breakfast | 33 |
| property_type | Room in boutique hotel | 4 |
| property_type | Room in hostel | 1 |
| property_type | Room in hotel | 12 |
| property_type | Room in serviced apartment | 11 |
| property_type | Shared room in condo | 1 |
| property_type | Shared room in home | 1 |
| property_type | Shared room in hostel | 1 |
| property_type | Shared room in rental unit | 4 |
| property_type | Tent | 3 |
| property_type | Tiny home | 54 |
| property_type | Tower | 1 |
| room_type | Entire home/apt | 6231 |
| room_type | Hotel room | 45 |
| room_type | Private room | 1266 |
| room_type | Shared room | 7 |
| instant_bookable | f | 3568 |
| instant_bookable | t | 3981 |
| municipality | Alia | 6 |
| municipality | Alimena | 1 |
| municipality | Altavilla Milicia | 80 |
| municipality | Altofonte | 5 |
| municipality | Bagheria | 112 |
| municipality | Balestrate | 140 |
| municipality | Belmonte Mezzagno | 1 |
| municipality | Bisacquino | 1 |
| municipality | Blufi | 1 |
| municipality | Bolognetta | 5 |
| municipality | Bompietro | 3 |
| municipality | Borgetto | 16 |
| municipality | Caccamo | 25 |
| municipality | Caltavuturo | 3 |
| municipality | Campofelice di Roccella | 189 |
| municipality | Campofiorito | 4 |
| municipality | Camporeale | 1 |
| municipality | Capaci | 54 |
| municipality | Carini | 144 |
| municipality | Castelbuono | 64 |
| municipality | Casteldaccia | 43 |
| municipality | Castellana Sicula | 9 |
| municipality | Castronovo di Sicilia | 5 |
| municipality | Cefalù | 636 |
| municipality | Cerda | 1 |
| municipality | Chiusa Sclafani | 2 |
| municipality | Ciminna | 4 |
| municipality | Cinisi | 114 |
| municipality | Collesano | 47 |
| municipality | Contessa Entellina | 4 |
| municipality | Corleone | 15 |
| municipality | Ficarazzi | 14 |
| municipality | Gangi | 23 |
| municipality | Geraci Siculo | 6 |
| municipality | Giuliana | 1 |
| municipality | Gratteri | 25 |
| municipality | Isnello | 12 |
| municipality | Isola delle Femmine | 100 |
| municipality | Lascari | 82 |
| municipality | Lercara Friddi | 5 |
| municipality | Marineo | 3 |
| municipality | Mezzojuso | 1 |
| municipality | Misilmeri | 7 |
| municipality | Monreale | 79 |
| municipality | Montelepre | 10 |
| municipality | Montemaggiore Belsito | 1 |
| municipality | Palazzo Adriano | 2 |
| municipality | Palermo | 4426 |
| municipality | Partinico | 114 |
| municipality | Petralia Soprana | 18 |
| municipality | Petralia Sottana | 22 |
| municipality | Piana degli Albanesi | 7 |
| municipality | Polizzi Generosa | 19 |
| municipality | Pollina | 53 |
| municipality | Prizzi | 8 |
| municipality | Roccamena | 3 |
| municipality | Roccapalumba | 4 |
| municipality | San Giuseppe Jato | 2 |
| municipality | San Mauro Castelverde | 8 |
| municipality | Santa Cristina Gela | 4 |
| municipality | Santa Flavia | 153 |
| municipality | Sciara | 1 |
| municipality | Scillato | 10 |
| municipality | Sclafani Bagni | 5 |
| municipality | Termini Imerese | 60 |
| municipality | Terrasini | 229 |
| municipality | Torretta | 19 |
| municipality | Trabia | 78 |
| municipality | Trappeto | 104 |
| municipality | Ustica | 84 |
| municipality | Valledolmo | 2 |
| municipality | Ventimiglia di Sicilia | 2 |
| municipality | Villabate | 7 |
| municipality | Villafrati | 1 |
#data visualization for numerical variables
# Prepara il dataset per gli istogrammi
df_long_num <- df_listings_cleansed %>%
select(all_of(numeric_cols)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Rinomina le variabili con etichette più compatte
df_long_num$Variable <- as.character(df_long_num$Variable)
df_long_num$Variable <- dplyr::recode(df_long_num$Variable,
"review_scores_accuracy" = "Accuracy",
"review_scores_cleanliness" = "Cleanliness",
"review_scores_checkin" = "Check-in",
"review_scores_communication" = "Communication",
"review_scores_location" = "Location",
"review_scores_rating" = "Rating",
"review_scores_value" = "Value",
"reviews_per_month" = "Reviews/month",
"minimum_nights" = "Min Nights",
"maximum_nights" = "Max Nights",
"price" = "Price",
"latitude" = "Latitude",
"longitude" = "Longitude"
)
# Istogrammi con etichette leggibili e titoli non tagliati
ggplot(df_long_num, aes(x = Value, fill = Variable)) +
geom_histogram(alpha = 0.6, position = "identity", bins = 20) +
facet_wrap(~Variable, scales = "free", ncol = 3) + # distribuisce meglio i grafici
labs(title = "Histograms of Multiple Variables", x = "Value", y = "Frequency") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
strip.text = element_text(size = 9, face = "bold"), # più piccolo = meno tagliato
plot.title = element_text(size = 14, hjust = 0.5),
plot.margin = margin(20, 10, 10, 10)
)
The distribution is highly skewed, with several peaks at specific values (e.g., 30, 365, 1000).
This suggests that many hosts impose fixed limits on long-term stays.
Most listings require a minimum of 1 or a few nights, with a small number demanding longer stays.
The data is concentrated at the lower end, meaning that short-term stays are the norm.
The majority of listings have low prices, but there are occasional higher-priced accommodations.
The distribution is right-skewed, indicating a few luxury listings that charge significantly more.
These variables display left-skewed distributions, meaning most properties receive high ratings (close to 5 stars).
The limited variation suggests that guests generally rate their stays very positively.
This may indicate that review scores are not a strong differentiating factor between listings.
Most listings receive relatively few reviews per month, while some accumulate a significantly higher number.
This right-skewed distribution suggests that a small percentage of properties are far more popular than the majority.
#data visualization categorical variables
# Define the maximum number of categories to keep
top_n_municipalities <- 6
top_n_property_types <- 6
# Count occurrences and select the most frequent names
top_municipalities <- df_listings_cleansed %>%
count(municipality, sort = TRUE) %>%
slice_head(n = top_n_municipalities) %>%
pull(municipality)
top_property_types <- df_listings_cleansed %>%
count(property_type, sort = TRUE) %>%
slice_head(n = top_n_property_types) %>%
pull(property_type)
# Crea versione semplificata con "Other" per le categorie meno frequenti
df_listings_cleansed_representation <- df_listings_cleansed %>%
mutate(
municipality = ifelse(municipality %in% top_municipalities, as.character(municipality), "Other"),
property_type = ifelse(property_type %in% top_property_types, as.character(property_type), "Other")
)
# Prepara i dati per il grafico
df_long_cat <- df_listings_cleansed_representation %>%
select(c("host_is_superhost", "room_type", "municipality", "instant_bookable", "property_type", "host_identity_verified")) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category")
# Ordina le categorie come fattori (opzionale)
df_long_cat$Category <- factor(df_long_cat$Category, levels = unique(df_long_cat$Category))
# Genera i barplot
ggplot(df_long_cat, aes(x = Category, fill = Variable)) +
geom_bar(position = "dodge", alpha = 0.8) +
facet_wrap(~ Variable, scales = "free_x") +
labs(title = "Bar Plots of Categorical Variables", x = "Category", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The majority of hosts have a verified identity, indicating a high level of trust and security in the listings. However, a small portion of hosts remain unverified, which might impact guest confidence.
Less than half of the listings belong to Superhosts, meaning that while a significant number of hosts maintain high service quality, the majority do not qualify for this status.
The distribution is quite balanced, meaning that roughly half of the listings allow guests to book instantly, while the other half require manual approval
The distribution shows a highly imbalanced pattern, where a single municipality (likely Palermo) has the highest number of listings, while other municipalities have significantly fewer. This suggests that most Airbnb accommodations are concentrated in a specific area, which may impact clustering and pricing analysis.
There is a wide variety of property types, but a few dominate the market. The most common property type seems to be “Entire home/apartment,” while other types are much less frequent.
The vast majority of listings are entire homes/apartments, followed by private rooms, and very few shared rooms or hotel rooms. This confirms that Airbnb is primarily used for renting out full apartments or private accommodations rather than shared spaces.
### Check relationship between each potential regressor and response (price)
plot_data <- df_listings_cleansed %>%
select(c("maximum_nights", "minimum_nights", "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value", "reviews_per_month", 'price')) %>%
gather(key = "predictor", value = "predictor_value", -price)
plot_data <- df_listings_cleansed %>%
select(c("maximum_nights", "minimum_nights", "review_scores_rating",
"review_scores_accuracy", "review_scores_cleanliness",
"review_scores_checkin", "review_scores_communication",
"review_scores_location", "review_scores_value",
"reviews_per_month", "price")) %>%
tidyr::gather(key = "predictor", value = "predictor_value", -price)
# Converte i nomi delle variabili in etichette più leggibili
plot_data$predictor <- as.character(plot_data$predictor)
plot_data$predictor <- dplyr::recode(plot_data$predictor,
"review_scores_accuracy" = "Accuracy",
"review_scores_cleanliness" = "Cleanliness",
"review_scores_checkin" = "Check-in",
"review_scores_communication" = "Communication",
"review_scores_location" = "Location",
"review_scores_rating" = "Rating",
"review_scores_value" = "Value for Money",
"reviews_per_month" = "Reviews per Month",
"minimum_nights" = "Min Nights",
"maximum_nights" = "Max Nights"
)
# Crea il grafico con etichette leggibili
ggplot(plot_data, aes(x = predictor_value, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
facet_wrap(~predictor, scales = "free_x") +
theme_bw() +
theme(
axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
strip.text = element_text(size = 11, face = "bold"),
plot.title = element_text(size = 14, hjust = 0.5),
plot.margin = margin(20, 10, 10, 10)
) +
labs(
x = "Predictor Value",
y = "Price",
title = "Scatter Plots: Predictors vs Price"
)
##### **Maximum & Minimum Nights:**
#Some listings allow **very long stays**, and these often have highly variable pricing.
#Short-stay listings generally have **lower prices**, but there are exceptions.
#The presence of extreme values suggests that some listings cater to long-term rentals or have unique pricing structures.
##### **Review Scores (Accuracy, Cleanliness, Check-in, Communication, Location, Rating, Value):**
#Most review scores are concentrated between **4 and 5 stars**, leading to a lack of variability.
#No strong correlation between higher review scores and price is apparent.
#This suggests that while guests generally rate accommodations highly, **hosts do not necessarily increase prices based on reviews alone.**
##### **Reviews Per Month:**
#Most listings receive **fewer reviews per month**, but a few have a significantly high number.
#High-priced listings tend to receive fewer reviews, indicating that **premium accommodations might have lower booking frequency.**
par(mfrow = c(3, 3), mar = c(4, 4, 4, 2)) # margine più alto sotto
boxplot(df_listings_cleansed$price ~ df_listings_cleansed$host_identity_verified,
ylab = "Price", xlab = "", main = "Price vs Host Identity Verified",
las = 1, cex.axis = 0.9)
boxplot(df_listings_cleansed$price ~ df_listings_cleansed$host_is_superhost,
ylab = "Price", xlab = "", main = "Price vs Superhost",
las = 1, cex.axis = 0.9)
boxplot(df_listings_cleansed$price ~ df_listings_cleansed$instant_bookable,
ylab = "Price", xlab = "", main = "Price vs Instant Booking",
las = 1, cex.axis = 0.9)
boxplot(df_listings_cleansed$price ~ df_listings_cleansed$municipality,
ylab = "Price", xlab = "", main = "Price vs Municipality",
las = 2, cex.axis = 0.6)
boxplot(df_listings_cleansed$price ~ df_listings_cleansed$property_type,
ylab = "Price", xlab = "", main = "Price vs Property Type",
las = 2, cex.axis = 0.6)
boxplot(df_listings_cleansed$price ~ df_listings_cleansed$room_type,
ylab = "Price", xlab = "", main = "Price vs Room Type",
las = 2, cex.axis = 0.8)
No significant difference in price between verified and non-verified hosts.
Both categories contain several high-price outliers, indicating that verification does not strongly influence pricing.
Superhosts and non-Superhosts have similar price distributions, with some expensive listings in both groups.
The median price appears to be slightly higher for Superhosts, but the effect is not substantial.
The price distribution is almost identical between instant bookable and non-instant bookable listings.
This suggests that instant booking does not significantly impact pricing strategies.
There is substantial variation in price across different municipalities.
Some municipalities have wider interquartile ranges, while others have mostly lower-priced listings.
The price distribution varies significantly by property type.
Some property types have consistently lower prices, while others have wider distributions with multiple outliers.
This indicates that property type is an important factor in pricing.
Entire home/apartment listings tend to have higher prices compared to shared rooms.
Private rooms (not shown separately here) would likely fall between these two categories.
#CLUSTERING
df_listings_cleansed <- df_listings_cleansed %>%
mutate(
mean_review_score_detailed = rowMeans(select(., review_scores_rating:review_scores_value), na.rm = TRUE)
)
#Municipality aggregation
cluster_data <- df_listings_cleansed %>%
group_by(municipality) %>%
summarise(
mean_price = mean(price, na.rm = TRUE),
mean_detailed_score = mean(mean_review_score_detailed, na.rm = TRUE)
) %>%
na.omit()
#Data standardization
numeric_cols <- cluster_data %>%
select(where(is.numeric)) # seleziona solo le variabili numeriche
scaled_data <- scale(numeric_cols)
row.names(scaled_data) <- cluster_data$municipality
#check means and sd
scale_data_mean_sd= as.data.frame(scaled_data) %>%
summarise(across(everything(), list(mean = mean, sd = sd)))
#Hierarchical Clustering GERARCHICO ( WARD method)
dist_matrix <- dist(scaled_data)
hc_ward <- hclust(dist_matrix, method = "ward.D")
#Dendrogram
library(dendextend)
dend <- as.dendrogram(hc_ward) %>%
set("labels", rownames(scaled_data)) %>%
set("branches_k_color")
labels(dend) <- ifelse(seq_along(labels(dend)) %% 3 == 0, labels(dend), "")
par(mar = c(14, 4, 4, 2)) # bottom, left, top, right
plot(dend,
main = "Dendrogram - Ward Method",
ylab = "Distance",
las = 2,
cex = 0.6)
#Choose the optimal number of clusters
#elbow method
library(factoextra)
## Warning: il pacchetto 'factoextra' è stato creato con R versione 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Metodo del gomito (WSS) applicato ai tuoi dati
fviz_nbclust(as.data.frame(scaled_data), FUN = hcut, method = "wss", hc_method = "ward")
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
#the optimal number of clusters according to thise method is 3
#silhouette
fviz_nbclust(as.data.frame(scaled_data), FUN = hcut, method = "silhouette", hc_method = "ward")
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
#the optimal number of clusters according to thise method is 9
#our decision: k=3 optimal number of clusters
#cut dendogram
dend <- as.dendrogram(hc_ward) %>%
set("labels", rownames(scaled_data)) %>%
set("branches_k_color", k = 3)
labels(dend) <- ifelse(seq_along(labels(dend)) %% 3 == 0, labels(dend), "")
par(mar = c(14, 4, 4, 2))
plot(dend,
main = "Dendrogram - Ward Method",
ylab = "Distance",
las = 2,
cex = 0.6)
abline(h = 23, col = 'red', lwd = 2)
#Cluster
cluster_assignment <- cutree(hc_ward, k = 3)
cluster_data$cluster <- as.factor(cluster_assignment)
ggplot(cluster_data, aes(x = mean_price, y = mean_detailed_score, color = cluster)) +
geom_point(size = 3, alpha = 0.9) +
labs(
title = "Clustering Ward method (k = 3)",
x = "Average price ",
y = "Detailed score"
) +
theme_minimal() +
theme(legend.position = "right") +
scale_color_manual(values = c("red", "green", "blue"))
#Clusters' statistics
summary_clusters <- cluster_data %>%
group_by(cluster) %>%
summarise(
n_comuni = n(),
average_price = round(mean(mean_price), 2),
average_detailed_score = round(mean(mean_detailed_score), 2),
municipality = paste(municipality, collapse = ", ")
)
print(summary_clusters)
## # A tibble: 3 × 5
## cluster n_comuni average_price average_detailed_score municipality
## <fct> <int> <dbl> <dbl> <chr>
## 1 1 20 152. 4.8 Alia, Balestrate, Bisac…
## 2 2 13 54.9 4.94 Alimena, Belmonte Mezza…
## 3 3 41 81.4 4.74 Altavilla Milicia, Alto…
summary_clusters$municipality
## [1] "Alia, Balestrate, Bisacquino, Borgetto, Caccamo, Carini, Cefalù, Cerda, Chiusa Sclafani, Ciminna, Cinisi, Collesano, Misilmeri, Partinico, Polizzi Generosa, Santa Flavia, Terrasini, Trabia, Ustica, Ventimiglia di Sicilia"
## [2] "Alimena, Belmonte Mezzagno, Blufi, Bompietro, Castronovo di Sicilia, Giuliana, Marineo, Mezzojuso, Montelepre, Montemaggiore Belsito, San Giuseppe Jato, Sciara, Villafrati"
## [3] "Altavilla Milicia, Altofonte, Bagheria, Bolognetta, Caltavuturo, Campofelice di Roccella, Campofiorito, Camporeale, Capaci, Castelbuono, Casteldaccia, Castellana Sicula, Contessa Entellina, Corleone, Ficarazzi, Gangi, Geraci Siculo, Gratteri, Isnello, Isola delle Femmine, Lascari, Lercara Friddi, Monreale, Palazzo Adriano, Palermo, Petralia Soprana, Petralia Sottana, Piana degli Albanesi, Pollina, Prizzi, Roccamena, Roccapalumba, San Mauro Castelverde, Santa Cristina Gela, Scillato, Sclafani Bagni, Termini Imerese, Torretta, Trappeto, Valledolmo, Villabate"
#cluster1---> High-end destinations with premium prices and high customer ratings. Likely tourist hotspots.
#cluster2--> Low-cost locations with excellent ratings. Possibly small, lesser-known towns with highly appreciated stays.
#cluster3---> Mid-range price and quality. A diverse group with a mix of performance levels.
#rename clusters to have a clearer representation
cluster_data$cluster <- factor(cluster_data$cluster,
levels = c("1", "2", "3"),
labels = c("Premium", "Affordable & Top Rated", "Mid-Range"))
summary_clusters$cluster <- factor(summary_clusters$cluster,
levels = c("1", "2", "3"),
labels = c("Premium", "Affordable & Top Rated", "Mid-Range"))
#CHOROPLETH MAP
#represent the previously defined clusters on a map with different colors depending on clusters' price and average ratings
# Libreries
library(sf)
## Warning: il pacchetto 'sf' è stato creato con R versione 4.4.3
## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library(dplyr)
library(ggplot2)
# Carica e prepara lo shapefile
sf_gb <- read_sf("Com01012025_g", layer = "Com01012025_g_WGS84") %>%
st_make_valid() %>%
st_transform(crs = 4326) %>%
mutate(COMUNE = toupper(COMUNE))
# Prepara cluster_data
cluster_data <- cluster_data %>%
mutate(municipality = toupper(municipality))
# Join e filtro per Palermo
sf_palermo <- sf_gb %>%
left_join(cluster_data %>% select(municipality, cluster),
by = c("COMUNE" = "municipality")) %>%
filter(COD_PROV == "82")
# Separa comuni clusterizzati
sf_clustered <- sf_palermo %>% filter(!is.na(cluster))
# Mappa
ggplot() +
geom_sf(data = sf_palermo, fill = "grey90", color = "white", size = 0.2) +
geom_sf(data = sf_clustered, aes(fill = as.factor(cluster)), color = "white", size = 0.3) +
scale_fill_manual(
values = c("darkblue", "dodgerblue", "lightblue"),
name = "Cluster",
labels = c("Premium", "Affordable & High-Quality", "Mid-Range")
) +
labs(
title = "Clustering of Palermo Province Municipalities on Quality and Price",
subtitle = "Source: ISTAT",
caption = "Market data analysis @ LUISS"
) +
theme_minimal() +
theme_void() +
theme(legend.position = "right")
#ITINERARY
#It’s a function that returns an itinerary based on the user’s total budget, number of nights, and chosen theme among:
#Sea
#Mountain / Mountain & Nature
#Relaxation
#Culture / History & Culture
#Entertainment
#Food & Wine
#Spirituality / Spirituality & Tradition
library(dplyr)
library(geosphere)
## Warning: il pacchetto 'geosphere' è stato creato con R versione 4.4.3
preferences <- read_excel("city_preferences.xlsx")
#group by for a better visualization
preferences <- preferences %>%
distinct(municipality, preference) %>%
group_by(municipality) %>%
summarise(
preferences_list = paste(sort(unique(preference)), collapse = ", ")
) %>%
arrange(municipality)
# calculate centroids with lat and long
sf_centroids_coords <- sf_gb %>%
st_centroid(of_largest_polygon = TRUE) %>%
mutate(
lat = st_coordinates(geometry)[, 2],
lon = st_coordinates(geometry)[, 1],
COMUNE = toupper(COMUNE)
) %>%
st_drop_geometry() %>%
select(COMUNE, lat, lon)
## Warning: st_centroid assumes attributes are constant over geometries
# Join:add lat/lon and preferences to cluster_data
cluster_data <- cluster_data %>%
mutate(municipality = toupper(trimws(municipality))) %>%
left_join(sf_centroids_coords,
by = c("municipality" = "COMUNE")) %>%
left_join(preferences,
by = "municipality")
#function
suggest_itinerary <- function(budget,
total_number_nights,
summary_clusters,
cluster_data,
user_preference) {
cat("\nStarting smart itinerary suggestion …\n")
#column check
needed <- c("municipality", "lat", "lon",
"mean_price", "mean_detailed_score",
"cluster", "preferences_list") # ← nome corretto
if (!all(needed %in% names(cluster_data))) {
stop("cluster_data must contain: ",
paste(needed, collapse = ", "))
}
# Cluster selection based on budget
budget_per_night <- budget / total_number_nights
cat("Total budget:", budget, "€ | per-night budget:",
round(budget_per_night, 2), "€\n")
cat("Requested theme:", user_preference, "\n")
chosen_cluster <- summary_clusters %>%
filter(average_price <= budget_per_night) %>%
arrange(desc(average_price)) %>%
slice_head(n = 1) %>%
pull(cluster)
if (length(chosen_cluster) == 0) {
cat("No cluster fits the budget.\n"); return(NULL)
}
cat("Chosen cluster:", chosen_cluster, "\n")
# Filter cities by cluster AND preferences_list
cities_pool <- cluster_data %>%
filter(cluster == chosen_cluster,
grepl(user_preference,
preferences_list, # ← colonna corretta
ignore.case = TRUE),
!is.na(lat), !is.na(lon)) %>%
distinct(municipality, .keep_all = TRUE)
if (nrow(cities_pool) == 0) {
cat("No cities with theme",
user_preference, "in that cluster.\n"); return(NULL)
}
# How many cities (~1 every 2.5 nights)
n_cities <- max(1,
min( ceiling(total_number_nights / 2.5),
nrow(cities_pool) ) )
cat("Number of cities to visit:", n_cities, "\n")
set.seed(as.integer(Sys.time()))
weights <- cities_pool$mean_detailed_score /
sum(cities_pool$mean_detailed_score)
picked <- cities_pool %>%
slice_sample(n = n_cities, weight_by = weights)
# Allocate nights (≥1 per city, variable)
base_nights <- rep(1, n_cities)
remaining_nights <- total_number_nights - n_cities
if (remaining_nights > 0) {
extra_weights <- (picked$mean_detailed_score^2) /
sum(picked$mean_detailed_score^2)
extra_nights <- rmultinom(1, remaining_nights,
prob = extra_weights) |>
as.numeric()
# ensure variation
if (remaining_nights >= 2 &&
length(unique(extra_nights)) == 1) {
take <- sample(which(extra_nights > 0), 1)
give <- sample(setdiff(seq_along(extra_nights), take), 1)
extra_nights[take] <- extra_nights[take] - 1
extra_nights[give] <- extra_nights[give] + 1
}
} else {
extra_nights <- rep(0, n_cities)
}
picked$nights <- base_nights + extra_nights
picked$total_cost <- round(picked$mean_price * picked$nights, 2)
# Re-balance if cost exceeds budget
attempts <- 0
while (sum(picked$total_cost) > budget && attempts < 15) {
idx <- which.max(picked$mean_price)
if (picked$nights[idx] > 1) {
picked$nights[idx] <- picked$nights[idx] - 1
picked$total_cost <- round(picked$mean_price * picked$nights, 2)
}
attempts <- attempts + 1
}
# Order by nearest neighbour
if (n_cities > 1) {
dist_mat <- distm(picked[, c("lon", "lat")])
order_v <- integer(n_cities); visited <- rep(FALSE, n_cities)
order_v[1] <- 1; visited[1] <- TRUE
for (i in 2:n_cities) {
last <- order_v[i - 1]
cand <- which.min(ifelse(visited, Inf, dist_mat[last, ]))
if (length(cand) == 0 || visited[cand]) break
order_v[i] <- cand; visited[cand] <- TRUE
}
order_v <- order_v[order_v > 0]
picked <- picked[order_v, ]
}
# Output
itinerary <- picked %>%
select(municipality, preferences_list,
lat, lon, nights,
price_per_night = mean_price,
total_cost)
cat("\nSuggested itinerary:\n")
print(itinerary)
}
#recall function
itinerary <- suggest_itinerary(
budget = 800,
total_number_nights = 5,
summary_clusters = summary_clusters,
cluster_data = cluster_data, # deve includere “preferences_list”
user_preference = "Cultura")
##
## Starting smart itinerary suggestion …
## Total budget: 800 € | per-night budget: 160 €
## Requested theme: Cultura
## Chosen cluster: 1
## Number of cities to visit: 2
##
## Suggested itinerary:
## # A tibble: 2 × 7
## municipality preferences_list lat lon nights price_per_night total_cost
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CACCAMO Cultura e Storia, … 37.9 13.7 2 116. 232.
## 2 MISILMERI Cultura e Storia 38.0 13.4 3 132. 397.