Team 17- Palermo

Objectives and Expected Results

Analysis Objectives

The primary goal of this analysis is to develop an Airbnb Budget Trip Planner, which suggests a personalized itinerary optimized for both cost and accommodation quality. Specifically, the objectives are:

  • Distribute the stay between a main city (e.g., Palermo) and nearby towns.

  • Maximize the number of nights within the user’s budget.

  • Ensure a minimum quality level by selecting highly-rated accommodations.

  • Identify cheaper alternatives through clustering of cities.

Expected Results

  • Creation of a personalized itinerary with the optimal distribution of nights.

  • Prediction of the expected price per night using regression models.

  • Identification of clusters of cities with similar prices and quality.

  • Clear visualizations using maps and graphs to facilitate decision-making.

Potential Business Actions Enabled

  • Optimization of accommodation selection for budget-conscious travelers.

  • Support for booking platforms in dynamic price management.

  • Enhancement of user experience through AI-based recommendations

Exploratory Data Analysis (EDA)

Description of Key Variables

The exploratory analysis focuses on several key variables:

Host and Booking-Related Variables

  • host_is_superhost: Indicates whether the host is a superhost (quality and reliability).

  • host_identity_verified: Indicates whether the host’s identity has been verified

  • room_type: Type of accommodation (e.g., entire home, private room).

  • property_type: Specific type of property (e.g., apartment, villa, boutique hotel)

  • instant_bookable: Indicates whether the listing can be booked instantly.

Price and Stay Duration

  • price: Price per night of the listing.

  • minimum_nights: Minimum number of nights required by the host.

  • maximum_nights: Maximum number of nights available for booking.

Guest Ratings

  • review_scores_rating: Overall rating of the accommodation.

  • review_scores_accuracy: Accuracy of the listing’s information.

  • review_scores_cleanliness: Cleanliness of the accommodation.

  • review_scores_checkin: Ease of check-in.

  • review_scores_communication: Host’s communication quality.

  • review_scores_location: Location rating by previous guests.

  • review_scores_value: Value-for-money rating.

Reviews and Popularity

  • reviews_per_month: Number of reviews received per month.

  • municipality: City/locality of the listing.

Data Strengths

  • The review_scores_rating variable helps identify high-quality accommodations.
  • price and municipality allow analysis of price variations across cities.
  • room_type differentiates between entire homes and private rooms.

Data Weaknesses

  • Missing Data: Some listings lack complete review scores.
  • Outliers in Prices: Extreme values with significantly higher prices than the average.
  • Uneven Distribution: Some cities have a disproportionately high number of listings, which may affect clustering results

Analysis

library(readxl)
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(viridisLite)
library(car)
library(kableExtra)
library(cluster)
library(dendextend)
library(leaflet)
library(combinat)
library(stringr)
## Set the preferred working directory
setwd("C:/Users/Alice/OneDrive - LUISS Libera Università Internazionale degli Studi Sociali Guido Carli/ALICE/LUISS/magistrale/appunti/semestre 2/market data analysis/progetto")
## Read the listings dataset
df_listings <- read_excel("listings .xlsx")
df_reviews <- read_excel("reviews.xlsx")
df_calendar <- read_excel("calendar.xlsx")
## Specify columns to remove for cleaning purposes
columns_to_remove<- c("listing_id", "listing_url", "last_scraped", "name", "description", "neighborhood_overview",  "host_id", "host_name", "host_location",
"host_since", "host_response_time", "host_response_rate", "host_acceptance_rate",  
"host_neighbourhood", "host_listings_count", "host_has_profile_pic",  
"host_verifications", "neighbourhood_cleansed",  "number_of_reviews",
"accommodates", "bedrooms", "beds", "amenities",   
"availability_30", "availability_90", "availability_365", "number_of_reviews_ltm", "number_of_reviews_l30d",  
"first_review", "last_review"
  )
## Remove the selected columns
df_listings_cleansed <- df_listings[, !(colnames(df_listings) %in%
                                         columns_to_remove)]
## remove NA values
df_listings_cleansed <- df_listings_cleansed %>%
  drop_na()
sum(is.na(df_listings_cleansed))
## [1] 0
##numeric and categorical variables
col_types = sapply(df_listings_cleansed, class)

numeric_cols = names(which(sapply(col_types, function(x) any(x %in% c("numeric", "integer")))))

categorical_cols = names(which(sapply(col_types, function(x) any(x %in% c("factor", "character")))))
# Define the numeric variables 
numeric_vars <- c("price")

# Rimuoviamo gli outliers
# abbiamo controllato su airbnb minimo e massimo

df_listings_cleansed <- df_listings_cleansed %>%
  filter(price >= 9, price <= 1200)
## Summary statistics for numerical variables

numeric_vars <- df_listings_cleansed %>%
  select(where(is.numeric))

numeric_table <- summary(numeric_vars)

kable(numeric_table, caption = "Summary statistics of all numerical variables") %>%
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed", "bordered"))
Summary statistics of all numerical variables
latitude longitude price minimum_nights maximum_nights review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value reviews_per_month
Min. :37.60 Min. :12.98 Min. : 15.0 Min. : 1.000 Min. : 1.0 Min. :1.000 Min. :0.000 Min. :0.00 Min. :0.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. : 0.0100
1st Qu.:38.06 1st Qu.:13.33 1st Qu.: 60.0 1st Qu.: 1.000 1st Qu.: 29.0 1st Qu.:4.670 1st Qu.:4.700 1st Qu.:4.71 1st Qu.:4.810 1st Qu.:4.83 1st Qu.:4.580 1st Qu.:4.600 1st Qu.: 0.1600
Median :38.12 Median :13.36 Median : 80.0 Median : 2.000 Median : 90.0 Median :4.850 Median :4.870 Median :4.89 Median :4.930 Median :4.95 Median :4.800 Median :4.800 Median : 0.5000
Mean :38.10 Mean :13.44 Mean : 105.1 Mean : 2.601 Mean : 404.4 Mean :4.741 Mean :4.764 Mean :4.78 Mean :4.835 Mean :4.84 Mean :4.697 Mean :4.689 Mean : 0.9516
3rd Qu.:38.13 3rd Qu.:13.39 3rd Qu.: 113.0 3rd Qu.: 3.000 3rd Qu.:1125.0 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.00 3rd Qu.:5.000 3rd Qu.:5.00 3rd Qu.:4.960 3rd Qu.:4.930 3rd Qu.: 1.2900
Max. :38.72 Max. :14.25 Max. :1150.0 Max. :365.000 Max. :1125.0 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :14.3300
## Summary delle variabili categoriche

# Seleziona solo le variabili categoriche
categorical_vars <- df_listings_cleansed %>%
  select(where(~ is.character(.) || is.factor(.)))

# Calcola le frequenze per ogni variabile
categorical_summary <- map(categorical_vars, ~ as.data.frame(table(.))) %>%
  imap(~ mutate(.x, Variable = .y)) %>%
  bind_rows() %>%
  rename(Value = ".", Count = "Freq") %>%
  relocate(Variable, .before = Value)

# Visualizza la tabella con kable
kable(categorical_summary, caption = "Frequenze delle variabili categoriche") %>%
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed", "bordered"))
Frequenze delle variabili categoriche
Variable Value Count
host_is_superhost f 4903
host_is_superhost t 2646
host_identity_verified f 644
host_identity_verified t 6905
property_type Barn 1
property_type Boat 5
property_type Camper/RV 1
property_type Campsite 1
property_type Casa particular 6
property_type Dome 4
property_type Earthen home 1
property_type Entire bed and breakfast 1
property_type Entire bungalow 11
property_type Entire cabin 11
property_type Entire chalet 2
property_type Entire condo 1014
property_type Entire cottage 61
property_type Entire guest suite 30
property_type Entire guesthouse 8
property_type Entire home 990
property_type Entire loft 125
property_type Entire place 10
property_type Entire rental unit 2726
property_type Entire serviced apartment 28
property_type Entire townhouse 64
property_type Entire vacation home 429
property_type Entire villa 623
property_type Farm stay 20
property_type Island 1
property_type Private room 7
property_type Private room in bed and breakfast 438
property_type Private room in boat 1
property_type Private room in casa particular 7
property_type Private room in castle 3
property_type Private room in condo 178
property_type Private room in cottage 2
property_type Private room in farm stay 27
property_type Private room in guest suite 23
property_type Private room in guesthouse 6
property_type Private room in home 110
property_type Private room in hut 1
property_type Private room in loft 7
property_type Private room in nature lodge 12
property_type Private room in rental unit 340
property_type Private room in resort 1
property_type Private room in serviced apartment 12
property_type Private room in tiny home 1
property_type Private room in townhouse 7
property_type Private room in vacation home 31
property_type Private room in villa 28
property_type Room in aparthotel 8
property_type Room in bed and breakfast 33
property_type Room in boutique hotel 4
property_type Room in hostel 1
property_type Room in hotel 12
property_type Room in serviced apartment 11
property_type Shared room in condo 1
property_type Shared room in home 1
property_type Shared room in hostel 1
property_type Shared room in rental unit 4
property_type Tent 3
property_type Tiny home 54
property_type Tower 1
room_type Entire home/apt 6231
room_type Hotel room 45
room_type Private room 1266
room_type Shared room 7
instant_bookable f 3568
instant_bookable t 3981
municipality Alia 6
municipality Alimena 1
municipality Altavilla Milicia 80
municipality Altofonte 5
municipality Bagheria 112
municipality Balestrate 140
municipality Belmonte Mezzagno 1
municipality Bisacquino 1
municipality Blufi 1
municipality Bolognetta 5
municipality Bompietro 3
municipality Borgetto 16
municipality Caccamo 25
municipality Caltavuturo 3
municipality Campofelice di Roccella 189
municipality Campofiorito 4
municipality Camporeale 1
municipality Capaci 54
municipality Carini 144
municipality Castelbuono 64
municipality Casteldaccia 43
municipality Castellana Sicula 9
municipality Castronovo di Sicilia 5
municipality Cefalù 636
municipality Cerda 1
municipality Chiusa Sclafani 2
municipality Ciminna 4
municipality Cinisi 114
municipality Collesano 47
municipality Contessa Entellina 4
municipality Corleone 15
municipality Ficarazzi 14
municipality Gangi 23
municipality Geraci Siculo 6
municipality Giuliana 1
municipality Gratteri 25
municipality Isnello 12
municipality Isola delle Femmine 100
municipality Lascari 82
municipality Lercara Friddi 5
municipality Marineo 3
municipality Mezzojuso 1
municipality Misilmeri 7
municipality Monreale 79
municipality Montelepre 10
municipality Montemaggiore Belsito 1
municipality Palazzo Adriano 2
municipality Palermo 4426
municipality Partinico 114
municipality Petralia Soprana 18
municipality Petralia Sottana 22
municipality Piana degli Albanesi 7
municipality Polizzi Generosa 19
municipality Pollina 53
municipality Prizzi 8
municipality Roccamena 3
municipality Roccapalumba 4
municipality San Giuseppe Jato 2
municipality San Mauro Castelverde 8
municipality Santa Cristina Gela 4
municipality Santa Flavia 153
municipality Sciara 1
municipality Scillato 10
municipality Sclafani Bagni 5
municipality Termini Imerese 60
municipality Terrasini 229
municipality Torretta 19
municipality Trabia 78
municipality Trappeto 104
municipality Ustica 84
municipality Valledolmo 2
municipality Ventimiglia di Sicilia 2
municipality Villabate 7
municipality Villafrati 1
#data visualization for numerical variables 

# Prepara il dataset per gli istogrammi
df_long_num <- df_listings_cleansed %>%
  select(all_of(numeric_cols)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Rinomina le variabili con etichette più compatte
df_long_num$Variable <- as.character(df_long_num$Variable)
df_long_num$Variable <- dplyr::recode(df_long_num$Variable,
  "review_scores_accuracy" = "Accuracy",
  "review_scores_cleanliness" = "Cleanliness",
  "review_scores_checkin" = "Check-in",
  "review_scores_communication" = "Communication",
  "review_scores_location" = "Location",
  "review_scores_rating" = "Rating",
  "review_scores_value" = "Value",
  "reviews_per_month" = "Reviews/month",
  "minimum_nights" = "Min Nights",
  "maximum_nights" = "Max Nights",
  "price" = "Price",
  "latitude" = "Latitude",
  "longitude" = "Longitude"
)

# Istogrammi con etichette leggibili e titoli non tagliati
ggplot(df_long_num, aes(x = Value, fill = Variable)) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 20) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +  # distribuisce meglio i grafici
  labs(title = "Histograms of Multiple Variables", x = "Value", y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    strip.text = element_text(size = 9, face = "bold"),  # più piccolo = meno tagliato
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.margin = margin(20, 10, 10, 10)
  )

Maximum Nights (Top Left, Red)
  • The distribution is highly skewed, with several peaks at specific values (e.g., 30, 365, 1000).

  • This suggests that many hosts impose fixed limits on long-term stays.

Minimum Nights (Top Center, Yellow)
  • Most listings require a minimum of 1 or a few nights, with a small number demanding longer stays.

  • The data is concentrated at the lower end, meaning that short-term stays are the norm.

Price (Top Right, Brown)
  • The majority of listings have low prices, but there are occasional higher-priced accommodations.

  • The distribution is right-skewed, indicating a few luxury listings that charge significantly more.

Review Scores Accuracy, Check-in, Cleanliness, Communication, Location, Rating, and Value (Middle & Bottom Rows, Green to Purple)
  • These variables display left-skewed distributions, meaning most properties receive high ratings (close to 5 stars).

  • The limited variation suggests that guests generally rate their stays very positively.

  • This may indicate that review scores are not a strong differentiating factor between listings.

Reviews Per Month (Bottom Right, Pink)
  • Most listings receive relatively few reviews per month, while some accumulate a significantly higher number.

  • This right-skewed distribution suggests that a small percentage of properties are far more popular than the majority.

#data visualization categorical variables

# Define the maximum number of categories to keep
top_n_municipalities <- 6
top_n_property_types <- 6

# Count occurrences and select the most frequent names
top_municipalities <- df_listings_cleansed %>%
  count(municipality, sort = TRUE) %>%
  slice_head(n = top_n_municipalities) %>%
  pull(municipality)

top_property_types <- df_listings_cleansed %>%
  count(property_type, sort = TRUE) %>%
  slice_head(n = top_n_property_types) %>%
  pull(property_type)

# Crea versione semplificata con "Other" per le categorie meno frequenti
df_listings_cleansed_representation <- df_listings_cleansed %>%
  mutate(
    municipality = ifelse(municipality %in% top_municipalities, as.character(municipality), "Other"),
    property_type = ifelse(property_type %in% top_property_types, as.character(property_type), "Other")
  )

# Prepara i dati per il grafico
df_long_cat <- df_listings_cleansed_representation %>%
  select(c("host_is_superhost", "room_type", "municipality", "instant_bookable", "property_type", "host_identity_verified")) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category")

# Ordina le categorie come fattori (opzionale)
df_long_cat$Category <- factor(df_long_cat$Category, levels = unique(df_long_cat$Category))

# Genera i barplot
ggplot(df_long_cat, aes(x = Category, fill = Variable)) +
  geom_bar(position = "dodge", alpha = 0.8) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(title = "Bar Plots of Categorical Variables", x = "Category", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Host Identity Verified (Top Left, Red):

The majority of hosts have a verified identity, indicating a high level of trust and security in the listings. However, a small portion of hosts remain unverified, which might impact guest confidence.

Host is Superhost (Top Center, Brown):

Less than half of the listings belong to Superhosts, meaning that while a significant number of hosts maintain high service quality, the majority do not qualify for this status.

Instant Bookable (Top Right, Green):

The distribution is quite balanced, meaning that roughly half of the listings allow guests to book instantly, while the other half require manual approval

Municipality (Bottom Left, Blue):

The distribution shows a highly imbalanced pattern, where a single municipality (likely Palermo) has the highest number of listings, while other municipalities have significantly fewer. This suggests that most Airbnb accommodations are concentrated in a specific area, which may impact clustering and pricing analysis.

Property Type (Bottom Center, Light Blue):

There is a wide variety of property types, but a few dominate the market. The most common property type seems to be “Entire home/apartment,” while other types are much less frequent.

Room Type (Bottom Right, Pink):

The vast majority of listings are entire homes/apartments, followed by private rooms, and very few shared rooms or hotel rooms. This confirms that Airbnb is primarily used for renting out full apartments or private accommodations rather than shared spaces.

### Check relationship between each potential regressor and response (price)
plot_data <- df_listings_cleansed %>%
  select(c("maximum_nights", "minimum_nights",  "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness",  "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value", "reviews_per_month", 'price')) %>%
  gather(key = "predictor", value = "predictor_value", -price)



plot_data <- df_listings_cleansed %>%
  select(c("maximum_nights", "minimum_nights", "review_scores_rating",
           "review_scores_accuracy", "review_scores_cleanliness",
           "review_scores_checkin", "review_scores_communication",
           "review_scores_location", "review_scores_value",
           "reviews_per_month", "price")) %>%
  tidyr::gather(key = "predictor", value = "predictor_value", -price)

# Converte i nomi delle variabili in etichette più leggibili
plot_data$predictor <- as.character(plot_data$predictor)
plot_data$predictor <- dplyr::recode(plot_data$predictor,
  "review_scores_accuracy" = "Accuracy",
  "review_scores_cleanliness" = "Cleanliness",
  "review_scores_checkin" = "Check-in",
  "review_scores_communication" = "Communication",
  "review_scores_location" = "Location",
  "review_scores_rating" = "Rating",
  "review_scores_value" = "Value for Money",
  "reviews_per_month" = "Reviews per Month",
  "minimum_nights" = "Min Nights",
  "maximum_nights" = "Max Nights"
)

# Crea il grafico con etichette leggibili
ggplot(plot_data, aes(x = predictor_value, y = price)) +
  geom_point(alpha = 0.5, color = "blue") +
  facet_wrap(~predictor, scales = "free_x") +
  theme_bw() +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12),
    strip.text = element_text(size = 11, face = "bold"),
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.margin = margin(20, 10, 10, 10)
  ) +
  labs(
    x = "Predictor Value",
    y = "Price",
    title = "Scatter Plots: Predictors vs Price"
  )

##### **Maximum & Minimum Nights:**

#Some listings allow **very long stays**, and these often have highly variable pricing.
#Short-stay listings generally have **lower prices**, but there are exceptions.
#The presence of extreme values suggests that some listings cater to long-term rentals or have unique pricing structures.

##### **Review Scores (Accuracy, Cleanliness, Check-in, Communication, Location, Rating, Value):**

#Most review scores are concentrated between **4 and 5 stars**, leading to a lack of variability.

#No strong correlation between higher review scores and price is apparent.

#This suggests that while guests generally rate accommodations highly, **hosts do not necessarily increase prices based on reviews alone.**

##### **Reviews Per Month:**
#Most listings receive **fewer reviews per month**, but a few have a significantly high number.

#High-priced listings tend to receive fewer reviews, indicating that **premium accommodations might have lower booking frequency.**

par(mfrow = c(3, 3), mar = c(4, 4, 4, 2))  # margine più alto sotto

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$host_identity_verified, 
        ylab = "Price", xlab = "", main = "Price vs Host Identity Verified",
        las = 1, cex.axis = 0.9)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$host_is_superhost, 
        ylab = "Price", xlab = "", main = "Price vs Superhost",
        las = 1, cex.axis = 0.9)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$instant_bookable, 
        ylab = "Price", xlab = "", main = "Price vs Instant Booking",
        las = 1, cex.axis = 0.9)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$municipality, 
        ylab = "Price", xlab = "", main = "Price vs Municipality", 
        las = 2, cex.axis = 0.6)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$property_type, 
        ylab = "Price", xlab = "", main = "Price vs Property Type", 
        las = 2, cex.axis = 0.6)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$room_type, 
        ylab = "Price", xlab = "", main = "Price vs Room Type", 
        las = 2, cex.axis = 0.8)

Host Identity Verified (Top Left)
  • No significant difference in price between verified and non-verified hosts.

  • Both categories contain several high-price outliers, indicating that verification does not strongly influence pricing.

Superhost Status (Top Center)
  • Superhosts and non-Superhosts have similar price distributions, with some expensive listings in both groups.

  • The median price appears to be slightly higher for Superhosts, but the effect is not substantial.

Instant Booking (Top Right)
  • The price distribution is almost identical between instant bookable and non-instant bookable listings.

  • This suggests that instant booking does not significantly impact pricing strategies.

Municipality (Bottom Left)
  • There is substantial variation in price across different municipalities.

  • Some municipalities have wider interquartile ranges, while others have mostly lower-priced listings.

Property Type (Bottom Center)
  • The price distribution varies significantly by property type.

  • Some property types have consistently lower prices, while others have wider distributions with multiple outliers.

  • This indicates that property type is an important factor in pricing.

Room Type (Bottom Right)
  • Entire home/apartment listings tend to have higher prices compared to shared rooms.

  • Private rooms (not shown separately here) would likely fall between these two categories.

#CLUSTERING

df_listings_cleansed <- df_listings_cleansed %>%
  mutate(
    mean_review_score_detailed = rowMeans(select(., review_scores_rating:review_scores_value), na.rm = TRUE)
  )

#Municipality aggregation
cluster_data <- df_listings_cleansed %>%
  group_by(municipality) %>%
  summarise(
    mean_price = mean(price, na.rm = TRUE),
     mean_detailed_score = mean(mean_review_score_detailed, na.rm = TRUE)
  ) %>%
  na.omit()


#Data standardization
numeric_cols <- cluster_data %>%
  select(where(is.numeric))  # seleziona solo le variabili numeriche
scaled_data <- scale(numeric_cols)
row.names(scaled_data) <- cluster_data$municipality

#check means and sd
scale_data_mean_sd= as.data.frame(scaled_data) %>%
  summarise(across(everything(), list(mean = mean, sd = sd)))


#Hierarchical Clustering GERARCHICO ( WARD method)
dist_matrix <- dist(scaled_data)
hc_ward <- hclust(dist_matrix, method = "ward.D")


#Dendrogram
library(dendextend)

dend <- as.dendrogram(hc_ward) %>%
  set("labels", rownames(scaled_data)) %>%
  set("branches_k_color")

labels(dend) <- ifelse(seq_along(labels(dend)) %% 3 == 0, labels(dend), "")

par(mar = c(14, 4, 4, 2))  # bottom, left, top, right

plot(dend,
     main = "Dendrogram - Ward Method",
     ylab = "Distance",
     las = 2,      
     cex = 0.6)

#Choose the optimal number of clusters

#elbow method

library(factoextra)
## Warning: il pacchetto 'factoextra' è stato creato con R versione 4.4.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Metodo del gomito (WSS) applicato ai tuoi dati


fviz_nbclust(as.data.frame(scaled_data), FUN = hcut, method = "wss", hc_method = "ward")
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"

#the optimal number of clusters according to thise method is 3 

#silhouette

fviz_nbclust(as.data.frame(scaled_data), FUN = hcut, method = "silhouette", hc_method = "ward")
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"

#the optimal number of clusters according to thise method is 9

#our decision: k=3 optimal number of clusters

#cut dendogram

dend <- as.dendrogram(hc_ward) %>%
  set("labels", rownames(scaled_data)) %>%
  set("branches_k_color", k = 3)

labels(dend) <- ifelse(seq_along(labels(dend)) %% 3 == 0, labels(dend), "")

par(mar = c(14, 4, 4, 2))  

plot(dend,
     main = "Dendrogram - Ward Method",
     ylab = "Distance",
     las = 2,       
     cex = 0.6)     

abline(h = 23, col = 'red', lwd = 2)

#Cluster
cluster_assignment <- cutree(hc_ward, k = 3)
cluster_data$cluster <- as.factor(cluster_assignment)

ggplot(cluster_data, aes(x = mean_price, y = mean_detailed_score, color = cluster)) +
  geom_point(size = 3, alpha = 0.9) +
  labs(
    title = "Clustering Ward method (k = 3)",
    x = "Average price ",
    y = "Detailed score"
  ) +
  theme_minimal() +
  theme(legend.position = "right") +
  scale_color_manual(values = c("red", "green", "blue")) 

#Clusters' statistics 
summary_clusters <- cluster_data %>%
  group_by(cluster) %>%
  summarise(
    n_comuni = n(),
   average_price = round(mean(mean_price), 2),
    average_detailed_score = round(mean(mean_detailed_score), 2),
    municipality = paste(municipality, collapse = ", ")
  )

print(summary_clusters)
## # A tibble: 3 × 5
##   cluster n_comuni average_price average_detailed_score municipality            
##   <fct>      <int>         <dbl>                  <dbl> <chr>                   
## 1 1             20         152.                    4.8  Alia, Balestrate, Bisac…
## 2 2             13          54.9                   4.94 Alimena, Belmonte Mezza…
## 3 3             41          81.4                   4.74 Altavilla Milicia, Alto…
summary_clusters$municipality
## [1] "Alia, Balestrate, Bisacquino, Borgetto, Caccamo, Carini, Cefalù, Cerda, Chiusa Sclafani, Ciminna, Cinisi, Collesano, Misilmeri, Partinico, Polizzi Generosa, Santa Flavia, Terrasini, Trabia, Ustica, Ventimiglia di Sicilia"                                                                                                                                                                                                                                                                                                                                                      
## [2] "Alimena, Belmonte Mezzagno, Blufi, Bompietro, Castronovo di Sicilia, Giuliana, Marineo, Mezzojuso, Montelepre, Montemaggiore Belsito, San Giuseppe Jato, Sciara, Villafrati"                                                                                                                                                                                                                                                                                                                                                                                                       
## [3] "Altavilla Milicia, Altofonte, Bagheria, Bolognetta, Caltavuturo, Campofelice di Roccella, Campofiorito, Camporeale, Capaci, Castelbuono, Casteldaccia, Castellana Sicula, Contessa Entellina, Corleone, Ficarazzi, Gangi, Geraci Siculo, Gratteri, Isnello, Isola delle Femmine, Lascari, Lercara Friddi, Monreale, Palazzo Adriano, Palermo, Petralia Soprana, Petralia Sottana, Piana degli Albanesi, Pollina, Prizzi, Roccamena, Roccapalumba, San Mauro Castelverde, Santa Cristina Gela, Scillato, Sclafani Bagni, Termini Imerese, Torretta, Trappeto, Valledolmo, Villabate"
#cluster1---> High-end destinations with premium prices and high customer ratings. Likely tourist hotspots.
#cluster2--> Low-cost locations with excellent ratings. Possibly small, lesser-known towns with highly appreciated stays.
#cluster3---> Mid-range price and quality. A diverse group with a mix of performance levels.

#rename clusters to have a clearer representation

cluster_data$cluster <- factor(cluster_data$cluster,
                                 levels = c("1", "2", "3"),
                                 labels = c("Premium", "Affordable & Top Rated", "Mid-Range"))


summary_clusters$cluster  <- factor(summary_clusters$cluster,
                                 levels = c("1", "2", "3"),
                                 labels = c("Premium", "Affordable & Top Rated", "Mid-Range"))
#CHOROPLETH MAP

#represent the previously defined clusters on a map with different colors depending on clusters' price and average ratings

# Libreries


library(sf)
## Warning: il pacchetto 'sf' è stato creato con R versione 4.4.3
## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library(dplyr)
library(ggplot2)

# Carica e prepara lo shapefile
sf_gb <- read_sf("Com01012025_g", layer = "Com01012025_g_WGS84") %>%
  st_make_valid() %>%
  st_transform(crs = 4326) %>%
  mutate(COMUNE = toupper(COMUNE))

# Prepara cluster_data
cluster_data <- cluster_data %>%
  mutate(municipality = toupper(municipality))

# Join e filtro per Palermo
sf_palermo <- sf_gb %>%
  left_join(cluster_data %>% select(municipality, cluster),
            by = c("COMUNE" = "municipality")) %>%
  filter(COD_PROV == "82")

# Separa comuni clusterizzati
sf_clustered <- sf_palermo %>% filter(!is.na(cluster))

# Mappa
ggplot() +
  geom_sf(data = sf_palermo, fill = "grey90", color = "white", size = 0.2) +
  geom_sf(data = sf_clustered, aes(fill = as.factor(cluster)), color = "white", size = 0.3) +
  scale_fill_manual(
    values = c("darkblue", "dodgerblue", "lightblue"),
    name = "Cluster",
    labels = c("Premium", "Affordable & High-Quality", "Mid-Range")
  ) +
  labs(
    title = "Clustering of Palermo Province Municipalities on Quality and Price",
    subtitle = "Source: ISTAT",
    caption = "Market data analysis @ LUISS"
  ) +
  theme_minimal() +
  theme_void() +
  theme(legend.position = "right")

#ITINERARY
#It’s a function that returns an itinerary based on the user’s total budget, number of nights, and chosen theme among:

#Sea
#Mountain / Mountain & Nature
#Relaxation

#Culture / History & Culture
#Entertainment
#Food & Wine
#Spirituality / Spirituality & Tradition


library(dplyr)
library(geosphere)
## Warning: il pacchetto 'geosphere' è stato creato con R versione 4.4.3
preferences <- read_excel("city_preferences.xlsx")

#group by for a better visualization
preferences <- preferences %>%
  distinct(municipality, preference) %>%         
  group_by(municipality) %>%
  summarise(
    preferences_list = paste(sort(unique(preference)), collapse = ", ")
  ) %>%
  arrange(municipality)


# calculate centroids with lat and long
sf_centroids_coords <- sf_gb %>%
  st_centroid(of_largest_polygon = TRUE) %>%
  mutate(
    lat = st_coordinates(geometry)[, 2],
    lon = st_coordinates(geometry)[, 1],
    COMUNE = toupper(COMUNE)
  ) %>%
  st_drop_geometry() %>%
  select(COMUNE, lat, lon)
## Warning: st_centroid assumes attributes are constant over geometries
# Join:add lat/lon and preferences to cluster_data
cluster_data <- cluster_data %>%
  mutate(municipality = toupper(trimws(municipality))) %>%
  left_join(sf_centroids_coords,
            by = c("municipality" = "COMUNE")) %>%
  left_join(preferences,     
            by = "municipality")

#function

suggest_itinerary <- function(budget,
                              total_number_nights,
                              summary_clusters,
                              cluster_data,
                              user_preference) {

  cat("\nStarting smart itinerary suggestion …\n")

 #column check
  
  needed <- c("municipality", "lat", "lon",
              "mean_price", "mean_detailed_score",
              "cluster", "preferences_list")   # ← nome corretto
  if (!all(needed %in% names(cluster_data))) {
    stop("cluster_data must contain: ",
         paste(needed, collapse = ", "))
  }

  # Cluster selection based on budget
 
  budget_per_night <- budget / total_number_nights
  cat("Total budget:", budget, "€  |  per-night budget:",
      round(budget_per_night, 2), "€\n")
  cat("Requested theme:", user_preference, "\n")

  chosen_cluster <- summary_clusters %>%
    filter(average_price <= budget_per_night) %>%
    arrange(desc(average_price)) %>%
    slice_head(n = 1) %>%
    pull(cluster)

  if (length(chosen_cluster) == 0) {
    cat("No cluster fits the budget.\n"); return(NULL)
  }
  cat("Chosen cluster:", chosen_cluster, "\n")

  # Filter cities by cluster AND preferences_list
 
  cities_pool <- cluster_data %>%
    filter(cluster == chosen_cluster,
           grepl(user_preference,
                 preferences_list,           # ← colonna corretta
                 ignore.case = TRUE),
           !is.na(lat), !is.na(lon)) %>%
    distinct(municipality, .keep_all = TRUE)

  if (nrow(cities_pool) == 0) {
    cat("No cities with theme",
        user_preference, "in that cluster.\n"); return(NULL)
  }

  # How many cities (~1 every 2.5 nights)
  
  n_cities <- max(1,
                  min( ceiling(total_number_nights / 2.5),
                       nrow(cities_pool) ) )
  cat("Number of cities to visit:", n_cities, "\n")

  set.seed(as.integer(Sys.time()))
  weights <- cities_pool$mean_detailed_score /
             sum(cities_pool$mean_detailed_score)
  picked <- cities_pool %>%
    slice_sample(n = n_cities, weight_by = weights)

  # Allocate nights  (≥1 per city, variable)
 
  base_nights      <- rep(1, n_cities)
  remaining_nights <- total_number_nights - n_cities

  if (remaining_nights > 0) {
    extra_weights <- (picked$mean_detailed_score^2) /
                     sum(picked$mean_detailed_score^2)
    extra_nights  <- rmultinom(1, remaining_nights,
                               prob = extra_weights) |>
                     as.numeric()

    # ensure variation
    if (remaining_nights >= 2 &&
        length(unique(extra_nights)) == 1) {
      take <- sample(which(extra_nights > 0), 1)
      give <- sample(setdiff(seq_along(extra_nights), take), 1)
      extra_nights[take] <- extra_nights[take] - 1
      extra_nights[give] <- extra_nights[give] + 1
    }
  } else {
    extra_nights <- rep(0, n_cities)
  }

  picked$nights     <- base_nights + extra_nights
  picked$total_cost <- round(picked$mean_price * picked$nights, 2)

  #  Re-balance if cost exceeds budget
  
  attempts <- 0
  while (sum(picked$total_cost) > budget && attempts < 15) {
    idx <- which.max(picked$mean_price)
    if (picked$nights[idx] > 1) {
      picked$nights[idx] <- picked$nights[idx] - 1
      picked$total_cost  <- round(picked$mean_price * picked$nights, 2)
    }
    attempts <- attempts + 1
  }

  # Order by nearest neighbour
  
  if (n_cities > 1) {
    dist_mat <- distm(picked[, c("lon", "lat")])
    order_v  <- integer(n_cities); visited <- rep(FALSE, n_cities)
    order_v[1] <- 1; visited[1] <- TRUE

    for (i in 2:n_cities) {
      last <- order_v[i - 1]
      cand <- which.min(ifelse(visited, Inf, dist_mat[last, ]))
      if (length(cand) == 0 || visited[cand]) break
      order_v[i] <- cand; visited[cand] <- TRUE
    }
    order_v <- order_v[order_v > 0]
    picked  <- picked[order_v, ]
  }

  # Output
 
  itinerary <- picked %>%
  select(municipality, preferences_list,
         lat, lon, nights,
         price_per_night = mean_price,
         total_cost)

cat("\nSuggested itinerary:\n")
print(itinerary)
}

#recall function
itinerary <- suggest_itinerary(
  budget              = 800,
  total_number_nights = 5,
  summary_clusters    = summary_clusters,
  cluster_data        = cluster_data,      # deve includere “preferences_list”
  user_preference     = "Cultura")
## 
## Starting smart itinerary suggestion …
## Total budget: 800 €  |  per-night budget: 160 €
## Requested theme: Cultura 
## Chosen cluster: 1 
## Number of cities to visit: 2 
## 
## Suggested itinerary:
## # A tibble: 2 × 7
##   municipality preferences_list      lat   lon nights price_per_night total_cost
##   <chr>        <chr>               <dbl> <dbl>  <dbl>           <dbl>      <dbl>
## 1 CACCAMO      Cultura e Storia, …  37.9  13.7      2            116.       232.
## 2 MISILMERI    Cultura e Storia     38.0  13.4      3            132.       397.