Team 17- Palermo

Objectives and Expected Results

Analysis Objectives

The primary goal of this analysis is to develop an Airbnb Budget Trip Planner, which suggests a personalized itinerary optimized for both cost and accommodation quality. Specifically, the objectives are:

Distribute the stay between a main city (e.g., Palermo) and nearby towns.
Maximize the number of nights within the user’s budget.
Ensure a minimum quality level by selecting highly-rated accommodations.
Identify cheaper alternatives through clustering of cities.

Expected Results

Creation of a personalized itinerary with the optimal distribution of nights.
Prediction of the expected price per night using regression models.
Identification of clusters of cities with similar prices and quality.
Clear visualizations using maps and graphs to facilitate decision-making.

Potential Business Actions Enabled

Optimization of accommodation selection for budget-conscious travelers.
Support for booking platforms in dynamic price management.
Enhancement of user experience through AI-based recommendations

Exploratory Data Analysis (EDA)

Description of Key Variables

The exploratory analysis focuses on several key variables:

Host and Booking-Related Variables

host_is_superhost: Indicates whether the host is a superhost (quality and reliability).
host_identity_verified: Indicates whether the host’s identity has been verified
room_type: Type of accommodation (e.g., entire home, private room).
property_type: Specific type of property (e.g., apartment, villa, boutique hotel)
instant_bookable: Indicates whether the listing can be booked instantly.

Price and Stay Duration

price: Price per night of the listing.
minimum_nights: Minimum number of nights required by the host.
maximum_nights: Maximum number of nights available for booking.

Guest Ratings

review_scores_rating: Overall rating of the accommodation.
review_scores_accuracy: Accuracy of the listing’s information.
review_scores_cleanliness: Cleanliness of the accommodation.
review_scores_checkin: Ease of check-in.
review_scores_communication: Host’s communication quality.
review_scores_location: Location rating by previous guests.
review_scores_value: Value-for-money rating.

Reviews and Popularity

reviews_per_month: Number of reviews received per month.
municipality: City/locality of the listing.

Data Strengths

The review_scores_rating variable helps identify high-quality accommodations.
price and municipality allow analysis of price variations across cities.
room_type differentiates between entire homes and private rooms.

Data Weaknesses

Missing Data: Some listings lack complete review scores.
Outliers in Prices: Extreme values with significantly higher prices than the average.
Uneven Distribution: Some cities have a disproportionately high number of listings, which may affect clustering results

Analysis

library(readxl)
library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(viridisLite)
library(car)
library(kableExtra)
library(cluster)
library(dendextend)
library(leaflet)
library(combinat)
library(stringr)

## Set the preferred working directory
setwd("C:/Users/Alice/OneDrive - LUISS Libera Università Internazionale degli Studi Sociali Guido Carli/ALICE/LUISS/magistrale/appunti/semestre 2/market data analysis/progetto")

## Read the listings dataset
df_listings <- read_excel("listings .xlsx")
df_reviews <- read_excel("reviews.xlsx")
df_calendar <- read_excel("calendar.xlsx")

## Specify columns to remove for cleaning purposes
columns_to_remove<- c("listing_id", "listing_url", "last_scraped", "name", "description", "neighborhood_overview",  "host_id", "host_name", "host_location",
"host_since", "host_response_time", "host_response_rate", "host_acceptance_rate",  
"host_neighbourhood", "host_listings_count", "host_has_profile_pic",  
"host_verifications", "neighbourhood_cleansed",  "number_of_reviews",
"accommodates", "bedrooms", "beds", "amenities",   
"availability_30", "availability_90", "availability_365", "number_of_reviews_ltm", "number_of_reviews_l30d",  
"first_review", "last_review"
  )

## Remove the selected columns
df_listings_cleansed <- df_listings[, !(colnames(df_listings) %in%
                                         columns_to_remove)]

## remove NA values
df_listings_cleansed <- df_listings_cleansed %>%
  drop_na()
sum(is.na(df_listings_cleansed))

## [1] 0

##numeric and categorical variables
col_types = sapply(df_listings_cleansed, class)

numeric_cols = names(which(sapply(col_types, function(x) any(x %in% c("numeric", "integer")))))

categorical_cols = names(which(sapply(col_types, function(x) any(x %in% c("factor", "character")))))

# Define the numeric variables 
numeric_vars <- c("price")

# Rimuoviamo gli outliers
# abbiamo controllato su airbnb minimo e massimo

df_listings_cleansed <- df_listings_cleansed %>%
  filter(price >= 9, price <= 1200)

## Summary statistics for numerical variables

numeric_vars <- df_listings_cleansed %>%
  select(where(is.numeric))

numeric_table <- summary(numeric_vars)

kable(numeric_table, caption = "Summary statistics of all numerical variables") %>%
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed", "bordered"))

Summary statistics of all numerical variables
latitude	longitude	price	minimum_nights	maximum_nights	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	reviews_per_month
Min. :37.60	Min. :12.98	Min. : 15.0	Min. : 1.000	Min. : 1.0	Min. :1.000	Min. :0.000	Min. :0.00	Min. :0.000	Min. :1.00	Min. :1.000	Min. :1.000	Min. : 0.0100
1st Qu.:38.06	1st Qu.:13.33	1st Qu.: 60.0	1st Qu.: 1.000	1st Qu.: 29.0	1st Qu.:4.670	1st Qu.:4.700	1st Qu.:4.71	1st Qu.:4.810	1st Qu.:4.83	1st Qu.:4.580	1st Qu.:4.600	1st Qu.: 0.1600
Median :38.12	Median :13.36	Median : 80.0	Median : 2.000	Median : 90.0	Median :4.850	Median :4.870	Median :4.89	Median :4.930	Median :4.95	Median :4.800	Median :4.800	Median : 0.5000
Mean :38.10	Mean :13.44	Mean : 105.1	Mean : 2.601	Mean : 404.4	Mean :4.741	Mean :4.764	Mean :4.78	Mean :4.835	Mean :4.84	Mean :4.697	Mean :4.689	Mean : 0.9516
3rd Qu.:38.13	3rd Qu.:13.39	3rd Qu.: 113.0	3rd Qu.: 3.000	3rd Qu.:1125.0	3rd Qu.:5.000	3rd Qu.:5.000	3rd Qu.:5.00	3rd Qu.:5.000	3rd Qu.:5.00	3rd Qu.:4.960	3rd Qu.:4.930	3rd Qu.: 1.2900
Max. :38.72	Max. :14.25	Max. :1150.0	Max. :365.000	Max. :1125.0	Max. :5.000	Max. :5.000	Max. :5.00	Max. :5.000	Max. :5.00	Max. :5.000	Max. :5.000	Max. :14.3300

## Summary delle variabili categoriche

# Seleziona solo le variabili categoriche
categorical_vars <- df_listings_cleansed %>%
  select(where(~ is.character(.) || is.factor(.)))

# Calcola le frequenze per ogni variabile
categorical_summary <- map(categorical_vars, ~ as.data.frame(table(.))) %>%
  imap(~ mutate(.x, Variable = .y)) %>%
  bind_rows() %>%
  rename(Value = ".", Count = "Freq") %>%
  relocate(Variable, .before = Value)

# Visualizza la tabella con kable
kable(categorical_summary, caption = "Frequenze delle variabili categoriche") %>%
  kable_styling(full_width = FALSE, 
                bootstrap_options = c("striped", "hover", "condensed", "bordered"))

Frequenze delle variabili categoriche
Variable	Value	Count
host_is_superhost	f	4903
host_is_superhost	t	2646
host_identity_verified	f	644
host_identity_verified	t	6905
property_type	Barn	1
property_type	Boat	5
property_type	Camper/RV	1
property_type	Campsite	1
property_type	Casa particular	6
property_type	Dome	4
property_type	Earthen home	1
property_type	Entire bed and breakfast	1
property_type	Entire bungalow	11
property_type	Entire cabin	11
property_type	Entire chalet	2
property_type	Entire condo	1014
property_type	Entire cottage	61
property_type	Entire guest suite	30
property_type	Entire guesthouse	8
property_type	Entire home	990
property_type	Entire loft	125
property_type	Entire place	10
property_type	Entire rental unit	2726
property_type	Entire serviced apartment	28
property_type	Entire townhouse	64
property_type	Entire vacation home	429
property_type	Entire villa	623
property_type	Farm stay	20
property_type	Island	1
property_type	Private room	7
property_type	Private room in bed and breakfast	438
property_type	Private room in boat	1
property_type	Private room in casa particular	7
property_type	Private room in castle	3
property_type	Private room in condo	178
property_type	Private room in cottage	2
property_type	Private room in farm stay	27
property_type	Private room in guest suite	23
property_type	Private room in guesthouse	6
property_type	Private room in home	110
property_type	Private room in hut	1
property_type	Private room in loft	7
property_type	Private room in nature lodge	12
property_type	Private room in rental unit	340
property_type	Private room in resort	1
property_type	Private room in serviced apartment	12
property_type	Private room in tiny home	1
property_type	Private room in townhouse	7
property_type	Private room in vacation home	31
property_type	Private room in villa	28
property_type	Room in aparthotel	8
property_type	Room in bed and breakfast	33
property_type	Room in boutique hotel	4
property_type	Room in hostel	1
property_type	Room in hotel	12
property_type	Room in serviced apartment	11
property_type	Shared room in condo	1
property_type	Shared room in home	1
property_type	Shared room in hostel	1
property_type	Shared room in rental unit	4
property_type	Tent	3
property_type	Tiny home	54
property_type	Tower	1
room_type	Entire home/apt	6231
room_type	Hotel room	45
room_type	Private room	1266
room_type	Shared room	7
instant_bookable	f	3568
instant_bookable	t	3981
municipality	Alia	6
municipality	Alimena	1
municipality	Altavilla Milicia	80
municipality	Altofonte	5
municipality	Bagheria	112
municipality	Balestrate	140
municipality	Belmonte Mezzagno	1
municipality	Bisacquino	1
municipality	Blufi	1
municipality	Bolognetta	5
municipality	Bompietro	3
municipality	Borgetto	16
municipality	Caccamo	25
municipality	Caltavuturo	3
municipality	Campofelice di Roccella	189
municipality	Campofiorito	4
municipality	Camporeale	1
municipality	Capaci	54
municipality	Carini	144
municipality	Castelbuono	64
municipality	Casteldaccia	43
municipality	Castellana Sicula	9
municipality	Castronovo di Sicilia	5
municipality	Cefalù	636
municipality	Cerda	1
municipality	Chiusa Sclafani	2
municipality	Ciminna	4
municipality	Cinisi	114
municipality	Collesano	47
municipality	Contessa Entellina	4
municipality	Corleone	15
municipality	Ficarazzi	14
municipality	Gangi	23
municipality	Geraci Siculo	6
municipality	Giuliana	1
municipality	Gratteri	25
municipality	Isnello	12
municipality	Isola delle Femmine	100
municipality	Lascari	82
municipality	Lercara Friddi	5
municipality	Marineo	3
municipality	Mezzojuso	1
municipality	Misilmeri	7
municipality	Monreale	79
municipality	Montelepre	10
municipality	Montemaggiore Belsito	1
municipality	Palazzo Adriano	2
municipality	Palermo	4426
municipality	Partinico	114
municipality	Petralia Soprana	18
municipality	Petralia Sottana	22
municipality	Piana degli Albanesi	7
municipality	Polizzi Generosa	19
municipality	Pollina	53
municipality	Prizzi	8
municipality	Roccamena	3
municipality	Roccapalumba	4
municipality	San Giuseppe Jato	2
municipality	San Mauro Castelverde	8
municipality	Santa Cristina Gela	4
municipality	Santa Flavia	153
municipality	Sciara	1
municipality	Scillato	10
municipality	Sclafani Bagni	5
municipality	Termini Imerese	60
municipality	Terrasini	229
municipality	Torretta	19
municipality	Trabia	78
municipality	Trappeto	104
municipality	Ustica	84
municipality	Valledolmo	2
municipality	Ventimiglia di Sicilia	2
municipality	Villabate	7
municipality	Villafrati	1

#data visualization for numerical variables 

# Prepara il dataset per gli istogrammi
df_long_num <- df_listings_cleansed %>%
  select(all_of(numeric_cols)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Rinomina le variabili con etichette più compatte
df_long_num$Variable <- as.character(df_long_num$Variable)
df_long_num$Variable <- dplyr::recode(df_long_num$Variable,
  "review_scores_accuracy" = "Accuracy",
  "review_scores_cleanliness" = "Cleanliness",
  "review_scores_checkin" = "Check-in",
  "review_scores_communication" = "Communication",
  "review_scores_location" = "Location",
  "review_scores_rating" = "Rating",
  "review_scores_value" = "Value",
  "reviews_per_month" = "Reviews/month",
  "minimum_nights" = "Min Nights",
  "maximum_nights" = "Max Nights",
  "price" = "Price",
  "latitude" = "Latitude",
  "longitude" = "Longitude"
)

# Istogrammi con etichette leggibili e titoli non tagliati
ggplot(df_long_num, aes(x = Value, fill = Variable)) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 20) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +  # distribuisce meglio i grafici
  labs(title = "Histograms of Multiple Variables", x = "Value", y = "Frequency") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    strip.text = element_text(size = 9, face = "bold"),  # più piccolo = meno tagliato
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.margin = margin(20, 10, 10, 10)
  )

Maximum Nights (Top Left, Red)

The distribution is highly skewed, with several peaks at specific values (e.g., 30, 365, 1000).
This suggests that many hosts impose fixed limits on long-term stays.

Minimum Nights (Top Center, Yellow)

Most listings require a minimum of 1 or a few nights, with a small number demanding longer stays.
The data is concentrated at the lower end, meaning that short-term stays are the norm.

Price (Top Right, Brown)

The majority of listings have low prices, but there are occasional higher-priced accommodations.
The distribution is right-skewed, indicating a few luxury listings that charge significantly more.

Review Scores Accuracy, Check-in, Cleanliness, Communication, Location, Rating, and Value (Middle & Bottom Rows, Green to Purple)

These variables display left-skewed distributions, meaning most properties receive high ratings (close to 5 stars).
The limited variation suggests that guests generally rate their stays very positively.
This may indicate that review scores are not a strong differentiating factor between listings.

Reviews Per Month (Bottom Right, Pink)

Most listings receive relatively few reviews per month, while some accumulate a significantly higher number.
This right-skewed distribution suggests that a small percentage of properties are far more popular than the majority.

#data visualization categorical variables

# Define the maximum number of categories to keep
top_n_municipalities <- 6
top_n_property_types <- 6

# Count occurrences and select the most frequent names
top_municipalities <- df_listings_cleansed %>%
  count(municipality, sort = TRUE) %>%
  slice_head(n = top_n_municipalities) %>%
  pull(municipality)

top_property_types <- df_listings_cleansed %>%
  count(property_type, sort = TRUE) %>%
  slice_head(n = top_n_property_types) %>%
  pull(property_type)

# Crea versione semplificata con "Other" per le categorie meno frequenti
df_listings_cleansed_representation <- df_listings_cleansed %>%
  mutate(
    municipality = ifelse(municipality %in% top_municipalities, as.character(municipality), "Other"),
    property_type = ifelse(property_type %in% top_property_types, as.character(property_type), "Other")
  )

# Prepara i dati per il grafico
df_long_cat <- df_listings_cleansed_representation %>%
  select(c("host_is_superhost", "room_type", "municipality", "instant_bookable", "property_type", "host_identity_verified")) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category")

# Ordina le categorie come fattori (opzionale)
df_long_cat$Category <- factor(df_long_cat$Category, levels = unique(df_long_cat$Category))

# Genera i barplot
ggplot(df_long_cat, aes(x = Category, fill = Variable)) +
  geom_bar(position = "dodge", alpha = 0.8) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(title = "Bar Plots of Categorical Variables", x = "Category", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Host Identity Verified (Top Left, Red):

The majority of hosts have a verified identity, indicating a high level of trust and security in the listings. However, a small portion of hosts remain unverified, which might impact guest confidence.

Host is Superhost (Top Center, Brown):

Less than half of the listings belong to Superhosts, meaning that while a significant number of hosts maintain high service quality, the majority do not qualify for this status.

Instant Bookable (Top Right, Green):

The distribution is quite balanced, meaning that roughly half of the listings allow guests to book instantly, while the other half require manual approval

Municipality (Bottom Left, Blue):

The distribution shows a highly imbalanced pattern, where a single municipality (likely Palermo) has the highest number of listings, while other municipalities have significantly fewer. This suggests that most Airbnb accommodations are concentrated in a specific area, which may impact clustering and pricing analysis.

Property Type (Bottom Center, Light Blue):

There is a wide variety of property types, but a few dominate the market. The most common property type seems to be “Entire home/apartment,” while other types are much less frequent.

Room Type (Bottom Right, Pink):

The vast majority of listings are entire homes/apartments, followed by private rooms, and very few shared rooms or hotel rooms. This confirms that Airbnb is primarily used for renting out full apartments or private accommodations rather than shared spaces.

### Check relationship between each potential regressor and response (price)
plot_data <- df_listings_cleansed %>%
  select(c("maximum_nights", "minimum_nights",  "review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness",  "review_scores_checkin", "review_scores_communication", "review_scores_location", "review_scores_value", "reviews_per_month", 'price')) %>%
  gather(key = "predictor", value = "predictor_value", -price)



plot_data <- df_listings_cleansed %>%
  select(c("maximum_nights", "minimum_nights", "review_scores_rating",
           "review_scores_accuracy", "review_scores_cleanliness",
           "review_scores_checkin", "review_scores_communication",
           "review_scores_location", "review_scores_value",
           "reviews_per_month", "price")) %>%
  tidyr::gather(key = "predictor", value = "predictor_value", -price)

# Converte i nomi delle variabili in etichette più leggibili
plot_data$predictor <- as.character(plot_data$predictor)
plot_data$predictor <- dplyr::recode(plot_data$predictor,
  "review_scores_accuracy" = "Accuracy",
  "review_scores_cleanliness" = "Cleanliness",
  "review_scores_checkin" = "Check-in",
  "review_scores_communication" = "Communication",
  "review_scores_location" = "Location",
  "review_scores_rating" = "Rating",
  "review_scores_value" = "Value for Money",
  "reviews_per_month" = "Reviews per Month",
  "minimum_nights" = "Min Nights",
  "maximum_nights" = "Max Nights"
)

# Crea il grafico con etichette leggibili
ggplot(plot_data, aes(x = predictor_value, y = price)) +
  geom_point(alpha = 0.5, color = "blue") +
  facet_wrap(~predictor, scales = "free_x") +
  theme_bw() +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12),
    strip.text = element_text(size = 11, face = "bold"),
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.margin = margin(20, 10, 10, 10)
  ) +
  labs(
    x = "Predictor Value",
    y = "Price",
    title = "Scatter Plots: Predictors vs Price"
  )

##### **Maximum & Minimum Nights:**

#Some listings allow **very long stays**, and these often have highly variable pricing.
#Short-stay listings generally have **lower prices**, but there are exceptions.
#The presence of extreme values suggests that some listings cater to long-term rentals or have unique pricing structures.

##### **Review Scores (Accuracy, Cleanliness, Check-in, Communication, Location, Rating, Value):**

#Most review scores are concentrated between **4 and 5 stars**, leading to a lack of variability.

#No strong correlation between higher review scores and price is apparent.

#This suggests that while guests generally rate accommodations highly, **hosts do not necessarily increase prices based on reviews alone.**

##### **Reviews Per Month:**
#Most listings receive **fewer reviews per month**, but a few have a significantly high number.

#High-priced listings tend to receive fewer reviews, indicating that **premium accommodations might have lower booking frequency.**

par(mfrow = c(3, 3), mar = c(4, 4, 4, 2))  # margine più alto sotto

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$host_identity_verified, 
        ylab = "Price", xlab = "", main = "Price vs Host Identity Verified",
        las = 1, cex.axis = 0.9)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$host_is_superhost, 
        ylab = "Price", xlab = "", main = "Price vs Superhost",
        las = 1, cex.axis = 0.9)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$instant_bookable, 
        ylab = "Price", xlab = "", main = "Price vs Instant Booking",
        las = 1, cex.axis = 0.9)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$municipality, 
        ylab = "Price", xlab = "", main = "Price vs Municipality", 
        las = 2, cex.axis = 0.6)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$property_type, 
        ylab = "Price", xlab = "", main = "Price vs Property Type", 
        las = 2, cex.axis = 0.6)

boxplot(df_listings_cleansed$price ~ df_listings_cleansed$room_type, 
        ylab = "Price", xlab = "", main = "Price vs Room Type", 
        las = 2, cex.axis = 0.8)

Host Identity Verified (Top Left)

No significant difference in price between verified and non-verified hosts.
Both categories contain several high-price outliers, indicating that verification does not strongly influence pricing.

Superhost Status (Top Center)

Superhosts and non-Superhosts have similar price distributions, with some expensive listings in both groups.
The median price appears to be slightly higher for Superhosts, but the effect is not substantial.

Instant Booking (Top Right)

The price distribution is almost identical between instant bookable and non-instant bookable listings.
This suggests that instant booking does not significantly impact pricing strategies.

Municipality (Bottom Left)

There is substantial variation in price across different municipalities.
Some municipalities have wider interquartile ranges, while others have mostly lower-priced listings.

Property Type (Bottom Center)

The price distribution varies significantly by property type.
Some property types have consistently lower prices, while others have wider distributions with multiple outliers.
This indicates that property type is an important factor in pricing.

Room Type (Bottom Right)

Entire home/apartment listings tend to have higher prices compared to shared rooms.
Private rooms (not shown separately here) would likely fall between these two categories.

#CLUSTERING

df_listings_cleansed <- df_listings_cleansed %>%
  mutate(
    mean_review_score_detailed = rowMeans(select(., review_scores_rating:review_scores_value), na.rm = TRUE)
  )

#Municipality aggregation
cluster_data <- df_listings_cleansed %>%
  group_by(municipality) %>%
  summarise(
    mean_price = mean(price, na.rm = TRUE),
     mean_detailed_score = mean(mean_review_score_detailed, na.rm = TRUE)
  ) %>%
  na.omit()


#Data standardization
numeric_cols <- cluster_data %>%
  select(where(is.numeric))  # seleziona solo le variabili numeriche
scaled_data <- scale(numeric_cols)
row.names(scaled_data) <- cluster_data$municipality

#check means and sd
scale_data_mean_sd= as.data.frame(scaled_data) %>%
  summarise(across(everything(), list(mean = mean, sd = sd)))


#Hierarchical Clustering GERARCHICO ( WARD method)
dist_matrix <- dist(scaled_data)
hc_ward <- hclust(dist_matrix, method = "ward.D")


#Dendrogram
library(dendextend)

dend <- as.dendrogram(hc_ward) %>%
  set("labels", rownames(scaled_data)) %>%
  set("branches_k_color")

labels(dend) <- ifelse(seq_along(labels(dend)) %% 3 == 0, labels(dend), "")

par(mar = c(14, 4, 4, 2))  # bottom, left, top, right

plot(dend,
     main = "Dendrogram - Ward Method",
     ylab = "Distance",
     las = 2,      
     cex = 0.6)

#Choose the optimal number of clusters

#elbow method

library(factoextra)

## Warning: il pacchetto 'factoextra' è stato creato con R versione 4.4.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# Metodo del gomito (WSS) applicato ai tuoi dati


fviz_nbclust(as.data.frame(scaled_data), FUN = hcut, method = "wss", hc_method = "ward")

## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"

## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"

#the optimal number of clusters according to thise method is 3 

#silhouette

fviz_nbclust(as.data.frame(scaled_data), FUN = hcut, method = "silhouette", hc_method = "ward")

## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"
## Il metodo "ward" è stato rinominato in "ward.D"; nota il nuovo "ward.D2"

#the optimal number of clusters according to thise method is 9

#our decision: k=3 optimal number of clusters

#cut dendogram

dend <- as.dendrogram(hc_ward) %>%
  set("labels", rownames(scaled_data)) %>%
  set("branches_k_color", k = 3)

labels(dend) <- ifelse(seq_along(labels(dend)) %% 3 == 0, labels(dend), "")

par(mar = c(14, 4, 4, 2))  

plot(dend,
     main = "Dendrogram - Ward Method",
     ylab = "Distance",
     las = 2,       
     cex = 0.6)     

abline(h = 23, col = 'red', lwd = 2)

#Cluster
cluster_assignment <- cutree(hc_ward, k = 3)
cluster_data$cluster <- as.factor(cluster_assignment)

ggplot(cluster_data, aes(x = mean_price, y = mean_detailed_score, color = cluster)) +
  geom_point(size = 3, alpha = 0.9) +
  labs(
    title = "Clustering Ward method (k = 3)",
    x = "Average price ",
    y = "Detailed score"
  ) +
  theme_minimal() +
  theme(legend.position = "right") +
  scale_color_manual(values = c("red", "green", "blue"))

#Clusters' statistics 
summary_clusters <- cluster_data %>%
  group_by(cluster) %>%
  summarise(
    n_comuni = n(),
   average_price = round(mean(mean_price), 2),
    average_detailed_score = round(mean(mean_detailed_score), 2),
    municipality = paste(municipality, collapse = ", ")
  )

print(summary_clusters)

## # A tibble: 3 × 5
##   cluster n_comuni average_price average_detailed_score municipality            
##   <fct>      <int>         <dbl>                  <dbl> <chr>                   
## 1 1             20         152.                    4.8  Alia, Balestrate, Bisac…
## 2 2             13          54.9                   4.94 Alimena, Belmonte Mezza…
## 3 3             41          81.4                   4.74 Altavilla Milicia, Alto…

summary_clusters$municipality

## [1] "Alia, Balestrate, Bisacquino, Borgetto, Caccamo, Carini, Cefalù, Cerda, Chiusa Sclafani, Ciminna, Cinisi, Collesano, Misilmeri, Partinico, Polizzi Generosa, Santa Flavia, Terrasini, Trabia, Ustica, Ventimiglia di Sicilia"                                                                                                                                                                                                                                                                                                                                                      
## [2] "Alimena, Belmonte Mezzagno, Blufi, Bompietro, Castronovo di Sicilia, Giuliana, Marineo, Mezzojuso, Montelepre, Montemaggiore Belsito, San Giuseppe Jato, Sciara, Villafrati"                                                                                                                                                                                                                                                                                                                                                                                                       
## [3] "Altavilla Milicia, Altofonte, Bagheria, Bolognetta, Caltavuturo, Campofelice di Roccella, Campofiorito, Camporeale, Capaci, Castelbuono, Casteldaccia, Castellana Sicula, Contessa Entellina, Corleone, Ficarazzi, Gangi, Geraci Siculo, Gratteri, Isnello, Isola delle Femmine, Lascari, Lercara Friddi, Monreale, Palazzo Adriano, Palermo, Petralia Soprana, Petralia Sottana, Piana degli Albanesi, Pollina, Prizzi, Roccamena, Roccapalumba, San Mauro Castelverde, Santa Cristina Gela, Scillato, Sclafani Bagni, Termini Imerese, Torretta, Trappeto, Valledolmo, Villabate"

#cluster1---> High-end destinations with premium prices and high customer ratings. Likely tourist hotspots.
#cluster2--> Low-cost locations with excellent ratings. Possibly small, lesser-known towns with highly appreciated stays.
#cluster3---> Mid-range price and quality. A diverse group with a mix of performance levels.

#rename clusters to have a clearer representation

cluster_data$cluster <- factor(cluster_data$cluster,
                                 levels = c("1", "2", "3"),
                                 labels = c("Premium", "Affordable & Top Rated", "Mid-Range"))


summary_clusters$cluster  <- factor(summary_clusters$cluster,
                                 levels = c("1", "2", "3"),
                                 labels = c("Premium", "Affordable & Top Rated", "Mid-Range"))

#CHOROPLETH MAP

#represent the previously defined clusters on a map with different colors depending on clusters' price and average ratings

# Libreries


library(sf)

## Warning: il pacchetto 'sf' è stato creato con R versione 4.4.3

## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE

library(dplyr)
library(ggplot2)

# Carica e prepara lo shapefile
sf_gb <- read_sf("Com01012025_g", layer = "Com01012025_g_WGS84") %>%
  st_make_valid() %>%
  st_transform(crs = 4326) %>%
  mutate(COMUNE = toupper(COMUNE))

# Prepara cluster_data
cluster_data <- cluster_data %>%
  mutate(municipality = toupper(municipality))

# Join e filtro per Palermo
sf_palermo <- sf_gb %>%
  left_join(cluster_data %>% select(municipality, cluster),
            by = c("COMUNE" = "municipality")) %>%
  filter(COD_PROV == "82")

# Separa comuni clusterizzati
sf_clustered <- sf_palermo %>% filter(!is.na(cluster))

# Mappa
ggplot() +
  geom_sf(data = sf_palermo, fill = "grey90", color = "white", size = 0.2) +
  geom_sf(data = sf_clustered, aes(fill = as.factor(cluster)), color = "white", size = 0.3) +
  scale_fill_manual(
    values = c("darkblue", "dodgerblue", "lightblue"),
    name = "Cluster",
    labels = c("Premium", "Affordable & High-Quality", "Mid-Range")
  ) +
  labs(
    title = "Clustering of Palermo Province Municipalities on Quality and Price",
    subtitle = "Source: ISTAT",
    caption = "Market data analysis @ LUISS"
  ) +
  theme_minimal() +
  theme_void() +
  theme(legend.position = "right")

#ITINERARY
#It’s a function that returns an itinerary based on the user’s total budget, number of nights, and chosen theme among:

#Sea
#Mountain / Mountain & Nature
#Relaxation

#Culture / History & Culture
#Entertainment
#Food & Wine
#Spirituality / Spirituality & Tradition


library(dplyr)
library(geosphere)

## Warning: il pacchetto 'geosphere' è stato creato con R versione 4.4.3

preferences <- read_excel("city_preferences.xlsx")

#group by for a better visualization
preferences <- preferences %>%
  distinct(municipality, preference) %>%         
  group_by(municipality) %>%
  summarise(
    preferences_list = paste(sort(unique(preference)), collapse = ", ")
  ) %>%
  arrange(municipality)


# calculate centroids with lat and long
sf_centroids_coords <- sf_gb %>%
  st_centroid(of_largest_polygon = TRUE) %>%
  mutate(
    lat = st_coordinates(geometry)[, 2],
    lon = st_coordinates(geometry)[, 1],
    COMUNE = toupper(COMUNE)
  ) %>%
  st_drop_geometry() %>%
  select(COMUNE, lat, lon)

## Warning: st_centroid assumes attributes are constant over geometries

# Join:add lat/lon and preferences to cluster_data
cluster_data <- cluster_data %>%
  mutate(municipality = toupper(trimws(municipality))) %>%
  left_join(sf_centroids_coords,
            by = c("municipality" = "COMUNE")) %>%
  left_join(preferences,     
            by = "municipality")

#function

suggest_itinerary <- function(budget,
                              total_number_nights,
                              summary_clusters,
                              cluster_data,
                              user_preference) {

  cat("\nStarting smart itinerary suggestion …\n")

 #column check
  
  needed <- c("municipality", "lat", "lon",
              "mean_price", "mean_detailed_score",
              "cluster", "preferences_list")   # ← nome corretto
  if (!all(needed %in% names(cluster_data))) {
    stop("cluster_data must contain: ",
         paste(needed, collapse = ", "))
  }

  # Cluster selection based on budget
 
  budget_per_night <- budget / total_number_nights
  cat("Total budget:", budget, "€  |  per-night budget:",
      round(budget_per_night, 2), "€\n")
  cat("Requested theme:", user_preference, "\n")

  chosen_cluster <- summary_clusters %>%
    filter(average_price <= budget_per_night) %>%
    arrange(desc(average_price)) %>%
    slice_head(n = 1) %>%
    pull(cluster)

  if (length(chosen_cluster) == 0) {
    cat("No cluster fits the budget.\n"); return(NULL)
  }
  cat("Chosen cluster:", chosen_cluster, "\n")

  # Filter cities by cluster AND preferences_list
 
  cities_pool <- cluster_data %>%
    filter(cluster == chosen_cluster,
           grepl(user_preference,
                 preferences_list,           # ← colonna corretta
                 ignore.case = TRUE),
           !is.na(lat), !is.na(lon)) %>%
    distinct(municipality, .keep_all = TRUE)

  if (nrow(cities_pool) == 0) {
    cat("No cities with theme",
        user_preference, "in that cluster.\n"); return(NULL)
  }

  # How many cities (~1 every 2.5 nights)
  
  n_cities <- max(1,
                  min( ceiling(total_number_nights / 2.5),
                       nrow(cities_pool) ) )
  cat("Number of cities to visit:", n_cities, "\n")

  set.seed(as.integer(Sys.time()))
  weights <- cities_pool$mean_detailed_score /
             sum(cities_pool$mean_detailed_score)
  picked <- cities_pool %>%
    slice_sample(n = n_cities, weight_by = weights)

  # Allocate nights  (≥1 per city, variable)
 
  base_nights      <- rep(1, n_cities)
  remaining_nights <- total_number_nights - n_cities

  if (remaining_nights > 0) {
    extra_weights <- (picked$mean_detailed_score^2) /
                     sum(picked$mean_detailed_score^2)
    extra_nights  <- rmultinom(1, remaining_nights,
                               prob = extra_weights) |>
                     as.numeric()

    # ensure variation
    if (remaining_nights >= 2 &&
        length(unique(extra_nights)) == 1) {
      take <- sample(which(extra_nights > 0), 1)
      give <- sample(setdiff(seq_along(extra_nights), take), 1)
      extra_nights[take] <- extra_nights[take] - 1
      extra_nights[give] <- extra_nights[give] + 1
    }
  } else {
    extra_nights <- rep(0, n_cities)
  }

  picked$nights     <- base_nights + extra_nights
  picked$total_cost <- round(picked$mean_price * picked$nights, 2)

  #  Re-balance if cost exceeds budget
  
  attempts <- 0
  while (sum(picked$total_cost) > budget && attempts < 15) {
    idx <- which.max(picked$mean_price)
    if (picked$nights[idx] > 1) {
      picked$nights[idx] <- picked$nights[idx] - 1
      picked$total_cost  <- round(picked$mean_price * picked$nights, 2)
    }
    attempts <- attempts + 1
  }

  # Order by nearest neighbour
  
  if (n_cities > 1) {
    dist_mat <- distm(picked[, c("lon", "lat")])
    order_v  <- integer(n_cities); visited <- rep(FALSE, n_cities)
    order_v[1] <- 1; visited[1] <- TRUE

    for (i in 2:n_cities) {
      last <- order_v[i - 1]
      cand <- which.min(ifelse(visited, Inf, dist_mat[last, ]))
      if (length(cand) == 0 || visited[cand]) break
      order_v[i] <- cand; visited[cand] <- TRUE
    }
    order_v <- order_v[order_v > 0]
    picked  <- picked[order_v, ]
  }

  # Output
 
  itinerary <- picked %>%
  select(municipality, preferences_list,
         lat, lon, nights,
         price_per_night = mean_price,
         total_cost)

cat("\nSuggested itinerary:\n")
print(itinerary)
}

#recall function

itinerary <- suggest_itinerary(
  budget              = 800,
  total_number_nights = 5,
  summary_clusters    = summary_clusters,
  cluster_data        = cluster_data,      # deve includere “preferences_list”
  user_preference     = "Cultura")

## 
## Starting smart itinerary suggestion …
## Total budget: 800 €  |  per-night budget: 160 €
## Requested theme: Cultura 
## Chosen cluster: 1 
## Number of cities to visit: 2 
## 
## Suggested itinerary:
## # A tibble: 2 × 7
##   municipality preferences_list      lat   lon nights price_per_night total_cost
##   <chr>        <chr>               <dbl> <dbl>  <dbl>           <dbl>      <dbl>
## 1 CACCAMO      Cultura e Storia, …  37.9  13.7      2            116.       232.
## 2 MISILMERI    Cultura e Storia     38.0  13.4      3            132.       397.

Palermo Analysis