Spatial Determinants of Airbnb Pricing in Rome: A Machine Learning Approach

1. Introduction

The market for short-term rentals has drawn a lot of academic attention, especially in places like Rome that have a strong cultural history. Analytical techniques that take into consideration geographic context and geographical dependency among observations are necessary to fully understand the spatial drivers of nightly pricing on platforms such as Airbnb. This study uses spatial machine learning techniques to examine how neighborhood features, the density of local amenities, and proximity to historical sites affect Airbnb nightly rates in Rome, Italy.

The analysis proceeds through four stages: (1) spatial feature engineering using OpenStreetMap data; (2) formal testing for spatial autocorrelation using Moran’s I statistic; (3) exploratory spatial data analysis of listing price distributions; and (4) predictive modeling using a Random Forest algorithm with spatially-derived covariates. 37,652 Airbnb listings were collected from the Inside Airbnb project to create the original dataset. The research is carried out on a representative random sample of 10,000 postings to guarantee computational tractability during spatial feature engineering, offering thorough coverage of Rome’s short-term rental market.

2. Data and Methodology

2.1 Libraries and Environment

The analysis utilises several R packages for spatial data manipulation (sf), OpenStreetMap querying (osmdata), spatial weights and autocorrelation testing (spdep), and machine learning (ranger).

library(sf)        # Spatial data handling
library(osmdata)   # OpenStreetMap data extraction
library(spdep)     # Spatial dependence and weights
library(ranger)    # Fast Random Forest implementation
library(dplyr)     # Data manipulation
library(ggplot2)   # Static visualisation
library(leaflet)   # Interactive web maps
library(tmap)      # Thematic cartography

2.2 Data Loading and Preprocessing

Each listing in the dataset has 79 variables, such as review scores, pricing details, property attributes, and geographic coordinates. Preprocessing involves cleaning the price variable (removing currency symbols and commas), filtering implausible values focusing on standard rentals rather than extreme luxury villas, and converting the tabular data into a projected spatial object suitable for distance calculations.

# Loading raw CSV and converting to spatial object
airbnb_raw <- read.csv("listings.csv") %>%
  filter(!is.na(price), !is.na(latitude), !is.na(longitude)) %>%
  # Cleaning price string (e.g., "$1,200.00" -> 1200)
  mutate(price_num = as.numeric(gsub("[^0-9.]", "", price))) %>%
  filter(price_num > 0 & price_num < 1000) # Removing zero and extreme outlier values

# Converting to sf object with WGS84, then project to UTM Zone 33N (EPSG:32633)
# for accurate Euclidean distance calculations in meters
airbnb_sf <- st_as_sf(airbnb_raw, coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(32633)

cat("Total listings after filtering:", nrow(airbnb_sf), "\n")

## Total listings after filtering: 33158

# Drawing a random sample to ensure computational tractability
# for spatial operations (buffering, intersection with OSM features)
set.seed(42)
airbnb_sf <- airbnb_sf %>% sample_n(min(10000, nrow(airbnb_sf)))
cat("Sample size for analysis:", nrow(airbnb_sf), "\n")

## Sample size for analysis: 10000

3. Exploratory Spatial Data Analysis

The spatial distribution of listing prices is visualised below using an interactive point map. A visual assessment indicates significant spatial variation, with more expensive listings concentrated in Rome’s historic district (Centro Storico) and close to popular tourist destinations.

# Transforming back to WGS84 for Leaflet visualisation
airbnb_map <- st_transform(airbnb_sf, 4326)

# Defining colour palette
pal <- colorNumeric(palette = "YlOrRd", domain = airbnb_map$price_num)

leaflet(airbnb_map) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(
    radius = 3,
    color = ~pal(price_num),
    stroke = FALSE, fillOpacity = 0.7,
    popup = ~paste0("Price: $", price_num, "<br>",
                    "Room type: ", room_type, "<br>",
                    "Neighbourhood: ", neighbourhood_cleansed)
  ) %>%
  addLegend("bottomright", pal = pal, values = ~price_num, title = "Nightly Price ($)")

4. Spatial Feature Engineering

To capture locational externalities, two types of spatial characteristics are established:

The Euclidean distance (in meters) between each listing and the Colosseum, one of Rome’s most famous tourist attractions.
The number of cafés within a 500-meter radius around each listing, known as the “amenity density” and used as a stand-in for walkability and neighborhood vitality.

The Overpass API is used to retrieve these characteristics from OpenStreetMap.

# 4.1 Distance to the Colosseum (approximate coordinates)
colosseum_pt <- st_sfc(st_point(c(12.4922, 41.8902)), crs = 4326) %>%
  st_transform(32633)

airbnb_sf$dist_to_colosseum <- as.numeric(st_distance(airbnb_sf, colosseum_pt))

# 4.2 Extracting café locations from OpenStreetMap
rome_bbox <- getbb("Rome, Italy")

cafes_osm <- opq(bbox = rome_bbox) %>%
  add_osm_feature(key = 'amenity', value = 'cafe') %>%
  osmdata_sf()

cafes_sf <- cafes_osm$osm_points %>%
  filter(!is.na(geometry)) %>%
  st_transform(32633)

# 4.3 Radial zoning: counting cafés within 500m buffer
buffers_500m <- st_buffer(airbnb_sf, dist = 500)
airbnb_sf$cafe_count_500m <- lengths(st_intersects(buffers_500m, cafes_sf))

# 4.4 Extracting projected coordinates as features (spatial trend surface)
coords <- st_coordinates(airbnb_sf)
airbnb_sf$X <- coords[, 1]
airbnb_sf$Y <- coords[, 2]

cat("Distance to Colosseum - Mean:", round(mean(airbnb_sf$dist_to_colosseum), 0), "m\n")

## Distance to Colosseum - Mean: 4030 m

cat("Café count (500m) - Mean:", round(mean(airbnb_sf$cafe_count_500m), 1), "\n")

## Café count (500m) - Mean: 24.8

5. Spatial Autocorrelation Analysis

Prior to applying a standard machine learning model, it is essential to assess whether the dependent variable exhibits spatial autocorrelation: i.e., whether nearby listings tend to have similar prices. If present, this violates the independence assumption of classical regression and motivates the inclusion of spatial covariates.

A K-Nearest Neighbours (KNN) spatial weight matrix is constructed with \(k = 5\), and Moran’s I statistic is computed under the null hypothesis of complete spatial randomness.

# Constructing KNN spatial weight matrix (k=5)
coords_matrix <- st_coordinates(airbnb_sf)
knn <- knearneigh(coords_matrix, k = 5)
nb <- knn2nb(knn)
listw <- nb2listw(nb, style = "W")

# Global Moran's I test on price
moran_test <- moran.test(airbnb_sf$price_num, listw)
print(moran_test)

## 
##  Moran I test under randomisation
## 
## data:  airbnb_sf$price_num  
## weights: listw    
## 
## Moran I statistic standard deviate = 42.572, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      2.511983e-01     -1.000100e-04      3.484414e-05

A statistically significant positive Moran’s I statistic (\(I > 0\), \(p < 0.001\)) indicates the presence of positive spatial autocorrelation: high-priced listings tend to cluster near other high-priced listings, and conversely for lower-priced properties. This finding highlights the necessity of accounting for spatial structure. While traditional spatial autoregressive models explicitly model this dependence, we address it by engineering robust spatial covariates (distance metrics, density) and providing spatial coordinates (X, Y) to allow the Random Forest algorithm to implicitly learn these non-linear spatial trends.

6. Predictive Modelling: Random Forest

A Random Forest model is trained to predict nightly price using both spatial features (distance to Colosseum, café density, projected coordinates) and intrinsic listing attributes (guest capacity, number of bedrooms, beds, and room type). The ranger implementation is included for computational efficiency with permutation-based variable importance estimation.

# Preparing modelling data frame (drop geometry)
ml_data <- airbnb_sf %>%
  st_drop_geometry() %>%
  select(price_num, dist_to_colosseum, cafe_count_500m, X, Y,
         accommodates, bedrooms, beds, room_type) %>%
  mutate(
    bedrooms = ifelse(is.na(bedrooms), 0, bedrooms),
    beds = ifelse(is.na(beds), 1, beds)
  ) %>%
  na.omit()

# Encoding categorical variable
ml_data$room_type <- as.factor(ml_data$room_type)

# Train/test split (80/20)
set.seed(123)
train_idx <- sample(1:nrow(ml_data), 0.8 * nrow(ml_data))
train_data <- ml_data[train_idx, ]
test_data <- ml_data[-train_idx, ]

cat("Training set size:", nrow(train_data), "\n")

## Training set size: 8000

cat("Test set size:", nrow(test_data), "\n")

## Test set size: 2000

# Training Random Forest
rf_model <- ranger(
  formula = price_num ~ .,
  data = train_data,
  num.trees = 500,
  importance = 'permutation',
  seed = 123
)

print(rf_model)

## Ranger result
## 
## Call:
##  ranger(formula = price_num ~ ., data = train_data, num.trees = 500,      importance = "permutation", seed = 123) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      8000 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       7136.216 
## R squared (OOB):                  0.4432639

7. Model Evaluation and Residual Analysis

7.1 Variable Importance

Permutation importance quantifies the increase in prediction error when each feature’s values are randomly shuffled, thereby breaking the association with the response variable.

importance_df <- data.frame(
  Variable = names(rf_model$variable.importance),
  Importance = rf_model$variable.importance
) %>%
  arrange(desc(Importance))

ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Permutation Variable Importance",
    subtitle = "Random Forest Model (500 Trees)",
    x = "Feature",
    y = "Importance (Increase in MSE)"
  )

7.2 Prediction Accuracy

# Generating predictions on held-out test set
predictions <- predict(rf_model, data = test_data)$predictions
test_data$predicted_price <- predictions
test_data$residual <- test_data$price_num - test_data$predicted_price

# Performance metrics
rmse <- sqrt(mean(test_data$residual^2))
mae <- mean(abs(test_data$residual))
r_squared <- 1 - sum(test_data$residual^2) / sum((test_data$price_num - mean(test_data$price_num))^2)

cat("Test Set Performance Metrics:\n")

## Test Set Performance Metrics:

cat("  RMSE: $", round(rmse, 2), "\n")

##   RMSE: $ 84.61

cat("  MAE:  $", round(mae, 2), "\n")

##   MAE:  $ 51.3

cat("  R²:   ", round(r_squared, 4), "\n")

##   R²:    0.456

7.3 Spatial Distribution of Residuals

It is possible to determine whether systematic geographical patterns are still not described by the model by mapping model residuals (actual minus predicted price). Clustered residuals with the same sign would reveal missing spatial variables and show residual spatial autocorrelation.

# Re-attaching geometry from coordinates for mapping
test_sf <- st_as_sf(test_data, coords = c("X", "Y"), crs = 32633) %>%
  st_transform(4326)

# Diverging palette: red = model underpredicts, blue = model overpredicts
res_pal <- colorNumeric(palette = "RdBu", domain = test_sf$residual, reverse = TRUE)

leaflet(test_sf) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(
    radius = 4,
    color = ~res_pal(residual),
    stroke = FALSE, fillOpacity = 0.8,
    popup = ~paste0("Actual: $", price_num,
                    "<br>Predicted: $", round(predicted_price, 2),
                    "<br>Residual: $", round(residual, 2))
  ) %>%
  addLegend("bottomright", pal = res_pal, values = ~residual, title = "Residual ($)")

8. Conclusion

This study demonstrates the utility of spatial machine learning for modelling short-term rental prices. Locational context is a primary driver of Airbnb price in Rome, as confirmed by the significant positive spatial autocorrelation found using Moran’s I. The Random Forest model captures both macro-scale geographical trends and micro-scale neighborhood influence by combining OpenStreetMap-derived spatial variables (distance to the Colosseum and local café density) with projected coordinates and fundamental listing properties.

The residual map offers analytical insight into areas where the model consistently overpredicts or underpredicts, indicating future study topics, while the variable relevance analysis shows the relative contribution of spatial vs non-spatial predictors. Potential improvements include using spatially weighted models, adding more locations of interest (such as restaurants or metro stations), or implementing spatial cross-validation techniques to reduce spatial leakage in model evaluation.