Spatial Econometrics Project

Author

Pasquale Gravante

INTRODUCTION

Spatial Analysis on London AirBnb Data

This dataset provides a comprehensive look at Airbnb prices in London in the weekends. Each listing is evaluated for various attributes such as room types, cleanliness and satisfaction ratings, bedrooms, distance from the city centre, and more to capture an in-depth understanding of Airbnb prices.

Data Description

realSum The total price of the Airbnb listing. (Numeric)
room_type The type of room being offered (e.g. private, shared, etc.). (Categorical)
room_shared Whether the room is shared or not. (Boolean)
room_private Whether the room is private or not. (Boolean)
person_capacity The maximum number of people that can stay in the room. (Numeric)
host_is_superhost Whether the host is a superhost or not. (Boolean)
multi Whether the listing is for multiple rooms or not. (Boolean)
biz Whether the listing is for business purposes or not. (Boolean)
cleanliness_rating The cleanliness rating of the listing. (Numeric)
guest_satisfaction_overall The overall guest satisfaction rating of the listing. (Numeric)
bedrooms The number of bedrooms in the listing. (Numeric)
dist The distance from the city centre. (Numeric)
metro_dist The distance from the nearest metro station. (Numeric)
lng The longitude of the listing. (Numeric)
lat The latitude of the listing. (Numeric)

Research question

How do spatial and non-spatial factors influence Airbnb prices in a given city, and what are the implications?

DATA PREPARATION

Loading libraries

library(tidyverse)
library(GWmodel)
library(cluster)
library(factoextra)
library(leaflet)
library(sf)
library(sp)
library(spatialreg)
library(spdep)
library(spgwr)

Loading Data (Shapefile and csv file)

Both the shapefile of London geographical data and the economic variables about Airbnb prices are uploaded.

london <- st_read("London_Ward_CityMerged.shp")
Reading layer `London_Ward_CityMerged' from data source 
  `/Users/paky/Desktop/Spatial Econometrics/Spatial Econometrics Project/London_Ward_CityMerged.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 625 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
Projected CRS: OSGB36 / British National Grid
airbnb <- read_csv("london_weekends.csv")

EDA

In this section the structure of the economic data is explored.

Data Structure

dim(airbnb)
[1] 5379   20
summary(airbnb)
      ...1         realSum          room_type         room_shared    
 Min.   :   0   Min.   :   54.33   Length:5379        Mode :logical  
 1st Qu.:1344   1st Qu.:  174.51   Class :character   FALSE:5352     
 Median :2689   Median :  268.12   Mode  :character   TRUE :27       
 Mean   :2689   Mean   :  364.39                                     
 3rd Qu.:4034   3rd Qu.:  438.27                                     
 Max.   :5378   Max.   :12937.27                                     
 room_private    person_capacity host_is_superhost     multi       
 Mode :logical   Min.   :2.000   Mode :logical     Min.   :0.0000  
 FALSE:2445      1st Qu.:2.000   FALSE:4484        1st Qu.:0.0000  
 TRUE :2934      Median :2.000   TRUE :895         Median :0.0000  
                 Mean   :2.858                     Mean   :0.2798  
                 3rd Qu.:4.000                     3rd Qu.:1.0000  
                 Max.   :6.000                     Max.   :1.0000  
      biz         cleanliness_rating guest_satisfaction_overall    bedrooms    
 Min.   :0.0000   Min.   : 2.000     Min.   : 20.00             Min.   :0.000  
 1st Qu.:0.0000   1st Qu.: 9.000     1st Qu.: 87.00             1st Qu.:1.000  
 Median :0.0000   Median :10.000     Median : 94.00             Median :1.000  
 Mean   :0.3579   Mean   : 9.194     Mean   : 90.92             Mean   :1.133  
 3rd Qu.:1.0000   3rd Qu.:10.000     3rd Qu.: 99.00             3rd Qu.:1.000  
 Max.   :1.0000   Max.   :10.000     Max.   :100.00             Max.   :8.000  
      dist            metro_dist        attr_index      attr_index_norm  
 Min.   : 0.04056   Min.   :0.01388   Min.   :  68.74   Min.   :  4.778  
 1st Qu.: 3.54568   1st Qu.:0.32404   1st Qu.: 177.22   1st Qu.: 12.320  
 Median : 4.93914   Median :0.53613   Median : 247.65   Median : 17.215  
 Mean   : 5.32762   Mean   :1.01653   Mean   : 294.58   Mean   : 20.477  
 3rd Qu.: 6.83807   3rd Qu.:1.09076   3rd Qu.: 361.07   3rd Qu.: 25.099  
 Max.   :17.32120   Max.   :9.17409   Max.   :1438.56   Max.   :100.000  
   rest_index     rest_index_norm        lng                lat       
 Min.   : 140.5   Min.   :  2.515   Min.   :-0.25170   Min.   :51.41  
 1st Qu.: 382.1   1st Qu.:  6.839   1st Qu.:-0.16996   1st Qu.:51.49  
 Median : 527.3   Median :  9.439   Median :-0.11813   Median :51.51  
 Mean   : 625.6   Mean   : 11.197   Mean   :-0.11478   Mean   :51.50  
 3rd Qu.: 764.2   3rd Qu.: 13.678   3rd Qu.:-0.06772   3rd Qu.:51.53  
 Max.   :5587.1   Max.   :100.000   Max.   : 0.12018   Max.   :51.58  
head(airbnb)
# A tibble: 6 × 20
   ...1 realSum room_type       room_shared room_private person_capacity
  <dbl>   <dbl> <chr>           <lgl>       <lgl>                  <dbl>
1     0    121. Private room    FALSE       TRUE                       2
2     1    196. Private room    FALSE       TRUE                       2
3     2    193. Private room    FALSE       TRUE                       3
4     3    180. Private room    FALSE       TRUE                       2
5     4    406. Entire home/apt FALSE       FALSE                      3
6     5    354. Entire home/apt FALSE       FALSE                      2
# ℹ 14 more variables: host_is_superhost <lgl>, multi <dbl>, biz <dbl>,
#   cleanliness_rating <dbl>, guest_satisfaction_overall <dbl>, bedrooms <dbl>,
#   dist <dbl>, metro_dist <dbl>, attr_index <dbl>, attr_index_norm <dbl>,
#   rest_index <dbl>, rest_index_norm <dbl>, lng <dbl>, lat <dbl>

We’ve got data about 5379 listings (houses) and 20 variables (some feature selection will be made afterwards).

Listing on the map

ggplot(airbnb, aes(x = lng, y = lat)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

With this graph we can see where in the space there more and where there are less Airbnb listings.

Data Manipulation

Firstly, we ensure that both datasets have the same coordinate reference system (CRS).

# Ensure both datasets have the same coordinate reference system (CRS)
london <- st_transform(london, 4326)
airbnb_spatial <- st_as_sf(airbnb, coords = c("lng", "lat"), crs=4326)
joined_data <- st_join(airbnb_spatial, london, join = st_within)

Then, a variable selection is done and their values are aggregated by polygons (i.e., for each area of London, the average value of the variables for the listings in that space in taken).

# List of variables
vars <- c("realSum","person_capacity", "bedrooms", "dist", "guest_satisfaction_overall", "cleanliness_rating")

# Aggregate point data by polygon (area) to compute mean values
polygon_summary <- joined_data %>%
  group_by(POLY_ID) %>%
  summarise(
    across(all_of(vars), ~ mean(.x, na.rm = TRUE))  
  ) %>%
  st_drop_geometry()

# Retrieve polygon geometries from the original London dataset
polygon_geometries <- london %>%
  select(POLY_ID, geometry)

# Merge polygon geometries with polygon summary using left_join
final_summary <- left_join(polygon_summary, polygon_geometries, by = "POLY_ID")

# Convert final_summary to an sf object
final_summary <- st_as_sf(final_summary)

We now have a new dataset containing 223 areas of London and 6 economic variables for each one.

MODELING

Spatial weights

cont.sf <- poly2nb(final_summary)
spatial_weights <- nb2listw(cont.sf, style="W")

Spatial weights matrix is computed.

Spatial autocorrelation

Moran’s test for spatial autocorrelation is then performed:

moran.test(final_summary$realSum, spatial_weights)

    Moran I test under randomisation

data:  final_summary$realSum  
weights: spatial_weights    

Moran I statistic standard deviate = 13.104, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.521592954      -0.004504505       0.001611774 
  • The Moran’s I test results indicate that there is a significant positive spatial autocorrelation in the price variable among the spatial areas (polygons) represented in the dataset.

  • This positive autocorrelation suggests that values of price tend to be similar among neighboring polygons, implying spatial clustering or patterns in the distribution of this variable across the study area.

  • The strong statistical significance (very low p-value) reinforces the conclusion that the observed spatial autocorrelation is unlikely to occur by random chance alone.

Spatial Lag Model

# Define the formula for the spatial lag model
formula_lag <- realSum ~ person_capacity + bedrooms + dist + guest_satisfaction_overall + cleanliness_rating

# Fit the spatial lag model
model_lag <- lagsarlm(formula_lag, data = final_summary, listw = spatial_weights)

# View summary of the spatial lag model
summary(model_lag)

Call:
lagsarlm(formula = formula_lag, data = final_summary, listw = spatial_weights)

Residuals:
     Min       1Q   Median       3Q      Max 
-170.117  -51.598  -16.053   26.518  915.766 

Type: lag 
Coefficients: (asymptotic standard errors) 
                           Estimate Std. Error z value  Pr(>|z|)
(Intercept)                134.6892   200.8917  0.6705    0.5026
person_capacity            123.5157    20.2611  6.0962 1.086e-09
bedrooms                    37.5831    31.9576  1.1760    0.2396
dist                       -15.4734     3.2313 -4.7886 1.680e-06
guest_satisfaction_overall  -3.8703     3.1624 -1.2238    0.2210
cleanliness_rating          16.8386    30.6019  0.5502    0.5821

Rho: 0.2998, LR test value: 13.459, p-value: 0.00024381
Asymptotic standard error: 0.083576
    z-value: 3.5872, p-value: 0.00033427
Wald statistic: 12.868, p-value: 0.00033427

Log likelihood: -1348.753 for lag model
ML residual variance (sigma squared): 10309, (sigma: 101.53)
Number of observations: 223 
Number of parameters estimated: 8 
AIC: 2713.5, (AIC for lm: 2725)
LM test for residual autocorrelation
test value: 2.1082, p-value: 0.14652

Interpretation:

Coefficients:

  • Intercept: The estimated intercept is 134.6892, which represents the expected value of price when all other predictor variables are zero.

  • person_capacity: For every unit increase in person_capacity, the expected value of price increases by 123.5157, holding other variables constant.

  • bedrooms: The coefficient for bedrooms is 37.5831, suggesting that an increase in the number of bedrooms is associated with an increase in price, although the p-value (0.2396) indicates that this relationship is not statistically significant at conventional levels.

  • dist: A one-unit increase in dist (distance) is associated with a decrease of 15.4734 in price. This negative coefficient is statistically significant (p-value < 0.001), indicating that properties farther away tend to have lower price values.

  • guest_satisfaction_overall and cleanness_rating: These coefficients are not statistically significant (p-values > 0.05), suggesting that there is insufficient evidence to conclude that these variables have a linear relationship with price.

Spatial Autocorrelation:

  • Rho (Rho): The spatial autoregressive parameter (rho) is estimated to be 0.2998. This indicates positive spatial autocorrelation, suggesting that similar values of realSum tend to occur in nearby locations.

Model Fit:

  • AIC: The Akaike Information Criterion (AIC) for the lag model is 2713.5, which is lower than the AIC for a standard linear regression (lm), indicating that the spatial lag model provides a better fit.

Residual Autocorrelation Test:

  • LM Test for Residual Autocorrelation: The LM test statistic (2.1082) with a p-value of 0.14652 tests for residual autocorrelation. A higher p-value (> 0.05) suggests no significant evidence of residual autocorrelation, although caution should be exercised given the proximity to conventional significance levels.

Conclusions:

The spatial lag model reveals significant relationships between the price and the variables person_capacity and dist, while also detecting positive spatial autocorrelation, which suggests that nearby observations are more similar than those farther apart. This model provides valuable insights into the spatial dependency of the price variable and the role of different predictors in explaining variations in this variable.

Spatial Error Model

Since we now know that there is spatial autocorrelation, a spatial error model is used to account for it.

# Define the formula for the spatial error model
formula_error <- realSum ~ person_capacity + bedrooms + dist + guest_satisfaction_overall + cleanliness_rating

# Fit the spatial error model using errorsarlm
model_error <- errorsarlm(formula_error, data = final_summary, listw = spatial_weights)

# Summarize the model results
summary(model_error)

Call:
errorsarlm(formula = formula_error, data = final_summary, listw = spatial_weights)

Residuals:
     Min       1Q   Median       3Q      Max 
-161.435  -50.154  -16.422   27.628  909.328 

Type: error 
Coefficients: (asymptotic standard errors) 
                           Estimate Std. Error z value  Pr(>|z|)
(Intercept)                250.3207   207.3326  1.2073    0.2273
person_capacity            124.4784    21.2092  5.8691 4.383e-09
bedrooms                    46.4639    31.7415  1.4638    0.1432
dist                       -23.8709     3.4822 -6.8551 7.127e-12
guest_satisfaction_overall  -2.5608     3.2461 -0.7889    0.4302
cleanliness_rating           6.0222    31.0584  0.1939    0.8463

Lambda: 0.32482, LR test value: 10.347, p-value: 0.0012967
Asymptotic standard error: 0.091854
    z-value: 3.5363, p-value: 0.00040582
Wald statistic: 12.505, p-value: 0.00040582

Log likelihood: -1350.309 for error model
ML residual variance (sigma squared): 10419, (sigma: 102.07)
Number of observations: 223 
Number of parameters estimated: 8 
AIC: 2716.6, (AIC for lm: 2725)

The spatial error model’s results provide insights into how the specified predictors influence the price while considering spatial effects in the data. The significant Lambda value and model fit statistics support the validity and usefulness of this modeling approach for this purpose.

Geographically Weighted Regression

Data preparation

Firstly, centroids of each London area are created.

crds.sf<-st_centroid(final_summary$geometry)
crds<-st_coordinates(crds.sf)

Then, the formula for the regression in created:

formula_gwr <- realSum ~ person_capacity + bedrooms + dist + guest_satisfaction_overall + cleanliness_rating

Lastly, the optimal bandwidth for the kernel is computed:

bw<-ggwr.sel(formula_gwr, data=final_summary, coords=crds, family=poisson(), longlat=TRUE)
Bandwidth: 11.53587 CV score: 2384555 
Bandwidth: 18.64679 CV score: 2394005 
Bandwidth: 7.141076 CV score: 2360428 
Bandwidth: 4.424945 CV score: 2321810 
Bandwidth: 2.746284 CV score: 2277530 
Bandwidth: 1.708814 CV score: 2310278 
Bandwidth: 2.967365 CV score: 2285530 
Bandwidth: 2.520723 CV score: 2270120 
Bandwidth: 2.210601 CV score: 2265873 
Bandwidth: 2.174159 CV score: 2266271 
Bandwidth: 2.269214 CV score: 2265744 
Bandwidth: 2.251943 CV score: 2265722 
Bandwidth: 2.253018 CV score: 2265722 
Bandwidth: 2.252651 CV score: 2265722 
Bandwidth: 2.252611 CV score: 2265722 
Bandwidth: 2.252692 CV score: 2265722 
Bandwidth: 2.252651 CV score: 2265722 

Model

Finally, a Generalized Geographically Weighted Regression is run:

# Compute GGWR model with bandwidth selection
ggwr_model <- ggwr(formula_gwr, data = final_summary, longlat = TRUE, coords = crds, bandwidth = bw)

# Summary of GGWR model
ggwr_model
Call:
ggwr(formula = formula_gwr, data = final_summary, coords = crds, 
    bandwidth = bw, longlat = TRUE)
Kernel function: gwr.Gauss 
Fixed bandwidth: 2.252651 
Summary of GWR coefficient estimates at data points:
                                 Min.    1st Qu.     Median    3rd Qu.
X.Intercept.               -807.27858  -20.86557  280.29804  484.31534
person_capacity              17.98834   81.95564  115.96462  158.79685
bedrooms                   -118.96642   -6.56119   38.20485  130.28437
dist                        -58.32765  -45.52561  -32.67451  -22.02527
guest_satisfaction_overall  -26.45223   -6.34695   -3.20808    0.50117
cleanliness_rating         -235.73410  -26.11439   17.71770   71.31600
                                 Max.   Global
X.Intercept.                976.63962 343.6122
person_capacity             225.52532 147.3500
bedrooms                    310.90045  23.0610
dist                         -6.03762 -22.5523
guest_satisfaction_overall   14.05380  -5.7395
cleanliness_rating          267.55642  22.8392
  • GWR coefficient estimates help to understand how the relationships between variables differ across space, providing insights into local variations that may not be captured by a traditional global regression model. They highlight the spatial heterogeneity in the studied relationships and can be used as a guide to more targeted and context-specific interpretations.

  • For instance, if person_capacity has a median coefficient estimate of 116 and a wide range from 82 to 226 across different locations, it suggests that the effect of person_capacity on the price varies substantially depending on the specific geographic context. Some areas might show a stronger positive relationship between person_capacity and price, while others might exhibit weaker or negative relationships.

Visualization

We can see this graphically:

plots_data <- final_summary
par(mfrow = c(3, 2))
plots_data$GWR.person_capacity<-ggwr_model$SDF$person_capacity
ggplot()+geom_sf(data=plots_data, aes(fill=GWR.person_capacity))

plots_data$GWR.bedrooms<-ggwr_model$SDF$bedrooms
ggplot()+geom_sf(data=plots_data, aes(fill=GWR.bedrooms))

plots_data$GWR.dist<-ggwr_model$SDF$dist
ggplot()+geom_sf(data=plots_data, aes(fill=GWR.dist))

plots_data$GWR.guest_satisfaction_overall<-ggwr_model$SDF$guest_satisfaction_overall
ggplot()+geom_sf(data=plots_data, aes(fill=GWR.guest_satisfaction_overall))

plots_data$GWR.cleanliness_rating<-ggwr_model$SDF$cleanliness_rating
ggplot()+geom_sf(data=plots_data, aes(fill=GWR.cleanliness_rating))

  • Regions with darker colors suggest locations where the relationships between the predictors and the response variable are less impactful or where other unmodeled factors might be more influential.

  • Brighter colors highlight locations where the predictor variables strongly explain variations in the response variable. These areas could be significant for targeted interventions or further investigation.

Boxplot

boxplot(as.data.frame(ggwr_model$SDF)[,3:7])
abline(h=0, lty=3, lwd=2, col="red")

  • This boxplot provides insights into the distribution and variability of the GWR coefficient estimates for each predictor variable. By analyzing this boxplot, some can spatial patterns in how different predictors impact the response variable, can be revealed.

  • For example, if certain predictors consistently show positive or negative coefficients across most locations, it suggests spatially varying relationships. This is the case of the variables person_capacity and bedrooms.

CLUSTERING

We are now going to perform clustering of the different areas:

Optimal number of clusters

fviz_nbclust(as.data.frame(ggwr_model$SDF[,3:7]), FUNcluster=kmeans)

  • As we can see from the graph, the best number of clusters according to the Silhouette index is 3, even though 2 or 4 also fine.

K-Means

K-Means clustering algorithm is then performed by selecting 3 clusters:

km3c <- eclust(as.data.frame(ggwr_model$SDF[,3:7]), "kmeans", k=3)

plots_data$clust3 <- km3c$cluster
ggplot() + geom_sf(data=plots_data, aes(fill=clust3))

  • The clustering analysis applied to GWR coefficients helps in identifying spatial groupings of locations with similar predictor-response relationships. It facilitates the exploration of spatial patterns, differentiation between areas, and identification of localized trends and variations in the study area.
  • As we can see from the graph, around 60% of the total variability is explained, which is not bad.

MODELING II

Let’s now create dummy variables representing clusters and add them to the model:

final_summary$clust1<-rep(0, times=dim(final_summary)[1])
final_summary$clust1[km3c$cluster==1]<-1
final_summary$clust2<-rep(0, times=dim(final_summary)[1])
final_summary$clust2[km3c$cluster==2]<-1
final_summary$clust3<-rep(0, times=dim(final_summary)[1])
final_summary$clust3[km3c$cluster==3]<-1
final_summary$clust4<-rep(0, times=dim(final_summary)[1])

Spatial Error Model

By adding the dummy variables to the model, we are controlling for spatial drift.

new_eq <- realSum ~ person_capacity + bedrooms + dist + guest_satisfaction_overall + cleanliness_rating + clust1 + clust2
model.sem<-errorsarlm(new_eq, data=final_summary, spatial_weights)
summary(model.sem)

Call:
errorsarlm(formula = new_eq, data = final_summary, listw = spatial_weights)

Residuals:
     Min       1Q   Median       3Q      Max 
-165.469  -50.351  -16.813   24.876  921.326 

Type: error 
Coefficients: (asymptotic standard errors) 
                           Estimate Std. Error z value  Pr(>|z|)
(Intercept)                251.5349   207.0747  1.2147    0.2245
person_capacity            122.9442    21.3243  5.7655 8.143e-09
bedrooms                    45.6809    31.7169  1.4403    0.1498
dist                       -24.3186     3.8481 -6.3197 2.621e-10
guest_satisfaction_overall  -2.7588     3.2421 -0.8509    0.3948
cleanliness_rating           6.6765    31.0178  0.2152    0.8296
clust1                      24.0430    24.5375  0.9798    0.3272
clust2                      20.4912    28.8552  0.7101    0.4776

Lambda: 0.31442, LR test value: 9.6431, p-value: 0.0019007
Asymptotic standard error: 0.092549
    z-value: 3.3973, p-value: 0.00068054
Wald statistic: 11.542, p-value: 0.00068054

Log likelihood: -1349.822 for error model
ML residual variance (sigma squared): 10388, (sigma: 101.92)
Number of observations: 223 
Number of parameters estimated: 10 
AIC: 2719.6, (AIC for lm: 2727.3)

  • The significant Lambda and test statistics suggest that there is spatial autocorrelation in the residuals of the model, indicating that nearby locations still exhibit similar “realSum” (price) values even after accounting for all specified predictors and spatially varying effects (clust1, clust2).

  • The estimated coefficients for each predictor and dummy variable provide insights into how these variables influence price in the presence of spatial effects. For example, person_capacity and dist have significant impacts on the price, while the dummy variables (clust1, clust2) capture additional spatial variability in the price.

In summary, the spatial error model with dummy variables (clust1, clust2) helps to control for spatial heterogeneity and autocorrelation in the residuals, providing a more accurate and robust analysis of the relationship between the price and the predictor variables within a spatial context. The model results highlight the importance of considering spatial effects when analyzing geographic data.

OLS Model with dummies

Lastly, we also include a linear model where the clusters are include to do a comparison:

ols_model<-lm(new_eq, data=final_summary)
summary(ols_model)

Call:
lm(formula = new_eq, data = final_summary)

Residuals:
    Min      1Q  Median      3Q     Max 
-130.31  -56.64  -24.54   30.81  906.01 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 351.482    207.493   1.694   0.0917 .  
person_capacity             142.240     21.266   6.689 1.92e-10 ***
bedrooms                     23.112     33.732   0.685   0.4940    
dist                        -23.125      3.059  -7.559 1.16e-12 ***
guest_satisfaction_overall   -6.038      3.344  -1.806   0.0724 .  
cleanliness_rating           24.680     32.309   0.764   0.4458    
clust1                       25.397     20.360   1.247   0.2136    
clust2                       22.569     24.500   0.921   0.3580    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 107.1 on 215 degrees of freedom
Multiple R-squared:  0.5243,    Adjusted R-squared:  0.5088 
F-statistic: 33.85 on 7 and 215 DF,  p-value: < 2.2e-16

CONCLUSIONS

Modeling

  • According to the R-squared, with the linear model we are only able to explain 50% of the variability of the data.

  • Among the models, the Spatial Error Model and the Geographically Weighted Regression appear to perform better based on lower AIC values and potentially improved model fit compared to OLS. The inclusion of spatial weights and clustering variables also improved the fit of these models.

  • Finally, we can say that for this purpose the need to account for spatial information is crucial for proper analysis and conclusions.

Answers to the research question

The research conducted in this project reveals that both spatial and non-spatial factors significantly influence Airbnb prices. The key findings can be summarized as follows:

  1. Influence of Location (Spatial Factors):

    • Distance to Amenities: Properties located closer to popular attractions, city centers, or transportation hubs generally command higher prices. This is evidenced by the significant negative coefficient for the distance variable (dist) in both the spatial error model and the geographically weighted regression (GWR).

    • Spatial Autocorrelation: The presence of positive spatial autocorrelation indicates that Airbnb prices are not randomly distributed but are spatially clustered. This means that high-priced properties are often located near other high-priced properties, and the same holds for low-priced properties. This was confirmed by the Moran’s I statistic and the results of spatial lag and error models.

  2. Property Characteristics (Non-Spatial Factors):

    • Capacity and Size: Larger properties with more bedrooms and higher guest capacity have higher prices. The coefficients for person_capacity and bedrooms are positive and significant across models.

    • Quality Ratings: Guest satisfaction and cleanliness ratings have a positive, although not always statistically significant, impact on prices.

  3. Local Variations and Clustering

    • GWR Analysis: The GWR model highlights that the influence of these factors varies across different areas of the city. For instance, the impact of distance to amenities on price is more pronounced in some neighborhoods than others. This local variation can be crucial for hosts aiming to optimize pricing based on their specific location.

    • Cluster Analysis: The clustering analysis revealed distinct groups of properties with similar characteristics and pricing behaviors. These clusters can help hosts understand their competition and market segment, allowing for more targeted marketing and pricing strategies.