Short-term rental platforms such as Airbnb have transformed urban hospitality markets by enabling flexible, peer-to-peer accommodation (Guttentag, 2015). Listing prices vary widely within and across cities, shaped by location, room type, proximity to amenities, temporal demand cycles (weekdays/weekends), and perceived quality signals such as cleanliness and guest ratings (Teubner, Hawlitschek, & Dann, 2017; Dogru & Pekin, 2017). Understanding these determinants supports fair pricing, urban tourism policy, and platform governance (Zervas, Proserpio, & Byers, 2017).
Despite abundant city-level studies, systematic multi-city comparisons using a harmonized dataset remain limited (Wachsmuth & Weisler, 2018). This project quantifies how room type, spatial attributes (latitude/longitude, metro access, attractiveness/restaurant indices), cleanliness, and temporal factors jointly explain price differences across major European cities.
Describe the distribution of key variables (price, ratings, spatial features) across cities and between weekdays/weekends.
Estimate the association between price, room type, cleanliness rating, metro distance, and urban amenity indices.
Diagnose model assumptions and assess multicollinearity.
Derive policy and managerial implications for city stakeholders and hosts.
RQ1: How do price levels and dispersion differ across European cities and by day-type? RQ2: What is the marginal association between price and room type, cleanliness rating, metro distance, and urban amenity indices? RQ3: Do spatial attributes (lat/lng) and city context remain important after controlling for service quality?
H1: Private rooms and shared rooms are priced lower than entire homes, ceteris paribus. H2: Higher cleanliness ratings and attractiveness/restaurant indices are positively associated with price. H3: Greater distance to metro is negatively associated with price. H4: Price patterns differ between weekdays and weekends.
Findings inform hosts’ pricing strategies, platform search design, and city tourism policy (e.g., transit-oriented planning). A multi-city lens improves generalizability beyond single-city case studies (Li, Moreno, & Zhang, 2019).
The analysis covers multiple European cities with weekday/weekend snapshots. Results reflect listing supply and platform dynamics at the time of data collection and may not capture regulatory changes or seasonality beyond the sampled periods.
Price (realSum): Listing price in local currency units.
Room type: Categorical type of accommodation (entire home/apt, private room, etc.).
Cleanliness rating: Numeric rating reported on the platform.
Attractiveness/Restaurant indices: City grid indices proxying proximity/abundance of attractions and restaurants.
Metro distance: Distance to the nearest metro station.
Business day: Weekday vs. weekend segmentation.
Prior work links room type to price premiums (entire homes > private rooms > shared rooms). Quality signals (cleanliness, reviews) often command higher willingness to pay. Spatial accessibility (centrality, transit proximity) and urban amenity density (cultural and culinary clusters) also influence pricing. Multi-city comparisons highlight heterogeneous urban structures and tourism flows.
Urban cores concentrate demand; peripheries price lower unless compensated by unique attributes. Transit and walk ability increase accessibility and perceived value. City-specific zoning/regulatory environments further shape supply elasticity and price dispersion.
Weekend/holiday periods typically fetch higher prices due to leisure demand. Weekdays may reflect business travel and local events. Temporal segmentation is essential in models.
This project contributes a harmonized multi-city data set, side-by-side visualization of city/day-type patterns, and a unified regression framing determinants across spatial, temporal, and quality dimensions.
A quantitative, cross-sectional, multi-city analysis combining exploratory visualization, correlation assessment, and linear regression with diagnostic checks.
Data include listings from Amsterdam, Athens, Barcelona, Berlin, Budapest, Lisbon, London, Paris, Rome, and Vienna, each by weekdays and weekends. Core variables include price, room type, guest satisfaction, cleanliness rating, bedrooms, amenity indices (attr_index, rest_index and normalized variants), metro distance, latitude/longitude.
Original data were imported and harmonized across cities. Data wrangling steps included: combining datasets, creating identifiers for city and business day, checking missing values, and performing feature engineering (dummy variables, transformations, clustering, normalization). Regression models estimated determinants of price.
amsterdam_weekdays <- read.csv("amsterdam_weekdays.csv")
amsterdam_weekends <- read.csv("amsterdam_weekends.csv")
athens_weekdays <- read.csv("athens_weekdays.csv")
athens_weekends <- read.csv("athens_weekends.csv")
barcelona_weekdays <- read.csv("barcelona_weekdays.csv")
barcelona_weekends <- read.csv("barcelona_weekends.csv")
berlin_weekdays <- read.csv("berlin_weekdays.csv")
berlin_weekends <- read.csv("berlin_weekends.csv")
budapest_weekdays <- read.csv("budapest_weekdays.csv")
budapest_weekends <- read.csv("budapest_weekends.csv")
lisbon_weekdays <- read.csv("lisbon_weekdays.csv")
lisbon_weekends <- read.csv("lisbon_weekends.csv")
london_weekdays <- read.csv("london_weekdays.csv")
london_weekends <- read.csv("london_weekends.csv")
paris_weekends <- read.csv("paris_weekends.csv")
paris_weekdays <- read.csv("paris_weekdays.csv")
rome_weekdays <- read.csv("rome_weekdays.csv")
rome_weekends <- read.csv("rome_weekends.csv")
vienna_weekdays <- read.csv("vienna_weekdays.csv")
vienna_weekends <- read.csv("vienna_weekends.csv")
A source column tags each observation with its origin (e.g., berlin_weekends), preserving provenance for stratified analysis.
The bar plot illustrates the distribution of Airbnb listings across the sampled cities. The variation in bar height reflects differences in sample size, with larger bars indicating cities that contributed more listings to the dataset. A greater number of listings enhances the precision and reliability of city-level estimates, while smaller sample sizes may limit the stability of results and increase sampling variability. This distribution is therefore an important consideration when comparing determinants of pricing across cities, as uneven representation may influence the strength of statistical inferences.
The temporal balance analysis compares the number of weekday and weekend observations within the dataset. The relative distribution between these two categories is important for assessing the reliability of temporal effects on pricing. A reasonably balanced distribution ensures that comparisons between weekday and weekend patterns are stable and less prone to bias. However, a substantial imbalance in observations could weaken the validity of temporal comparisons, as estimates for the underrepresented category may be less precise and more sensitive to sampling variability.
The chart evaluates whether both weekdays and weekends are adequately represented within each city. Ensuring balanced representation across temporal categories is critical for making unbiased comparisons of weekday versus weekend pricing patterns. If both time periods are well captured, the resulting contrasts more accurately reflect true temporal effects. In contrast, underrepresentation of either weekdays or weekends in certain cities could introduce bias, limiting the validity of temporal inferences at the city level.
The stacked proportion plot illustrates each city’s relative composition of weekday and weekend listings. This visualization highlights how temporal representation varies across cities, providing insights into the balance of observations within each location. Cities with more even proportions enable more reliable contrasts between weekday and weekend pricing, whereas cities with skewed distributions may produce biased temporal comparisons. Thus, the stacked proportions not only reveal temporal patterns across locations but also serve as a diagnostic tool for assessing the robustness of city-level temporal analyses.
combined_data %>%
count(business_day) %>%
ggplot(aes(x = "", y = n, fill = business_day)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Business Day Distribution") +
theme_void()
Overall, the results suggest a potential small to moderate imbalance between weekday and weekend listings across the dataset. While both categories are represented, the uneven distribution indicates that one temporal category may be slightly more dominant. Such imbalance does not invalidate temporal comparisons but may influence the precision of estimates, particularly for the underrepresented category. Recognizing this imbalance is therefore important when interpreting weekday–weekend contrasts in pricing.
## X realSum person_capacity multi
## Min. : 0 Min. : 34.78 Min. :2.000 Min. :0.0000
## 1st Qu.: 646 1st Qu.: 148.75 1st Qu.:2.000 1st Qu.:0.0000
## Median :1334 Median : 211.34 Median :3.000 Median :0.0000
## Mean :1621 Mean : 279.88 Mean :3.162 Mean :0.2914
## 3rd Qu.:2382 3rd Qu.: 319.69 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :5378 Max. :18545.45 Max. :6.000 Max. :1.0000
## biz cleanliness_rating guest_satisfaction_overall
## Min. :0.0000 Min. : 2.000 Min. : 20.00
## 1st Qu.:0.0000 1st Qu.: 9.000 1st Qu.: 90.00
## Median :0.0000 Median :10.000 Median : 95.00
## Mean :0.3502 Mean : 9.391 Mean : 92.63
## 3rd Qu.:1.0000 3rd Qu.:10.000 3rd Qu.: 99.00
## Max. :1.0000 Max. :10.000 Max. :100.00
## bedrooms dist metro_dist attr_index
## Min. : 0.000 Min. : 0.01504 Min. : 0.002301 Min. : 15.15
## 1st Qu.: 1.000 1st Qu.: 1.45314 1st Qu.: 0.248480 1st Qu.: 136.80
## Median : 1.000 Median : 2.61354 Median : 0.413269 Median : 234.33
## Mean : 1.159 Mean : 3.19129 Mean : 0.681540 Mean : 294.20
## 3rd Qu.: 1.000 3rd Qu.: 4.26308 3rd Qu.: 0.737840 3rd Qu.: 385.76
## Max. :10.000 Max. :25.28456 Max. :14.273577 Max. :4513.56
## attr_index_norm rest_index rest_index_norm lng
## Min. : 0.9263 Min. : 19.58 Min. : 0.5928 Min. :-9.2263
## 1st Qu.: 6.3809 1st Qu.: 250.85 1st Qu.: 8.7515 1st Qu.:-0.0725
## Median : 11.4683 Median : 522.05 Median : 17.5422 Median : 4.8730
## Mean : 13.4238 Mean : 626.86 Mean : 22.7862 Mean : 7.4261
## 3rd Qu.: 17.4151 3rd Qu.: 832.63 3rd Qu.: 32.9646 3rd Qu.:13.5188
## Max. :100.0000 Max. :6696.16 Max. :100.0000 Max. :23.7860
## lat
## Min. :37.95
## 1st Qu.:41.40
## Median :47.51
## Mean :45.67
## 3rd Qu.:51.47
## Max. :52.64
Ratings appear concentrated at higher scores (moderate clustering), suggesting generally positive experiences.
Airbnb prices typically a strong right-skew in price, indicating many budget/mid listings and fewer premium properties.
# Boxplot of realSum by Country
ggplot(combined_data, aes(x = Country, y = realSum, fill = Country)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Price (realSum) by Country", y = "Price", x = "Country") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
There appears to be moderate to large differences in median price and dispersion across cities. Visual differences suggest heterogeneity; statistical significance would require formal tests.
# Boxplot of cleanliness_rating by business_day
ggplot(combined_data, aes(x = business_day, y = cleanliness_rating, fill = business_day)) +
geom_boxplot() +
labs(title = "Cleanliness Rating by Business Day", y = "Rating", x = "Business Day") +
theme_minimal()
There appears to be little difference in cleanliness by day-type .
ggplot(combined_data, aes(x = guest_satisfaction_overall, y = realSum)) +
geom_point(alpha = 0.5, color = "#d62728") +
labs(title = "Guest Satisfaction vs Price", x = "Satisfaction", y = "Price") +
theme_minimal()
A positive trend suggests higher‑priced listings also earn higher satisfaction (possibly due to amenities/location). Weak/no trend implies satisfaction is not price‑driven after all else.
ggplot(combined_data, aes(x = metro_dist, y = realSum)) +
geom_point(alpha = 0.5, color = "#9467bd") +
labs(title = "Metro Distance vs Price", x = "Metro Distance", y = "Price") +
theme_minimal()
There appears to be a negative relationship between distance to metro and price. listings farther from metro stations tend to be cheaper.
combined_data %>%
group_by(Country) %>%
summarise(avg_price = mean(realSum, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(Country, avg_price), y = avg_price)) +
geom_col(fill = "#FF7F0E") +
coord_flip() +
labs(title = "Average Price by Country", x = "Country", y = "Average Price") +
theme_minimal()
The analysis indicates moderate to large differences in average prices across cities, suggesting that location plays an important role in determining Airbnb pricing. These differences may reflect variations in demand, cost of living, tourism activity, and local market conditions. Cities with higher average prices likely capture premium markets or stronger demand, while lower-priced cities may reflect more affordable or less competitive markets. Such cross-city variation highlights the importance of including location-specific factors in the analysis to avoid biased or oversimplified conclusions about pricing determinants.
The Geospatial scatter plot reveals the presence of clusters with moderately higher attractiveness, indicating that certain neighborhoods or areas within the cities draw relatively stronger demand. These clusters suggest localized effects where factors such as proximity to tourist attractions, accessibility, or neighborhood quality contribute to elevated listing appeal. The moderate strength of clustering implies that while location is influential, it interacts with other determinants—such as room type, amenities, and pricing strategy—in shaping overall demand.
The comparison of amenity distribution across day-types (weekday versus weekend) indicates that the spread of amenities appears broadly similar, with only minor observable differences. These variations are small and, since no formal statistical tests were conducted, cannot be interpreted as significant. The findings therefore suggest that amenity availability is generally consistent across temporal categories, implying that temporal price differences are more likely to be driven by other factors rather than variation in amenity provision.
The analysis shows that transit accessibility varies across cities, with moderate differences observed in the distribution of listings relative to transportation options. These patterns suggest that accessibility may contribute to pricing and demand differences across locations. However, since this conclusion is based on visual inspection, statistical significance has not been formally established. Further testing would be required to determine whether the observed differences in transit accessibility are meaningful predictors of pricing outcomes. Compares spatial distribution of metro accessibility by city, highlighting denser transit grids vs more car‑oriented peripheries.
The spatial patterns reveal co-located hotspots of cultural and culinary amenities with moderate strength, suggesting that certain neighborhoods concentrate multiple attractive features for visitors. Such clustering indicates that areas offering both cultural and dining experiences may hold a competitive advantage in shaping demand and pricing. However, these findings are based on visual analysis, and their statistical significance requires confirmation through regression modeling to establish whether amenity concentration is a meaningful predictor of listing performance.
# Use real basemaps with leaflet (interactive)
leaflet(combined_data) %>%
addTiles() %>%
addCircleMarkers(~lng, ~lat,
radius = ~attr_index_norm * 5,
color = ~colorNumeric("YlOrRd", attr_index_norm)(attr_index_norm),
popup = ~paste("City:", Country,
"<br>Attr Index:", attr_index,
"<br>Rest Index:", rest_index)) %>%
addLegend("bottomright", pal = colorNumeric("YlOrRd", combined_data$attr_index_norm),
values = ~attr_index_norm, title = "Attr Index (Norm)")
The interactive mapping results highlight localized clusters of high attractiveness, which appear visually moderate to strong in intensity. These clusters suggest that specific neighborhoods or districts consistently draw higher demand, likely due to their proximity to cultural, commercial, or leisure amenities. The spatial concentration of attractiveness reinforces the importance of location in shaping Airbnb performance. Nonetheless, as these observations are derived from visual inspection, further statistical validation is necessary to confirm the strength and significance of these localized effects.