Abstract

This project assesses the relationship between Airbnb listing distribution and mean household income across the Boston, Massachusetts neighborhoods. A rarely assessed relationship, this project closes the gap in understanding by combining US Census income data with that of Airbnb listings. This produced a process for assessing other variable relationships to Airbnb listing distribution. Utilizing exploratory analysis and linear regression, mean household income and Airbnb listing distribution was found to have no correlation. Utilizing the methodolgies employed in this study, future analysis can be performed to expand the variables assessed to include greater demographic characteristics within each neighborhood.

Introduction

An alternative to hotel accommodations, Airbnb has made it easier for individuals to rent rooms or whole houses to other individuals creating a new competitive corner of the market disrupting tourism business models in larger cities (Gutiérrez et al. 2017). Research into impacts and Airbnb distribution has been primarily focused in European markets with attention given to larger tourist city centers (Adamiak et al. 2019; Garcia-López et al. 2020). An exception can be found in a study focusing on Los Angeles County, California that compared short-term housing rental impacts on housing markets by comparing cities in the county that had implemented restrictions on short-term rentals and those that had no such restrictions. It was found that these restrictions led to a 2% drop in property values (Koster et al. 2021).

A city with a market for historical tourism, Boston mixes historic tourist sites amid residential communities. Boston features 24 economically varied neighborhoods and is heavily touristed with no active restrictions or ordinances impacting Airbnb listings making it an ideal candidate for Airbnb market analysis (Meet Boston | Your Official Guide to Boston, n.d.). This project seeks to determine if there is a relationship between the social-economic status of a neighborhood and the preponderance of Airbnb listings. Utilizing Boston as the region of focus and exploratory and statistical analysis methods, which neighborhoods have the highest mean household income? Do these locations show a correlation with the amount of Airbnb listings?

Data

Airbnb listing data was sourced from Insideairbnb.com. This data contains listings from January 1st, 2025 to December 27th, 2025. To obtain the data, Inside Airbnb utilizes public information compiled from the Airbnb website and is verified and cleaned prior to release. The occupancy rate for each listing are calculated utilizing a model based on the work of (Bunn 2024). The resulting model ,called “San Francisco Model”, takes in to account the average stay length of approximately 50% of reviews in a listing. This value is then multiplied by the estimated bookings of the listing to produce an occupancy rate. This value compared with the listing type was utilized to produce an occupancy rating for each neighborhood normalized by listing type. This data set will require filtering of variables to ensure only the neighborhood, coordinates, occupancy rate, and listing type are selected.

US Census data was sourced utilizing the tidycensus package pulling the 2024 5-yr estimate data. Utilizing the mean household income by census tract, this data set provides the mean household income for each neighborhood when combined. Utilizing tidycensus provides a clean data set that can be combined with other data sets for deeper analysis.

Neighborhood data was sourced from the official City of Boston data hub. This data set provides geographies for the 24 Boston neighborhoods aligned with census tracks allowing for incorporation of tidycensus data.

Methodology

The data available for this project varies by year. For the Airbnb data, it was only available for 2025 while US Census data was available for only as recent as 2024. This required an assumption that mean household incomes for each neighborhood would not drastically change. From this assumption, the rest of the methodology as employed.

Preparing Data

Before analysis could be completed, tools and data was first loaded into the project. This project was structured under the tidy data principles bringing in packages with functions containing tidy inputs and outputs. These included tidyverse, leaflet, sf, ggplot2, dplyr, tidyr, and tidycensus.

# load packages ----

library(tidyverse)
library(here)
library(dplyr)
library(tidyr)
library(janitor)
library(ggplot2)
library(ggiraph)
library(cowplot)
library(scales)
library(leaflet)
library(sf)
library(mapview)
library(tidycensus)
library(tigris)
library(knitr)

With packages loaded, each data set was loaded and cleaned. First, the Boston data was loaded with geography defined as polygon vector data. Reviewing the data set, it was identified that the neighborhood variable name was cut short requiring renaming and an overall cleaning of names overall. Figure 1 features the initial construct of the neighborhood data set.

neighborhoods <- st_read(here("data", "Boston_Neighborhood_Boundaries_Approximated_by_2020_Census_Tracts.shp"), 
                         quiet = TRUE)

view(neighborhoods)

CleanNeighborhoods <- neighborhoods %>%
  rename(neighborhood = neighborho) %>%
  clean_names()
mapview(CleanNeighborhoods, zcol = "neighborhood")

Next, Airbnb listing data was loaded and cleaned. Once loaded, the data set was cleaned, columns were renamed for American English spelling, and a subset of the data was pulled to include the neighborhood and room type variables. The listing data set was first converted to an sf feature utilizing st_as_sf() and was initially plotted by room type over the neighborhood. Figure 2 shows that the most common type of listings was for entire homes and apartment rentals.

listings <- read_csv(here("data", "listings_sum.csv"))

CleanListings <- listings %>%
  clean_names() %>%
  rename(neighborhood = neighbourhood) %>%
  select(id,
         neighborhood,
         room_type,
         latitude,
         longitude)
listings_plot <- st_as_sf(CleanListings, coords = c("longitude", "latitude"), crs = 4326)

neighborhood_listings <- ggplot() +
  geom_sf_interactive(data = CleanNeighborhoods, aes(tooltip = neighborhood, fill = neighborhood), alpha = 0.2) +
  geom_sf(data = listings_plot,
          aes(color = as.factor(room_type))) +
  coord_sf(crs = 4326) +
  labs(title = "Airbnb Listings by Room Type \n Plotted on Boston's Neighborhoods",
       fill = "Neighborhood",
       color = "Room Type")+
  theme_void()

girafe(ggobj = neighborhood_listings) %>%
  girafe_options(opts_hover(css = "fill:cyan;", reactive = TRUE))

Figure 2 - Initial plot of Airbnb listings by room type across the Boston neighborhoods. The most common type of listings is found to be the entire house and apartment listings.

Lastly, the US Census data was pulled at the census tract level for Suffolk County, Massachusetts that matches the metropolitan region of Boston. The associated variable associated with the total mean household income was identified as “B19013_001”. As Boston is a port city, water edges are prevalent in the data. This required the use of erase_water to ensure the census tract polygons matched the geographic reality. Similar to the neighborhood data set, the census data was structured with polygon vectors (Figure 3).

# Find variable for mean household income
vars <- load_variables(2024, "acs5")

View(vars)

# Pull data
suffolk_income <- get_acs(
  geography = "tract",
  variables = "B19013_001",
  state = "MA",
  county = "Suffolk",
  geometry = TRUE
)
##   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   1%  |                                                                              |==                                                                    |   3%  |                                                                              |====                                                                  |   5%  |                                                                              |=====                                                                 |   8%  |                                                                              |=======                                                               |   9%  |                                                                              |=======                                                               |  10%  |                                                                              |========                                                              |  12%  |                                                                              |=========                                                             |  13%  |                                                                              |==========                                                            |  14%  |                                                                              |============                                                          |  17%  |                                                                              |=============                                                         |  19%  |                                                                              |==============                                                        |  20%  |                                                                              |===============                                                       |  21%  |                                                                              |=================                                                     |  24%  |                                                                              |=================                                                     |  25%  |                                                                              |==================                                                    |  26%  |                                                                              |===================                                                   |  27%  |                                                                              |====================                                                  |  28%  |                                                                              |====================                                                  |  29%  |                                                                              |======================                                                |  32%  |                                                                              |========================                                              |  34%  |                                                                              |=========================                                             |  36%  |                                                                              |===========================                                           |  39%  |                                                                              |=============================                                         |  41%  |                                                                              |===============================                                       |  44%  |                                                                              |=================================                                     |  46%  |                                                                              |==================================                                    |  49%  |                                                                              |====================================                                  |  52%  |                                                                              |======================================                                |  54%  |                                                                              |========================================                              |  57%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  62%  |                                                                              |=============================================                         |  64%  |                                                                              |===============================================                       |  67%  |                                                                              |=================================================                     |  69%  |                                                                              |==================================================                    |  72%  |                                                                              |====================================================                  |  74%  |                                                                              |======================================================                |  77%  |                                                                              |========================================================              |  79%  |                                                                              |=========================================================             |  82%  |                                                                              |===========================================================           |  85%  |                                                                              |=============================================================         |  87%  |                                                                              |===============================================================       |  90%  |                                                                              |=================================================================     |  92%  |                                                                              |==================================================================    |  95%  |                                                                              |====================================================================  |  97%  |                                                                              |======================================================================| 100%
# Clean up data set
sf_use_s2(FALSE)

suffolk_erase <- erase_water(suffolk_income, 
                             area_threshold = 0.5, 
                             year = 2024)
mapview(suffolk_erase, zcol = "estimate")

Methodological Steps

Figure 4 is a flow chart of the processes employed to determine if a correlation exists between the mean household income and total Airbnb listings in each neighborhood. The three data sets each required unique data wrangling to produce tidy data ready for analysis operations. Selecting the Airbnb variables previously discussed, a new, tidy data set was produced. This data set was combined with Boston neighborhood data to create visual context for distribution of listings across the city. Following this, US Census data for census tracks of the Boston area was pulled using tidycensus. This data set was converted into centroid points for each tract. The mean income for each point was added together for all points in each neighborhood and averaged to produce a mean value for each neighborhood. The resulting data set was plotted and combined with point count data from the Airbnb listings data set to produce a joined table consisting of neighborhoods with variables of mean household income and total listings by type. From this data set, plots were produced of the total listings versus the mean household income of each neighborhood. From these plots, correlation was determined through the use of a simple linear regression model.

Figure 4 - A flow chart of the processes employed to determine if a correlation exists between the types and occupancy of Airbnb listings and the mean household income of each neighborhood
Figure 4 - A flow chart of the processes employed to determine if a correlation exists between the types and occupancy of Airbnb listings and the mean household income of each neighborhood

Determine Mean Income by Neighborhood

To determine the mean income for each neighborhood, each census track was first converted to a point at the centroid. This allows for point to polygon joining to occur. The resulting centroids and neighborhood data set were both set to a common coordinated reference system, WGS84, and combined using st_join. Any rows with no mean income values were removed. This then allowed for each row to be grouped by neighborhood and the mean incomes to be averaged to produce the final mean income y neighborhood data set (Figure 5).

# convert to centroids
income_centroids <- st_centroid(suffolk_erase)
## Warning: st_centroid assumes attributes are constant over geometries
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for
## longitude/latitude data
# set the coordinate reference

income_centroids <- st_set_crs(income_centroids, 4326)
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that
CleanNeighborhoods <- st_set_crs(CleanNeighborhoods, 4326)

# Combine the values

joined_income <- st_join(CleanNeighborhoods, income_centroids)
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
# convert variable column to integer and save as mean_income

joined_income <- joined_income %>%
  drop_na(estimate)

mean_income_neighborhoods <- joined_income %>%
  group_by(neighborhood) %>%
  summarise(mean_income = (sum(estimate)/n_distinct(estimate)))
## although coordinates are longitude/latitude, st_union assumes that they are
## planar
mapview(mean_income_neighborhoods, zcol = "mean_income")

Create Analysis Data Set

Finally, the Airbnb listing data and mean household income by neighborhood data were joined using st_join to associate each listing with its respective neighborhood location. The resulting data set was reduced to the neighborhood, mean income, room type, and geometry variables. To attain the total count of listings in each neighborhood, the data were grouped by neighborhood and mean household income and the resulting number of listings in the room type variable were counted. These values were then summarized as a new variable call total_listings.

# join tables with polygons preserved
joined_listings_income <- st_join(mean_income_neighborhoods, listings_plot)
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
# select columns
joined_listings_income <- joined_listings_income %>%
  select(neighborhood = neighborhood.x,
         mean_income,
         room_type,
         geometry)

# count number of listings by room type by neighborhood
count_listings_income <- joined_listings_income %>%
  group_by(neighborhood, mean_income) %>%
  count(room_type)
## although coordinates are longitude/latitude, st_union assumes that they are
## planar
# plot of Number Listings by Neighborhood

sum_listings <- count_listings_income %>%
  group_by(neighborhood, mean_income) %>%
  summarise(total_listings = sum(n))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by neighborhood and mean_income.
## ℹ Output is grouped by neighborhood.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(neighborhood, mean_income))` for per-operation
##   grouping (`?dplyr::dplyr_by`) instead.
## although coordinates are longitude/latitude, st_union assumes that they are
## planar

Data Analysis

The first analysis conducted was exploratory analysis. Utilizing ggplot() from the ggplot2 package, two plots were produced, one with the total listings by neighborhood and the other of the mean income by neighborhood. These plots were then compared to identify if any visual relationship was apparent. To reduce visual impacts of outliers, a box plot was created (Figure 6). This identified outliers being total listings beyond 400. The analysis data set was thus filtered of any total listing values beyond 400 and the visual analysis conducted again. To confirm the initial impressions from the exploratory data analysis, a simple linear regression was performed to determine if total listings could be explained by the mean household income.

# plot of mean income of neighborhoods

plot_income <- ggplot(mean_income_neighborhoods) +
  geom_sf(aes(fill = mean_income)) +
  scale_fill_distiller(name = "Mean Income",
                       palette = "Greens",
                       breaks = pretty_breaks(),
                       direction = 1) +
  coord_sf(crs = 4326) +
  labs(title = "Mean Income by Neighborhood",
       fill = "Mean Income") +
  theme_void()

# plot of Number Listings by Neighborhood
plot_listings <- ggplot(sum_listings) +
  geom_sf(aes(fill = total_listings)) +
  scale_fill_distiller(name = "Total Listings",
                       palette = "Blues",
                       breaks = pretty_breaks(),
                       direction = 1) +
  coord_sf(crs = 4326) +
  labs(title = "Total Listings by Neighborhood",
       fill = "Total Listings") +
  theme_void()

# check for outliers

sum_listings %>%
  ggplot(aes(total_listings, total_listings)) +
  geom_boxplot()
## Warning: Orientation is not uniquely specified when both the x and y aesthetics are
## continuous. Picking default orientation 'x'.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Figure 6 - Boxplot of total listings by neighborhood. A single outlier can be found beyond 400 total listings.

Figure 6 - Boxplot of total listings by neighborhood. A single outlier can be found beyond 400 total listings.

# filter out outlier
filter_sum_listings <- sum_listings %>%
  filter(total_listings <= 400)

# new plot of filtered listings

plot_filtered_listings <- ggplot(filter_sum_listings) +
  geom_sf(aes(fill = total_listings)) +
  scale_fill_distiller(name = "Total Listings",
                       palette = "Blues",
                       breaks = pretty_breaks(),
                       direction = 1) +
  coord_sf(crs = 4326) +
  labs(title = "Total Listings by Neighborhood",
       fill = "Total Listings") +
  theme_void()

# create linear regression model

lm_r = lm(formula = total_listings ~ mean_income, data = filter_sum_listings)

Results

The first visual comparison of mean household income and total listings by neighborhood (Figure 7) did not appear to show a correlation. The neighborhood with the most listings was found to be Dorchester on the eastern shore. This did not relate to either the lowest or highest income neighborhoods.

plot_grid(plot_income, 
          plot_listings,
          nrow = 2,
          hjust = 0,
          label_x = 0,
          align = "hv")
Figure 7 - Plots of mean household income and total listings by neighborhood. The neighborhood with the most listings was found to be Dorchester on the eastern shore.

Figure 7 - Plots of mean household income and total listings by neighborhood. The neighborhood with the most listings was found to be Dorchester on the eastern shore.

The total listings value for Dorchester was to be an outlier. This value was removed and the data replotted. The resulting data set still did not show any areas that visually correlated between the variables (Figure 8).

plot_grid(plot_income, 
          plot_filtered_listings,
          nrow = 2,
          hjust = 0,
          label_x = 0,
          align = "hv")
Figure 8 - Plots of mean household income and total listings by neighborhood with the Dorchester neighborhood filtered out. No correlating features were identified.

Figure 8 - Plots of mean household income and total listings by neighborhood with the Dorchester neighborhood filtered out. No correlating features were identified.

To confirm the visual analysis, a simple linear regression was conducted. The regression model showed no relationship with a r-squared value of -0.03261. By plotting these values on a scatter plot with a fit line, no relationship is indicated. Overall, the total Airbnb listings by Boston neighborhoods did not have a correlation with the mean household incomes of each.

summary(lm_r)
## 
## Call:
## lm(formula = total_listings ~ mean_income, data = filter_sum_listings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.18  -74.43  -32.23   63.12  226.65 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.050e+02  7.148e+01   1.468    0.158
## mean_income 3.537e-04  6.094e-04   0.580    0.568
## 
## Residual standard error: 98.81 on 20 degrees of freedom
## Multiple R-squared:  0.01656,    Adjusted R-squared:  -0.03261 
## F-statistic: 0.3368 on 1 and 20 DF,  p-value: 0.5682
ggplot(filter_sum_listings) +
  geom_point(aes(x = mean_income, y = total_listings)) +
  geom_smooth(method = "lm", se = FALSE, 
              aes(x = mean_income, y = total_listings)) +
  labs(title = "Total Listings vs Mean Income",
       x = "Mean Income",
       y = "Total Listings") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Figure 9 - linear regression model plot indicating no relationship between total Airbnb listings and mean household income by neighborhood.

Figure 9 - linear regression model plot indicating no relationship between total Airbnb listings and mean household income by neighborhood.

Reflection

While no correlation was found, this project had success is establishing a process for analyzing Airbnb listings data with US Census data. The data was readily available. Boston’s government data hub provides well curated data sets that can be easily accessed. For future analysis, more census variables should be incorporated into the analysis to include demographics, education attainment, and age. This will create a holistic data set for further understanding of variables that could explain the distribution of listings across the city.

References

Adamiak, Czesław, Barbara Szyda, Anna Dubownik, and David García-Álvarez. 2019. “Airbnb Offer in Spainspatial Analysis of the Pattern and Determinants of Its Distribution.” ISPRS International Journal of Geo-Information 8 (3): 155. https://doi.org/10.3390/ijgi8030155.
Bunn, James J. 2024. The Effects of Short-Term Rentals on Housing in Rural Utah.
Garcia-López, Miquel-Àngel, Jordi Jofre-Monseny, Rodrigo Martínez-Mazza, and Mariona Segú. 2020. “Do Short-Term Rental Platforms Affect Housing Markets? Evidence from Airbnb in Barcelona.” Journal of Urban Economics 119 (September): 103278. https://doi.org/10.1016/j.jue.2020.103278.
Gutiérrez, Javier, Juan Carlos García-Palomares, Gustavo Romanillos, and María Henar Salas-Olmedo. 2017. “The Eruption of Airbnb in Tourist Cities: Comparing Spatial Patterns of Hotels and Peer-to-Peer Accommodation in Barcelona.” Tourism Management 62 (October): 278–91. https://doi.org/10.1016/j.tourman.2017.05.003.
Koster, Hans R. A., Jos van Ommeren, and Nicolas Volkhausen. 2021. “Short-Term Rentals and the Housing Market: Quasi-Experimental Evidence from Airbnb in Los Angeles.” Journal of Urban Economics 124 (July): 103356. https://doi.org/10.1016/j.jue.2021.103356.
Meet Boston | Your Official Guide to Boston. n.d. https://www.meetboston.com/.