This project assesses the relationship between Airbnb listing distribution and mean household income across the Boston, Massachusetts neighborhoods. A rarely assessed relationship, this project closes the gap in understanding by combining US Census income data with that of Airbnb listings. This produced a process for assessing other variable relationships to Airbnb listing distribution. Utilizing exploratory analysis and linear regression, mean household income and Airbnb listing distribution was found to have no correlation. Utilizing the methodolgies employed in this study, future analysis can be performed to expand the variables assessed to include greater demographic characteristics within each neighborhood.
An alternative to hotel accommodations, Airbnb has made it easier for individuals to rent rooms or whole houses to other individuals creating a new competitive corner of the market disrupting tourism business models in larger cities (Gutiérrez et al. 2017). Research into impacts and Airbnb distribution has been primarily focused in European markets with attention given to larger tourist city centers (Adamiak et al. 2019; Garcia-López et al. 2020). An exception can be found in a study focusing on Los Angeles County, California that compared short-term housing rental impacts on housing markets by comparing cities in the county that had implemented restrictions on short-term rentals and those that had no such restrictions. It was found that these restrictions led to a 2% drop in property values (Koster et al. 2021).
A city with a market for historical tourism, Boston mixes historic tourist sites amid residential communities. Boston features 24 economically varied neighborhoods and is heavily touristed with no active restrictions or ordinances impacting Airbnb listings making it an ideal candidate for Airbnb market analysis (Meet Boston | Your Official Guide to Boston, n.d.). This project seeks to determine if there is a relationship between the social-economic status of a neighborhood and the preponderance of Airbnb listings. Utilizing Boston as the region of focus and exploratory and statistical analysis methods, which neighborhoods have the highest mean household income? Do these locations show a correlation with the amount of Airbnb listings?
Airbnb listing data was sourced from Insideairbnb.com. This data contains listings from January 1st, 2025 to December 27th, 2025. To obtain the data, Inside Airbnb utilizes public information compiled from the Airbnb website and is verified and cleaned prior to release. The occupancy rate for each listing are calculated utilizing a model based on the work of (Bunn 2024). The resulting model ,called “San Francisco Model”, takes in to account the average stay length of approximately 50% of reviews in a listing. This value is then multiplied by the estimated bookings of the listing to produce an occupancy rate. This value compared with the listing type was utilized to produce an occupancy rating for each neighborhood normalized by listing type. This data set will require filtering of variables to ensure only the neighborhood, coordinates, occupancy rate, and listing type are selected.
US Census data was sourced utilizing the tidycensus
package pulling the 2024 5-yr estimate data. Utilizing the mean
household income by census tract, this data set provides the mean
household income for each neighborhood when combined. Utilizing
tidycensus provides a clean data set that can be combined
with other data sets for deeper analysis.
Neighborhood data was sourced from the official City of Boston data hub. This data
set provides geographies for the 24 Boston neighborhoods aligned with
census tracks allowing for incorporation of tidycensus
data.
The data available for this project varies by year. For the Airbnb data, it was only available for 2025 while US Census data was available for only as recent as 2024. This required an assumption that mean household incomes for each neighborhood would not drastically change. From this assumption, the rest of the methodology as employed.
Before analysis could be completed, tools and data was first loaded
into the project. This project was structured under the tidy data
principles bringing in packages with functions containing tidy inputs
and outputs. These included tidyverse,
leaflet, sf, ggplot2,
dplyr, tidyr, and tidycensus.
# load packages ----
library(tidyverse)
library(here)
library(dplyr)
library(tidyr)
library(janitor)
library(ggplot2)
library(ggiraph)
library(cowplot)
library(scales)
library(leaflet)
library(sf)
library(mapview)
library(tidycensus)
library(tigris)
library(knitr)
With packages loaded, each data set was loaded and cleaned. First, the Boston data was loaded with geography defined as polygon vector data. Reviewing the data set, it was identified that the neighborhood variable name was cut short requiring renaming and an overall cleaning of names overall. Figure 1 features the initial construct of the neighborhood data set.
neighborhoods <- st_read(here("data", "Boston_Neighborhood_Boundaries_Approximated_by_2020_Census_Tracts.shp"),
quiet = TRUE)
view(neighborhoods)
CleanNeighborhoods <- neighborhoods %>%
rename(neighborhood = neighborho) %>%
clean_names()
mapview(CleanNeighborhoods, zcol = "neighborhood")
Next, Airbnb listing data was loaded and cleaned. Once loaded, the
data set was cleaned, columns were renamed for American English
spelling, and a subset of the data was pulled to include the
neighborhood and room type variables. The listing data set was first
converted to an sf feature utilizing
st_as_sf() and was initially plotted by room type over the
neighborhood. Figure 2 shows that the most common type of listings was
for entire homes and apartment rentals.
listings <- read_csv(here("data", "listings_sum.csv"))
CleanListings <- listings %>%
clean_names() %>%
rename(neighborhood = neighbourhood) %>%
select(id,
neighborhood,
room_type,
latitude,
longitude)
listings_plot <- st_as_sf(CleanListings, coords = c("longitude", "latitude"), crs = 4326)
neighborhood_listings <- ggplot() +
geom_sf_interactive(data = CleanNeighborhoods, aes(tooltip = neighborhood, fill = neighborhood), alpha = 0.2) +
geom_sf(data = listings_plot,
aes(color = as.factor(room_type))) +
coord_sf(crs = 4326) +
labs(title = "Airbnb Listings by Room Type \n Plotted on Boston's Neighborhoods",
fill = "Neighborhood",
color = "Room Type")+
theme_void()
girafe(ggobj = neighborhood_listings) %>%
girafe_options(opts_hover(css = "fill:cyan;", reactive = TRUE))
Figure 2 - Initial plot of Airbnb listings by room type across the Boston neighborhoods. The most common type of listings is found to be the entire house and apartment listings.
Lastly, the US Census data was pulled at the census tract level for
Suffolk County, Massachusetts that matches the metropolitan region of
Boston. The associated variable associated with the total mean household
income was identified as “B19013_001”. As Boston is a port city, water
edges are prevalent in the data. This required the use of
erase_water to ensure the census tract polygons matched the
geographic reality. Similar to the neighborhood data set, the census
data was structured with polygon vectors (Figure 3).
# Find variable for mean household income
vars <- load_variables(2024, "acs5")
View(vars)
# Pull data
suffolk_income <- get_acs(
geography = "tract",
variables = "B19013_001",
state = "MA",
county = "Suffolk",
geometry = TRUE
)
## | | | 0% | |= | 1% | |== | 3% | |==== | 5% | |===== | 8% | |======= | 9% | |======= | 10% | |======== | 12% | |========= | 13% | |========== | 14% | |============ | 17% | |============= | 19% | |============== | 20% | |=============== | 21% | |================= | 24% | |================= | 25% | |================== | 26% | |=================== | 27% | |==================== | 28% | |==================== | 29% | |====================== | 32% | |======================== | 34% | |========================= | 36% | |=========================== | 39% | |============================= | 41% | |=============================== | 44% | |================================= | 46% | |================================== | 49% | |==================================== | 52% | |====================================== | 54% | |======================================== | 57% | |========================================= | 59% | |=========================================== | 62% | |============================================= | 64% | |=============================================== | 67% | |================================================= | 69% | |================================================== | 72% | |==================================================== | 74% | |====================================================== | 77% | |======================================================== | 79% | |========================================================= | 82% | |=========================================================== | 85% | |============================================================= | 87% | |=============================================================== | 90% | |================================================================= | 92% | |================================================================== | 95% | |==================================================================== | 97% | |======================================================================| 100%
# Clean up data set
sf_use_s2(FALSE)
suffolk_erase <- erase_water(suffolk_income,
area_threshold = 0.5,
year = 2024)
mapview(suffolk_erase, zcol = "estimate")
Figure 4 is a flow chart of the processes employed to determine if a
correlation exists between the mean household income and total Airbnb
listings in each neighborhood. The three data sets each required unique
data wrangling to produce tidy data ready for analysis operations.
Selecting the Airbnb variables previously discussed, a new, tidy data
set was produced. This data set was combined with Boston neighborhood
data to create visual context for distribution of listings across the
city. Following this, US Census data for census tracks of the Boston
area was pulled using tidycensus. This data set was
converted into centroid points for each tract. The mean income for each
point was added together for all points in each neighborhood and
averaged to produce a mean value for each neighborhood. The resulting
data set was plotted and combined with point count data from the Airbnb
listings data set to produce a joined table consisting of neighborhoods
with variables of mean household income and total listings by type. From
this data set, plots were produced of the total listings versus the mean
household income of each neighborhood. From these plots, correlation was
determined through the use of a simple linear regression model.
To determine the mean income for each neighborhood, each census track
was first converted to a point at the centroid. This allows for point to
polygon joining to occur. The resulting centroids and neighborhood data
set were both set to a common coordinated reference system, WGS84, and
combined using st_join. Any rows with no mean income values
were removed. This then allowed for each row to be grouped by
neighborhood and the mean incomes to be averaged to produce the final
mean income y neighborhood data set (Figure 5).
# convert to centroids
income_centroids <- st_centroid(suffolk_erase)
## Warning: st_centroid assumes attributes are constant over geometries
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for
## longitude/latitude data
# set the coordinate reference
income_centroids <- st_set_crs(income_centroids, 4326)
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that
CleanNeighborhoods <- st_set_crs(CleanNeighborhoods, 4326)
# Combine the values
joined_income <- st_join(CleanNeighborhoods, income_centroids)
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
# convert variable column to integer and save as mean_income
joined_income <- joined_income %>%
drop_na(estimate)
mean_income_neighborhoods <- joined_income %>%
group_by(neighborhood) %>%
summarise(mean_income = (sum(estimate)/n_distinct(estimate)))
## although coordinates are longitude/latitude, st_union assumes that they are
## planar
mapview(mean_income_neighborhoods, zcol = "mean_income")
Finally, the Airbnb listing data and mean household income by
neighborhood data were joined using st_join to associate
each listing with its respective neighborhood location. The resulting
data set was reduced to the neighborhood, mean income, room type, and
geometry variables. To attain the total count of listings in each
neighborhood, the data were grouped by neighborhood and mean household
income and the resulting number of listings in the room type variable
were counted. These values were then summarized as a new variable call
total_listings.
# join tables with polygons preserved
joined_listings_income <- st_join(mean_income_neighborhoods, listings_plot)
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
# select columns
joined_listings_income <- joined_listings_income %>%
select(neighborhood = neighborhood.x,
mean_income,
room_type,
geometry)
# count number of listings by room type by neighborhood
count_listings_income <- joined_listings_income %>%
group_by(neighborhood, mean_income) %>%
count(room_type)
## although coordinates are longitude/latitude, st_union assumes that they are
## planar
# plot of Number Listings by Neighborhood
sum_listings <- count_listings_income %>%
group_by(neighborhood, mean_income) %>%
summarise(total_listings = sum(n))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by neighborhood and mean_income.
## ℹ Output is grouped by neighborhood.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(neighborhood, mean_income))` for per-operation
## grouping (`?dplyr::dplyr_by`) instead.
## although coordinates are longitude/latitude, st_union assumes that they are
## planar
The first analysis conducted was exploratory analysis. Utilizing
ggplot() from the ggplot2 package, two plots
were produced, one with the total listings by neighborhood and the other
of the mean income by neighborhood. These plots were then compared to
identify if any visual relationship was apparent. To reduce visual
impacts of outliers, a box plot was created (Figure 6). This identified
outliers being total listings beyond 400. The analysis data set was thus
filtered of any total listing values beyond 400 and the visual analysis
conducted again. To confirm the initial impressions from the exploratory
data analysis, a simple linear regression was performed to determine if
total listings could be explained by the mean household income.
# plot of mean income of neighborhoods
plot_income <- ggplot(mean_income_neighborhoods) +
geom_sf(aes(fill = mean_income)) +
scale_fill_distiller(name = "Mean Income",
palette = "Greens",
breaks = pretty_breaks(),
direction = 1) +
coord_sf(crs = 4326) +
labs(title = "Mean Income by Neighborhood",
fill = "Mean Income") +
theme_void()
# plot of Number Listings by Neighborhood
plot_listings <- ggplot(sum_listings) +
geom_sf(aes(fill = total_listings)) +
scale_fill_distiller(name = "Total Listings",
palette = "Blues",
breaks = pretty_breaks(),
direction = 1) +
coord_sf(crs = 4326) +
labs(title = "Total Listings by Neighborhood",
fill = "Total Listings") +
theme_void()
# check for outliers
sum_listings %>%
ggplot(aes(total_listings, total_listings)) +
geom_boxplot()
## Warning: Orientation is not uniquely specified when both the x and y aesthetics are
## continuous. Picking default orientation 'x'.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Figure 6 - Boxplot of total listings by neighborhood. A single outlier can be found beyond 400 total listings.
# filter out outlier
filter_sum_listings <- sum_listings %>%
filter(total_listings <= 400)
# new plot of filtered listings
plot_filtered_listings <- ggplot(filter_sum_listings) +
geom_sf(aes(fill = total_listings)) +
scale_fill_distiller(name = "Total Listings",
palette = "Blues",
breaks = pretty_breaks(),
direction = 1) +
coord_sf(crs = 4326) +
labs(title = "Total Listings by Neighborhood",
fill = "Total Listings") +
theme_void()
# create linear regression model
lm_r = lm(formula = total_listings ~ mean_income, data = filter_sum_listings)
The first visual comparison of mean household income and total listings by neighborhood (Figure 7) did not appear to show a correlation. The neighborhood with the most listings was found to be Dorchester on the eastern shore. This did not relate to either the lowest or highest income neighborhoods.
plot_grid(plot_income,
plot_listings,
nrow = 2,
hjust = 0,
label_x = 0,
align = "hv")
Figure 7 - Plots of mean household income and total listings by neighborhood. The neighborhood with the most listings was found to be Dorchester on the eastern shore.
The total listings value for Dorchester was to be an outlier. This value was removed and the data replotted. The resulting data set still did not show any areas that visually correlated between the variables (Figure 8).
plot_grid(plot_income,
plot_filtered_listings,
nrow = 2,
hjust = 0,
label_x = 0,
align = "hv")
Figure 8 - Plots of mean household income and total listings by neighborhood with the Dorchester neighborhood filtered out. No correlating features were identified.
To confirm the visual analysis, a simple linear regression was conducted. The regression model showed no relationship with a r-squared value of -0.03261. By plotting these values on a scatter plot with a fit line, no relationship is indicated. Overall, the total Airbnb listings by Boston neighborhoods did not have a correlation with the mean household incomes of each.
summary(lm_r)
##
## Call:
## lm(formula = total_listings ~ mean_income, data = filter_sum_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.18 -74.43 -32.23 63.12 226.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.050e+02 7.148e+01 1.468 0.158
## mean_income 3.537e-04 6.094e-04 0.580 0.568
##
## Residual standard error: 98.81 on 20 degrees of freedom
## Multiple R-squared: 0.01656, Adjusted R-squared: -0.03261
## F-statistic: 0.3368 on 1 and 20 DF, p-value: 0.5682
ggplot(filter_sum_listings) +
geom_point(aes(x = mean_income, y = total_listings)) +
geom_smooth(method = "lm", se = FALSE,
aes(x = mean_income, y = total_listings)) +
labs(title = "Total Listings vs Mean Income",
x = "Mean Income",
y = "Total Listings") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Figure 9 - linear regression model plot indicating no relationship between total Airbnb listings and mean household income by neighborhood.
While no correlation was found, this project had success is establishing a process for analyzing Airbnb listings data with US Census data. The data was readily available. Boston’s government data hub provides well curated data sets that can be easily accessed. For future analysis, more census variables should be incorporated into the analysis to include demographics, education attainment, and age. This will create a holistic data set for further understanding of variables that could explain the distribution of listings across the city.