My data source is AirBnb 2025 Washington, D.C. listing data. In this report, I will examine various measures of success in the AirBNB market (such as review score and market share) and what may influence them (such as price or superhost status). After removing rows with no price information, the data includes variables for 4644 properties such as nightly price, review scores, and host characteristics. Four types of properties are represented: hotel, shared room, private room, and entire home/apartment. There were 2625 unique hosts in the data. The mean nightly price was $155.39, with a min of $8, a max of $970, and a median of $123.
library(plotly)
library(data.table)
library(DescTools)
library(ggplot2)
library(dplyr)
library(ggthemes)
library(tidyr)
#Setting working directory and importing data; removing unneeded columns
setwd("C:/Users/mcshe/Documents/R Project")
df <- fread("C:/Users/mcshe/Documents/R Project/listings.csv", drop = c("id", "listing_url", "scrape_id", "last_scraped", "source", "description",
"neighborhood_overview", "picture_url", "host_url", "host_name", "host_about",
"host_location", "host_thumbnail_url", "host_picture_url", "host_verifications",
"neighbourhood", "neighbourhood_group_cleansed", "latitude", "longitude",
"bathrooms_text", "minimum_minimum_nights", "maximum_minimum_nights",
"minimum_maximum_nights", "maximum_maximum_nights", "mininum_nights_avg_ntm",
"maximum_nights_avg_ntm", "calendar_updated", "availability_30",
"availability_60", "availability_90", "availability_365", "calendar_last_scraped",
"number_of_reviews_ltm", "availability_eoy", "number_of_reviews_ly",
"estimated_occupancy_l365d", "license"))
#Converting price column to numbers by removing dollar signs via gsub and using as.numeric
df$price <- as.numeric(gsub("\\$", "", df$price))
#Removing any rows with NA or 0 as price
keep_rows <- which(!is.na(df$price) & df$price > 0)
NA_price <- which(is.na(df$price) | df$price <= 0)
rows_to_drop <- setdiff(NA_price, keep_rows)
df <- df[-rows_to_drop, ]
Property number and average review score were aggregated by host, then plotted against each other to observe potential correlations. To account for outliers, any host with property number more than 1 standard deviation above the mean was removed from the graph.
host_df <- df %>%
group_by(host_id) %>%
summarise(
avg_review = mean(review_scores_rating, na.rm = TRUE),
listings = n()
) %>%
filter(listings > 0)
#calculate mean and sd to remove outliers
mean_listings <- mean(host_df$listings)
sd_listings <- sd(host_df$listings)
cutoff <- mean_listings + sd_listings
host_df_no_outliers <- host_df %>%
filter(listings <= cutoff)
#scatterplot
ggplot(host_df_no_outliers, aes(x = listings, y = avg_review)) +
geom_point(alpha = 0.5, color = "black") +
labs(
x = "Number of Listings",
y = "Average Review Score",
title = "Listings vs Average Review Score (per Host)"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
The plot shows that most hosts manage only one or two properties and receive review scores clustered near the upper end of the rating scale. There is no clear upward or downward trend, suggesting that owning more properties does not necessarily lead to higher or lower average ratings. This indicates that host scale alone is not a strong determinant nor an accurate predictor of customer satisfaction.
Listings with missing review scores or invalid prices were removed. Properties were then grouped by review score and plotted against the average price per night for each score.
#create new df, removing NA's/0's; create avg_price column
price_review_df <- df %>%
filter(!is.na(review_scores_rating) & price > 0) %>%
group_by(review_scores_rating) %>%
summarise(avg_price = mean(price, na.rm = TRUE)) %>%
arrange(review_scores_rating)
#create line chart
ggplot(price_review_df, aes(x = review_scores_rating, y = avg_price)) +
geom_line(color = "red", size = 1) +
geom_point(color = "black", size = 2) +
labs(
x = "Review Score",
y = "Average Price per Night ($)",
title = "Average Price vs Review Score"
) +
theme_calc() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
)
The chart shows relatively stable pricing across most rating levels, with only modest variation. While higher-rated properties seem to be associated with slightly higher prices, the overall relationship appears weak. This suggests that factors other than rating may play a larger role in determining price.
After trimming down the neighborhoods to only those in the top 25 (in terms of listing total), the number of unique hosts in each neighborhood is counted by superhost status. The total host number for each neighborhood is listed at the end of each bar.
#need to trim whitespace around neighborhood name and remove NA's/blanks
df <- df %>%
mutate(host_neighbourhood = trimws(host_neighbourhood)) %>%
filter(!is.na(host_neighbourhood), host_neighbourhood != "")
#create df with just the top 25 neighborhoods in terms of listing number
top25_neighborhoods <- df %>%
count(host_neighbourhood, name = "total_listings") %>%
arrange(desc(total_listings)) %>%
slice_head(n = 25)
#create df that has the neighborhoods (within top 25) and superhost status
#group data by hosts per neighborhood for stacked bar chart
host_neighborhood_df <- df %>%
filter(host_neighbourhood %in% top25_neighborhoods$host_neighbourhood,
!is.na(host_is_superhost)) %>%
mutate(
host_is_superhost = ifelse(host_is_superhost == "t",
"Superhost", "Not Superhost")
) %>%
group_by(host_neighbourhood, host_is_superhost) %>%
summarise(
num_hosts = n_distinct(host_id),
.groups = "drop"
)
#reorder df so that neighborhoods with more hosts will be at top
host_neighborhood_df <- host_neighborhood_df %>%
group_by(host_neighbourhood) %>%
mutate(total_hosts = sum(num_hosts)) %>%
ungroup() %>%
mutate(
host_neighbourhood =
reorder(host_neighbourhood, total_hosts)
)
#create horizontal stacked bar chart using superhost status as fill
ggplot(host_neighborhood_df,
aes(x = host_neighbourhood,
y = num_hosts,
fill = host_is_superhost)) +
geom_bar(stat = "identity") +
geom_text(aes(x = host_neighbourhood,
y = total_hosts,
label = total_hosts,
hjust=-.01)) +
coord_flip() +
labs(
x = "Neighborhood (Top 25 by Listings)",
y = "Number of Hosts",
fill = "Host Type",
title = "Hosts in Top 25 Neighborhoods by Listings\n(Superhost vs Non-Superhost)"
) +
theme_light() +
theme(plot.title = element_text(hjust = .5)) +
scale_fill_brewer(palette="Paired", guide = guide_legend(reverse=TRUE))
The chart highlights how host composition varies geographically. Across the major neighborhoods, non-superhosts make up the majority of hosts. However, in areas with higher property concentrations such as Northwest Washington and Northeast Washington, there is a substantial number of superhosts.
Listings are divided by property type (private room, hotel, shared room, or whole home).
property_type <- df %>%
count(room_type)
plot_ly(
property_type, labels = ~room_type, values = ~n, type = "pie", textposition
= "outside", textinfo = "label+percent",
hole = 0.5,
hoverinfo = "label+value+percent"
) %>%
layout(
title = list(
text = "Distribution of D.C. Airbnb Property Types (2025)",
x = 0.5
),
annotations = list(
list(
text = paste0("<b>Total Listings</b><br>", count(df)),
x = 0.5,
y = 0.5,
showarrow = FALSE,
font = list(size = 14)
)
)
)
Entire homes/apartments dominated the market, representing the largest share of listings by a wide margin. Private rooms were the second most prevalent, while shared and hotel rooms constituted a relatively small portion of total listings. This suggests that the D.C. Airbnb market is primarily oriented toward full-unit rentals rather than shared accommodations.
The top 10 neighborhoods (by listing) were broken down by neighborhood and property type. The count for each intersection of neighborhood and property type were placed into each cell of a grid (those with a count of 0 had nothing written in the cells). A purple gradient conveys the number of listings relative to other neighborhood/property type combinations.
#create df that only holds the top 10 neighborhoods in terms of listing number
top10_neighborhoods <- df %>%
filter(!is.na(room_type)) %>%
count(host_neighbourhood, name = "total_listings") %>%
arrange(desc(total_listings)) %>%
slice_head(n = 10)
property_neighborhood_df <- df %>%
filter(host_neighbourhood %in% top10_neighborhoods$host_neighbourhood,
!is.na(room_type)) %>%
count(host_neighbourhood, room_type, name = "n_listings") %>%
complete(host_neighbourhood, room_type, fill = list(n_listings = 0))
#create breaks to use for heatmap
breaks <- c(seq(0, max(property_neighborhood_df$n_listings), by=200))
#create heatmap
ggplot(property_neighborhood_df,
aes(room_type, host_neighbourhood, fill = n_listings)) +
geom_tile(color = "grey85", linewidth = 0.3) +
geom_text(aes(label = ifelse(n_listings == 0, "", n_listings)),
size = 3) +
scale_fill_gradient(
low = "white",
high = "#5B0F8A",
breaks = pretty(property_neighborhood_df$n_listings, n = 5)
) +
labs(
x = "Room Type",
y = "Neighborhood",
fill = "Number of Listings",
title = "Room Type Distribution by Neighborhood (Top 10)"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 15, hjust = 1),
legend.position = "right"
)
Entire homes are the dominant listing type across all neighborhoods,
particularly in Northwest Washington, which shows the highest overall
activity. Private rooms appear most frequently in dense urban
neighborhoods, while hotel and shared rooms are relatively rare. These
findings are consistent with those of the host count and property type
count charts, where Northwest Washington and entire homes/apartments
were the highest respectively.
Overall, the visualizations reveal that the Washington, D.C. Airbnb market is dominated by entire-home listings and small-scale hosts, with pricing and review scores showing relatively weak relationships. Geographic differences are evident, with certain neighborhoods exhibiting higher activity and a stronger presence of superhosts. Together, these patterns suggest a market shaped primarily by location and property type rather than host scale or nightly price alone.