Introducing a new Airbnb for a new world of travel
Introduction
Airbnb involvement with in short-term rental market has seen significant growth in recent years, reshaping the way travelers find accommodation and influencing local economies. While its activities are increasing the cost of local housing, this project explores Airbnb listings in Washington, DC for the year 2025, analyzing key variables such as pricing trends, neighborhood distribution, and host activity levels.
Data Source
The dataset comes from Airbnb, which collects listing details directly from hosts and guest interactions. This data is sourced through publicly available information on Airbnb’s platform, where hosts provide details about their rentals, including pricing, availability, reviews, and geographic location. Since the information’s gathered by Airbnb are host dependent, a large data point are marked NA
Key Variables in the Dataset
For this analysis, the following variables will be examined: • Name: The title of the Airbnb listing • Neighborhood: The area in Washington, DC where the listing is located • Latitude & Longitude: The geographic coordinates for mapping • Room Type: Categories such as private rooms, entire homes, or shared spaces • Price: The nightly rental rate • Minimum Nights: The minimum required stay set by the host • Number of Reviews: Guest feedback count for assessing popularity • Last Review Date: The most recent review left by a guest • Reviews Per Month: A metric indicating booking frequency • Host Listing Count: The number of properties managed by a single host • Availability (365 days): Indicates how frequently a listing is available throughout the year
Data Collection Methodology
Airbnb gathers this data through host-provided details, guest interactions, and platform activity. Listings are updated dynamically based on host settings, guest bookings, and user reviews. Why This Topic Matters The Airbnb market in Washington, DC has a significant impact on local housing, tourism, and revenue generation. By analyzing this dataset, I aim to:
1. Understand pricing trends across neighborhoods.
2. Identify which areas are most frequently booked.
Data Cleaning Summary
Column Renaming
All headers were converted to snake_case via tolower() and gsub(), ensuring consistent, space-free variable names (e.g. Number of Reviews → number_of_reviews).
Variable Selection
From the original 18 columns, we retained the 12 essential fields (name, neighborhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365, number_of_reviews_ltm) using dplyr::select().
Critical-Field Filtering
Listings lacking price or geolocation (latitude/longitude) were removed via filter(!is.na(…)), since they cannot be visualized or modeled.
Missing-Value Handling
For booking-frequency proxies (reviews_per_month, number_of_reviews_ltm), NAs were replaced with zero—interpreting a missing count as “no reviews.” All other fields were left as-is (no wholesale row drops), preserving 6,257 listings.
Final Dataset
The cleaned dataset contains 6,257 observations and 12 variables, ready for regression analysis, mapping, and visualization.
Load necessary libraries
library(tidyverse)
Warning: package 'readr' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.3
library(RColorBrewer)library(viridis)
Loading required package: viridisLite
library(GGally)
Warning: package 'GGally' was built under R version 4.4.3
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Load dataset from working directory to the global environment,and make headers lower_case and remove space
setwd("C:/Users/Owner/OneDrive/Desktop/Data110")# Suppress all messages when reading the CSV fileairbnb_2025 <-suppressMessages(read_csv("airbnb_washington_dc,2025.csv", show_col_types =FALSE))names(airbnb_2025)<-gsub( " ","_",tolower(names(airbnb_2025)))head(airbnb_2025)
# A tibble: 6 × 18
id name host_id host_name neighbourhood_group neighbourhood latitude
<dbl> <chr> <dbl> <chr> <lgl> <chr> <dbl>
1 3686 Vita's Hid… 4645 Vita NA Historic Ana… 38.9
2 3943 Historic R… 5059 Vasa NA Edgewood, Bl… 38.9
3 4197 Capitol Hi… 5061 Sandra NA Capitol Hill… 38.9
4 4529 Bertina's … 5803 Bertina NA Eastland Gar… 38.9
5 5589 Cozy apt i… 6527 Ami NA Kalorama Hei… 38.9
6 7103 Lovely gue… 17633 Charlotte NA Spring Valle… 38.9
# ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
# minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
# reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
# availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>
remove unwanted variables, while the removed variables are very important to Airbnb,however, for the purpose of my statistical analysis I will exclude them.
Very weak negative correlations with minimum nights (–0.078), total reviews (–0.061), and reviews per month (–0.039). All three are statistically significant but practically tiny.However,no appreciable link between price and availability (0.019, not significant).
Minimum nights
Moderately negative correlation with number of reviews (–0.159) and reviews per month (–0.306): listings that require longer stays get fewer reviews and are reviewed less frequently.
Positive correlation with availability (0.108): properties with longer minimum stays tend to be available more days of the year.
Reviews and availability
Strong positive correlation between total reviews and reviews per month (0.596): more‐reviewed listings tend also to get reviewed more often.
Both total reviews (–0.096) and reviews per month (–0.128) have small but significant negative correlations with availability: busier listings are booked more and thus less available.
Distributions and scatterplots
Price and reviews distributions are right‐skewed, with many low values and a long tail.
The linear fit lines in the lower panels confirm the very shallow slopes for most relationships, indicating weak linear trends.
Perform multiple linear regression to analyze how various factors affect Airbnb pricing in Washington, DC.
# Fit the multiple linear regression modelfit1 <-lm(price ~ minimum_nights + number_of_reviews + reviews_per_month + availability_365, data = airbnb_clean)summary(fit1)
Call:
lm(formula = price ~ minimum_nights + number_of_reviews + reviews_per_month +
availability_365, data = airbnb_clean)
Residuals:
Min 1Q Median 3Q Max
-186.7 -76.6 -33.6 24.5 6819.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 188.94291 8.12816 23.245 < 2e-16 ***
minimum_nights -1.23180 0.20480 -6.015 1.96e-09 ***
number_of_reviews -0.10134 0.03653 -2.774 0.00557 **
reviews_per_month -3.99356 2.34635 -1.702 0.08883 .
availability_365 0.03738 0.02925 1.278 0.20140
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 226.7 on 3982 degrees of freedom
Multiple R-squared: 0.0129, Adjusted R-squared: 0.01191
F-statistic: 13.01 on 4 and 3982 DF, p-value: 1.584e-10
Intercept (188.94): Base price when all predictors are zero. Minimum nights (−1.23): Each extra required night cuts the price by $1.23. Number of reviews (−0.10): Each additional review lowers price by $0.10. Reviews per month (−3.99): A one–unit increase in monthly review rate lowers price by $3.99. Availability_365 (+0.037): Each extra available day per year adds $0.04 to the price.
Model fit Adjusted R² = 0.0119 → These four factors explain only about 1.2% of the variation in price—so most of what drives price (like location, room type, amenities) isn’t captured here.
Diagnostics
Residuals vs. Fitted
# Add fitted values and residuals directly into airbnb_cleanairbnb_clean <- airbnb_clean |>mutate(Fitted =fitted(fit1),Residuals =resid(fit1) )# 1. Residuals vs. Fitted plotggplot(airbnb_clean, aes(x = Fitted, y = Residuals)) +geom_point(alpha =0.4) +geom_hline(yintercept =0, linetype ="dashed", color ="red") +labs(title ="Residuals vs. Fitted",x ="Fitted values",y ="Residuals" ) +theme_minimal()
Residuals vs. Fitted
Purpose: Checks whether errors have constant variance (homoscedasticity) and whether the linear form is appropriate. What to look for: A funnel shape (residuals spreading out as fitted values grow) ⇒ heteroscedasticity (variance changing with price).Any systematic curve or pattern ⇒ the model’s linear form may be mis-specified (consider adding non-linear terms).
Purpose: Checks whether residuals are approximately normally distributed (an assumption for valid hypothesis tests). What to look for:Points should fall roughly along the 45° reference line.S-shaped deviations at the ends ⇒ heavy tails (outliers) or skewness (consider transformations or robust methods).
Significant predictors:
min_nights (p<0.001) and number_of_reviews (p≈0.0056) reviews_per_month marginal (p≈0.089) availability_365 not significant (p≈0.20)
Residuals vs Fitted:
A pronounced funnel for high‐priced listings → heteroscedasticity. Extreme positive residuals (under‐predicted luxury prices).
Normal Q–Q:
Bulk of residuals roughly normal, but heavy positive tail of extreme outliers.
Exploring different variables in my dataset, plot at least THREE distinct types of visualizations.
Average Price by Top 10 Neighborhoods
Here we compute mean price per neighborhood, select the top 10 by listing count, and plot their average prices to see where hosts charge the most.
# 1) Summarize & shorten namesneigh_price <- airbnb_dc2025 |>group_by(neighbourhood) |>summarise(avg_price =mean(price, na.rm =TRUE), # average pricen_listings =n() # total listings ) |>arrange(desc(n_listings)) |>slice(1:10) |>mutate(# keep only the text before the first comma and wrap at 15 charactersneighbourhood_short =str_wrap(str_remove(neighbourhood, ",.*"), width =15) )
2) Plot Average Airbnb price for the top 10 neighborhood
# 1) Compute average price per neighborhood and pick the bottom 10 by avg_pricebottom_price <- airbnb_dc2025 |>group_by(neighbourhood) |>summarise(avg_price =mean(price, na.rm =TRUE),n_listings =n() ) |>arrange(avg_price) |>slice(1:10) |># keep only the text before the first comma and wrap at 15 charactersmutate(neighbourhood_short =str_wrap(str_remove(neighbourhood, ",.*"), width =15) )
Plot 2, Average Airbnb price for bottom ten neighbohood using interactivity by plotly.