Final Project

Author

Paul Daniel-Orie

Introducing a new Airbnb for a new world of travel

Introducing a new Airbnb for a new world of travel

Introduction

Airbnb involvement with in short-term rental market has seen significant growth in recent years, reshaping the way travelers find accommodation and influencing local economies. While its activities are increasing the cost of local housing, this project explores Airbnb listings in Washington, DC for the year 2025, analyzing key variables such as pricing trends, neighborhood distribution, and host activity levels.

Data Source

The dataset comes from Airbnb, which collects listing details directly from hosts and guest interactions. This data is sourced through publicly available information on Airbnb’s platform, where hosts provide details about their rentals, including pricing, availability, reviews, and geographic location. Since the information’s gathered by Airbnb are host dependent, a large data point are marked NA

Key Variables in the Dataset

For this analysis, the following variables will be examined: • Name: The title of the Airbnb listing • Neighborhood: The area in Washington, DC where the listing is located • Latitude & Longitude: The geographic coordinates for mapping • Room Type: Categories such as private rooms, entire homes, or shared spaces • Price: The nightly rental rate • Minimum Nights: The minimum required stay set by the host • Number of Reviews: Guest feedback count for assessing popularity • Last Review Date: The most recent review left by a guest • Reviews Per Month: A metric indicating booking frequency • Host Listing Count: The number of properties managed by a single host • Availability (365 days): Indicates how frequently a listing is available throughout the year

Data Collection Methodology

Airbnb gathers this data through host-provided details, guest interactions, and platform activity. Listings are updated dynamically based on host settings, guest bookings, and user reviews. Why This Topic Matters The Airbnb market in Washington, DC has a significant impact on local housing, tourism, and revenue generation. By analyzing this dataset, I aim to:

1. Understand pricing trends across neighborhoods.

2. Identify which areas are most frequently booked.

Data Cleaning Summary

Column Renaming

All headers were converted to snake_case via tolower() and gsub(), ensuring consistent, space-free variable names (e.g. Number of Reviews → number_of_reviews).

Variable Selection

From the original 18 columns, we retained the 12 essential fields (name, neighborhood, latitude, longitude, room_type, price, minimum_nights, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365, number_of_reviews_ltm) using dplyr::select().

Critical-Field Filtering

Listings lacking price or geolocation (latitude/longitude) were removed via filter(!is.na(…)), since they cannot be visualized or modeled.

Missing-Value Handling

For booking-frequency proxies (reviews_per_month, number_of_reviews_ltm), NAs were replaced with zero—interpreting a missing count as “no reviews.” All other fields were left as-is (no wholesale row drops), preserving 6,257 listings.

Final Dataset

The cleaned dataset contains 6,257 observations and 12 variables, ready for regression analysis, mapping, and visualization.

Load necessary libraries

library(tidyverse)
Warning: package 'readr' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.3
library(RColorBrewer)
library(viridis)
Loading required package: viridisLite
library(GGally)
Warning: package 'GGally' was built under R version 4.4.3
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Load dataset from working directory to the global environment,and make headers lower_case and remove space

setwd("C:/Users/Owner/OneDrive/Desktop/Data110")
# Suppress all messages when reading the CSV file
airbnb_2025 <- suppressMessages(read_csv("airbnb_washington_dc,2025.csv", show_col_types = FALSE))
names(airbnb_2025)<- gsub( " ","_",tolower(names(airbnb_2025)))
head(airbnb_2025)
# A tibble: 6 × 18
     id name        host_id host_name neighbourhood_group neighbourhood latitude
  <dbl> <chr>         <dbl> <chr>     <lgl>               <chr>            <dbl>
1  3686 Vita's Hid…    4645 Vita      NA                  Historic Ana…     38.9
2  3943 Historic R…    5059 Vasa      NA                  Edgewood, Bl…     38.9
3  4197 Capitol Hi…    5061 Sandra    NA                  Capitol Hill…     38.9
4  4529 Bertina's …    5803 Bertina   NA                  Eastland Gar…     38.9
5  5589 Cozy apt i…    6527 Ami       NA                  Kalorama Hei…     38.9
6  7103 Lovely gue…   17633 Charlotte NA                  Spring Valle…     38.9
# ℹ 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
#   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>

remove unwanted variables, while the removed variables are very important to Airbnb,however, for the purpose of my statistical analysis I will exclude them.

airbnb_dc2025 <- airbnb_2025|>
  select(-c(neighbourhood_group,license,host_id,host_name,id,last_review))
head(airbnb_dc2025)
# A tibble: 6 × 12
  name           neighbourhood latitude longitude room_type price minimum_nights
  <chr>          <chr>            <dbl>     <dbl> <chr>     <dbl>          <dbl>
1 Vita's Hideaw… Historic Ana…     38.9     -77.0 Private …    60             31
2 Historic Rowh… Edgewood, Bl…     38.9     -77.0 Private …    63              1
3 Capitol Hill … Capitol Hill…     38.9     -77.0 Private …   128              4
4 Bertina's  Ho… Eastland Gar…     38.9     -76.9 Private …    64             30
5 Cozy apt in A… Kalorama Hei…     38.9     -77.0 Entire h…    NA             50
6 Lovely guest … Spring Valle…     38.9     -77.1 Entire h…    74             31
# ℹ 5 more variables: number_of_reviews <dbl>, reviews_per_month <dbl>,
#   calculated_host_listings_count <dbl>, availability_365 <dbl>,
#   number_of_reviews_ltm <dbl>

Understanding the variables and summary statistics.

summary(airbnb_dc2025)
     name           neighbourhood         latitude       longitude     
 Length:6257        Length:6257        Min.   :38.82   Min.   :-77.11  
 Class :character   Class :character   1st Qu.:38.90   1st Qu.:-77.03  
 Mode  :character   Mode  :character   Median :38.91   Median :-77.01  
                                       Mean   :38.91   Mean   :-77.01  
                                       3rd Qu.:38.92   3rd Qu.:-76.99  
                                       Max.   :38.99   Max.   :-76.91  
                                                                       
  room_type             price        minimum_nights   number_of_reviews
 Length:6257        Min.   :  10.0   Min.   :  1.00   Min.   :   0.00  
 Class :character   1st Qu.:  88.0   1st Qu.:  1.00   1st Qu.:   1.00  
 Mode  :character   Median : 131.0   Median :  2.00   Median :  19.00  
                    Mean   : 168.7   Mean   : 13.23   Mean   :  66.38  
                    3rd Qu.: 193.0   3rd Qu.: 31.00   3rd Qu.:  86.00  
                    Max.   :7000.0   Max.   :701.00   Max.   :1205.00  
                    NA's   :1488                                       
 reviews_per_month calculated_host_listings_count availability_365
 Min.   : 0.010    Min.   :  1.00                 Min.   :  0.0   
 1st Qu.: 0.470    1st Qu.:  1.00                 1st Qu.: 43.0   
 Median : 1.460    Median :  3.00                 Median :175.0   
 Mean   : 1.974    Mean   : 33.15                 Mean   :175.8   
 3rd Qu.: 2.940    3rd Qu.: 14.00                 3rd Qu.:303.0   
 Max.   :28.200    Max.   :289.00                 Max.   :365.0   
 NA's   :1236                                                     
 number_of_reviews_ltm
 Min.   :  0.0        
 1st Qu.:  0.0        
 Median :  5.0        
 Mean   : 15.8        
 3rd Qu.: 25.0        
 Max.   :290.0        
                      

Filter/clean the data to handle missing values.

airbnb_clean <- airbnb_dc2025 |>
  select(price, minimum_nights, number_of_reviews, reviews_per_month, availability_365,number_of_reviews_ltm,latitude, longitude,name, neighbourhood ) |>
  filter(!is.na(price) & !is.na(reviews_per_month)) |>
  mutate(price = as.numeric(price))  # convert if needed

Use ggpairs() for regression insight

ggpairs(
  airbnb_clean,
  columns = 1:5,
  lower = list(continuous = wrap("smooth", method = "lm", se = FALSE)),
  diag  = list(continuous = wrap("densityDiag")),
  upper = list(continuous = "cor"),
  progress = FALSE
)

Key Insights from the Correlation Matrix

Price relationships

Very weak negative correlations with minimum nights (–0.078), total reviews (–0.061), and reviews per month (–0.039). All three are statistically significant but practically tiny.However,no appreciable link between price and availability (0.019, not significant).

Minimum nights

Moderately negative correlation with number of reviews (–0.159) and reviews per month (–0.306): listings that require longer stays get fewer reviews and are reviewed less frequently.

Positive correlation with availability (0.108): properties with longer minimum stays tend to be available more days of the year.

Reviews and availability

Strong positive correlation between total reviews and reviews per month (0.596): more‐reviewed listings tend also to get reviewed more often.

Both total reviews (–0.096) and reviews per month (–0.128) have small but significant negative correlations with availability: busier listings are booked more and thus less available.

Distributions and scatterplots

Price and reviews distributions are right‐skewed, with many low values and a long tail.

The linear fit lines in the lower panels confirm the very shallow slopes for most relationships, indicating weak linear trends.

Perform multiple linear regression to analyze how various factors affect Airbnb pricing in Washington, DC.

# Fit the multiple linear regression model
fit1 <- lm(price ~ minimum_nights + number_of_reviews + reviews_per_month + availability_365, data = airbnb_clean)
summary(fit1)

Call:
lm(formula = price ~ minimum_nights + number_of_reviews + reviews_per_month + 
    availability_365, data = airbnb_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-186.7  -76.6  -33.6   24.5 6819.3 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       188.94291    8.12816  23.245  < 2e-16 ***
minimum_nights     -1.23180    0.20480  -6.015 1.96e-09 ***
number_of_reviews  -0.10134    0.03653  -2.774  0.00557 ** 
reviews_per_month  -3.99356    2.34635  -1.702  0.08883 .  
availability_365    0.03738    0.02925   1.278  0.20140    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 226.7 on 3982 degrees of freedom
Multiple R-squared:  0.0129,    Adjusted R-squared:  0.01191 
F-statistic: 13.01 on 4 and 3982 DF,  p-value: 1.584e-10

Extract Adjusted R² directly

adj_r2 <- summary(fit1)$adj.r.squared
print(adj_r2)
[1] 0.0119078

Regression equation

Price = 1.88.94(-1,23minimum_nights) -0.1(number_of_reviews)-3.99(reviews_per_month)+ 0.03(availability_365)

Intercept (188.94): Base price when all predictors are zero. Minimum nights (−1.23): Each extra required night cuts the price by $1.23. Number of reviews (−0.10): Each additional review lowers price by $0.10. Reviews per month (−3.99): A one–unit increase in monthly review rate lowers price by $3.99. Availability_365 (+0.037): Each extra available day per year adds $0.04 to the price.

Model fit Adjusted R² = 0.0119 → These four factors explain only about 1.2% of the variation in price—so most of what drives price (like location, room type, amenities) isn’t captured here.

Diagnostics

Residuals vs. Fitted

# Add fitted values and residuals directly into airbnb_clean
airbnb_clean <- airbnb_clean |>
  mutate(
    Fitted    = fitted(fit1),
    Residuals = resid(fit1)
  )

# 1. Residuals vs. Fitted plot
ggplot(airbnb_clean, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Residuals vs. Fitted",
    x     = "Fitted values",
    y     = "Residuals"
  ) +
  theme_minimal()

Residuals vs. Fitted

Purpose: Checks whether errors have constant variance (homoscedasticity) and whether the linear form is appropriate. What to look for: A funnel shape (residuals spreading out as fitted values grow) ⇒ heteroscedasticity (variance changing with price).Any systematic curve or pattern ⇒ the model’s linear form may be mis-specified (consider adding non-linear terms).

2. Normal Q–Q plot of residuals

ggplot(airbnb_clean, aes(sample = Residuals)) +
  stat_qq(alpha = 0.6) +
  stat_qq_line(color = "red") +
  labs(
    title = "Normal Q–Q Plot",
    x     = "Theoretical Quantiles",
    y     = "Sample Quantiles"
  ) +
  theme_minimal()

Normal Q–Q Plot

Purpose: Checks whether residuals are approximately normally distributed (an assumption for valid hypothesis tests). What to look for:Points should fall roughly along the 45° reference line.S-shaped deviations at the ends ⇒ heavy tails (outliers) or skewness (consider transformations or robust methods).

Significant predictors:

min_nights (p<0.001) and number_of_reviews (p≈0.0056) reviews_per_month marginal (p≈0.089) availability_365 not significant (p≈0.20)

Residuals vs Fitted:

A pronounced funnel for high‐priced listings → heteroscedasticity. Extreme positive residuals (under‐predicted luxury prices).

Normal Q–Q:

Bulk of residuals roughly normal, but heavy positive tail of extreme outliers.

Exploring different variables in my dataset, plot at least THREE distinct types of visualizations.

Average Price by Top 10 Neighborhoods

Here we compute mean price per neighborhood, select the top 10 by listing count, and plot their average prices to see where hosts charge the most.

# 1) Summarize & shorten names
neigh_price <- airbnb_dc2025 |>
  group_by(neighbourhood) |>
  summarise(
    avg_price  = mean(price, na.rm = TRUE),  # average price
    n_listings = n()                         # total listings
  ) |>
  arrange(desc(n_listings)) |>
  slice(1:10) |>
  mutate(
    # keep only the text before the first comma and wrap at 15 characters
    neighbourhood_short = str_wrap(str_remove(neighbourhood, ",.*"), width = 15)
  ) 

2) Plot Average Airbnb price for the top 10 neighborhood

ggplot(neigh_price,
       aes(
         x    = reorder(neighbourhood_short, avg_price),
         y    = avg_price,
         fill = avg_price
       )) +
  geom_col() +
  coord_flip() +
  scale_fill_gradientn(
    colours = brewer.pal(5, "YlOrRd"),
    name    = "Avg Price\n(USD)"
  ) +
  labs(
    title   = "Average Airbnb Price\nfor Top 10 DC Neighborhoods",
    x       = NULL,
    y       = "Average Price (USD)",
    caption = "Data source: Airbnb DC listings"
  ) +
  theme_light(base_size = 14) +
  theme(
    plot.title      = element_text(face = "bold", hjust = 0.5),
    axis.text.y     = element_text(size = 11),
    legend.position = "right"
  )

Average Price by bottom 10 Neighborhoods

# 1) Compute average price per neighborhood and pick the bottom 10 by avg_price
bottom_price <- airbnb_dc2025 |>
  group_by(neighbourhood) |>
  summarise(
    avg_price  = mean(price, na.rm = TRUE),
    n_listings = n()
  ) |>
  arrange(avg_price) |>
  slice(1:10) |>
  # keep only the text before the first comma and wrap at 15 characters
  mutate(
    neighbourhood_short = str_wrap(str_remove(neighbourhood, ",.*"), width = 15)
  )

Plot 2, Average Airbnb price for bottom ten neighbohood using interactivity by plotly.

# Create the base ggplot chart
ggplot_bar <- ggplot(bottom_price, aes(
    x    = reorder(neighbourhood_short, avg_price),
    y    = avg_price,
    fill = avg_price
  )) +
  geom_col() +
  coord_flip() +
  scale_fill_gradientn(
    colours = brewer.pal(5, "BuGn"),
    name    = "Avg Price\n(USD)"
  ) +
  labs(
    title   = "Average Airbnb Price\nfor Bottom 10 DC Neighborhoods",
    x       = "Neighborhood",
    y       = "Average Price (USD)",
    caption = "Data source: Airbnb DC listings"
  ) +
  theme_classic(base_size = 14) +
  theme(
    plot.title      = element_text(face = "bold", hjust = 0.5),
    axis.text.y     = element_text(size = 10),
    legend.position = "right"
  )

# Convert ggplot to interactive plotly chart
interactive_plot <- ggplotly(ggplot_bar, tooltip = c("x", "y", "fill"))

# Display interactive plot
interactive_plot