DATA 110_Corrected Alluvial of DC AirBnB Listings by Price and Neighborhood

Alluvial of DC AirBnB Listings by Price and Neighborhood

I was curious to explore the difference prices for Air bnb’s across DC neighborhoods. So I decided to create an alluvial plot of DC Air B n B listings by price and neighborhood

Loading Packages and Data Set

library(tidyverse)
library(ggalluvial)
library(RColorBrewer)

#airbnb <- read_csv(file.choose())  #I had A LOT  of trouble trying to get R to read the csv when I put the name of the file in the parantheses so I chose to manually select the file from my folder instead

airbnb <- read_csv("airbnb_DC_fixed.csv") 

glimpse(airbnb)

Rows: 6,257
Columns: 18
$ id                             <dbl> 3686, 3943, 4197, 4529, 5589, 7103, 117…
$ name                           <chr> "Vita's Hideaway", "Historic Rowhouse N…
$ host_id                        <dbl> 4645, 5059, 5061, 5803, 6527, 17633, 32…
$ host_name                      <chr> "Vita", "Vasa", "Sandra", "Bertina", "A…
$ neighbourhood_group            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ neighbourhood                  <chr> "Historic Anacostia", "Edgewood, Bloomi…
$ latitude                       <dbl> 38.86339, 38.91195, 38.88719, 38.90585,…
$ longitude                      <dbl> -76.98889, -77.00456, -76.99472, -76.94…
$ room_type                      <chr> "Private room", "Private room", "Privat…
$ price                          <dbl> 60, 63, 128, 64, NA, 74, 85, 52, 125, 5…
$ minimum_nights                 <dbl> 31, 1, 4, 30, 50, 31, 31, 31, 30, 31, 3…
$ number_of_reviews              <dbl> 84, 534, 64, 102, 96, 91, 415, 120, 38,…
$ last_review                    <chr> "8/30/2023", "2/19/2025", "1/30/2025", …
$ reviews_per_month              <dbl> 0.48, 2.77, 0.33, 0.54, 0.51, 0.50, 2.2…
$ calculated_host_listings_count <dbl> 1, 5, 2, 2, 1, 27, 4, 4, 1, 4, 4, 2, 11…
$ availability_365               <dbl> 1, 349, 352, 179, 158, 310, 194, 218, 3…
$ number_of_reviews_ltm          <dbl> 0, 38, 6, 0, 0, 0, 3, 3, 0, 2, 2, 0, 1,…
$ license                        <chr> NA, "Hosted License: 5007242201001033",…

Exploring the data set

glimpse(airbnb)

Rows: 6,257
Columns: 18
$ id                             <dbl> 3686, 3943, 4197, 4529, 5589, 7103, 117…
$ name                           <chr> "Vita's Hideaway", "Historic Rowhouse N…
$ host_id                        <dbl> 4645, 5059, 5061, 5803, 6527, 17633, 32…
$ host_name                      <chr> "Vita", "Vasa", "Sandra", "Bertina", "A…
$ neighbourhood_group            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ neighbourhood                  <chr> "Historic Anacostia", "Edgewood, Bloomi…
$ latitude                       <dbl> 38.86339, 38.91195, 38.88719, 38.90585,…
$ longitude                      <dbl> -76.98889, -77.00456, -76.99472, -76.94…
$ room_type                      <chr> "Private room", "Private room", "Privat…
$ price                          <dbl> 60, 63, 128, 64, NA, 74, 85, 52, 125, 5…
$ minimum_nights                 <dbl> 31, 1, 4, 30, 50, 31, 31, 31, 30, 31, 3…
$ number_of_reviews              <dbl> 84, 534, 64, 102, 96, 91, 415, 120, 38,…
$ last_review                    <chr> "8/30/2023", "2/19/2025", "1/30/2025", …
$ reviews_per_month              <dbl> 0.48, 2.77, 0.33, 0.54, 0.51, 0.50, 2.2…
$ calculated_host_listings_count <dbl> 1, 5, 2, 2, 1, 27, 4, 4, 1, 4, 4, 2, 11…
$ availability_365               <dbl> 1, 349, 352, 179, 158, 310, 194, 218, 3…
$ number_of_reviews_ltm          <dbl> 0, 38, 6, 0, 0, 0, 3, 3, 0, 2, 2, 0, 1,…
$ license                        <chr> NA, "Hosted License: 5007242201001033",…

# 6257 observations, 18 variables 

names(airbnb)

 [1] "id"                             "name"                          
 [3] "host_id"                        "host_name"                     
 [5] "neighbourhood_group"            "neighbourhood"                 
 [7] "latitude"                       "longitude"                     
 [9] "room_type"                      "price"                         
[11] "minimum_nights"                 "number_of_reviews"             
[13] "last_review"                    "reviews_per_month"             
[15] "calculated_host_listings_count" "availability_365"              
[17] "number_of_reviews_ltm"          "license"

### Checking distribution of price variable to see where to create cut offs when creating a categorical variable later
summary(airbnb$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   10.0    88.0   131.0   168.7   193.0  7000.0    1488

#Price is right skewed, mean is greater than the median, max is $7000
#Price category variable based on summary stats
# Low is less than 100
#Medium is  100-170
# High is 170-250
#High-end is 250 and above

# Checking to see number of neighborhoods in the data set so I can select the appropriate color package 

length(unique(airbnb$neighbourhood))

[1] 39

#39 neighborhoods- this is too many to show in an alluvial. To make it more readable I will focus on the top 10 neighborhoods with the most listings

# Checking types of Air B n B listings

Cleaning the Data Set

airbnb2 <- airbnb |>
  filter(!is.na(price)) |>   # remove missing prices first
  mutate(price_category = case_when ( #Creating new variable price category based off of summary stats of price
    price < 100 ~ "Low",
    price >= 100 & price < 170 ~ "Medium",
    price >= 170 & price < 250 ~ "High",
    price >= 250 ~ "High-end"
  ))

##Checking new variable is correctly labeled
#Price category variable based on summary stats

unique(airbnb2$price_category)

[1] "Low"      "Medium"   "High-end" "High"

table(airbnb2$price_category) #checking to see the number of listings in each price category


    High High-end      Low   Medium 
     900      679     1559     1631

#Creating a new neighborhood variable counting listings per neighborhood

neigh_counts <- airbnb2 |>
  count(neighbourhood, sort = TRUE)

neigh_counts

# A tibble: 39 × 2
   neighbourhood                                                               n
   <chr>                                                                   <int>
 1 Capitol Hill, Lincoln Park                                                490
 2 Union Station, Stanton Park, Kingman Park                                 490
 3 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View                359
 4 Dupont Circle, Connecticut Avenue/K Street                                320
 5 Brightwood Park, Crestwood, Petworth                                      282
 6 Edgewood, Bloomingdale, Truxton Circle, Eckington                         280
 7 Shaw, Logan Circle                                                        253
 8 Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol…   241
 9 Ivy City, Arboretum, Trinidad, Carver Langston                            184
10 Kalorama Heights, Adams Morgan, Lanier Heights                            170
# ℹ 29 more rows

# Creating a new variable that only focuses on top 10 neighborhoods

top10neighbourhoods <- neigh_counts[1:10, ]
top10neighbourhoods

# A tibble: 10 × 2
   neighbourhood                                                               n
   <chr>                                                                   <int>
 1 Capitol Hill, Lincoln Park                                                490
 2 Union Station, Stanton Park, Kingman Park                                 490
 3 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View                359
 4 Dupont Circle, Connecticut Avenue/K Street                                320
 5 Brightwood Park, Crestwood, Petworth                                      282
 6 Edgewood, Bloomingdale, Truxton Circle, Eckington                         280
 7 Shaw, Logan Circle                                                        253
 8 Downtown, Chinatown, Penn Quarters, Mount Vernon Square, North Capitol…   241
 9 Ivy City, Arboretum, Trinidad, Carver Langston                            184
10 Kalorama Heights, Adams Morgan, Lanier Heights                            170

# Filtering the dataset to only keep listings top 10 neighborhoods
airbnb_top10  <- airbnb2 |>
  filter(neighbourhood %in% top10neighbourhoods$neighbourhood)

Summarizing counts for the Alluvial

airbnb_alluvial <- airbnb_top10 |>
  group_by(neighbourhood, price_category) |>
  summarise(count = n())

`summarise()` has grouped output by 'neighbourhood'. You can override using the
`.groups` argument.

airbnb_alluvial

# A tibble: 40 × 3
# Groups:   neighbourhood [10]
   neighbourhood                                            price_category count
   <chr>                                                    <chr>          <int>
 1 Brightwood Park, Crestwood, Petworth                     High              28
 2 Brightwood Park, Crestwood, Petworth                     High-end          20
 3 Brightwood Park, Crestwood, Petworth                     Low              160
 4 Brightwood Park, Crestwood, Petworth                     Medium            74
 5 Capitol Hill, Lincoln Park                               High             123
 6 Capitol Hill, Lincoln Park                               High-end          83
 7 Capitol Hill, Lincoln Park                               Low               77
 8 Capitol Hill, Lincoln Park                               Medium           207
 9 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park V… High              46
10 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park V… High-end          32
# ℹ 30 more rows

Creating Alluvial

airbnb_alluvial$price_category <- factor(
  airbnb_alluvial$price_category,
  levels = c("Low", "Medium", "High", "High-end")
)

ggalluv <- airbnb_alluvial |>
  ggplot(aes(x = price_category,
             y = count,
             alluvium = neighbourhood)) +
  
  geom_alluvium(aes(fill = neighbourhood),
                color = "white",
                width = .1,
                alpha = .8,
                decreasing = FALSE) +
  
  scale_fill_brewer(palette = "Spectral") + #I chose to use Spectral as I have 10 categories for the neighborhoods and it should be enough
  labs(title = "Distribution of DC Airbnb Listings by Price Category & Neighborhood",
       y = "Number of Listings",
       fill = "Neighborhood",
       caption = "Source: DC Airbnb Dataset") + theme_minimal()
ggalluv

Brief Explanation of Plot

The alluvial above shows the distribution of DC AirBnB listings by price category for the top 10 neighborhoods with the most listings. I really wanted to create an alluvial for this assignment. Having attended GW for graduate school, I am well aware that housing is quite expensive in DC. I was curious to understand the price differences in AirBnB listings by neighborhood. The width of each ribbon represents the number of listings within each price category (on the x axis) for a specific neighborhood denoted by the color. I limited the plot to show top 10 neighborhoods with the most listing because 39 neighborhoods are represented in the data set which is too many to plot. To create the price categories, I ran summary statistics for the price variable and determined cut offs by observing Q1, Median, Mean, Q3. I made sure to reorder the price categories so they appear in order from Low to High-end instead of alphabetically

One pattern that stands out in the alluvial is that Medium appear to be the most common across several neighborhoods like Union Station, Capitol Hill, and Dupont. While Low price listings appear most in Brightwood park, Crestwood and Petworth. Also interestingly, Kalorama Heights, which is generally known to be high-end, has very few high-end listings. This is likely because homeowners in that area are not leasing out their giant mansions on AirBnB.