Lab 01: Airbnbs in NYC

Author

Amanda Rose Knudsen

Airbnb in NYC

Airbnb has had a disruptive effect on the hotel, rental home, and vacation industry throughout the world. The success of Airbnb has not come without controversies, with critics arguing that Airbnb has adverse impacts on housing and rental prices and also on the daily lives of people living in neighborhoods where Airbnb is popular. This controversy has been particularly intense in NYC, where the debate been Airbnb proponents and detractors eventually led to the city imposing strong restrictions on the use of Airbnb. If you find this issue interesting and want to go deeper, there is the potential for an interesting project that brings in hotels (which have interesting regulations in NYC), hotel price data, and rental data and looks at these things together.

Because Airbnb listings are available online through their website and app, it is possible for us to acquire and visualize the impacts of Airbnb on different cities, including New York City. This is possible through the work of an organization called inside airbnb

Github Instructions

Before we introduce the data and the main assignment, let’s begin with a few key steps to configure the file and create a github repository for your first assignment. This is optional but I think it is a good idea to start getting familiar with github tools. These steps come from the happygitwithr tutorial.

  • Start a new github repository in your account, clone it to your computer (using RStudio to start a new project from a repository or any other way)
  • Update the YAML, changing the author name to your name, and Render the document.
  • Commit your changes with a meaningful commit message.
  • Push your changes to GitHub.
  • Go to your repo on GitHub and confirm that your changes are visible in your Qmd and md files. If anything is missing, commit and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling and visualisation, and the ggridges package to make a ridge plot in the last exercise. You may need to install ggridges if you haven’t already, you can do that using:

install.packages("ggridges")

Then make sure to load both packages:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggridges)

Data

The data for this assignment can be found on my github page by clicking here and downloading nycbnb.csv

You can read the data into R using the command:

nycbnb = read_csv("/Users/amandaknudsen/AmandaLocal/AmandaR/DATA-607/Lab01/nycbnb.csv")

where you should replace /home/georgehagstrom/work/Teaching/DATA607/website/data/nycbnb.csv" with the local path to your file.

Important note: It is generally not wise to include datasets in github repositories, especially if they are large and can change frequently.

Note from Amanda to request to review, generally, re: RStudio project file structure and their connection to the other files within the folder / working directory / connection to GitHub repository… Some early questions I’ve encountered: What is the recommendation/guidance for nesting files and/or when to create a new .Rproj file versus simply creating a new folder within the .Rproj working directory?

You can view the dataset as a spreadsheet using the View() function. Note that you should not put this function in your R Markdown document, but instead type it directly in the Console, as it pops open a new window (and the concept of popping open a window in a static document doesn’t really make sense…). When you run this in the console, you’ll see the following data viewer window pop up.

Exercises

Problem 1. How many observations (rows) does the dataset have? Instead of hard coding the number in your answer, use inline code.

comma <- function(x) format(x, digits = 2, big.mark = ",")

There are 37,765 observations (rows) in the nycbnb dataset.

Problem 2. Run View(nycbnb) in your Console to view the data in the data viewer. What does each row in the dataset represent?

Each row in the dataset represents an Airbnb listing: each observation has an ID as well as unique features associated with that ID, such as the listing URL.

Each column represents a variable. We can get a list of the variables in the data frame using the names() function.

names(nycbnb)
 [1] "id"                   "price"                "neighborhood"        
 [4] "borough"              "accommodates"         "bathrooms"           
 [7] "bedrooms"             "beds"                 "review_scores_rating"
[10] "number_of_reviews"    "listing_url"         

You can find descriptions of each of the variables in the help file for the dataset, which you can find online at the inside airbnb data dictionary

Problem 3. Pick one of the five boroughs of NYC (Manhattan, Queens, Brooklyn, the Bronx, or Staten Island), and create a faceted histogram where each facet represents a neighborhood in your chosen borough and displays the distribution of Airbnb prices in that neighborhood. Think critically about whether it makes more sense to stack the facets on top of each other in a column, lay them out in a row, or wrap them around. Along with your visualization, include your reasoning for the layout you chose for your facets.

# #| fig.width: 7
# #| fig.height: 10

# str(nycbnb)

# summary(nycbnb)

airbnbQueens <- nycbnb |> 
  filter(borough == "Queens") |> 
  group_by(neighborhood) |> 
  select("id", "price", "neighborhood")

airbnbQueens_clean <- drop_na(airbnbQueens, price)

ggplot(airbnbQueens_clean, aes(x = price)) +
  geom_histogram(color = "darkgreen", fill = "darkgreen") +
  facet_wrap(~ neighborhood, ncol = 6, as.table = TRUE) +
  labs(title = "Faceted Histogram of Airbnb Prices per Neighborhood in Queens",
       x = "Price",
       y = "Count of Airbnb Listings") +
  theme(strip.text = element_text(size = 6, margin = margin(t = 1, r = 0, b = 1, l = 0)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I chose to set up the faceted histogram like this because there were many neighborhoods to work with. I set the columns in the facet_wrap so that this fit visibly decent enough, and adjusted the size of the text and the margins around each facet so that the name of the neighborhood could still be read clearly enough.

Problem 4. Use a single pipeline to identity the neighborhoods city-wide with the top five median listing prices that have a minimum of 50 listings. Then, in another pipeline filter the data for these five neighborhoods and make ridge plots of the distributions of listing prices in these five neighborhoods. In a third pipeline calculate the minimum, mean, median, standard deviation, IQR, and maximum listing price in each of these neighborhoods. Use the visualization and the summary statistics to describe the distribution of listing prices in the neighborhoods. (Your answer will include three pipelines, one of which ends in a visualization, and a narrative.)

nycbnb_p1 = nycbnb |> 
  group_by(neighborhood) |> 
  summarise(
    medianprice = median(price, na.rm=TRUE),
    countperneighborhood = n()
  ) |> 
  arrange(desc(medianprice)) |> 
  filter(countperneighborhood > 50) |> 
  slice_head(n = 5) |> 
  select (neighborhood, medianprice)

nycbnb_p1
# A tibble: 5 × 2
  neighborhood      medianprice
  <chr>                   <dbl>
1 Tribeca                   341
2 Battery Park City         334
3 Greenwich Village         330
4 Theater District          310
5 SoHo                      299
nycbnbp1alt = nycbnb |> 
  summarise(
    medianprice = median(price, na.rm=TRUE),
    countperneighborhood = n(),
    .by = neighborhood
  ) |> 
  arrange(desc(medianprice)) |> 
  filter(countperneighborhood > 50) |> 
  slice_head(n = 5) |> 
  select (neighborhood, medianprice)

nycbnbp1alt
# A tibble: 5 × 2
  neighborhood      medianprice
  <chr>                   <dbl>
1 Tribeca                   341
2 Battery Park City         334
3 Greenwich Village         330
4 Theater District          310
5 SoHo                      299
toppriceneighborhoods = nycbnb |> 
  filter(neighborhood %in% c("Tribeca", "Battery Park City", "Greenwich Village", 
                             "Theater District", "SoHo"))

ggplot(toppriceneighborhoods, aes(x = price, y = neighborhood)) +
  geom_density_ridges() +
  theme_ridges() + 
  labs(title = "NYC Neighborhoods with Top Median Airbnb Prices")
Picking joint bandwidth of 63.9
Warning: Removed 533 rows containing non-finite outside the scale range
(`stat_density_ridges()`).

nycbnb_p3 = toppriceneighborhoods |> 
  group_by(neighborhood) |> 
  summarise(
    minprice = min(price, na.rm = TRUE),
    meanprice = mean(price, na.rm = TRUE),
    medianprice = median(price, na.rm = TRUE),
    stdprice =  sd(price, na.rm = TRUE),
    iqrprice = IQR(price, na.rm = TRUE),
    maxprice = max(price, na.rm = TRUE)
  ) 

nycbnb_p3
# A tibble: 5 × 7
  neighborhood      minprice meanprice medianprice stdprice iqrprice maxprice
  <chr>                <dbl>     <dbl>       <dbl>    <dbl>    <dbl>    <dbl>
1 Battery Park City       80      375.         334     211.     351       886
2 Greenwich Village       73      385.         330     251.     285       999
3 SoHo                    89      356.         299     223.     205       995
4 Theater District        57      321.         310     165.     175.      946
5 Tribeca                150      378.         341     188.     217       999

Ridge plots show the range of prices on Airbnb for the five NYC neighborhoods with the highest median prices. The first observation is that all these neighborhoods are in Manhattan, and each has a relatively broad range of prices. The Theater District neighborhood has the lowest minimum price at 57, yet it is SoHo which has the lowest median price of 299, while Battery Park City has the lowest maximum price at 886. Further, we can see using the visualizations of these five neighborhoods in the ridge plots and using the summary statistics that the outliers appear in the higher end of the range of prices, with tails to the right side (higher end of prices). Both Greenwich Village and Tribeca share the ranking for top maximum price (999) and Tribeca has the highest median price overall (341). Additional observations include that the Theater District has the lowest standard deviation and lowest IQR, as well as the lowest mean price, all indicating that the data points are clustered closer to the median compared with the other 4 neighborhoods with the top median prices.

Problem 5. Create a visualization that will help you compare the distribution of review scores (review_scores_rating) across neighborhoods. You get to decide what type of visualization to create and which neighborhoods are most interesting to you, and there is more than one correct answer! In your answer, include a brief interpretation of how Airbnb guests rate properties in general and how the neighborhoods compare to each other in terms of their ratings.

library(ggthemes)

airbnbBrooklyn <- nycbnb |> 
  filter(borough == "Brooklyn") |>
  filter(neighborhood %in% c("Bedford-Stuyvesant", "Clinton Hill", "Crown Heights", 
                             "Fort Greene", "Park Slope", "Prospect Heights")) |> 
  select("id", "price", "neighborhood", "review_scores_rating")

ggplot(airbnbBrooklyn, aes(x = review_scores_rating, y = price)) +
  geom_point(aes(color = neighborhood, shape = neighborhood)) +
  labs(
    title = "Prices and Review Scores for Airbnbs Near Me",
    x = "Airbnb Review Score", y = "Airbnb Price"
  ) +
  scale_color_colorblind()
Warning: Removed 2795 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(airbnbBrooklyn, aes(x = neighborhood, y = review_scores_rating)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen") +
  labs(
    title = "Review Scores for Airbnbs Near Me",
    subtitle = "Bed-Stuy, Clinton Hill, Crown Heights, Fort Greene, Park Slope, and Prospect Heights"
  )
Warning: Removed 1318 rows containing non-finite outside the scale range
(`stat_boxplot()`).

For nearly 10 years, my husband and I have lived on a block that’s historically considered Crown Heights, but many newcomers (and real estate agents) call Prospect Heights. Last October I read an interesting article on The Upshot in NYT that visualized the ‘borders’ and ‘names’ of NYC neighborhoods. I was interested to see so many people had varying names for different neighborhoods across the city – not just my part of town. With that article in mind, I wanted to look at the airbnb data that compared prices-and-review-scores along with the compared review scores, which is why I chose to create a scatterplot as well as a boxplot to look at a handful of neighborhoods near where I live in Brooklyn. With the boxplot I was able to see more clearly which neighborhoods (BedStuy, Crown Heights, and Park Slope) got a 1 or lower review score. I was also interested to see that Clinton Hill and Prospect Heights have listings that received no lower than a 4. Using both visualizations we can see anything below a 4 rating is an outlier, in any of the neighborhoods near me. Almost everyone tends to rate Airbnb properties highly, despite differences in price or name.