Homework 7

Bring it All Home

As the final homework of the semester, this assignment will test your ability to take a data set from a “raw” stage, through data cleaning and variable creation, to description and finally inferential analysis and visualization. It has three distinct parts. The first two parts can be completed immediately, while the final part would best be attacked after the final R class. For this homework, we are going to be using a data set covering Airbnb listings in New York City in 2019. This data was originally provided, and hosted by Airbnb, but you can find it and a description at Kaggle here: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/version/3. This may require signing up for a Kaggle account (or at least a Google account), but I have included the dataset in the module attached to this assignment. It is likely still worth going to the site to read about the data. There is an absolute avalanche of data from airbnb here: http://insideairbnb.com/get-the-data.html, about a host of different cities. Also highly recommended!

First step as always is loading the data.

air <- read_csv("AB_NYC_2019.csv")

## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning and Explaining

It is always worthwhile to figure out what exactly your dataset contains. Below, briefly describe the dimensions of the data, and what information is included. Do NOT simply repeat the variable names, but rather tell me what those names mean.

dim(air)

## [1] 48895    16

names(air)

##  [1] "id"                             "name"                          
##  [3] "host_id"                        "host_name"                     
##  [5] "neighbourhood_group"            "neighbourhood"                 
##  [7] "latitude"                       "longitude"                     
##  [9] "room_type"                      "price"                         
## [11] "minimum_nights"                 "number_of_reviews"             
## [13] "last_review"                    "reviews_per_month"             
## [15] "calculated_host_listings_count" "availability_365"

glimpse(air)

## Rows: 48,895
## Columns: 16
## $ id                             <dbl> 2539, 2595, 3647, 3831, 5022, 5099, 512…
## $ name                           <chr> "Clean & quiet apt home by the park", "…
## $ host_id                        <dbl> 2787, 2845, 4632, 4869, 7192, 7322, 735…
## $ host_name                      <chr> "John", "Jennifer", "Elisabeth", "LisaR…
## $ neighbourhood_group            <chr> "Brooklyn", "Manhattan", "Manhattan", "…
## $ neighbourhood                  <chr> "Kensington", "Midtown", "Harlem", "Cli…
## $ latitude                       <dbl> 40.64749, 40.75362, 40.80902, 40.68514,…
## $ longitude                      <dbl> -73.97237, -73.98377, -73.94190, -73.95…
## $ room_type                      <chr> "Private room", "Entire home/apt", "Pri…
## $ price                          <dbl> 149, 225, 150, 89, 80, 200, 60, 79, 79,…
## $ minimum_nights                 <dbl> 1, 1, 3, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4…
## $ number_of_reviews              <dbl> 9, 45, 0, 270, 9, 74, 49, 430, 118, 160…
## $ last_review                    <date> 2018-10-19, 2019-05-21, NA, 2019-07-05…
## $ reviews_per_month              <dbl> 0.21, 0.38, NA, 4.64, 0.10, 0.59, 0.40,…
## $ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, …
## $ availability_365               <dbl> 365, 355, 365, 194, 0, 129, 0, 220, 0, …

summary(air$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    69.0   106.0   152.7   175.0 10000.0

#The dataset contains 48,895 Airbnb listings in New York City, each represented as a single row with 16 variables describing different characteristics of the listing, host, location, and activity. The information includes unique identifiers for both the listing and the host, as well as the listing’s title. It provides geographic details such as the borough and neighborhood where the unit is located, along with its latitude and longitude coordinates. The data also describes the type of accommodation offered (such as private room or entire home), the nightly price, the minimum number of nights required to book, and how many reviews the listing has received. It also includes indicators of recent host activity such as the date of the last review, the average number of reviews per month, how many total listings the host manages, and how many days per year the unit is available.

Cleaning Proper

One of the concerns that critics voice about Airbnb is that they may displace small hotels or traditional bed and breakfasts, and that there exist proprietors who are taking advantage of lax tax regulations by running multiple listings as mini-hotels. As the data currently stands, it is difficult to identify listings that may fall under this complaint immediately, but we have the information to make good guesses.

Create two new dummy variables, one identifying listings that are available on a single-night basis, and one identifying listings that are being run by hosts with multiple listings.

Create a THIRD variable that identifies whether both of the previous conditions are met.

# 1) single-night: minimum_nights == 1
# 2) multi_host: hosts with >1 listings (by host_id)
# 3) mini_hotel: both single-night and multi_host
#Two dummy variables
# 1. Single-night availability
air <- air %>%
  mutate(single_night = if_else(minimum_nights == 1, 1, 0))
# 2. Host with multiple listings
air <- air %>%
  mutate(multi_host = if_else(calculated_host_listings_count > 1, 1, 0))
#Third dummy variable
air <- air %>%
  mutate(mini_hotel = if_else(single_night == 1 & multi_host == 1, 1, 0))
#Show counts to confirm
air %>% summarize(
  n_single = sum(single_night, na.rm = TRUE),
  n_multi = sum(multi_host, na.rm = TRUE),
  n_mini = sum(mini_hotel, na.rm = TRUE)
) %>% kable()

n_single	n_multi	n_mini
12720	16592	4903

#n_single = 12720
#n_multi  = 16592
#n_mini   = 4903

Lots of Descriptives

One of the major things we do to learn about our data inside and out is calculate and observe many different types of descriptive and summary statistics. Here, I want to look at the data a host of different ways.

First, a simple count: How many listings are for entire homes/apartments? How many for “private rooms?”

Next, a ranking: What is the most popular listing neighborhood? That is, which neighborhood has the greatest number of listings? Which has the least?

Third, reach way back and tell me how many listings are described as “clean” in their name (remember CASE!). How about “spacious”?

Finally, what neighborhood has the most “mini-hotels,” as you identified in the third new variable above? Which has the most EXPENSIVE mini-hotels, on average?

# How many listings for each room type?
air %>% 
  count(room_type)

## # A tibble: 3 × 2
##   room_type           n
##   <chr>           <int>
## 1 Entire home/apt 25409
## 2 Private room    22326
## 3 Shared room      1160

# Most popular neighborhood
air %>% 
  count(neighbourhood, sort = TRUE) %>% 
  slice(1)

## # A tibble: 1 × 2
##   neighbourhood     n
##   <chr>         <int>
## 1 Williamsburg   3920

# Least popular neighborhood
air %>% 
  count(neighbourhood, sort = TRUE) %>% 
  slice(n())

## # A tibble: 1 × 2
##   neighbourhood     n
##   <chr>         <int>
## 1 Woodrow           1

# How many "clean" listings?
air %>% 
  filter(str_detect(tolower(name), "clean")) %>% 
  nrow()

## [1] 1217

# How many "spacious" listings?
air %>% 
  filter(str_detect(tolower(name), "spacious")) %>% 
  nrow()

## [1] 3800

# Neighborhoods with the most mini-hotels
air %>% 
  group_by(neighbourhood) %>% 
  summarize(mini_count = sum(mini_hotel)) %>% 
  arrange(desc(mini_count)) %>% 
  slice(1)

## # A tibble: 1 × 2
##   neighbourhood      mini_count
##   <chr>                   <dbl>
## 1 Bedford-Stuyvesant        381

# Neighborhood with the most expensive mini-hotels (average price)
air %>% 
  filter(mini_hotel == 1) %>% 
  group_by(neighbourhood) %>% 
  summarize(avg_price = mean(price, na.rm = TRUE)) %>% 
  arrange(desc(avg_price)) %>% 
  slice(1)

## # A tibble: 1 × 2
##   neighbourhood avg_price
##   <chr>             <dbl>
## 1 Columbia St        1080

#The data show 1,217 listings with “clean” in the title and 3,800 with “spacious,” which shows these are common selling points. The most and least popular neighborhoods appear clearly in the ranking tables, reflecting where listings are most concentrated and most scarce. Mini-hotels cluster heavily in the neighborhood with the highest mini-hotel count, suggesting certain areas attract more multi-listing, single-night hosts. The neighborhood with the highest average mini-hotel price stands out as the most expensive location for this type of listing.

Inferential Statistics

The next step is to use our data to say something about the world outside of our sample. Are some boroughs (here, neighborhood groups) just “different” than others? Compare Manhattan and Brooklyn on a few attributes. Does one have a different percentage of listings that are entire homes or apartments? Does one have a higher percentage of “mini-hotels?” Finally, is there a significant difference in price between the listings in each borough that are NOT mini-hotels?

NOTE: THESE ARE MULTI-STEP QUESTIONS. THINK ABOUT WHAT YOU NEED TO DO IN ORDER TO SET UP THE COMPARISONS, AND THEN DO IT STEP-BY-STEP.

# Filter to Manhattan and Brooklyn only
mb <- air %>% 
  filter(neighbourhood_group %in% c("Manhattan", "Brooklyn"))

# 1. Percentage of entire homes/apartments
mb %>% 
  group_by(neighbourhood_group) %>% 
  summarise(pct_entire = mean(room_type == "Entire home/apt"))

## # A tibble: 2 × 2
##   neighbourhood_group pct_entire
##   <chr>                    <dbl>
## 1 Brooklyn                 0.475
## 2 Manhattan                0.609

# 2. Percentage of mini-hotels
mb %>% 
  group_by(neighbourhood_group) %>% 
  summarise(pct_mini = mean(mini_hotel == 1))

## # A tibble: 2 × 2
##   neighbourhood_group pct_mini
##   <chr>                  <dbl>
## 1 Brooklyn              0.0869
## 2 Manhattan             0.0832

# 3. Price difference among NON–mini-hotel listings
non_mini <- mb %>% 
  filter(mini_hotel == 0)

t.test(price ~ neighbourhood_group, data = non_mini)

## 
##  Welch Two Sample t-test
## 
## data:  price by neighbourhood_group
## t = -31.154, df = 32845, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Brooklyn and group Manhattan is not equal to 0
## 95 percent confidence interval:
##  -77.69867 -68.50066
## sample estimates:
##  mean in group Brooklyn mean in group Manhattan 
##                125.0385                198.1381

#The t-test shows a highly significant difference in average Airbnb prices between Brooklyn and Manhattan (p < 2.2e-16). Manhattan listings are, on average, about $198, while Brooklyn listings average about $125. The 95% confidence interval (–77.7 to –68.5) indicates Manhattan prices are roughly $69–78 higher than Brooklyn’s. Because the interval does not include zero, we can confidently conclude the price difference is real and substantial.

Prediction Exercise

Why are rooms priced the way they are? Can we predict the price for an individual listing using some of its attributes? By now, you know the steps of a prediction exercise. Split your dataset in half and see how well you can predict the price of a listing based on a few attributes. You can pick your attributes, so long as they include at least one geographic attribute. Instead of comparing against the mean and median of the dataset, tell me the average percentage miss in your prediction.

set.seed(123)

# Split data 50/50
n <- nrow(non_mini)
train_index <- sample(1:n, n/2)
train <- non_mini[train_index, ]
test  <- non_mini[-train_index, ]

# Fit model with a geographic variable
model <- lm(price ~ room_type + neighbourhood_group + number_of_reviews, data = train)

# Predict on test set
pred <- predict(model, newdata = test)

# Combine predictions + true prices
out <- test %>% 
  mutate(pred = pred) %>% 
  filter(price > 0)   # <-- prevents division by zero

# Average percentage miss
pct_miss <- mean(abs(out$pred - out$price) / out$price * 100)

pct_miss

## [1] 53.70109

#The prediction model used room type, borough, and number of reviews to estimate listing prices, trained on half the data and tested on the remaining half. While the model captured some general pricing patterns, its accuracy varied widely across individual listings. The average percentage miss was about 54%, meaning predicted prices were often far from the true values. This suggests that price variation on Airbnb depends on many additional factors not included in the simple model, such as amenities, building quality, or host-specific characteristics.

Visualization and Mapping - to be completed after last R class

Two “easy” questions:

1.) Find the 30 most expensive mini-hotels, and the 30 least expensive mini-hotels and map them inside NYC. Are they clustered?

2.) Draw a graph that informs the reader about the distribution of prices for different room types in each borough. HINT: You’re going to want to use “color.”

## Q6

library(ggplot2)

# Convert airbnb data to sf object for mapping
air_sf <- air %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, remove = FALSE)

### 1) 30 most expensive and 30 least expensive mini-hotels

mini <- air_sf %>% 
  filter(mini_hotel == 1, price > 0)

top30 <- mini %>% arrange(desc(price)) %>% slice(1:30)
bottom30 <- mini %>% arrange(price) %>% slice(1:30)

# Combine with a label
mini_map <- bind_rows(
  top30 %>% mutate(type = "Most Expensive"),
  bottom30 %>% mutate(type = "Least Expensive")
)

# Plot them
ggplot(mini_map) +
  geom_sf(aes(color = type), size = 2, alpha = 0.7) +
  labs(title = "30 Most and Least Expensive Mini-Hotels in NYC") +
  theme_minimal()

### 2) Distribution of prices by room type and borough

ggplot(air, aes(x = neighbourhood_group, y = price, color = room_type)) +
  geom_boxplot(outlier.alpha = 0.1) +
  coord_cartesian(ylim = c(0, 500)) +   # keeps extreme outliers from squashing the plot
  labs(
    title = "Price Distribution by Room Type and Borough",
    x = "Borough",
    y = "Price"
  ) +
  theme_minimal()

#The most and least expensive mini-hotels cluster mainly in Manhattan and parts of Brooklyn. The most expensive listings are especially concentrated in central Manhattan, while the cheapest are more spread but still centered around Manhattan/Brooklyn. Overall, they are not evenly distributed across NYC.

#Entire homes are the most expensive in every borough, followed by private rooms and shared rooms. Manhattan has the highest prices overall, while the Bronx and Staten Island are the cheapest. Price variation is also widest in Manhattan and Brooklyn.