The basic structure of the New York City Airbnb market, including pricing, availability, and location-based patterns, are analyzed and understood using this dataset.

Rows as Observations: Each row represents a unique Airbnb listing in New York City.

Columns as Variables

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_delim("./AB_NYC_2019.csv", delim = ",")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
ABCDEFGHIJ0123456789
id
<dbl>
name
<chr>
host_id
<dbl>
host_name
<chr>
2539Clean & quiet apt home by the park2787John
2595Skylit Midtown Castle2845Jennifer
3647THE VILLAGE OF HARLEM....NEW YORK !4632Elisabeth
3831Cozy Entire Floor of Brownstone4869LisaRoxanne
5022Entire Apt: Spacious Studio/Loft by central park7192Laura
5099Large Cozy 1 BR Apartment In Midtown East7322Chris
view(data)
str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

1. Numeric summary of data for Columns

Numeric columns

# Summary of Price column
summary(data$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    69.0   106.0   152.7   175.0 10000.0
#Summary of Reviews per month Column
 summary(data$reviews_per_month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.010   0.190   0.720   1.373   2.020  58.500   10052

Categorical Columns

# Summary of Room type

table(data$room_type)
## 
## Entire home/apt    Private room     Shared room 
##           25409           22326            1160
# Summary of neighbourhood
table(data$neighbourhood_group)
## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##          1091         20104         21661          5666           373

Combined Summary

# Combined summary of price (numeric) and room_type (categorical)
summary(data[, c("price", "room_type")])
##      price          room_type        
##  Min.   :    0.0   Length:48895      
##  1st Qu.:   69.0   Class :character  
##  Median :  106.0   Mode  :character  
##  Mean   :  152.7                     
##  3rd Qu.:  175.0                     
##  Max.   :10000.0

2. Novel Questions to Investigate

  1. What is the average price vary by room type?

  2. What effect does the type of room—private, communal, or entire home/apt—have on the average listing price in different neighborhood_group?

  3. Is there a relationship between the number of reviews and price?

3. Address at least one of the above questions using an aggregation function

What is the average price vary by room type?

You can use group_by() and summarize() from the dplyr package to calculate the average price for each room type.

mean_price_roomtype <-data|>
  group_by(room_type) |>
  summarize(mean_price =mean(price))

mean_price_roomtype
ABCDEFGHIJ0123456789
room_type
<chr>
mean_price
<dbl>
Entire home/apt211.79425
Private room89.78097
Shared room70.12759

Compared to private and shared rooms, the average price for assignments for entire homes or apartments is significantly greater. This information can be used to better analyze market pricing patterns for both hosts and prospective visitors.

4. Visual summaries

Plot 1. Average Listing Price by Room Type and Neighborhood Group:

# Calculate average price by room type and neighborhood group
avg_price <- data %>%
  group_by(neighbourhood_group, room_type) %>%
  summarize(mean_price = mean(price, na.rm = TRUE))
## `summarise()` has grouped output by 'neighbourhood_group'. You can override
## using the `.groups` argument.
print(avg_price)
## # A tibble: 15 × 3
## # Groups:   neighbourhood_group [5]
##    neighbourhood_group room_type       mean_price
##    <chr>               <chr>                <dbl>
##  1 Bronx               Entire home/apt      128. 
##  2 Bronx               Private room          66.8
##  3 Bronx               Shared room           59.8
##  4 Brooklyn            Entire home/apt      178. 
##  5 Brooklyn            Private room          76.5
##  6 Brooklyn            Shared room           50.5
##  7 Manhattan           Entire home/apt      249. 
##  8 Manhattan           Private room         117. 
##  9 Manhattan           Shared room           89.0
## 10 Queens              Entire home/apt      147. 
## 11 Queens              Private room          71.8
## 12 Queens              Shared room           69.0
## 13 Staten Island       Entire home/apt      174. 
## 14 Staten Island       Private room          62.3
## 15 Staten Island       Shared room           57.4
ggplot(avg_price, aes(x = neighbourhood_group, y = mean_price, fill = room_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Listing Price by Room Type and Neighborhood Group",
       x = "Neighborhood Group",
       y = "Average Price",
       ) +
  theme_minimal()

The variation in average price for different room categories (private room, shared room, entire home/apt) across different neighborhood groups in New York City is illustrated by the bar plot.

Plot 2. Scatter plot of price vs. number of reviews

# Scatter plot of price vs. number of reviews
ggplot(data, aes(x = number_of_reviews, y = price)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Price vs. Number of Reviews", x = "Number of Reviews", y = "Price")
## `geom_smooth()` using formula = 'y ~ x'

The quantity of reviews and price seem to be slightly correlated negatively, indicating that listings with higher prices may get less reviews. The reason for this could be that cheap listings draw more frequent visitors, or that more expensive listings target specialty markets with fewer visitors.