The basic structure of the New York City Airbnb market, including pricing, availability, and location-based patterns, are analyzed and understood using this dataset.
Rows as Observations: Each row represents a unique Airbnb listing in New York City.
Columns as Variables
Continuous Variables: price, latitude and longitude ,minimum_night.
Time-based Column: last_review.
Categorical Variables: neighbourhood_group, room_type.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_delim("./AB_NYC_2019.csv", delim = ",")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date (1): last_review
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
id <dbl> | name <chr> | host_id <dbl> | host_name <chr> | |
---|---|---|---|---|
2539 | Clean & quiet apt home by the park | 2787 | John | |
2595 | Skylit Midtown Castle | 2845 | Jennifer | |
3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | |
3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | |
5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | |
5099 | Large Cozy 1 BR Apartment In Midtown East | 7322 | Chris |
view(data)
str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:48895] 2539 2595 3647 3831 5022 ...
## $ name : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : num [1:48895] 2787 2845 4632 4869 7192 ...
## $ host_name : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
## $ reviews_per_month : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. host_id = col_double(),
## .. host_name = col_character(),
## .. neighbourhood_group = col_character(),
## .. neighbourhood = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. room_type = col_character(),
## .. price = col_double(),
## .. minimum_nights = col_double(),
## .. number_of_reviews = col_double(),
## .. last_review = col_date(format = ""),
## .. reviews_per_month = col_double(),
## .. calculated_host_listings_count = col_double(),
## .. availability_365 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Summary of Price column
summary(data$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 69.0 106.0 152.7 175.0 10000.0
#Summary of Reviews per month Column
summary(data$reviews_per_month)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.010 0.190 0.720 1.373 2.020 58.500 10052
# Summary of Room type
table(data$room_type)
##
## Entire home/apt Private room Shared room
## 25409 22326 1160
# Summary of neighbourhood
table(data$neighbourhood_group)
##
## Bronx Brooklyn Manhattan Queens Staten Island
## 1091 20104 21661 5666 373
# Combined summary of price (numeric) and room_type (categorical)
summary(data[, c("price", "room_type")])
## price room_type
## Min. : 0.0 Length:48895
## 1st Qu.: 69.0 Class :character
## Median : 106.0 Mode :character
## Mean : 152.7
## 3rd Qu.: 175.0
## Max. :10000.0
What is the average price vary by room type?
What effect does the type of room—private, communal, or entire home/apt—have on the average listing price in different neighborhood_group?
Is there a relationship between the number of reviews and price?
What is the average price vary by room type?
You can use group_by() and summarize() from the dplyr package to calculate the average price for each room type.
mean_price_roomtype <-data|>
group_by(room_type) |>
summarize(mean_price =mean(price))
mean_price_roomtype
room_type <chr> | mean_price <dbl> | |||
---|---|---|---|---|
Entire home/apt | 211.79425 | |||
Private room | 89.78097 | |||
Shared room | 70.12759 |
Compared to private and shared rooms, the average price for assignments for entire homes or apartments is significantly greater. This information can be used to better analyze market pricing patterns for both hosts and prospective visitors.
# Calculate average price by room type and neighborhood group
avg_price <- data %>%
group_by(neighbourhood_group, room_type) %>%
summarize(mean_price = mean(price, na.rm = TRUE))
## `summarise()` has grouped output by 'neighbourhood_group'. You can override
## using the `.groups` argument.
print(avg_price)
## # A tibble: 15 × 3
## # Groups: neighbourhood_group [5]
## neighbourhood_group room_type mean_price
## <chr> <chr> <dbl>
## 1 Bronx Entire home/apt 128.
## 2 Bronx Private room 66.8
## 3 Bronx Shared room 59.8
## 4 Brooklyn Entire home/apt 178.
## 5 Brooklyn Private room 76.5
## 6 Brooklyn Shared room 50.5
## 7 Manhattan Entire home/apt 249.
## 8 Manhattan Private room 117.
## 9 Manhattan Shared room 89.0
## 10 Queens Entire home/apt 147.
## 11 Queens Private room 71.8
## 12 Queens Shared room 69.0
## 13 Staten Island Entire home/apt 174.
## 14 Staten Island Private room 62.3
## 15 Staten Island Shared room 57.4
ggplot(avg_price, aes(x = neighbourhood_group, y = mean_price, fill = room_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Listing Price by Room Type and Neighborhood Group",
x = "Neighborhood Group",
y = "Average Price",
) +
theme_minimal()
The variation in average price for different room categories (private room, shared room, entire home/apt) across different neighborhood groups in New York City is illustrated by the bar plot.
# Scatter plot of price vs. number of reviews
ggplot(data, aes(x = number_of_reviews, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Price vs. Number of Reviews", x = "Number of Reviews", y = "Price")
## `geom_smooth()` using formula = 'y ~ x'
The quantity of reviews and price seem to be slightly correlated negatively, indicating that listings with higher prices may get less reviews. The reason for this could be that cheap listings draw more frequent visitors, or that more expensive listings target specialty markets with fewer visitors.