library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_csv("AB_NYC_2019.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date (1): last_review
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:48895] 2539 2595 3647 3831 5022 ...
## $ name : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : num [1:48895] 2787 2845 4632 4869 7192 ...
## $ host_name : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
## $ reviews_per_month : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. host_id = col_double(),
## .. host_name = col_character(),
## .. neighbourhood_group = col_character(),
## .. neighbourhood = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. room_type = col_character(),
## .. price = col_double(),
## .. minimum_nights = col_double(),
## .. number_of_reviews = col_double(),
## .. last_review = col_date(format = ""),
## .. reviews_per_month = col_double(),
## .. calculated_host_listings_count = col_double(),
## .. availability_365 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Price is the most significant response variable in the NYC Airbnb
data. Price has a important on booking decisions and revenue, thus it
matters to both hosts and guests.
For Guests: Price is a crucial consideration while planning their trip to New York City.
For Hosts: The key to managing a successful listing is understanding how to set a price that is both profitable and competitive.
We hypothesize that the borough, or location, will have significant
impact on Airbnb listing prices.
Null Hypothesis (H₀): The average price of Airbnb in each borough does not differ significantly.
Alternative Hypothesis (H₁): At least two boroughs have
significantly different average Airbnb rates.
To find out if there are any notable differences in the mean prices among the five boroughs of New York City, we will use an ANOVA test.
table(data$neighbourhood_group)
##
## Bronx Brooklyn Manhattan Queens Staten Island
## 1091 20104 21661 5666 373
# Perform one-way ANOVA on price by neighbourhood_group
anova_result <- aov(price ~ neighbourhood_group, data = data)
# Display ANOVA table
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## neighbourhood_group 4 7.959e+07 19897739 355 <2e-16 ***
## Residuals 48890 2.740e+09 56051
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If the p-value is less than 0.05, we reject the null hypothesis, indicating that the average price in at least one borough differs significantly from the others. Borough influences the price.
p-value ≥ 0.05: The null hypothesis cannot be rejected,
indicating that there is insufficient evidence to support the idea that
average prices vary by borough.
What This Means:
If borough has an impact on price: Guests may find cheaper accommodation in certain boroughs. Prices may need to be determined by hosts according to the market in their borough.
If borough does no effect on price then other factors (such as room type or reviews) should be taken into account when setting prices, and location might not be the deciding factor.
Explanatory Variable: Number of Reviews
Since properties with more reviews may be seen as more reputable or
well-liked, we hypothesize that the number of reviews will have a linear
connection with price.
Null Hypothesis (H₀): The price and the number of reviews have no significant relationship linearly.
Alternative Hypothesis (H₁): The price and the number of reviews have a significant linear relationship.
We will assess the relationship between price and number_of_reviews using a simple linear regression model.
# Fit a linear regression model for price ~ number_of_reviews
lm_model <- lm(price ~ number_of_reviews, data = data)
# Display the regression summary
summary(lm_model)
##
## Call:
## lm(formula = price ~ number_of_reviews, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158.7 -84.1 -42.7 24.6 9842.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 158.73718 1.22396 129.69 <2e-16 ***
## number_of_reviews -0.25850 0.02435 -10.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 239.9 on 48893 degrees of freedom
## Multiple R-squared: 0.0023, Adjusted R-squared: 0.002279
## F-statistic: 112.7 on 1 and 48893 DF, p-value: < 2.2e-16
Results Interpretation:
Intercept (β₀): This model’s intercept is 158.74. This is the predicted average price for a listing in a lack of reviews. It can be thought of as a starting price in this situation, perhaps representing the general base rate for Airbnb listings without taking reviews into account.
Slope (β₁): number_of_reviews has a slope of -0.2585. According to this negative coefficient, the price of a listing should, on average, drop by $0.26 for every extra review it receives. Although small this inverse relationship would suggest that listings with more reviews typically have slightly cheaper prices, either in order to attract in more visitors or as a result of increased competition.
Significance: The p-value for
number_of_reviews
is less than 0.05, confirming that there
is a statistically significant, albeit weak, relationship between
price
and number_of_reviews
.
Interpretation and Context:
The results show that listings with more reviews typically have
slightly lower pricing, indicating that listings with a lot of ratings
might prioritize accessibility or affordability. This can lead hosts to
deliberately set their listing prices to get more reviews at first, or
consider leveraging other factors (such location or room quality) to
support greater prices as review counts grow.
We can comprehend how the quantity of reviews affects price, albeit in a
slight way, by analyzing the intercept and slope in this way.
You should include visualizations to support your analysis:
ggplot(data, aes(x = neighbourhood_group, y = price)) +
geom_boxplot() +
labs(title = "Price Distribution by Borough", x = "Borough", y = "Price")
Regression Visualization: A scatter plot with a regression line showing the relationship between the number of reviews and price.
ggplot(data, aes(x = number_of_reviews, y = price)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Relationship between Number of Reviews and Price", x = "Number of Reviews", y = "Price")
## `geom_smooth()` using formula = 'y ~ x'
Conclusion and Additional Research:
ANOVA: Guests and hosts can use this information to strategically set prices or adjust their budgets if borough has a significant impact on price.
Regression: In order to increase the value of their listing,
hosts should give priority to obtaining reviews if they have a
substantial impact on price.
Further questions to investigate:
Do other elements—like the kind of room and availability—have a greater impact on price than reviews or location?
Do reviews have diminishing returns, meaning that they no longer
affect price after a certain number?