library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_csv("AB_NYC_2019.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

1. Select on a continuous variable for response Variable of Response: Price

Price is the most significant response variable in the NYC Airbnb data. Price has a important on booking decisions and revenue, thus it matters to both hosts and guests.

2. Select a categorical explanatory variable, such as the borough’s neighborhood group.



We hypothesize that the borough, or location, will have significant impact on Airbnb listing prices.

To find out if there are any notable differences in the mean prices among the five boroughs of New York City, we will use an ANOVA test.

table(data$neighbourhood_group)
## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##          1091         20104         21661          5666           373
# Perform one-way ANOVA on price by neighbourhood_group
anova_result <- aov(price ~ neighbourhood_group, data = data)

# Display ANOVA table
summary(anova_result)
##                        Df    Sum Sq  Mean Sq F value Pr(>F)    
## neighbourhood_group     4 7.959e+07 19897739     355 <2e-16 ***
## Residuals           48890 2.740e+09    56051                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Interpretation of the ANOVA Test Summary Results:

4. Select a Continuous Explanatory Variable


Explanatory Variable: Number of Reviews

Since properties with more reviews may be seen as more reputable or well-liked, we hypothesize that the number of reviews will have a linear connection with price.

# Fit a linear regression model for price ~ number_of_reviews
lm_model <- lm(price ~ number_of_reviews, data = data)

# Display the regression summary
summary(lm_model)
## 
## Call:
## lm(formula = price ~ number_of_reviews, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -158.7  -84.1  -42.7   24.6 9842.6 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       158.73718    1.22396  129.69   <2e-16 ***
## number_of_reviews  -0.25850    0.02435  -10.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239.9 on 48893 degrees of freedom
## Multiple R-squared:  0.0023, Adjusted R-squared:  0.002279 
## F-statistic: 112.7 on 1 and 48893 DF,  p-value: < 2.2e-16

5. Linear Regression Model Summary

Results Interpretation:

Significance: The p-value for number_of_reviews is less than 0.05, confirming that there is a statistically significant, albeit weak, relationship between price and number_of_reviews.

Interpretation and Context:

The results show that listings with more reviews typically have slightly lower pricing, indicating that listings with a lot of ratings might prioritize accessibility or affordability. This can lead hosts to deliberately set their listing prices to get more reviews at first, or  consider leveraging other factors (such location or room quality) to support greater prices as review counts grow.


We can comprehend how the quantity of reviews affects price, albeit in a slight way, by analyzing the intercept and slope in this way.

6. Visualization for Results

You should include visualizations to support your analysis:

ggplot(data, aes(x = neighbourhood_group, y = price)) +
  geom_boxplot() +
  labs(title = "Price Distribution by Borough", x = "Borough", y = "Price")

Conclusion and Additional Research: