week8

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read_csv("AB_NYC_2019.csv")

## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(data)

## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

1. Select on a continuous variable for response Variable of Response: Price

Price is the most significant response variable in the NYC Airbnb data. Price has a important on booking decisions and revenue, thus it matters to both hosts and guests.

For Guests: Price is a crucial consideration while planning their trip to New York City.
For Hosts: The key to managing a successful listing is understanding how to set a price that is both profitable and competitive.

2. Select a categorical explanatory variable, such as the borough’s neighborhood group.

We hypothesize that the borough, or location, will have significant impact on Airbnb listing prices.

Null Hypothesis (H₀): The average price of Airbnb in each borough does not differ significantly.
Alternative Hypothesis (H₁): At least two boroughs have significantly different average Airbnb rates.

To find out if there are any notable differences in the mean prices among the five boroughs of New York City, we will use an ANOVA test.

table(data$neighbourhood_group)

## 
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##          1091         20104         21661          5666           373

# Perform one-way ANOVA on price by neighbourhood_group
anova_result <- aov(price ~ neighbourhood_group, data = data)

# Display ANOVA table
summary(anova_result)

##                        Df    Sum Sq  Mean Sq F value Pr(>F)    
## neighbourhood_group     4 7.959e+07 19897739     355 <2e-16 ***
## Residuals           48890 2.740e+09    56051                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Interpretation of the ANOVA Test Summary Results:

If the p-value is less than 0.05, we reject the null hypothesis, indicating that the average price in at least one borough differs significantly from the others. Borough influences the price.
p-value ≥ 0.05: The null hypothesis cannot be rejected, indicating that there is insufficient evidence to support the idea that average prices vary by borough.

What This Means:
If borough has an impact on price: Guests may find cheaper accommodation in certain boroughs. Prices may need to be determined by hosts according to the market in their borough.
If borough does no effect on price then other factors (such as room type or reviews) should be taken into account when setting prices, and location might not be the deciding factor.

4. Select a Continuous Explanatory Variable

Explanatory Variable: Number of Reviews

Since properties with more reviews may be seen as more reputable or well-liked, we hypothesize that the number of reviews will have a linear connection with price.

Null Hypothesis (H₀): The price and the number of reviews have no significant relationship linearly.
Alternative Hypothesis (H₁): The price and the number of reviews have a significant linear relationship.

We will assess the relationship between price and number_of_reviews using a simple linear regression model.

# Fit a linear regression model for price ~ number_of_reviews
lm_model <- lm(price ~ number_of_reviews, data = data)

# Display the regression summary
summary(lm_model)

## 
## Call:
## lm(formula = price ~ number_of_reviews, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -158.7  -84.1  -42.7   24.6 9842.6 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       158.73718    1.22396  129.69   <2e-16 ***
## number_of_reviews  -0.25850    0.02435  -10.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239.9 on 48893 degrees of freedom
## Multiple R-squared:  0.0023, Adjusted R-squared:  0.002279 
## F-statistic: 112.7 on 1 and 48893 DF,  p-value: < 2.2e-16

5. Linear Regression Model Summary

Results Interpretation:

Intercept (β₀): This model’s intercept is 158.74. This is the predicted average price for a listing in a lack of reviews. It can be thought of as a starting price in this situation, perhaps representing the general base rate for Airbnb listings without taking reviews into account.
Slope (β₁): number_of_reviews has a slope of -0.2585. According to this negative coefficient, the price of a listing should, on average, drop by $0.26 for every extra review it receives. Although small this inverse relationship would suggest that listings with more reviews typically have slightly cheaper prices, either in order to attract in more visitors or as a result of increased competition.

Significance: The p-value for number_of_reviews is less than 0.05, confirming that there is a statistically significant, albeit weak, relationship between price and number_of_reviews.

Interpretation and Context:

The results show that listings with more reviews typically have slightly lower pricing, indicating that listings with a lot of ratings might prioritize accessibility or affordability. This can lead hosts to deliberately set their listing prices to get more reviews at first, or consider leveraging other factors (such location or room quality) to support greater prices as review counts grow.

We can comprehend how the quantity of reviews affects price, albeit in a slight way, by analyzing the intercept and slope in this way.

6. Visualization for Results

You should include visualizations to support your analysis:

ANOVA Visualization: A box plot showing price distribution across different boroughs.

ggplot(data, aes(x = neighbourhood_group, y = price)) +
  geom_boxplot() +
  labs(title = "Price Distribution by Borough", x = "Borough", y = "Price")

Regression Visualization: A scatter plot with a regression line showing the relationship between the number of reviews and price.

ggplot(data, aes(x = number_of_reviews, y = price)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Relationship between Number of Reviews and Price", x = "Number of Reviews", y = "Price")

## `geom_smooth()` using formula = 'y ~ x'

Conclusion and Additional Research:

ANOVA: Guests and hosts can use this information to strategically set prices or adjust their budgets if borough has a significant impact on price.
Regression: In order to increase the value of their listing, hosts should give priority to obtaining reviews if they have a substantial impact on price.

Further questions to investigate:
Do other elements—like the kind of room and availability—have a greater impact on price than reviews or location?
Do reviews have diminishing returns, meaning that they no longer affect price after a certain number?