WEEK7

library(pwr)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read_csv("AB_NYC_2019.csv")

## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(data)

## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Here is a step-by-step analysis of Hypothesis 1 utilizing the Neyman-Pearson framework, together with the necessary sample size calculation, alpha level, power, and hypothesis test:

Hypothesis 1: Price Difference Between Manhattan and Other Boroughs.

Null hypothesis (H0): The average cost of Airbnb listings in Manhattan and every other borough is essentially the same.
Hypothesis Alternative (H1): Compared to other boroughs, Manhattan has substantially higher average Airbnb listing prices.
The Neyman-Pearson Framework:
To verify this theory, we shall take the following actions:

Test:To compare the mean pricing of Manhattan with the other boroughs, a two-sample t-test was employed.
Alpha level (Type I Error): 0.05, which is a conventional cutoff point that denotes a 5% probability of incorrectly rejecting the null hypothesis.
Power (1 - Type II Error): 0.80 indicates that, in the event that an actual difference is detected, there is an 80% chance of doing so.
Effect Size: A $50 price differential that is economically significant will be taken into account. This will assist us in figuring out how large an impact must be before decision-makers begin to recognize it.

Calculating the Sample Size

We must estimate the effect size (Cohen’s d) in order to calculate the necessary sample size. The sample size is then determined using power analysis.
First, determine the effect size.

Step 1: Calculate Effect Size

The effect size formula (Cohen’s d) is as follows:

d=μ1−μ2/σ

Where:

The means of Manhattan and the other boroughs are μ1 and μ2, respectively.
The price’s pooled standard deviation is represented by σ.

In order to make things simpler, we’ll make the meaningful pricing assumption that Manhattan and the other boroughs differ by $50. The effect size will then be determined by using the standard deviation of the dataset’s prices.

Step 2: Perform Power Analysis to Calculate Sample Size

# Perform power analysis for a medium effect size (Cohen's d = 0.5)
smple_size <- pwr.t.test(d = 0.5, power = 0.8, sig.level = 0.05, type = "two.sample")$n
smple_size

## [1] 63.76561

the Two-Sample T-Test Hypothesis Test:

The two-sample t-test can be used to determine whether the average price in Manhattan is substantially greater than the average price in other boroughs once determined the sufficient data.

# Create a new column to group non-Manhattan listings
data$borough_group <- ifelse(data$neighbourhood_group == "Manhattan", 
                                    "Manhattan", 
                                    "OtherBoroughs")


t_test_ <- t.test(price ~ borough_group, data = data, alternative = "greater")

t_test_

## 
##  Welch Two Sample t-test
## 
## data:  price by borough_group
## t = 34.967, df = 34580, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Manhattan and group OtherBoroughs is greater than 0
## 95 percent confidence interval:
##  75.54538      Inf
## sample estimates:
##     mean in group Manhattan mean in group OtherBoroughs 
##                    196.8758                    117.6012

Interpretation of the Results

Reject the null hypothesis if the p-value is less than 0.05. This would imply that Manhattan has substantially higher average prices than other boroughs.
The null hypothesis cannot be rejected if the p-value is greater than 0.05, indicating that the average prices in Manhattan and the other boroughs do not differ significantly.

Visualization: Price Distribution by Borough

Finally, to visually support your hypothesis test, you can plot the price distributions for Manhattan and the other boroughs.

library(ggplot2)

ggplot(data, aes(x = borough_group, y = price)) + 
  geom_boxplot(fill = c("skyblue", "lightcoral")) + 
  theme_minimal() + 
  labs(title = "Price Distribution: Manhattan vs. Other Boroughs", 
       x = "Borough Group", 
       y = "Price")

Hypothesis 2: Is there a relationship between price and the number of reviews?

We will investigate whether the number of reviews an Airbnb listing has and its price are significantly correlated for Hypothesis 2. Fisher’s Significance Testing framework, which focuses on analyzing the p-value to determine the strength of evidence against the null hypothesis, will be used to test this hypothesis.

Hypothesis Null (H0): The quantity of reviews and the price of an Airbnb listing are no correlated.

Hypothesis Alternative (H1): The price of an Airbnb listing and the quantity of reviews have a non-zero correlation.

Fisher’s Framework for Significance Testing:

To test this theory, we’ll do the following:

Test: To determine whether the number of reviews and the price have a linear relationship, we will do a Pearson correlation test.
Significance level (Alpha): Since 0.05 is the commonly accepted cutoff point for significance.
P-value interpretation: We will reject the null hypothesis and come to the conclusion that there is a statistically significant correlation between the price and the number of reviews if the p-value is less than 0.05. Otherwise, we will fail to reject the null hypothesis.

Perform Pearson Correlation Test

To test for the correlation between the number of reviews and price, we will use the cor.test() function in R.

# Perform Pearson correlation test
cor_test <- cor.test(data$number_of_reviews,data$price, method = "pearson", use = "complete.obs")

# Display the result
cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  data$number_of_reviews and data$price
## t = -10.616, df = 48893, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05679384 -0.03910709
## sample estimates:
##         cor 
## -0.04795423

This will provide the p-value and the correlation coefficient (r).

The strength and direction of the association will be shown by the correlation coefficient (r):
r > 0 indicates a positive relationship.
r < 0 indicates a negative relationship.
r = 0 means no correlation.
The statistical significance of the correlation observed will be determined by the p-value.

Interpretation of the Results:

Reject the null hypothesis if the p-value is less than 0.05. There is evidence that an Airbnb listing’s pricing and the number of reviews it has are related.
The null hypothesis cannot be rejected if the p-value is greater than 0.05, indicating that there is no statistically significant correlation between price and the number of reviews.

Visualization: Scatter Plot of Price vs. Number of Reviews

To visualize the relationship between price and number of reviews, we can create a scatter plot with a linear regression line.

ggplot(data, aes(x = number_of_reviews, y = price)) + 
  geom_point(alpha = 0.5, color = "green") + 
  geom_smooth(method = "lm", color = "red") + 
  theme_minimal() + 
  labs(title = "Scatter Plot of Price vs. Number of Reviews", 
       x = "Number of Reviews", 
       y = "Price")

## `geom_smooth()` using formula = 'y ~ x'

Acquired Knowledge and Additional Inquiries

Hypothesis 1:

Insights : We can determine whether or not Manhattan’s Airbnb listings are significantly more expensive than those in other boroughs based on the t-test results. Additional investigation into the reasons influencing price differences, such as property demand, location near to tourist attractions, or accommodation size, may be necessary if the data indicate that Manhattan has higher pricing.
Significance: This information is crucial for travelers choosing where to stay or for Airbnb hosts trying to offer competitive prices.
Further Questions: Are there particular Manhattan neighborhoods that contribute to the high average price if there is a significant price difference? How does the size or type of property affect price?

Hypothesis 2:

Insight: The Pearson correlation test will reveal whether listings with more reviews tend to have higher or lower prices. A positive correlation might indicate that well-reviewed listings can command higher prices, while a negative correlation might suggest that higher-priced listings receive fewer reviews (perhaps due to fewer bookings).
Significance: This is important for understanding customer behavior. If more reviews correlate with lower prices, Airbnb hosts might price strategically to increase bookings and reviews.
Further Questions: If there is a significant correlation, does it vary by borough or neighborhood? What other factors (e.g., host rating, listing amenities) influence the relationship between price and reviews?