Data Dive — Hypothesis Testing

# Loading necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(pwr)

# Load the data from the CSV file
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")

# Hypothesis 1: Difference in average property prices between houses with different numbers of bedrooms

# Define the null hypothesis
null_hypothesis_1 <- "There is no significant difference in the average property prices between houses with different numbers of bedrooms."

# Define parameters for Null Hypothesis 1
alpha_1 <- 0.05  # significance level
power_1 <- 0.8  # desired power
effect_size_1 <- 0.2  # minimum effect size to detect

# Calculate sample size for Null Hypothesis 1 (assuming two-sample t-test)
sample_size_1 <- pwr.t.test(d = effect_size_1, sig.level = alpha_1, power = power_1, type = "two.sample")$n

#Explanation:
#In this part, we are setting up Hypothesis 1, which aims to investigate if there is a significant difference in the average property prices between houses with different numbers of bedrooms. We define the null hypothesis stating that there is no significant difference in prices based on the number of bedrooms. We then specify the significance level (alpha), desired power, and minimum effect size. Using these parameters, we calculate the sample size needed for a two-sample t-test to detect the effect size with the specified power and significance level. The calculated sample size is printed to provide insight into the required data size for hypothesis testing.

# Print the sample size for Null Hypothesis 1
cat("Sample size for Null Hypothesis 1 (Difference in mean property prices between houses with 2 and 3 bedrooms):", sample_size_1, "\n")

## Sample size for Null Hypothesis 1 (Difference in mean property prices between houses with 2 and 3 bedrooms): 393.4057

# Perform hypothesis testing for Null Hypothesis 1 (Difference in mean property price between houses with different numbers of bedrooms)
# We'll perform a one-way ANOVA test
anova_result <- aov(PRICE ~ factor(BEDS), data = NY_House_Dataset)

# Summarize the ANOVA results
summary(anova_result)

##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## factor(BEDS)   26 7.313e+16 2.813e+15    2.89 1.32e-06 ***
## Residuals    4774 4.646e+18 9.732e+14                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Visualize the results for Null Hypothesis 1
# I'll create a boxplot to compare the distribution of property prices between houses with different numbers of bedrooms
library(ggplot2)
ggplot(NY_House_Dataset, aes(x = factor(BEDS), y = PRICE)) +
  geom_boxplot() +
  labs(title = "Comparison of Property Prices by Number of Bedrooms",
       x = "Number of Bedrooms",
       y = "Property Price")

# Display the p-value for Null Hypothesis 1 (ANOVA test)
anova_p_value <- summary(anova_result)[[1]]$"Pr(>F)"[1]
cat("P-value for Null Hypothesis 1 (ANOVA test):", anova_p_value, "\n")

## P-value for Null Hypothesis 1 (ANOVA test): 1.320088e-06

# Hypothesis 2: Correlation between property square footage and price

# Define the null hypothesis
null_hypothesis_2 <- "There is no significant correlation between property square footage and price."

# Define parameters for Null Hypothesis 2
alpha_2 <- 0.05  # significance level
power_2 <- 0.8  # desired power
effect_size_2 <- 0.3  # minimum effect size to detect

# Calculate sample size for Null Hypothesis 2 (correlation test)
sample_size_2 <- pwr.r.test(n = NULL, r = effect_size_2, sig.level = alpha_2, power = power_2)$n

# Print the sample size for Null Hypothesis 2
cat("Sample size for Null Hypothesis 2 (Correlation between property square footage and price):", sample_size_2, "\n")

## Sample size for Null Hypothesis 2 (Correlation between property square footage and price): 84.07364

#Explanation:
#In this part, we set up Hypothesis 2 to explore the correlation between property square footage and price. The null hypothesis states that there is no significant correlation between the two variables. Similar to Hypothesis 1, we define the significance level, desired power, and minimum effect size. We then calculate the sample size needed for detecting the effect size with the specified power and significance level using a Pearson correlation test. The calculated sample size is printed to provide insight into the required data size for testing Hypothesis 2.

# Perform hypothesis testing for Null Hypothesis 2 (Correlation between property square footage and price)
# I'll use Pearson's correlation test
correlation_test_result <- cor.test(NY_House_Dataset$PROPERTYSQFT, NY_House_Dataset$PRICE, method = "pearson")
correlation_test_p_value <- correlation_test_result$p.value

# Visualize the results for Null Hypothesis 2
# I'll create a scatter plot to visualize the relationship between property square footage and price
ggplot(NY_House_Dataset, aes(x = PROPERTYSQFT, y = PRICE)) +
  geom_point() +
  labs(title = "Relationship between Property Square Footage and Price",
       x = "Property Square Footage",
       y = "Property Price")

# Display the p-value for Null Hypothesis 2 (correlation test)
cat("P-value for Null Hypothesis 2 (correlation test):", correlation_test_p_value, "\n")

## P-value for Null Hypothesis 2 (correlation test): 1.306682e-14

#Here are some further questions that might need to be investigated:

#Hypothesis 1:

# 1) Are there any other variables that could potentially confound the relationship between the number of bedrooms and property prices?
# 2) How representative is the sample size calculated? Are there any biases in the data that might affect the generalization of the results?
# 3) Are there any outliers in the dataset that could skew the results of the hypothesis test?
#Hypothesis 2:

# 1) Besides property square footage, are there any other factors that could influence property prices?
# 2) Is the assumption of linearity between property square footage and price met? If not, how might this affect the interpretation of the correlation test results?
# 3) Are there any influential data points that could disproportionately affect the correlation coefficient?
#General:

# 1) Are there any missing values in the dataset that could affect the analysis?
# 2) How robust are the results to changes in the chosen alpha level, power level, and effect size?
# 3) Are there any potential interactions between the variables that should be explored further?

Data Dive — Hypothesis Testing

Abhinandhan Velagapudi

2024-02-25