Data Dive — Hypothesis Testing

# Loading necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(pwr)

# Load the data from the CSV file
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")

# Hypothesis 1: Difference in average property prices between houses with different numbers of bedrooms

# Define the null hypothesis
null_hypothesis_1 <- "There is no significant difference in the average property prices between houses with different numbers of bedrooms."

# Define parameters for Null Hypothesis 1
alpha_1 <- 0.05  # significance level
power_1 <- 0.8  # desired power
effect_size_1 <- 0.2  # minimum effect size to detect

#Explanation for Parameters:
#Alpha Level (α): The alpha level is set at 0.05, indicating a 5% chance of rejecting the null hypothesis when it is actually true. This is a common threshold used in hypothesis testing.

#Power Level: The power level is set at 0.8, indicating an 80% chance of correctly rejecting the null hypothesis when it is false. This level of power is generally considered acceptable in statistical analysis.

#Minimum Effect Size: The effect size is set at 0.2, representing a small but practically meaningful difference in average property prices between houses with different numbers of bedrooms.

# Calculate sample size for Null Hypothesis 1 (assuming two-sample t-test)
sample_size_1 <- pwr.t.test(d = effect_size_1, sig.level = alpha_1, power = power_1, type = "two.sample")$n

#Explanation:
#In this part, we are setting up Hypothesis 1, which aims to investigate if there is a significant difference in the average property prices between houses with different numbers of bedrooms. We define the null hypothesis stating that there is no significant difference in prices based on the number of bedrooms. We then specify the significance level (alpha), desired power, and minimum effect size. Using these parameters, we calculate the sample size needed for a two-sample t-test to detect the effect size with the specified power and significance level. The calculated sample size is printed to provide insight into the required data size for hypothesis testing.

# Print the sample size for Null Hypothesis 1
cat("Sample size for Null Hypothesis 1 (Difference in mean property prices between houses with 2 and 3 bedrooms):", sample_size_1, "\n")

## Sample size for Null Hypothesis 1 (Difference in mean property prices between houses with 2 and 3 bedrooms): 393.4057

# Perform hypothesis testing for Null Hypothesis 1 (Difference in mean property price between houses with different numbers of bedrooms)
# We'll perform a one-way ANOVA test
anova_result <- aov(PRICE ~ factor(BEDS), data = NY_House_Dataset)

# Summarize the ANOVA results
summary(anova_result)

##                Df    Sum Sq   Mean Sq F value   Pr(>F)    
## factor(BEDS)   26 7.313e+16 2.813e+15    2.89 1.32e-06 ***
## Residuals    4774 4.646e+18 9.732e+14                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Interpretation of Results:
#The sample size calculated for this hypothesis is sufficient to achieve the desired power level of 0.8, indicating that the test has adequate statistical power to detect the specified effect size.

#The ANOVA test results show a significant difference in average property prices between houses with different numbers of bedrooms (p < 0.05). This suggests that the number of bedrooms has a statistically significant effect on property prices.

#The boxplot visualization illustrates the distribution of property prices across different numbers of bedrooms, visually confirming the significant differences observed in the ANOVA test.

# Visualize the results for Null Hypothesis 1
# I'll create a boxplot to compare the distribution of property prices between houses with different numbers of bedrooms
library(ggplot2)
ggplot(NY_House_Dataset, aes(x = factor(BEDS), y = PRICE)) +
  geom_boxplot() +
  labs(title = "Comparison of Property Prices by Number of Bedrooms",
       x = "Number of Bedrooms",
       y = "Property Price")

# Display the p-value for Null Hypothesis 1 (ANOVA test)
anova_p_value <- summary(anova_result)[[1]]$"Pr(>F)"[1]
cat("P-value for Null Hypothesis 1 (ANOVA test):", anova_p_value, "\n")

## P-value for Null Hypothesis 1 (ANOVA test): 1.320088e-06

# Hypothesis 2: Correlation between property square footage and price

# Define the null hypothesis
null_hypothesis_2 <- "There is no significant correlation between property square footage and price."

# Define parameters for Null Hypothesis 2
alpha_2 <- 0.05  # significance level
power_2 <- 0.8  # desired power
effect_size_2 <- 0.3  # minimum effect size to detect

#Explanation for Parameters:
#Alpha Level (α): The alpha level is again set at 0.05, maintaining consistency with the previous hypothesis test.

#Power Level: The power level is set at 0.8, ensuring a high probability of detecting a true correlation between property square footage and price.

#Minimum Effect Size: The effect size is set at 0.3, representing a moderate correlation between property square footage and price.

# Calculate sample size for Null Hypothesis 2 (correlation test)
sample_size_2 <- pwr.r.test(n = NULL, r = effect_size_2, sig.level = alpha_2, power = power_2)$n

# Print the sample size for Null Hypothesis 2
cat("Sample size for Null Hypothesis 2 (Correlation between property square footage and price):", sample_size_2, "\n")

## Sample size for Null Hypothesis 2 (Correlation between property square footage and price): 84.07364

#Explanation:
#In this part, we set up Hypothesis 2 to explore the correlation between property square footage and price. The null hypothesis states that there is no significant correlation between the two variables. Similar to Hypothesis 1, we define the significance level, desired power, and minimum effect size. We then calculate the sample size needed for detecting the effect size with the specified power and significance level using a Pearson correlation test. The calculated sample size is printed to provide insight into the required data size for testing Hypothesis 2.

# Perform hypothesis testing for Null Hypothesis 2 (Correlation between property square footage and price)
# I'll use Pearson's correlation test
correlation_test_result <- cor.test(NY_House_Dataset$PROPERTYSQFT, NY_House_Dataset$PRICE, method = "pearson")
correlation_test_p_value <- correlation_test_result$p.value

# Visualize the results for Null Hypothesis 2
# I'll create a scatter plot to visualize the relationship between property square footage and price
ggplot(NY_House_Dataset, aes(x = PROPERTYSQFT, y = PRICE)) +
  geom_point() +
  labs(title = "Relationship between Property Square Footage and Price",
       x = "Property Square Footage",
       y = "Property Price")

# Display the p-value for Null Hypothesis 2 (correlation test)
cat("P-value for Null Hypothesis 2 (correlation test):", correlation_test_p_value, "\n")

## P-value for Null Hypothesis 2 (correlation test): 1.306682e-14

#Interpretation of Results:
#The sample size calculation was not applicable for this hypothesis, as it requires calculating the correlation coefficient rather than comparing means across groups.

#The Pearson correlation test results show a significant positive correlation between property square footage and price (p < 0.05), indicating that larger properties tend to have higher prices.

#The scatter plot visualization visually confirms the positive relationship between property square footage and price, with properties generally increasing in price as square footage increases.

#Here are some further questions that might need to be investigated:

#Hypothesis 1:

# 1) Are there any other variables that could potentially confound the relationship between the number of bedrooms and property prices?
# 2) How representative is the sample size calculated? Are there any biases in the data that might affect the generalization of the results?
# 3) Are there any outliers in the dataset that could skew the results of the hypothesis test?
#Hypothesis 2:

# 1) Besides property square footage, are there any other factors that could influence property prices?
# 2) Is the assumption of linearity between property square footage and price met? If not, how might this affect the interpretation of the correlation test results?
# 3) Are there any influential data points that could disproportionately affect the correlation coefficient?
#General:

# 1) Are there any missing values in the dataset that could affect the analysis?
# 2) How robust are the results to changes in the chosen alpha level, power level, and effect size?
# 3) Are there any potential interactions between the variables that should be explored further?

Data Dive — Hypothesis Testing

Abhinandhan Velagapudi

2024-02-25