This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggpubr)
library(ggrepel)
library(ggplot2)
library(tidyverse)
library(dplyr)
Homes <- read.csv('D:/DataSet/Homes.csv')
Hypothesis 1: Homes with more bedrooms have higher prices Test Details Null hypothesis (H0): There is no difference in price between Homes with different numbers of bedrooms Alternative hypothesis (HA): Homes with more bedrooms have higher prices on average Alpha: 0.05 Standard alpha level, balances Type I and Type II errors Power: 0.8 Want 80% chance of detecting effect if it exists Minimum detectable effect size: $500,000 Meaningful difference in home prices based on domain knowledge Neyman-Pearson Test Conditions needed: Independent observations Normal distribution Equal variance Test assumptions not met: Prices across bedroom groups do not appear normally distributed Therefore cannot perform Neyman-Pearson test Fisher’s Test
Homes %>%
group_by(beds) %>%
summarize(mean_price = mean(price, na.rm=TRUE)) %>%
ggboxplot(x="beds", y="mean_price", color="beds")
anova_result <- aov(price ~ beds, data=Homes)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## beds 1 7.738e+14 7.738e+14 120.7 <2e-16 ***
## Residuals 490 3.142e+15 6.412e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA shows a significant effect of number of bedrooms on home price (p < 0.05) Therefore, we reject the null hypothesis and conclude that home prices differ based on number of bedrooms Visualization
Homes %>%
ggplot(aes(x=beds, y=price)) +
geom_boxplot() +
labs(title="Home Price by Number of Bedrooms",
x="Number of Bedrooms",
y="Price (USD)")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
The boxplot shows that median and upper quartile home prices generally
increase as the number of bedrooms increases. The significant ANOVA
provides statistical evidence that number of bedrooms impacts home
price.
Hypothesis 2: Newer Homes have higher prices per square foot Test Details Null hypothesis (H0): There is no difference in price per square foot between newer and older Homes Alternative hypothesis (HA): Newer Homes have a higher average price per square foot compared to older Homes Alpha: 0.01 More conservative alpha due to one-tailed test Power: 0.9 Want higher power to detect effect in specific direction Minimum detectable effect size: $200 price per square foot Based on typical range of home prices per square foot Neyman-Pearson Test Conditions met: Independent observations Price per square foot close to normal distribution Equal variance across groups Grouping year built: Older Homes: built before 1980 (at least 40 years old) Newer Homes: built 1980 or later
older_Homes <- Homes %>%
filter(year_built < 1980)
newer_Homes <- Homes %>%
filter(year_built >= 1980)
t.test(price_per_sqft ~ Homes$year_built < 1980, Homes,
alternative="greater")
##
## Welch Two Sample t-test
##
## data: price_per_sqft by Homes$year_built < 1980
## t = 8.3281, df = 263.39, p-value = 2.268e-15
## alternative hypothesis: true difference in means between group FALSE and group TRUE is greater than 0
## 95 percent confidence interval:
## 470.3705 Inf
## sample estimates:
## mean in group FALSE mean in group TRUE
## 1552.150 965.505
The one-tailed t-test shows a significant difference between older and newer Homes (p < 0.01) The mean price per square foot is greater for newer Homes Therefore, we reject the null hypothesis and conclude that newer Homes have a higher price per square foot on average Fisher’s Test
Homes %>%
group_by(year_built) %>%
summarize(mean_ppft = mean(price_per_sqft, na.rm=TRUE)) %>%
ggplot(aes(x = factor(year_built< 1980), y = mean_ppft, fill = factor(year_built < 1980))) +
geom_boxplot() +
labs(x = "Year Built < 1980", y = "Mean Price Per Sqft") +
scale_fill_discrete(name = "Year Built < 1980")
aov_result <- aov(price_per_sqft ~ year_built < 1980, data=Homes)
summary(aov_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## year_built < 1980 1 40365887 40365887 88.3 <2e-16 ***
## Residuals 490 223994399 457131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA shows a significant effect of year built (newer vs. older) on price per square foot (p < 0.05) Aligns with the conclusion from the Neyman-Pearson test Visualization
Homes %>%
mutate(year_group = ifelse(year_built < 1980, "Built before 1980",
"Built 1980 or later")) %>%
ggplot(aes(x=year_group, y=price_per_sqft)) +
geom_boxplot() +
labs(title="Home Price per Square Foot by Year Built",
x="Year Built",
y="Price per Square Foot (USD)")
The boxplots illustrate the significant difference in price per square
foot between older and newer Homes. Newer Homes have a higher median and
upper quartile price per square foot compared to older Homes built
before 1980.
Conclusion The hypothesis tests and visualizations provide evidence that both number of bedrooms and year built have an impact on home prices and price per square foot, respectively. Home prices tend to increase with more bedrooms. Additionally, newer Homes tend to have a higher price per square foot compared to older Homes.