Introduction

The goal of this project is to investigate the factors that most strongly influence the sales price of houses in Ames, Iowa. Specifically, we aim to model the relationship between home price and living area, using both classical inference and bootstrap resampling to check the robustness of results. The broader purpose is to answer the practical question: how much does square footage of living space affect the sale price of a home in Ames?

Materials

The dataset used is the Ames housing dataset from the AmesHousing library in R, which includes homes sold between 2006-10 in Ames, Iowa.

Methodology

The methodology for this project incoudes data exploration, model fitting with SLR, diagnostic checks assessing residual plots, bootstrap resampling for confidence intervals, and comparison with parametric intervals.

Results and Conclusions

The fitted linear models estimates an intercept of approximately 13,289, meaning that a home with 0 square feet would be expected to sell for that price. The slope indicates that each additional square foots adds $112 in sales price. The residual plot shows linearity and variance assumptions are satisfied. The bootstrap intervals show the parametric assumptions are satisfied. The p-value is highly signifcant.

General Discussion

This regression analysis shows that living area is a very important driving factor of sales home price. Future work could inclue MLR models to see what other factors affect the price.

library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.4.3
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(broom)


housing <- make_ames()
set.seed(123)
small_housing <- housing %>% sample_n(1000)
num_vars <- small_housing %>%
  select(Sale_Price, Gr_Liv_Area, Garage_Area, Lot_Area, Year_Built)
GGally::ggpairs(num_vars)

fit <- lm(Sale_Price ~ Gr_Liv_Area, data = housing)
summary(fit)
## 
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area, data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483467  -30219   -1966   22728  334323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13289.634   3269.703   4.064 4.94e-05 ***
## Gr_Liv_Area   111.694      2.066  54.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4994 
## F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

par(mfrow = c(1,1))

set.seed(123)
B <- 1000
n <- nrow(housing)

boot_intercepts <- numeric(B)
boot_slopes <- numeric(B)

for (b in 1:B) {
  idx <- sample(1:n, n, replace = TRUE)
  sample_data <- housing[idx, ]
  model_b <- lm(Sale_Price ~ Gr_Liv_Area, data = sample_data)
  boot_intercepts[b] <- coef(model_b)[1]
  boot_slopes[b] <- coef(model_b)[2]
}


ci_intercept <- quantile(boot_intercepts, c(0.025, 0.975))
ci_slope <- quantile(boot_slopes, c(0.025, 0.975))

ci_intercept
##      2.5%     97.5% 
##  1051.109 26583.115
ci_slope
##     2.5%    97.5% 
## 101.9418 120.8120
confint(fit)
##                 2.5 %     97.5 %
## (Intercept) 6878.4845 19700.7842
## Gr_Liv_Area  107.6429   115.7451

#When interpreting the model, we see that a house with 0 square feet of living area is predicted to cost 13,289. For each additional square foot of living area, the price is expected to increase by 112. The p-value indicates the effect is highly significant. The bootstrap confidence interals provided more robust results. However, the results are close to that of the classic, indicating the assumptions of the classic CI are not violated.