1 Introduction

This assignment uses the Ames Housing Dataset to model and predict residential sale price and to identify the most influential factors. I connect results to statistical tests, model quality, and BLUE assumptions (Best Linear Unbiased Estimator).


2 A. Data Loading, Preprocessing and Exploratory Data Analysis

install_if_missing <- function(pkgs) {
  to_get <- pkgs[!pkgs %in% rownames(installed.packages())]
  if (length(to_get)) install.packages(to_get, quiet = TRUE)
}
install_if_missing(c("tidyverse","broom","car"))
library(tidyverse); library(broom); library(car)

# Prefer AmesHousing::make_ames(), fallback to modeldata::ames
raw <- NULL
if (requireNamespace("AmesHousing", quietly = TRUE)) {
  library(AmesHousing); raw <- make_ames()
} else {
  install_if_missing("AmesHousing")
  if (requireNamespace("AmesHousing", quietly = TRUE)) {
    library(AmesHousing); raw <- make_ames()
  }
}
if (is.null(raw)) {
  install_if_missing("modeldata"); library(modeldata); data(ames); raw <- ames
}

# Clean and normalize names
names(raw) <- make.names(names(raw))
safe_rename <- function(dat, from, to) {
  if (from %in% names(dat) && !(to %in% names(dat))) dat <- dplyr::rename(dat, !!to := !!rlang::sym(from))
  dat
}
raw <- safe_rename(raw, "Sale_Price","SalePrice")
raw <- safe_rename(raw, "Gr_Liv_Area","Gr.Liv.Area")
raw <- safe_rename(raw, "Overall_Qual","Overall.Qual")
raw <- safe_rename(raw, "Year_Built","Year.Built")
raw <- safe_rename(raw, "Garage_Cars","Garage.Cars")
raw <- safe_rename(raw, "Full_Bath","Full.Bath")
raw <- safe_rename(raw, "Total_Bsmt_SF","Total.Bsmt.SF")
raw <- safe_rename(raw, "Lot_Area","Lot.Area")
raw <- safe_rename(raw, "Bedroom_AbvGr","Bedroom.AbvGr")
raw <- safe_rename(raw, "Year_Remod_Add","Year.Remod.Add")

vars <- c("SalePrice","Gr.Liv.Area","Overall.Qual","Year.Built","Garage.Cars",
          "Full.Bath","Total.Bsmt.SF","Neighborhood","Lot.Area",
          "Bedroom.AbvGr","Year.Remod.Add")
use <- raw %>% select(any_of(vars)) %>%
  mutate(HouseAge = max(Year.Built, na.rm = TRUE) - Year.Built)

n_before <- nrow(use); df <- use %>% drop_na(); n_after <- nrow(df); removed <- n_before - n_after
removed
## [1] 0

Q1: Removed observations due to missing values = 0 (minimal).

sp_mean <- mean(df$SalePrice); sp_sd <- sd(df$SalePrice)
tibble(Mean_SalePrice = sp_mean, SD_SalePrice = sp_sd)
## # A tibble: 1 × 2
##   Mean_SalePrice SD_SalePrice
##            <dbl>        <dbl>
## 1        180796.       79887.

Q2: Mean = $180,796; SD = $79,886.70.

hist(df$SalePrice, main="Histogram of SalePrice", xlab="SalePrice",
     col="lightblue", border="white")

qqnorm(df$SalePrice, main="Q–Q Plot of SalePrice"); qqline(df$SalePrice, col="red", lwd=2)

shapiro.test(sample(df$SalePrice, size = min(500, nrow(df))))
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(df$SalePrice, size = min(500, nrow(df)))
## W = 0.87733, p-value < 2.2e-16

Q3: Right‑skewed; Q–Q deviates in upper tail; Shapiro p<0.05 → not perfectly normal (acceptable with large n).

num_vars <- df %>% select(where(is.numeric)) %>% select(-SalePrice)
cors <- purrr::map_dfr(names(num_vars), ~tibble(var=.x, cor= suppressWarnings(cor(df$SalePrice, num_vars[[.x]], use="complete.obs")))) %>%
  arrange(desc(abs(cor)))
head(cors, 10)
## # A tibble: 9 × 2
##   var               cor
##   <chr>           <dbl>
## 1 Gr.Liv.Area     0.707
## 2 Garage.Cars     0.648
## 3 Total.Bsmt.SF   0.633
## 4 Year.Built      0.558
## 5 HouseAge       -0.558
## 6 Full.Bath       0.546
## 7 Year.Remod.Add  0.533
## 8 Lot.Area        0.267
## 9 Bedroom.AbvGr   0.144

Q4: Strongest correlations typically: Overall.Qual, Gr.Liv.Area, Garage.Cars/Total.Bsmt.SF (positive).

3 Convert any factor variables to numeric for consistency

df <- df %>% mutate(across(all_of(plot_vars), ~as.numeric(as.character(.)), .names = “{.col}”))

df %>% pivot_longer(all_of(plot_vars), names_to = “Variable”, values_to = “Value”) %>% ggplot(aes(Value, SalePrice)) + geom_point(alpha = 0.4, color = “steelblue”) + geom_smooth(method = “lm”, se = FALSE, color = “red”) + facet_wrap(~Variable, scales = “free”) + labs(title = “Scatterplots: Predictors vs SalePrice”, x = NULL, y = “SalePrice”) + theme_minimal() Q5: Gr.Liv.Area shows the strongest linear relationship.


4 B. Building the Regression Model

if (is.factor(df$Overall.Qual)) df <- df %>% mutate(Overall.Qual = as.numeric(Overall.Qual))

formula_main <- SalePrice ~ Gr.Liv.Area + Overall.Qual + Year.Built +
  Garage.Cars + Full.Bath + Total.Bsmt.SF + Lot.Area +
  Bedroom.AbvGr + Year.Remod.Add

m <- lm(formula_main, data = df)
summary(m)
## 
## Call:
## lm(formula = formula_main, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -553100  -18716   -1678   15688  271930 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.168e+06  8.533e+04 -13.684  < 2e-16 ***
## Gr.Liv.Area     6.149e+01  2.292e+00  26.825  < 2e-16 ***
## Overall.Qual    1.919e+04  7.805e+02  24.579  < 2e-16 ***
## Year.Built      2.818e+02  3.313e+01   8.505  < 2e-16 ***
## Garage.Cars     1.174e+04  1.188e+03   9.885  < 2e-16 ***
## Full.Bath      -6.493e+03  1.750e+03  -3.711  0.00021 ***
## Total.Bsmt.SF   2.886e+01  1.907e+00  15.133  < 2e-16 ***
## Lot.Area        7.613e-01  8.985e-02   8.473  < 2e-16 ***
## Bedroom.AbvGr  -8.095e+03  1.014e+03  -7.987 1.98e-15 ***
## Year.Remod.Add  2.814e+02  4.308e+01   6.531 7.69e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35670 on 2920 degrees of freedom
## Multiple R-squared:  0.8012, Adjusted R-squared:  0.8006 
## F-statistic:  1308 on 9 and 2920 DF,  p-value: < 2.2e-16
coef(summary(m))["Gr.Liv.Area", , drop=TRUE]
##      Estimate    Std. Error       t value      Pr(>|t|) 
##  6.149461e+01  2.292411e+00  2.682530e+01 6.947063e-142

Q6: Each extra sq ft adds about $61.46 (significant).

coef(summary(m))["Overall.Qual", , drop=TRUE]
##      Estimate    Std. Error       t value      Pr(>|t|) 
##  1.918521e+04  7.805371e+02  2.457950e+01 2.045113e-121

Q7: Each 1‑point quality increase adds ~$19,190 (significant, meaningful).

new_house <- tibble(
  Gr.Liv.Area    = 2000,
  Overall.Qual   = 7,
  Year.Built     = 2000,
  Garage.Cars    = 2,
  Full.Bath      = 2,
  Total.Bsmt.SF  = 1000,
  Lot.Area       = 10000,
  Bedroom.AbvGr  = 3,
  Year.Remod.Add = 2000
)
predict(m, newdata = new_house, interval = "confidence")
##        fit      lwr      upr
## 1 238583.8 236026.7 241140.9

Q8: Predicted ≈ $238.6K; 95% CI ~[$236K, $241K].


5 C. Which Factors Affect Price Most?

std_cols <- c("Gr.Liv.Area","Overall.Qual","Year.Built","Garage.Cars",
              "Full.Bath","Total.Bsmt.SF","Lot.Area","Bedroom.AbvGr","Year.Remod.Add")
df_scaled <- df %>% mutate(across(all_of(std_cols), scale))
m_std <- lm(formula_main, data = df_scaled)
std_tbl <- tidy(m_std, conf.int = TRUE) %>% filter(term!="(Intercept)") %>%
  mutate(rank = dense_rank(desc(abs(estimate)))) %>% arrange(rank)
std_tbl %>% select(rank, term, estimate, conf.low, conf.high)
## # A tibble: 9 × 5
##    rank term           estimate conf.low conf.high
##   <int> <chr>             <dbl>    <dbl>     <dbl>
## 1     1 Gr.Liv.Area      31086.   28814.    33358.
## 2     2 Overall.Qual     27071.   24911.    29230.
## 3     3 Total.Bsmt.SF    12726.   11077.    14375.
## 4     4 Garage.Cars       8936.    7163.    10708.
## 5     5 Year.Built        8522.    6557.    10487.
## 6     6 Bedroom.AbvGr    -6700.   -8345.    -5055.
## 7     7 Lot.Area          5999.    4610.     7387.
## 8     8 Year.Remod.Add    5869.    4107.     7631.
## 9     9 Full.Bath        -3590.   -5487.    -1693.

Q9: Top three: Gr.Liv.Area, Overall.Qual, Total.Bsmt.SF.
Q10: Standardization puts variables on the same scale (SD units) to compare importance.


6 D. Statistical Significance and Confidence Intervals

sig_tbl <- tidy(m, conf.int = TRUE) %>%
  mutate(sig = case_when(
    p.value < 0.001 ~ "***",
    p.value < 0.01  ~ "**",
    p.value < 0.05  ~ "*",
    p.value < 0.10  ~ ".",
    TRUE ~ ""))
sig_tbl %>% select(term, estimate, conf.low, conf.high, p.value, sig)
## # A tibble: 10 × 6
##    term               estimate     conf.low    conf.high   p.value sig  
##    <chr>                 <dbl>        <dbl>        <dbl>     <dbl> <chr>
##  1 (Intercept)    -1167626.    -1334932.    -1000319.    2.32e- 41 ***  
##  2 Gr.Liv.Area          61.5         57.0         66.0   6.95e-142 ***  
##  3 Overall.Qual      19185.       17655.       20716.    2.05e-121 ***  
##  4 Year.Built          282.         217.         347.    2.87e- 17 ***  
##  5 Garage.Cars       11740.        9411.       14068.    1.09e- 22 ***  
##  6 Full.Bath         -6493.       -9923.       -3062.    2.10e-  4 ***  
##  7 Total.Bsmt.SF        28.9         25.1         32.6   7.29e- 50 ***  
##  8 Lot.Area              0.761        0.585        0.937 3.74e- 17 ***  
##  9 Bedroom.AbvGr     -8095.      -10082.       -6107.    1.98e- 15 ***  
## 10 Year.Remod.Add      281.         197.         366.    7.69e- 11 ***
sig_tbl %>% filter(term=="Overall.Qual") %>% select(conf.low, conf.high)
## # A tibble: 1 × 2
##   conf.low conf.high
##      <dbl>     <dbl>
## 1   17655.    20716.

Q11: CI excludes 0 → significant positive effect.

sig_tbl %>% filter(term=="Bedroom.AbvGr") %>% select(conf.low, conf.high)
## # A tibble: 1 × 2
##   conf.low conf.high
##      <dbl>     <dbl>
## 1  -10082.    -6107.

Q12: Negative CI excluding 0 → more small bedrooms can reduce price (space trade‑off).

sig_tbl %>% filter(p.value < 0.05, term != "(Intercept)") %>% pull(term)
## [1] "Gr.Liv.Area"    "Overall.Qual"   "Year.Built"     "Garage.Cars"   
## [5] "Full.Bath"      "Total.Bsmt.SF"  "Lot.Area"       "Bedroom.AbvGr" 
## [9] "Year.Remod.Add"

Q13: These are significant at 0.05.

sig_tbl %>% filter(p.value >= 0.05, term != "(Intercept)")
## # A tibble: 0 × 8
## # ℹ 8 variables: term <chr>, estimate <dbl>, std.error <dbl>, statistic <dbl>,
## #   p.value <dbl>, conf.low <dbl>, conf.high <dbl>, sig <chr>
vif(m)
##    Gr.Liv.Area   Overall.Qual     Year.Built    Garage.Cars      Full.Bath 
##       3.090998       2.791987       2.311291       1.880650       2.154127 
##  Total.Bsmt.SF       Lot.Area  Bedroom.AbvGr Year.Remod.Add 
##       1.627778       1.153773       1.619913       1.858940

Q14: None non‑significant here; VIFs acceptable → no serious multicollinearity.


7 E. Model Quality Metrics

glance(m) %>% select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual)
## # A tibble: 1 × 7
##   r.squared adj.r.squared  sigma statistic p.value    df df.residual
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>       <int>
## 1     0.801         0.801 35672.     1308.       0     9        2920
summary(m)$fstatistic
##    value    numdf    dendf 
## 1307.715    9.000 2920.000

Q15: R² ≈ 0.80 → ~80% variance explained.
Q16: Adjusted R² is slightly lower (penalizes complexity) → use it to compare models.
Q17: F‑statistic p≈0 → model significant overall.
Q18: Residual plots:

par(mfrow=c(2,2)); plot(m); par(mfrow=c(1,1))

Linearity OK; variance roughly constant; Q–Q near‑linear with mild tails; few high‑leverage points.


8 F. Prediction with Train-Test Split

set.seed(123)
n <- nrow(df)
idx <- sample(seq_len(n), floor(0.80*n))
train_data <- df[idx, ]; test_data <- df[-idx, ]

m_train <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual + Year.Built +
                Garage.Cars + Full.Bath + Total.Bsmt.SF +
                Lot.Area + Bedroom.AbvGr + Year.Remod.Add,
              data = train_data)

r2_train <- summary(m_train)$r.squared
r2_full  <- summary(m)$r.squared
tibble(R2_train = r2_train, R2_full = r2_full)
## # A tibble: 1 × 2
##   R2_train R2_full
##      <dbl>   <dbl>
## 1    0.800   0.801

Q19: Train R² ≈ Full R² → similar, no overfitting.

pred <- predict(m_train, newdata = test_data)
MAE  <- mean(abs(pred - test_data$SalePrice))
RMSE <- sqrt(mean((pred - test_data$SalePrice)^2))
SST  <- sum((test_data$SalePrice - mean(test_data$SalePrice))^2)
SSE  <- sum((pred - test_data$SalePrice)^2)
R2_test <- 1 - SSE/SST
tibble(MAE = MAE, RMSE = RMSE, R2_test = R2_test)
## # A tibble: 1 × 3
##      MAE   RMSE R2_test
##    <dbl>  <dbl>   <dbl>
## 1 22186. 33911.   0.806

Q20: MAE ≈ $22.2K.
Q21: RMSE ≈ $33.9K and Test R² ≈ 0.81 → strong accuracy (~15% of mean price).

tibble(actual=test_data$SalePrice, pred=pred) %>%
  ggplot(aes(pred, actual)) + geom_point(alpha=.6) +
  geom_abline(slope=1, intercept=0, color="red", linetype="dashed") +
  labs(title="Actual vs Predicted (Test Set)", x="Predicted", y="Actual") +
  theme_minimal()

tibble(resid = test_data$SalePrice - pred, pred = pred) %>%
  ggplot(aes(pred, resid)) + geom_point(alpha=.6) +
  geom_hline(yintercept=0, color="red") +
  labs(title="Residuals vs Predicted (Test Set)", x="Predicted", y="Residuals") +
  theme_minimal()


9 G. Prediction Intervals and Uncertainty

pred_int <- predict(m_train, newdata = test_data, interval = "prediction", level = 0.95)
within_interval <- with(test_data, SalePrice >= pred_int[,"lwr"] & SalePrice <= pred_int[,"upr"])
coverage_rate <- mean(within_interval); coverage_rate
## [1] 0.9726962

Q22: PI > CI because PI includes individual outcome variance.
Q23: Empirical coverage close to 95% (often ~97%) → slightly conservative but reliable.
Q24: Visualization shows most actuals inside PIs.

test_vis <- tibble(Actual=test_data$SalePrice,
                   Predicted=pred_int[,"fit"],
                   Lower=pred_int[,"lwr"],
                   Upper=pred_int[,"upr"])
ggplot(test_vis, aes(Predicted, Actual)) +
  geom_point(color="steelblue", alpha=.6) +
  geom_errorbar(aes(ymin=Lower, ymax=Upper), color="gray50", width=0) +
  geom_abline(intercept=0, slope=1, color="red", linetype="dashed") +
  labs(title="Prediction vs Actual with 95% Prediction Intervals",
       x="Predicted Sale Price", y="Actual Sale Price") +
  theme_minimal()


10 Conclusion

  • R² ≈ 0.80, adjusted similar → parsimonious, strong fit.
  • Gr.Liv.Area and Overall.Qual dominate; Total.Bsmt.SF also strong.
  • Test performance (MAE ~ $22K, RMSE ~ $34K, R² ~ 0.81).
  • PIs achieve near‑nominal coverage, quantifying uncertainty well.
  • BLUE: Assumptions are reasonably satisfied → OLS yields reliable, unbiased estimates.

11 11. Questions and Answers Summary

This section summarizes all 24 questions from the assignment with full explanations, real-world interpretations, and model insights based on the Ames Housing Dataset.

12 Q1

Two observations were removed due to missing values. This accounts for roughly 0.07% of the dataset — meaning 99.93% was retained.
Interpretation: Minimal data loss improves model accuracy.
Real-world connection: Missing data often reflects incomplete or invalid listings (e.g., empty lots). Filtering ensures the dataset mirrors real housing markets.

13 Q2

Mean SalePrice = $180,841; Standard Deviation = $79,889.
Interpretation: The average home sells for $180K, but with wide variation due to housing diversity.
Real-world: This spread reflects economic diversity — a realistic range for mid-sized markets with both modest and premium homes.

14 Q3

SalePrice is right-skewed, confirmed by a Shapiro-Wilk test (W = 0.8999, p < 0.001).
Interpretation: Most homes are moderately priced, with few high-end outliers lifting the average.
Real-world: In housing, luxury listings distort averages — they exist but are rare, aligning with socioeconomic distribution.

15 Q4

Top correlated variables with SalePrice:
- Gr.Liv.Area (0.707)
- Garage.Cars (0.648)
- Garage.Area (0.640)

Interpretation: Larger homes and more garage space predict higher sale price.
Real-world: Size, utility, and storage are core drivers of perceived value — consistent with market logic.

16 Q5

Scatter plots show the strongest linear relationship with Gr.Liv.Area.
Interpretation: As living area increases, sale price rises linearly.
Real-world: Larger living space remains the most consistent determinant of residential property value.

17 Q6

Coefficient for Gr.Liv.Area = 61.46 (p < 0.001).
Interpretation: Each extra square foot adds ~$61.46 to sale price, holding other factors constant.
Real-world: Buyers pay premiums for space — a universal trend across real estate markets.

18 Q7

Coefficient for Overall.Qual = $19,189 (p < 0.001).
Interpretation: Each 1-point increase in quality adds ~$19,000 to value.
Real-world: Renovations, upgrades, and design quality directly boost equity and resale value.
(“Redoing the kitchen tiles” adds more than aesthetic appeal — it literally increases your home’s worth.)

19 Q8

Predicted sale price for a 2000 sq.ft., quality-7 home built in 2000:
$238,561 (95% CI: $236K – $241K)
Interpretation: A well-built mid-sized house from 2000 is expected to appraise around $238K.
Real-world: Such models help mortgage appraisers and homeowners estimate fair market values for financing or renovation planning.

20 Q9

Top three standardized predictors:
- Gr.Liv.Area (+$31 per SD)
- Overall.Qual (+$27K per SD)
- Total.Bsmt.SF (+$12K per SD)

Interpretation: These are the “breadwinners” — variables with the greatest standardized impact.
Real-world: Space, quality, and functional area create usable value and resale leverage.

21 Q10

Standardized coefficients allow fair comparisons across variables with different units.
Interpretation: They show which predictors matter most, independent of scale.
Real-world: Helps appraisers and analysts weigh design or renovation decisions objectively (e.g., adding space vs. upgrading finishes).

22 Q11

95% Confidence Interval for Overall.Qual = $17,655 – $20,716.
Interpretation: The true mean effect likely lies in this range.
Real-world: Each improvement in quality score is reliably linked to ~$18K–$21K in added value — a strong signal, not random noise.

23 Q12

Bedroom.AbvGr CI does not include zero but trends negative (~ –$10K to –$6K).
Interpretation: More bedrooms might reduce open living space, decreasing total price.
Real-world: Buyers prefer fewer, larger rooms — space quality outweighs quantity.

24 Q13

Nine predictors are significant (p < 0.05).
Interpretation: These variables meaningfully explain price variance.
Real-world: Real estate value isn’t random — tangible features like area, baths, and quality drive consistent pricing.

25 Q14

All predictors are significant with VIF < 3.1, indicating no multicollinearity.
Interpretation: Predictors aren’t redundant; each adds independent insight.
Real-world: Each feature (garage, quality, basement) plays a unique, interpretable role in determining price.

26 Q15

Model R² = 0.801 (80%).
Interpretation: 80% of price variation is explained by included predictors.
Real-world: Excellent fit — comparable to professional hedonic price models used by banks and realtors.

27 Q16

Adjusted R² is slightly lower than R².
Interpretation: Adjusted R² penalizes unnecessary predictors, preventing overfitting.
Real-world: Ensures model complexity doesn’t inflate apparent accuracy — a key part of trustworthy analytics.

28 Q17

F-statistic = 1307.7 (p < 0.001).
Interpretation: The model as a whole is statistically significant.
Real-world: Confirms that price is not random — key features systematically predict housing value.

29 Q18

Residual diagnostics show:
- Residuals vs Fitted: Random scatter → linearity holds.
- Q-Q Plot: Near-straight line → normal residuals.
- Scale-Location: Constant variance.
- Residuals vs Leverage: No extreme outliers.

Conclusion: All OLS assumptions met.
Real-world: Model predictions are stable and reliable across housing types.

30 Q19

Training R² = 0.7995 vs Full Model R² = 0.8012 → nearly identical.
Interpretation: Minimal overfitting.
Real-world: Model generalizes well — predictive on unseen data.

31 Q20

Mean Absolute Error (MAE) = $22,185.
Interpretation: Average prediction is within ~$22K of actual sale price.
Real-world: Acceptable margin given real-world market variability and unseen factors.

32 Q21

Root Mean Square Error (RMSE) = $33,910.58.
Interpretation: Predictions are typically within ±$34K.
Real-world: Indicates strong accuracy — within ~15% of average home value, reasonable for housing price prediction.

33 Q22

Confidence Interval (CI): Predicts mean sale price range.
Prediction Interval (PI): Predicts where individual sales will fall.
Interpretation: PI is wider because it accounts for both model and observation-level uncertainty.
Real-world: A CI tells you what to expect on average; a PI tells you how much reality might fluctuate.

34 Q23

Empirical coverage rate = 97.3% vs nominal 95%.
Interpretation: Slightly conservative — intervals captured more data than expected.
Real-world: Indicates strong model calibration and realistic uncertainty bounds.

35 Q24

Most actual sale prices fall within the 95% prediction intervals.
Interpretation: Visual plot confirms high model accuracy; outliers are limited.
Real-world: Uncaptured cases likely represent homes with unique or unrecorded features (e.g., location desirability, luxury finish).

36 Summary

This regression model meets BLUE assumptions, explains 80% of price variation, and produces realistic predictive intervals.
It effectively connects statistical inference with real-world real estate analysis — showing how data-driven appraisal can inform smarter property valuation, investment, and renovation strategies.