2026-05-01

Project Summary

  • Housing Prices (Kaggle Data set)
  • Identify structural and qualitative features that predict sale price
  • Data set
    • Training and Testing Data

    • 81 Columns, 1460 Observations

Variables

  • MSSubClass
  • LotFrontage
  • LotArea
  • OverallQual
  • OverallCond
  • YearBuilt
  • YearRemodAdd
  • MasVnrArea
  • BsmtFinSF1
  • BsmtFinSF2
  • BsmtUnfSF …

Approach

  • Fix NA values
    • Numeric variables filled with respective median
    • Categorical variables filled with “Missing”
  • Fix skew in sale price
  • Create a process to automatically select variables (ongoing)

Sale Price

Log Sale Price

Preliminary Graphs

  • Log Sale Price predicted by Lot Area and Overall Quality

Model Summary

## 
## Call:
## lm(formula = logSalePrice ~ log(LotArea) + OverallQual, data = housing_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30522 -0.10637  0.01229  0.12192  0.61440 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.819794   0.095337   92.51   <2e-16 ***
## log(LotArea) 0.202737   0.010589   19.15   <2e-16 ***
## OverallQual  0.222510   0.003962   56.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2059 on 1457 degrees of freedom
## Multiple R-squared:  0.7346, Adjusted R-squared:  0.7342 
## F-statistic:  2016 on 2 and 1457 DF,  p-value: < 2.2e-16

Testing the Model

Next Steps

  • Continue making automatic selection

  • Finalize model

    • Including running diagnostic tests
  • Test model against real life situations