Linear Regression Model of Housing Prices vs Above Ground Living Square Footage

King’s County Housing Sales with Multiple Regression

This round, we’ll try predict housing prices via multiple regression and including extra predictors:

A quadratic term
A dichotomous term
A dichotomous vs quantitative term

We’ll add these extra terms with our original term of square footaage above ground level (sqft_above) and evaluate if this model explains more of the data.

Reference - https://www.kaggle.com/code/sid321axn/house-price-prediction-gboosting-adaboost-etc/data

Subsetting From the Original Dataset

For convenience’s sake, we’ll simply take a subset of 1000 from the original dataset with more than 2000 rows.

# Sample 1000 rows
sample_housing_prices <- sample_n(housing_prices, 1000, FALSE)


# filter outliers with large lots
#sample_housing_prices <- sample_housing_prices %>% filter( sqft_lot <= 100000)
#sample_housing_prices <- sample_housing_prices %>% filter( sqft_lot15 <= 100000)

Quadratic Term

We will create a quadratic term from the year built column. First, we’ll take a look and look at the data. We can confirm there isn’t a linear relationship between a house’s price and its year built.

plot(price ~ yr_built, data=sample_housing_prices)

There might be a flat, low slope relationship but then again, it looks evenly distributed. We’ll continue on the assumption there is no linear relationship.

# create a quadratic variable by squaring the yr_built column
sample_housing_prices$yr_built_2 <- sample_housing_prices$yr_built^2

Dichotomous Terms

We will create two dichotomous terms. One to add as a predictor by itself and another as an interaction with a quantitative term:

High Number of Bathrooms - any house with four more bathrooms will be classified as a 1
Prewar - houses built before 1929 will be set to 1.

Conventionally, the above terms are popular features that typically drive the price of houses but we will investigate if it’s true with this dataset.

# Create a dichotomous variable if the house has high number of bathrooms
sample_housing_prices <- sample_housing_prices %>% mutate (high_bathrooms = as.integer(bathrooms >= 4))


# create another dichotomous variable showing if the house was made before 1929
sample_housing_prices <- sample_housing_prices %>% mutate (prewar = as.integer(yr_built <= 1929))

Original Linear Regression Model

Our original model used a single explanatory variable: square footage above ground. From the model’s summary it explained about 34% of the data.

single_model <- lm(price ~ sqft_above, data=sample_housing_prices)
summary(single_model)

## 
## Call:
## lm(formula = price ~ sqft_above, data = sample_housing_prices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -765870 -168450  -41797  115362 3289955 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   650.77   22661.69   0.029    0.977    
## sqft_above    303.87      11.03  27.546   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 312500 on 998 degrees of freedom
## Multiple R-squared:  0.4319, Adjusted R-squared:  0.4313 
## F-statistic: 758.8 on 1 and 998 DF,  p-value: < 2.2e-16

Adding Other Terms

For our multiple regression model we’ll add all of our new terms.

multiple_model <- lm(price ~ sqft_above +       # original predictor
                             yr_built_2 +       # quadratic term
                             high_bathrooms +   # dichotomous term
                             prewar*bathrooms,  # dichotomous vs quantitative term
                             data=sample_housing_prices)
summary(multiple_model)

## 
## Call:
## lm(formula = price ~ sqft_above + yr_built_2 + high_bathrooms + 
##     prewar * bathrooms, data = sample_housing_prices)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1318282  -143674   -25537   100125  2802749 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.011e+06  4.808e+05   8.344 2.39e-16 ***
## sqft_above        2.636e+02  1.438e+01  18.325  < 2e-16 ***
## yr_built_2       -1.073e+00  1.276e-01  -8.413  < 2e-16 ***
## high_bathrooms    4.542e+05  6.781e+04   6.697 3.55e-11 ***
## prewar           -1.174e+05  6.779e+04  -1.732   0.0837 .  
## bathrooms         1.054e+05  1.863e+04   5.659 1.99e-08 ***
## prewar:bathrooms  4.894e+04  3.681e+04   1.329   0.1840    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 279800 on 993 degrees of freedom
## Multiple R-squared:  0.5468, Adjusted R-squared:  0.5441 
## F-statistic: 199.7 on 6 and 993 DF,  p-value: < 2.2e-16

From the above summary, we’ve explained 51% of the data; this is significant improvement but far from perfect.

Quantile Plots

The data seems to start following the variable until the end where the prices rise very high above the normal line.

single_residue <- resid(single_model)
qqnorm(single_residue, main="QQ Plot for Single Linear Regression")
qqline(single_residue)

multiple_residue <- resid(multiple_model)

qqnorm(multiple_residue, main="QQ Plot for Mulitple Linear Regression")
qqline(multiple_residue)

The QQ plot for the multiple regression model still underpredicts the actual values on the right hand side but not as poorly as the single linear model.

Residue Plots

plot(sample_housing_prices$sqft_above, 
     single_residue, 
     main="Residue Plot (Single Regression)", 
     xlab="Square Footage Above", 
     ylab="Residue")
abline(0, 0)

plot(sample_housing_prices$sqft_above + 
     sample_housing_prices$yr_built_2 +
     sample_housing_prices$high_bathrooms +
     sample_housing_prices$prewar*sample_housing_prices$bathrooms, 
     multiple_residue, 
     main="Residue Plot (Multiple Regression)", 
     xlab="Various Predictors", 
     ylab="Residue")
abline(0, 0)

We see the residue is more evenly distributed around the zero line than the single predictor model.