Linear Regression Model of Housing Prices vs Above Ground Living Square Footage

King’s County Housing Sales

We will perform a simple linear regression with the explanatory variable being the amount of livable square footage above ground and the response variable being the sale price.

Reference - https://www.kaggle.com/code/sid321axn/house-price-prediction-gboosting-adaboost-etc/data

Build a simple linear model

model <- lm(price ~ sqft_above, data=housing_prices)

summary(model)

## 
## Call:
## lm(formula = price ~ sqft_above, data = housing_prices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -913132 -165624  -41468  109327 5339232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59953.2     4729.8   12.68   <2e-16 ***
## sqft_above     268.5        2.4  111.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 292200 on 21611 degrees of freedom
## Multiple R-squared:  0.3667, Adjusted R-squared:  0.3667 
## F-statistic: 1.251e+04 on 1 and 21611 DF,  p-value: < 2.2e-16

Scatter plot

plot(price ~ sqft_above, data=housing_prices)
abline(model)

Quantile Plots

The data seems to start following the variable until the end where the prices rise very high above the normal line.

residue <- resid(model)

qqnorm(residue)
qqline(residue)

Checking assumptions with residue plot

plot(housing_prices$sqft_above, residue)
abline(0, 0)

But we see the data is not random around the zero line. The variability on the left side (lower amount of square footage above ground) is much less than on the right side (higher amount of square footage).

Clearly, this model cannot be used and thus housing prices cannot be explained by square footage above ground.