We will perform a simple linear regression with the explanatory variable being the amount of livable square footage above ground and the response variable being the sale price.
Reference - https://www.kaggle.com/code/sid321axn/house-price-prediction-gboosting-adaboost-etc/data
model <- lm(price ~ sqft_above, data=housing_prices)
summary(model)
##
## Call:
## lm(formula = price ~ sqft_above, data = housing_prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -913132 -165624 -41468 109327 5339232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59953.2 4729.8 12.68 <2e-16 ***
## sqft_above 268.5 2.4 111.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 292200 on 21611 degrees of freedom
## Multiple R-squared: 0.3667, Adjusted R-squared: 0.3667
## F-statistic: 1.251e+04 on 1 and 21611 DF, p-value: < 2.2e-16
plot(price ~ sqft_above, data=housing_prices)
abline(model)
The data seems to start following the variable until the end where the prices rise very high above the normal line.
residue <- resid(model)
qqnorm(residue)
qqline(residue)
plot(housing_prices$sqft_above, residue)
abline(0, 0)
But we see the data is not random around the zero line. The variability on the left side (lower amount of square footage above ground) is much less than on the right side (higher amount of square footage).
Clearly, this model cannot be used and thus housing prices cannot be explained by square footage above ground.