This round, we’ll try predict housing prices via multiple regression and including extra predictors:
We’ll add these extra terms with our original term of square footaage above ground level (sqft_above) and evaluate if this model explains more of the data.
Reference - https://www.kaggle.com/code/sid321axn/house-price-prediction-gboosting-adaboost-etc/data
For convenience’s sake, we’ll simply take a subset of 1000 from the original dataset with more than 2000 rows.
# Sample 1000 rows
sample_housing_prices <- sample_n(housing_prices, 1000, FALSE)
# filter outliers with large lots
#sample_housing_prices <- sample_housing_prices %>% filter( sqft_lot <= 100000)
#sample_housing_prices <- sample_housing_prices %>% filter( sqft_lot15 <= 100000)
We will create a quadratic term from the year built column. First, we’ll take a look and look at the data. We can confirm there isn’t a linear relationship between a house’s price and its year built.
plot(price ~ yr_built, data=sample_housing_prices)
There might be a flat, low slope relationship but then again, it looks evenly distributed. We’ll continue on the assumption there is no linear relationship.
# create a quadratic variable by squaring the yr_built column
sample_housing_prices$yr_built_2 <- sample_housing_prices$yr_built^2
We will create two dichotomous terms. One to add as a predictor by itself and another as an interaction with a quantitative term:
Conventionally, the above terms are popular features that typically drive the price of houses but we will investigate if it’s true with this dataset.
# Create a dichotomous variable if the house has high number of bathrooms
sample_housing_prices <- sample_housing_prices %>% mutate (high_bathrooms = as.integer(bathrooms >= 4))
# create another dichotomous variable showing if the house was made before 1929
sample_housing_prices <- sample_housing_prices %>% mutate (prewar = as.integer(yr_built <= 1929))
Our original model used a single explanatory variable: square footage above ground. From the model’s summary it explained about 34% of the data.
single_model <- lm(price ~ sqft_above, data=sample_housing_prices)
summary(single_model)
##
## Call:
## lm(formula = price ~ sqft_above, data = sample_housing_prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -765870 -168450 -41797 115362 3289955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 650.77 22661.69 0.029 0.977
## sqft_above 303.87 11.03 27.546 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 312500 on 998 degrees of freedom
## Multiple R-squared: 0.4319, Adjusted R-squared: 0.4313
## F-statistic: 758.8 on 1 and 998 DF, p-value: < 2.2e-16
For our multiple regression model we’ll add all of our new terms.
multiple_model <- lm(price ~ sqft_above + # original predictor
yr_built_2 + # quadratic term
high_bathrooms + # dichotomous term
prewar*bathrooms, # dichotomous vs quantitative term
data=sample_housing_prices)
summary(multiple_model)
##
## Call:
## lm(formula = price ~ sqft_above + yr_built_2 + high_bathrooms +
## prewar * bathrooms, data = sample_housing_prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1318282 -143674 -25537 100125 2802749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.011e+06 4.808e+05 8.344 2.39e-16 ***
## sqft_above 2.636e+02 1.438e+01 18.325 < 2e-16 ***
## yr_built_2 -1.073e+00 1.276e-01 -8.413 < 2e-16 ***
## high_bathrooms 4.542e+05 6.781e+04 6.697 3.55e-11 ***
## prewar -1.174e+05 6.779e+04 -1.732 0.0837 .
## bathrooms 1.054e+05 1.863e+04 5.659 1.99e-08 ***
## prewar:bathrooms 4.894e+04 3.681e+04 1.329 0.1840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 279800 on 993 degrees of freedom
## Multiple R-squared: 0.5468, Adjusted R-squared: 0.5441
## F-statistic: 199.7 on 6 and 993 DF, p-value: < 2.2e-16
From the above summary, we’ve explained 51% of the data; this is significant improvement but far from perfect.
The data seems to start following the variable until the end where the prices rise very high above the normal line.
single_residue <- resid(single_model)
qqnorm(single_residue, main="QQ Plot for Single Linear Regression")
qqline(single_residue)
multiple_residue <- resid(multiple_model)
qqnorm(multiple_residue, main="QQ Plot for Mulitple Linear Regression")
qqline(multiple_residue)
The QQ plot for the multiple regression model still underpredicts the actual values on the right hand side but not as poorly as the single linear model.
plot(sample_housing_prices$sqft_above,
single_residue,
main="Residue Plot (Single Regression)",
xlab="Square Footage Above",
ylab="Residue")
abline(0, 0)
plot(sample_housing_prices$sqft_above +
sample_housing_prices$yr_built_2 +
sample_housing_prices$high_bathrooms +
sample_housing_prices$prewar*sample_housing_prices$bathrooms,
multiple_residue,
main="Residue Plot (Multiple Regression)",
xlab="Various Predictors",
ylab="Residue")
abline(0, 0)
We see the residue is more evenly distributed around the zero line than the single predictor model.