Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
housing <- MASS::Boston
str(housing)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
We will explore Boston housing data from the 70s where predicting the median value of homes will be the main goal.
I chose this data set because there are many fantastic examples online using this data set for it’s flexibility when modeling. With 14 different features there are many ways to demonstrate using a variety data science tools and approaches.
corrplot(cor(housing))
We can see from the correleation plot a few variables have strong relationships with the median value (denoted as medv).
The features that most stand out in the plot are the:
percentage of lower status of the population (lstat) showing a strong negative correlation and
average number of dwellings (rm) showing the opposite
index of accessibility to radial highways (rad) to fulfill our dichotomous term since the problem calls for one and there ae only two in this data set (the other one being chas which doesn’t have as strong of a correlation).
housingplot <- housing[,colnames(housing) %in% c('rm','rad','lstat','medv')]
splom(housingplot,col = 'Orange')
We can see rm is appears to be a good candidate for linearity while lstat may be a candidate for a quadratic transformation for a better fit.
set.seed(315)
split <- sample.split(housing,SplitRatio = 0.75)
train <- subset(housing,split==TRUE)
lm.fit <- lm(medv ~ rm + rad + lstat, data=train)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ rm + rad + lstat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.706 -3.446 -1.010 1.970 28.684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.46006 3.97632 -0.619 0.5365
## rm 5.24601 0.55653 9.426 <2e-16 ***
## rad -0.08222 0.03814 -2.156 0.0318 *
## lstat -0.57107 0.06049 -9.441 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.348 on 357 degrees of freedom
## Multiple R-squared: 0.6503, Adjusted R-squared: 0.6473
## F-statistic: 221.3 on 3 and 357 DF, p-value: < 2.2e-16
So as a base line with the handpicked features we have the following:
We already saw before that lstat was a good candidate for being a quadratic term. We will also add interactivity between our significant variables to meet the requirements of our problem.
lm.fit2 <- lm(medv ~ rm*rad + I(lstat^2) + rad*lstat,
data=train)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ rm * rad + I(lstat^2) + rad * lstat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9433 -2.4024 -0.3213 2.0728 22.7037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.268501 5.540741 -7.087 7.45e-12 ***
## rm 10.835141 0.744660 14.550 < 2e-16 ***
## rad 3.868066 0.335161 11.541 < 2e-16 ***
## I(lstat^2) 0.033258 0.004209 7.901 3.53e-14 ***
## lstat -0.993555 0.145486 -6.829 3.73e-11 ***
## rm:rad -0.501455 0.047170 -10.631 < 2e-16 ***
## rad:lstat -0.054053 0.005032 -10.742 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.136 on 354 degrees of freedom
## Multiple R-squared: 0.7926, Adjusted R-squared: 0.7891
## F-statistic: 225.5 on 6 and 354 DF, p-value: < 2.2e-16
We can see a really good boost for our R Squared and a better F Statistic when we added interaction between significant variables and quadratic transformation to fit the lstat feature better. It’s also worth noting all our predictors are now rated highly significant.
graphics::par(mfrow=c(2,2))
graphics::plot(lm.fit2)
We can see from the residual analysis the following which clear us for the assumptions of linearity:
The Residuals vs Fitted show there are no non-linear patterns between our explanatory variables and response. The residuals looks equally spread around a horizontal indicating a linear relationship is a good model.
In the Normal q-q plot we can see that multivariate normality is observed, since the plot of our residuals are mostly following on the straight line. The residuals therefore are showing good normal distribution.
In the Scale-Location plot we can see that the residuals are spread equally along the range of predictors demonstrating equal variance or homoscedasticity.
In our Residuals vs Leverage plot we see no influences that we need to be concerned about affecting our regression.
In conclusion the linear model is appropriate especially after choosing features and transformations that would make the best fit.