Load the data

housing <- MASS::Boston 
str(housing)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

We will explore Boston housing data from the 70s where predicting the median value of homes will be the main goal.

I chose this data set because there are many fantastic examples online using this data set for it’s flexibility when modeling. With 14 different features there are many ways to demonstrate using a variety data science tools and approaches.

Data Analysis

corrplot(cor(housing))

We can see from the correleation plot a few variables have strong relationships with the median value (denoted as medv).

The features that most stand out in the plot are the:

percentage of lower status of the population (lstat) showing a strong negative correlation and
average number of dwellings (rm) showing the opposite
index of accessibility to radial highways (rad) to fulfill our dichotomous term since the problem calls for one and there ae only two in this data set (the other one being chas which doesn’t have as strong of a correlation).

housingplot <- housing[,colnames(housing) %in% c('rm','rad','lstat','medv')]
splom(housingplot,col = 'Orange')

We can see rm is appears to be a good candidate for linearity while lstat may be a candidate for a quadratic transformation for a better fit.

set.seed(315)
split <- sample.split(housing,SplitRatio = 0.75)
train <- subset(housing,split==TRUE)
lm.fit <- lm(medv ~ rm + rad + lstat, data=train)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ rm + rad + lstat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.706  -3.446  -1.010   1.970  28.684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.46006    3.97632  -0.619   0.5365    
## rm           5.24601    0.55653   9.426   <2e-16 ***
## rad         -0.08222    0.03814  -2.156   0.0318 *  
## lstat       -0.57107    0.06049  -9.441   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.348 on 357 degrees of freedom
## Multiple R-squared:  0.6503, Adjusted R-squared:  0.6473 
## F-statistic: 221.3 on 3 and 357 DF,  p-value: < 2.2e-16

So as a base line with the handpicked features we have the following:

R Squared of .65, pretty good to start of with in explaining the variance
F Statistic of 221 another strong indicator of strong relationships
Coefficients are all significant with rm and lstat being the highest

Adding Interaction and Quadratic Term

We already saw before that lstat was a good candidate for being a quadratic term. We will also add interactivity between our significant variables to meet the requirements of our problem.

lm.fit2 <- lm(medv ~ rm*rad + I(lstat^2) + rad*lstat,
                data=train)
summary(lm.fit2)

## 
## Call:
## lm(formula = medv ~ rm * rad + I(lstat^2) + rad * lstat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9433 -2.4024 -0.3213  2.0728 22.7037 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.268501   5.540741  -7.087 7.45e-12 ***
## rm           10.835141   0.744660  14.550  < 2e-16 ***
## rad           3.868066   0.335161  11.541  < 2e-16 ***
## I(lstat^2)    0.033258   0.004209   7.901 3.53e-14 ***
## lstat        -0.993555   0.145486  -6.829 3.73e-11 ***
## rm:rad       -0.501455   0.047170 -10.631  < 2e-16 ***
## rad:lstat    -0.054053   0.005032 -10.742  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.136 on 354 degrees of freedom
## Multiple R-squared:  0.7926, Adjusted R-squared:  0.7891 
## F-statistic: 225.5 on 6 and 354 DF,  p-value: < 2.2e-16

We can see a really good boost for our R Squared and a better F Statistic when we added interaction between significant variables and quadratic transformation to fit the lstat feature better. It’s also worth noting all our predictors are now rated highly significant.

graphics::par(mfrow=c(2,2))
graphics::plot(lm.fit2)

We can see from the residual analysis the following which clear us for the assumptions of linearity:

The Residuals vs Fitted show there are no non-linear patterns between our explanatory variables and response. The residuals looks equally spread around a horizontal indicating a linear relationship is a good model.

In the Normal q-q plot we can see that multivariate normality is observed, since the plot of our residuals are mostly following on the straight line. The residuals therefore are showing good normal distribution.

In the Scale-Location plot we can see that the residuals are spread equally along the range of predictors demonstrating equal variance or homoscedasticity.

In our Residuals vs Leverage plot we see no influences that we need to be concerned about affecting our regression.

In conclusion the linear model is appropriate especially after choosing features and transformations that would make the best fit.

Discussion Week 13

Angel Claudio

Fall 2020

Problem

Load the data

Data Analysis

Adding Interaction and Quadratic Term