This exercise is to practice Linear Regression models and to be conceptually sure about what needs to be done in a Data Science setting.

The below given Library files are required for the analysis

library(MASS)
library(ISLR)

Let us start our exercise with the Boston housing prices package. Let us have a look at the variables within the data set.

fix(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

The response variable is “medv” which is the median value of housing prices. Let us attempt to fit a linear model to this data set. Let us first attempt it with the predictor lstat

attach(Boston)
Boslin <- lm(medv~lstat,data=Boston)
summary(Boslin)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.17  -3.99  -1.32   2.03  24.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.5538     0.5626    61.4   <2e-16 ***
## lstat        -0.9500     0.0387   -24.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.22 on 504 degrees of freedom
## Multiple R-squared:  0.544,  Adjusted R-squared:  0.543 
## F-statistic:  602 on 1 and 504 DF,  p-value: <2e-16

A loot at the statistics, it can be seen that the predictor accounts for 56% variability of the response variable. Also the p value is very significant indicating that there is a strong relationship between the predictor and the response.

Let us try some prediction of the confidence interval with the predict() function in R. Please note that we have to provide some arbitary values for lstat predictor.

Bospred <- predict(Boslin,data.frame(lstat=(c(5,10,15))),interval="confidence")
Bospred
##     fit   lwr   upr
## 1 29.80 29.01 30.60
## 2 25.05 24.47 25.63
## 3 20.30 19.73 20.87
Bospred2 <- predict(Boslin,data.frame(lstat=(c(5,10,15))),interval="prediction")
Bospred2
##     fit    lwr   upr
## 1 29.80 17.566 42.04
## 2 25.05 12.828 37.28
## 3 20.30  8.078 32.53

We will plot the predictor and response along with the regressor line

plot(lstat,medv,col="red")

plot of chunk unnamed-chunk-5

plot(lstat,medv,pch=20)

plot of chunk unnamed-chunk-5

plot(lstat,medv,pch="+")

plot of chunk unnamed-chunk-5

plot(1:50,1:50,pch=1:50) ## This produces a plot of various symbols
## Warning: unimplemented pch value '26'
## Warning: unimplemented pch value '27'
## Warning: unimplemented pch value '28'
## Warning: unimplemented pch value '29'
## Warning: unimplemented pch value '30'
## Warning: unimplemented pch value '31'
abline(Boslin,lwd=3,col="blue")

plot of chunk unnamed-chunk-5

From the plots it is evident that the relationship between the medv and lstat is not completely linear. There is a fair amount of non linearity in the relationship.

Let us create some diagnostic plots from the data.

par(mfrow=c(2,2))
plot(Boslin)

plot of chunk unnamed-chunk-6

The plot of the model, shows four plots.

The first one is the Residuals v/s the fitted value plot. From the plot, it is seen that there is some pattern in the plot which indicates non-linear relationship. Another plot is the scaled residual plots. These two plots can be done seperately using the following plotting functions.

plot(predict(Boslin),residuals(Boslin),pch=20,col="blue")

plot of chunk unnamed-chunk-7

plot(predict(Boslin),rstudent(Boslin),pch=20,col="green")

plot of chunk unnamed-chunk-7

The leaverage points can be identified using the hatvalues() function. To find the index of the data point which has the maximum leaverage points, we can use the which.max function

plot(hatvalues(Boslin))

plot of chunk unnamed-chunk-8

which.max(hatvalues(Boslin))
## 375 
## 375

Multivariate regression models

We will next do a multivariate plot to identify the relationship between all the variables and the outcome variable.

attach(Boston)
## The following objects are masked from Boston (pos = 3):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio,
##     rad, rm, tax, zn
Bosall <- lm(medv~.,data=Boston)
summary(Bosall)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.594  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.65e+01   5.10e+00    7.14  3.3e-12 ***
## crim        -1.08e-01   3.29e-02   -3.29  0.00109 ** 
## zn           4.64e-02   1.37e-02    3.38  0.00078 ***
## indus        2.06e-02   6.15e-02    0.33  0.73829    
## chas         2.69e+00   8.62e-01    3.12  0.00193 ** 
## nox         -1.78e+01   3.82e+00   -4.65  4.2e-06 ***
## rm           3.81e+00   4.18e-01    9.12  < 2e-16 ***
## age          6.92e-04   1.32e-02    0.05  0.95823    
## dis         -1.48e+00   1.99e-01   -7.40  6.0e-13 ***
## rad          3.06e-01   6.63e-02    4.61  5.1e-06 ***
## tax         -1.23e-02   3.76e-03   -3.28  0.00111 ** 
## ptratio     -9.53e-01   1.31e-01   -7.28  1.3e-12 ***
## black        9.31e-03   2.69e-03    3.47  0.00057 ***
## lstat       -5.25e-01   5.07e-02  -10.35  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.75 on 492 degrees of freedom
## Multiple R-squared:  0.741,  Adjusted R-squared:  0.734 
## F-statistic:  108 on 13 and 492 DF,  p-value: <2e-16

From the statistics we can infer that the R2 has improved from around 54% to around 73% which indicates that 73% of the variability of the model is explained by the model.Taking a look at the p value of the statistics,it can be seen that most of the p values are significant indicating strong relationship of the variables with the outcome. However there are two insignificant variables within the model. These are variables “indus” and “age” who have very high p values indicating that they are insignificant.

As a next step for analysing this further let us create three seperate models with these variables individually and also together and check their p values.

indmod <- lm(medv~indus,data=Boston)
agemod <- lm(medv~age,data=Boston)
combmod <- lm(medv~indus+age,data=Boston)

summary(indmod)
## 
## Call:
## lm(formula = medv ~ indus, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13.02  -4.92  -1.46   3.18  32.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.7549     0.6834    43.5   <2e-16 ***
## indus        -0.6485     0.0523   -12.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.06 on 504 degrees of freedom
## Multiple R-squared:  0.234,  Adjusted R-squared:  0.232 
## F-statistic:  154 on 1 and 504 DF,  p-value: <2e-16
summary(agemod)
## 
## Call:
## lm(formula = medv ~ age, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.10  -5.14  -1.96   2.40  31.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.9787     0.9991   31.01   <2e-16 ***
## age          -0.1232     0.0135   -9.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.53 on 504 degrees of freedom
## Multiple R-squared:  0.142,  Adjusted R-squared:  0.14 
## F-statistic: 83.5 on 1 and 504 DF,  p-value: <2e-16
summary(combmod)
## 
## Call:
## lm(formula = medv ~ indus + age, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.54  -4.80  -1.75   2.80  33.21 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.1779     0.9409   33.14   <2e-16 ***
## indus        -0.5522     0.0681   -8.11    4e-15 ***
## age          -0.0364     0.0166   -2.19    0.029 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.03 on 503 degrees of freedom
## Multiple R-squared:  0.241,  Adjusted R-squared:  0.238 
## F-statistic:   80 on 2 and 503 DF,  p-value: <2e-16

From the summary statistics it can be seen that the R2 for all the three model are very insignificant. Another point to be noted is that all the p values when viewed individuall are significant. However whent they are part of the the model with all the variables, the p values become insignificant. This can be also because of collinearity effect between some of the variables. Let us check the Variance Infraltion factor(VIF) which is a measure of collinearity for the bigger model.

library(car)
vif(Bosall)
##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##   1.792   2.299   3.992   1.074   4.394   1.934   3.101   3.956   7.484 
##     tax ptratio   black   lstat 
##   9.009   1.799   1.349   2.941

The VIF statistics, does not show any high value for the predictors “Indus” and “age” and therefore it can be inferred that there isnt any significant collinearity for these variables. However there are some other important facts that get revealed from the VIF function namely the collinearity of the variables “tax” & “rad”. The collinearity for these variables have to be investigated further.

Since age and indus are not very significant variables let us try a new model by removing these predictors and see what is the impact to the overall statistics. The model can be updated by excluding these two variables.

Bosall1 <- update(Bosall,~.-age)
Bosall2 <- update(Bosall1,~.-indus)
summary(Bosall2)
## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.598  -2.739  -0.505   1.727  26.237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.34115    5.06749    7.17  2.7e-12 ***
## crim         -0.10841    0.03278   -3.31  0.00101 ** 
## zn            0.04584    0.01352    3.39  0.00075 ***
## chas          2.71872    0.85424    3.18  0.00155 ** 
## nox         -17.37602    3.53524   -4.92  1.2e-06 ***
## rm            3.80158    0.40632    9.36  < 2e-16 ***
## dis          -1.49271    0.18573   -8.04  6.8e-15 ***
## rad           0.29961    0.06340    4.73  3.0e-06 ***
## tax          -0.01178    0.00337   -3.49  0.00052 ***
## ptratio      -0.94652    0.12907   -7.33  9.2e-13 ***
## black         0.00929    0.00267    3.47  0.00056 ***
## lstat        -0.52255    0.04742  -11.02  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.74 on 494 degrees of freedom
## Multiple R-squared:  0.741,  Adjusted R-squared:  0.735 
## F-statistic:  128 on 11 and 494 DF,  p-value: <2e-16

From the summary statistics we can see that there is no major change in the R2 values for the model which means that these two predictors can be excluded from the model.

Interaction terms

Next we will investigate the significance of Interaction terms within the model.

An interaction term can be included by the function lstat:Black or lstat*Black. Let us check out the model

Bosint <- lm(medv~lstat*black,data=Boston)
summary(Bosint)
## 
## Call:
## lm(formula = medv ~ lstat * black, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.67  -3.92  -1.15   1.94  24.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.334465   3.752731    4.89  1.4e-06 ***
## lstat       -0.260773   0.177290   -1.47  0.14195    
## black        0.042666   0.009833    4.34  1.7e-05 ***
## lstat:black -0.001806   0.000476   -3.80  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.11 on 502 degrees of freedom
## Multiple R-squared:  0.561,  Adjusted R-squared:  0.559 
## F-statistic:  214 on 3 and 502 DF,  p-value: <2e-16

From the statistics it can be identified that the interaction term has a significant p value and that there is definitely a relationship. The R2 also have increased marginally.

Non linear transformation of the model

As seen from the earlier examples and the residual plot, there was a fair amount of non linearity in the model. Let us try to fit a higher order polynomials in the model and see if we get a good fit of the model.

Bossq <- lm(medv~lstat+I(lstat^2),data=Boston)
Bossing <- lm(medv~lstat,data=Boston)
summary(Bossq)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.28  -3.83  -0.53   2.31  25.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.86201    0.87208    49.1   <2e-16 ***
## lstat       -2.33282    0.12380   -18.8   <2e-16 ***
## I(lstat^2)   0.04355    0.00375    11.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.52 on 503 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.639 
## F-statistic:  449 on 2 and 503 DF,  p-value: <2e-16

From the model we can find out that the R2 has increased substantially . We can perform an Anova test on the data to compare both models. We can also do a plot of the new model

anova(Bossq,Bossing)
## Analysis of Variance Table
## 
## Model 1: medv ~ lstat + I(lstat^2)
## Model 2: medv ~ lstat
##   Res.Df   RSS Df Sum of Sq   F Pr(>F)    
## 1    503 15347                            
## 2    504 19472 -1     -4125 135 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(Bossq)

plot of chunk unnamed-chunk-15

From the anova test it is found out that the p value is quite significant and is virtually zero indicating that the alternate hypothesis which states that the models are different is true.

Also from the residual plot we can see that there is a little pattern for the residuals and therefore indicating non-linearity

We will try with multiple polynomials in the model. This can be achieved with the poly() function. Let us fit a fifth order polynomial in the model and check.

Bospol <- lm(medv~poly(lstat,5),data=Boston)
summary(Bospol)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.543  -3.104  -0.705   2.084  27.115 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.533      0.232   97.20  < 2e-16 ***
## poly(lstat, 5)1 -152.460      5.215  -29.24  < 2e-16 ***
## poly(lstat, 5)2   64.227      5.215   12.32  < 2e-16 ***
## poly(lstat, 5)3  -27.051      5.215   -5.19  3.1e-07 ***
## poly(lstat, 5)4   25.452      5.215    4.88  1.4e-06 ***
## poly(lstat, 5)5  -19.252      5.215   -3.69  0.00025 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.21 on 500 degrees of freedom
## Multiple R-squared:  0.682,  Adjusted R-squared:  0.679 
## F-statistic:  214 on 5 and 500 DF,  p-value: <2e-16

As seen from the statistics, the poly function gives a better R2 data. A question can arise , why dont we fit more polynomial functions and make a perfect fit? The flip side of this approach is called the phenomenon called Overfitting. We have to be cognizant of the fact that, the aim of our model has to be to fit any given data quite well. It can quite happen that a model which fits the training data perfectly can fit the test or validation data very badly because of overfitting. So we have to be judicious in the selection of the model we choose for our analysis.