Linear Regression(ISLR chap 3)

Multiple Linear Regression

Data: Boston from Mass

#display data
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

apply Linear Regression on data, choose medv as responce and all dimension as predictor

lm.fit <- lm(medv~., data = Boston)
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

result explanation

  1. Residuals: \(y_i - \hat{y_i}\)
boxplot(lm.fit$residuals, ylab = "Residuals")

From the box plot of the residuals, we could see the median is near 0, which implies our model roughly fit train data, but there are many outliers, which implies linear model maybe not good enough.

  1. Coefficients: \(\beta_i\)

We could see some coefficients have large \(P_{value}\) and some have small \(P_{value}\). Large \(P_{value}\) indicates the relation between these variable with responces are not strong enough.

  1. Residual standard error: \(RSE\)

RSE roughly measures the average amount of difference between \(y_i\) and \(\hat{y_i}\), therefore small RSE indicates model fit well. In our case, this model is not too bad

  1. R-squared: \(R^2\)

\(R^2\) measures how much the variability of Y is explained by X. If the ground truth of our model is linear, we expected to see \(R^2\) close to 1. In our case, \(R^2\) is not high. Adjusted \(R^2\) consider the condition of high \(P\)(for high dimension)

  1. F-statistics:

F-statistics indicates whether there is a relationship between predictors and response. The small \(p_{value}\) strongly indicates there exsist relationship between predictors and response.

some plot

plot(lm.fit)

(1): Residuals is \(y_i - \hat{y_i}\), Fitted values is \(\hat{y_i}\). If the model is linear the trend line(red line) should be roughly straight. Therefore a U shape trend line provides an indication of non-linearity in the data.

(2): Basicly Q-Q plot is used to measure normality. If the true model is linear, the residuals should be approximately normally distributed. Therefore, lack of linearity indicates non-normality. A roughly straight line indicates normality, further more, linearity. In our case, the plot indicates the true model is not linear.

(3): \(\sqrt{|standardized residuals|}\) is rescaled residuals. Therefore, Scale-Location plot is similar to the Residual vs Fitted plot. U shape trend line indicates non-linear relationship

(4): Leverage could measure how much each data point influences the regression line. If the linear model is true, the points should near 0 and reach 2-3 standard deviations away from 0, and symmetrically around 0. If points lie far from 0 with fewer points nearby, it will have much influence on regression line. The red line shows the Cook’s distance, we expected the red line close to horizontal line.detail see

Collinearity Checking

Collinearity refers that 2 or more predictors are closely related. Because of collinearity, it will be hard to determine how each predictors contribute to the response. Collineary cause the difficulty to estimates coefficients, which will increase the standard error of coefficients. Finally, we will fail to discover the relationship for some predictors.

1. correlation matrix

Scatterplot matrices could provides us some preview about correlation between predictors

#scatterplot matrices
pairs(~crim+rm+black+lstat, data = Boston) #just for explanation, I do not plot whole scatterplot matrices

Correlation matrix, the absolute value of correlation coefficient near 1 indicates strong linear relationship

cor(Boston[,1:5]) #just for explanation, just show part of the correlation matrix
##              crim          zn       indus        chas         nox
## crim   1.00000000 -0.20046922  0.40658341 -0.05589158  0.42097171
## zn    -0.20046922  1.00000000 -0.53382819 -0.04269672 -0.51660371
## indus  0.40658341 -0.53382819  1.00000000  0.06293803  0.76365145
## chas  -0.05589158 -0.04269672  0.06293803  1.00000000  0.09120281
## nox    0.42097171 -0.51660371  0.76365145  0.09120281  1.00000000
2. variance inflation factor (VIF)

However, correlation matrix can not reveal the collinearity between 3 or more predictors. Variance inflation factor could help us detect collinearity. When VIF values larger than 5 or 10 indicates collinearity

vif (lm.fit )
##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491

validate correlation coefficient for 2 larg VIF predictors, tax and rad

cor(Boston[,"tax"], Boston[,"rad"])
## [1] 0.9102282
plot(Boston$tax, Boston$rad)

Interaction term

Interaction effect could be view as “1+1>2”, if 2 predictors are interaction term, their contribution to the response is greater than either of these two seperately. Thereefore, sometimes we introduce \(X_1*X_2\) term, where %X_1,X_2% indicate two predictors with interaction effect.

lm.fit.inter <- lm(medv~lstat+age+lstat:age, data = Boston)
summary(lm.fit.inter)
## 
## Call:
## lm(formula = medv ~ lstat + age + lstat:age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Nonlinear relationship

lm.fit1 <- lm(medv~lstat, data = Boston)
plot(Boston$lstat, Boston$medv, xlab = "lstat", ylab = "medv", pch = 20, col = "blue")
abline(lm.fit1, col = "red")
legend("topright","regression line",lty=1, col="red")

The figures shows the relationship between lstat and medv could be nonlinear, so we introduce \(lstat^2\) term to fit the model

lm.fit2 <- lm(medv~lstat+I(lstat^2), data = Boston)
plot(Boston$lstat, Boston$medv, xlab = "lstat", ylab = "medv", pch = 20, col = "blue")
lines(sort(Boston$lstat), fitted(lm.fit2)[order(Boston$lstat)], col='red', lty=1) 
legend("topright","regression line",lty=1, col="red")

Compare the statistics of these two fitted model

summary(lm.fit1)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
summary(lm.fit2)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

Residual standard error, Multiple R-squared and F-statistic, all indicate lm.fit2 is better. We could use anova to check whether this is coincidence or not.

anova(lm.fit1, lm.fit2)
## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    504 19472                                 
## 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

small p-value indicates this is not coincidence.