#display data
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
lm.fit <- lm(medv~., data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
boxplot(lm.fit$residuals, ylab = "Residuals")
From the box plot of the residuals, we could see the median is near 0, which implies our model roughly fit train data, but there are many outliers, which implies linear model maybe not good enough.
We could see some coefficients have large \(P_{value}\) and some have small \(P_{value}\). Large \(P_{value}\) indicates the relation between these variable with responces are not strong enough.
RSE roughly measures the average amount of difference between \(y_i\) and \(\hat{y_i}\), therefore small RSE indicates model fit well. In our case, this model is not too bad
\(R^2\) measures how much the variability of Y is explained by X. If the ground truth of our model is linear, we expected to see \(R^2\) close to 1. In our case, \(R^2\) is not high. Adjusted \(R^2\) consider the condition of high \(P\)(for high dimension)
F-statistics indicates whether there is a relationship between predictors and response. The small \(p_{value}\) strongly indicates there exsist relationship between predictors and response.
plot(lm.fit)
(1): Residuals is \(y_i - \hat{y_i}\), Fitted values is \(\hat{y_i}\). If the model is linear the trend line(red line) should be roughly straight. Therefore a U shape trend line provides an indication of non-linearity in the data.
(2): Basicly Q-Q plot is used to measure normality. If the true model is linear, the residuals should be approximately normally distributed. Therefore, lack of linearity indicates non-normality. A roughly straight line indicates normality, further more, linearity. In our case, the plot indicates the true model is not linear.
(3): \(\sqrt{|standardized residuals|}\) is rescaled residuals. Therefore, Scale-Location plot is similar to the Residual vs Fitted plot. U shape trend line indicates non-linear relationship
(4): Leverage could measure how much each data point influences the regression line. If the linear model is true, the points should near 0 and reach 2-3 standard deviations away from 0, and symmetrically around 0. If points lie far from 0 with fewer points nearby, it will have much influence on regression line. The red line shows the Cook’s distance, we expected the red line close to horizontal line.detail see
Collinearity refers that 2 or more predictors are closely related. Because of collinearity, it will be hard to determine how each predictors contribute to the response. Collineary cause the difficulty to estimates coefficients, which will increase the standard error of coefficients. Finally, we will fail to discover the relationship for some predictors.
Scatterplot matrices could provides us some preview about correlation between predictors
#scatterplot matrices
pairs(~crim+rm+black+lstat, data = Boston) #just for explanation, I do not plot whole scatterplot matrices
Correlation matrix, the absolute value of correlation coefficient near 1 indicates strong linear relationship
cor(Boston[,1:5]) #just for explanation, just show part of the correlation matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.05589158 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.04269672 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.06293803 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.00000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.09120281 1.00000000
However, correlation matrix can not reveal the collinearity between 3 or more predictors. Variance inflation factor could help us detect collinearity. When VIF values larger than 5 or 10 indicates collinearity
vif (lm.fit )
## crim zn indus chas nox rm age dis
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945
## rad tax ptratio black lstat
## 7.484496 9.008554 1.799084 1.348521 2.941491
validate correlation coefficient for 2 larg VIF predictors, tax and rad
cor(Boston[,"tax"], Boston[,"rad"])
## [1] 0.9102282
plot(Boston$tax, Boston$rad)
Interaction effect could be view as “1+1>2”, if 2 predictors are interaction term, their contribution to the response is greater than either of these two seperately. Thereefore, sometimes we introduce \(X_1*X_2\) term, where %X_1,X_2% indicate two predictors with interaction effect.
lm.fit.inter <- lm(medv~lstat+age+lstat:age, data = Boston)
summary(lm.fit.inter)
##
## Call:
## lm(formula = medv ~ lstat + age + lstat:age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
lm.fit1 <- lm(medv~lstat, data = Boston)
plot(Boston$lstat, Boston$medv, xlab = "lstat", ylab = "medv", pch = 20, col = "blue")
abline(lm.fit1, col = "red")
legend("topright","regression line",lty=1, col="red")
The figures shows the relationship between lstat and medv could be nonlinear, so we introduce \(lstat^2\) term to fit the model
lm.fit2 <- lm(medv~lstat+I(lstat^2), data = Boston)
plot(Boston$lstat, Boston$medv, xlab = "lstat", ylab = "medv", pch = 20, col = "blue")
lines(sort(Boston$lstat), fitted(lm.fit2)[order(Boston$lstat)], col='red', lty=1)
legend("topright","regression line",lty=1, col="red")
Compare the statistics of these two fitted model
summary(lm.fit1)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
Residual standard error, Multiple R-squared and F-statistic, all indicate lm.fit2 is better. We could use anova to check whether this is coincidence or not.
anova(lm.fit1, lm.fit2)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
small p-value indicates this is not coincidence.