Let’s explain crime rates in Boston -> dataset “Boston” from package MASS
library(MASS)
#?Boston
dim(Boston)
## [1] 506 14
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
What is our dependent variable?
model1 <- lm(Boston$crim~Boston$black)
summary(model1) # getting results
##
## Call:
## lm(formula = Boston$crim ~ Boston$black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## Boston$black -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
Describe the trend. Is relation positive/negative? Strong or weak? Explanatory power?
Here we see quite weak negative trend (estimate = -0.036). The model explains about 15% of the variability of crime rate.
model3 <- lm(Boston$crim~Boston$chas+Boston$black)
summary(model3) #does something change?
##
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.794 -2.380 -2.173 -0.995 86.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.579673 1.426456 11.623 <2e-16 ***
## Boston$chas -1.259561 1.394069 -0.904 0.367
## Boston$black -0.036109 0.003878 -9.310 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.948 on 503 degrees of freedom
## Multiple R-squared: 0.1497, Adjusted R-squared: 0.1463
## F-statistic: 44.26 on 2 and 503 DF, p-value: < 2.2e-16
Adjusted R-squared (0.1463) is relatively the same so we cannot state if the model better or worse than the previous one. Also, we can see that there’s no significant relation between crime rate in the city and if it bounds the river.
model4<-update(model3, ~.+Boston$indus) # point means "the same as in previos model"
summary (model4)
##
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black + Boston$indus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.454 -2.078 -0.653 0.513 83.496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.441410 1.736139 4.862 1.56e-06 ***
## Boston$chas -2.116562 1.328385 -1.593 0.112
## Boston$black -0.025425 0.003949 -6.439 2.81e-10 ***
## Boston$indus 0.393925 0.052588 7.491 3.11e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.545 on 502 degrees of freedom
## Multiple R-squared: 0.2351, Adjusted R-squared: 0.2306
## F-statistic: 51.45 on 3 and 502 DF, p-value: < 2.2e-16
model5 <- update(model4,~.-Boston$chas)
summary(model5)
##
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.376 -2.182 -0.649 0.516 83.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.546889 1.737527 4.919 1.18e-06 ***
## Boston$black -0.025906 0.003943 -6.570 1.26e-10 ***
## Boston$indus 0.386709 0.052472 7.370 7.07e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.557 on 503 degrees of freedom
## Multiple R-squared: 0.2313, Adjusted R-squared: 0.2282
## F-statistic: 75.67 on 2 and 503 DF, p-value: < 2.2e-16
1 Multicollinearity - important problem. What does it mean?
What is the threat of Multicollinearity?
library (car)
vif(model4) # easy way to check for multicollinearity
## Boston$chas Boston$black Boston$indus
## 1.009877 1.152777 1.154607
Boston\(chas Boston\)black Boston$indus 1.009877 1.152777 1.154607 -> less than 5, everything’s OK The value of 10 means that independent variables must be changed!
2 Outliers and leverages It affects the values of coefficients in our regression (slope and intercept)
outlierTest(model4) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value Bonferroni p
## 381 12.763595 1.6802e-32 8.5020e-30
## 406 8.874146 1.2515e-17 6.3325e-15
## 419 8.379306 5.4015e-16 2.7331e-13
## 411 4.909203 1.2393e-06 6.2710e-04
## 405 4.653195 4.1891e-06 2.1197e-03
## 399 4.450539 1.0565e-05 5.3458e-03
## 415 4.417722 1.2232e-05 6.1895e-03
qqPlot(model4, main="QQ Plot") #qq plot for studentized resid
## [1] 381 406
How many outliers do we have?
leveragePlots(model4) # leverage plots
par(mfrow=c(2,2)) # to put 4 graphs on 1 screen
plot(model4) # diagnostics plots
What can you say about our regression?
3 Other tests to assess the adequacy of our model distribution of studentized residuals
library(MASS)
sresid <- studres(model4)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
distribution of studentized residuals IS NOT NORMAL
4 Evaluate homoscedasticity non-constant error variance test
ncvTest(model4) # if significant -> sign of heteroscedasticity
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 329.4616, Df = 1, p = < 2.22e-16
# OR
library(lmtest)
bptest(model4)# if significant -> sign of heteroscedasticity
##
## studentized Breusch-Pagan test
##
## data: model4
## BP = 13.797, df = 3, p-value = 0.003195
Try to solve our “heteroscedasticity” problem
# install.packages('e1071', dependencies=TRUE)
crimBCMod <- caret::BoxCoxTrans(Boston$crim)
print(crimBCMod)
## Box-Cox Transformation
##
## 506 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
##
## Largest/Smallest: 14100
## Sample Skewness: 5.19
##
## Estimated Lambda: -0.1
## With fudge factor, Lambda = 0 will be used for transformations
# transformation value is "estimated Lambda"
Boston$crim1<-predict(crimBCMod, Boston$crim)
hist(Boston$crim)
hist(Boston$crim1)
# and repeat the model4
model4.1 <- lm(Boston$crim1 ~ Boston$chas + Boston$indus + Boston$black)
ncvTest(model4.1) # did not solve the problem
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 97.17714, Df = 1, p = < 2.22e-16
# delelting influencial outliers (leverage): 419
Boston1 <- Boston[-c(419), ]
# new redictor, welcome
Boston1$rad <- as.factor(Boston1$rad)
plot(Boston1$rad)
So, for the task I had to built a model with factor variable. As there’re not much factor variables, and one was already used, I’m taking ‘rad’ variable. That is the index of accessibility to radial highways with 9 levels. Distribution can be seen above.
# new model and interpretation of its results
modelNew1 <- lm(Boston1$crim~Boston1$rad)
summary (modelNew1)
##
## Call:
## lm(formula = Boston1$crim ~ Boston1$rad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.917 -0.593 -0.076 0.086 76.681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03603 1.37235 0.026 0.979
## Boston1$rad2 0.04726 1.85817 0.025 0.980
## Boston1$rad3 0.06133 1.69546 0.036 0.971
## Boston1$rad4 0.35787 1.49190 0.240 0.811
## Boston1$rad5 0.65176 1.48690 0.438 0.661
## Boston1$rad6 0.11403 1.82539 0.062 0.950
## Boston1$rad7 0.11437 2.02461 0.056 0.955
## Boston1$rad8 0.33538 1.85817 0.180 0.857
## Boston1$rad24 12.25933 1.47339 8.321 8.54e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.137 on 496 degrees of freedom
## Multiple R-squared: 0.4245, Adjusted R-squared: 0.4152
## F-statistic: 45.74 on 8 and 496 DF, p-value: < 2.2e-16
For now the model seems to be quite nice: p-value is low enough (<.05), adjusted R-squared is 0.4152, which means that the model explains about 41.5% of variance of the predicted variable (crime rate). As for the significant factors, if index of accessibility to radial highways is equal to 24, it have positive relation to crime rate. In other words, when index of accessibility to radial highways is equal to 24, the crime rate increases by 12.3.
NewModel <- lm(Boston1$crim ~ Boston1$rad + Boston1$black)
summary(NewModel)
##
## Call:
## lm(formula = Boston1$crim ~ Boston1$rad + Boston1$black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.435 -0.428 -0.084 0.093 77.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.816147 1.898510 1.483 0.1386
## Boston1$rad2 0.026837 1.851751 0.014 0.9884
## Boston1$rad3 0.083791 1.689612 0.050 0.9605
## Boston1$rad4 0.311084 1.486893 0.209 0.8344
## Boston1$rad5 0.508351 1.483303 0.343 0.7320
## Boston1$rad6 0.100410 1.819077 0.055 0.9560
## Boston1$rad7 0.108371 2.017590 0.054 0.9572
## Boston1$rad8 0.306846 1.851776 0.166 0.8685
## Boston1$rad24 11.551505 1.506072 7.670 9.2e-14 ***
## Boston1$black -0.007142 0.003383 -2.111 0.0353 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.116 on 495 degrees of freedom
## Multiple R-squared: 0.4297, Adjusted R-squared: 0.4193
## F-statistic: 41.43 on 9 and 495 DF, p-value: < 2.2e-16
I’ve also decided to add one more predictor - proportion of blacks. As I remember it was quite significant factor. Now I see, that its presence is still significant and have negative relation to the crime rate, but it’s quite weak predictor. As for the explanatory power of the whole model, it increase just a little bit, now it explains almost 42% of variance of the predicted variable (crime rate).
# Checking:
# multicollinearity
vif(NewModel)
## GVIF Df GVIF^(1/(2*Df))
## Boston1$rad 1.252151 8 1.014153
## Boston1$black 1.252151 1 1.118996
# outliers
outlierTest(NewModel)
## rstudent unadjusted p-value Bonferroni p
## 381 15.512078 1.6898e-44 8.5334e-42
## 406 10.165554 3.5962e-22 1.8161e-19
## 411 6.355010 4.7410e-10 2.3942e-07
## 415 5.437406 8.5137e-08 4.2994e-05
## 405 4.959572 9.7322e-07 4.9148e-04
## 399 4.494258 8.7052e-06 4.3961e-03
## 427 3.946641 9.0751e-05 4.5829e-02
qqPlot(NewModel, main="QQ Plot")
## [1] 381 406
# leverages
leveragePlots(NewModel) # leverage plots
par(mfrow=c(2,2)) # to put 4 graphs on 1 screen
plot(NewModel) # diagnostics plots
Nested models – one of the model contains ALL the predictors. If you delete shared factors from both models, one of them becomes EMPTY: Y1=x1+x2+x3 Y2=x1+x2+x3+x4+x5
Non-nested models – both models have original predictors. Two types of non-nested models:
partly non-nested: Y1=x1+x2+x7 Y2=x1+x2+x3+x4+x5
completely non-nested: Y1=x1+x2+x3 Y2=x4+x5+x6
Comparing nested models -> anova {stats}. Comparing residuals = unexplained variances with control by df (degrees of freedom) look at p-value
anova (model1, model4) # which one is better?
Comparing non-nested models -> AIC (Aikake Information Criteria). The lower, the better!!
model7 <- lm(Boston$crim~Boston$dis+Boston$medv)
AIC(model4) #3487.085
## [1] 3487.085
AIC(model7) #3484.536 - better.
## [1] 3484.536
How to create a good model? Forward vs. backward selection
modelEMPTY <- lm(Boston$crim~0) # -> adding factors and comparing models = forward
summary(modelEMPTY)
##
## Call:
## lm(formula = Boston$crim ~ 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.006 0.082 0.257 3.677 88.976
##
## No Coefficients
##
## Residual standard error: 9.322 on 506 degrees of freedom
modelFULL <- lm(Boston$crim~., data = Boston) # -> removing insignificat factors = backward
summary (modelFULL)
##
## Call:
## lm(formula = Boston$crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.122 -2.570 -0.650 1.481 67.119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.815843 6.826482 4.221 2.90e-05 ***
## zn 0.081738 0.017823 4.586 5.74e-06 ***
## indus -0.126213 0.077567 -1.627 0.104348
## chas -0.597332 1.093331 -0.546 0.585079
## nox -22.455793 5.066704 -4.432 1.15e-05 ***
## rm 0.584749 0.567938 1.030 0.303704
## age -0.017450 0.016735 -1.043 0.297574
## dis -0.970012 0.261062 -3.716 0.000226 ***
## rad 0.137292 0.095503 1.438 0.151194
## tax -0.003363 0.004776 -0.704 0.481675
## ptratio -0.141233 0.173306 -0.815 0.415504
## black -0.002798 0.003443 -0.813 0.416728
## lstat 0.026137 0.071007 0.368 0.712967
## medv -0.231933 0.056176 -4.129 4.29e-05 ***
## crim1 3.156447 0.347787 9.076 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.965 on 491 degrees of freedom
## Multiple R-squared: 0.5324, Adjusted R-squared: 0.5191
## F-statistic: 39.94 on 14 and 491 DF, p-value: < 2.2e-16
what drawbacks do these methods have?
library(ggplot2)
regr <- ggplot(Boston, aes(crim,black))
regr+geom_point()
Perhaps a simple way to solve the problem of heteroscedasticity of residuals is to convert numeric variables into categorical.
Model 4 has following predictors: chas + black + indus My NewModel as predictors has: rad + black And both models have one outcome - crime rate.
So, these two models are partly non-nested, because they have one common variale ‘black’, and other predictors are different.
Let’s compare them
AIC(model4) #3487.085
## [1] 3487.085
AIC(NewModel) #3274.053 - the lower, the better
## [1] 3274.053
Woohoo, mine is better :) * yes, I’m proud of myself *