We begin by loadning the Boston dataset from the MASS package. This time we will also be focusing on explaining the variable crime rate.
# Attaching the packages to work with
library(sjPlot)
library(ggplot2)
library(MASS)
library(lmtest)
library(car)
# Loading the dataset and taking a look at the variables in it
?Boston
dim(Boston)
## [1] 506 14
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Since the models that are used for this script were alreday previously constructed in the practical task #1, I will shortly pinpoint the main differences between them based on the variables included in the equaltion and model significance.
# The very first model simply uses numeric variable "black" as the predictor
model1 <- lm(Boston$crim~Boston$black)
summary(model1)
##
## Call:
## lm(formula = Boston$crim ~ Boston$black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## Boston$black -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
# The variable "chas" is added to the equation as a predictor
model3 <- lm(Boston$crim~Boston$chas+Boston$black)
summary(model3)
##
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.794 -2.380 -2.173 -0.995 86.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.579673 1.426456 11.623 <2e-16 ***
## Boston$chas -1.259561 1.394069 -0.904 0.367
## Boston$black -0.036109 0.003878 -9.310 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.948 on 503 degrees of freedom
## Multiple R-squared: 0.1497, Adjusted R-squared: 0.1463
## F-statistic: 44.26 on 2 and 503 DF, p-value: < 2.2e-16
# The variable "indus" is added to the previous model
model4<-update(model3, ~.+Boston$indus)
summary (model4)
##
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black + Boston$indus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.454 -2.078 -0.653 0.513 83.496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.441410 1.736139 4.862 1.56e-06 ***
## Boston$chas -2.116562 1.328385 -1.593 0.112
## Boston$black -0.025425 0.003949 -6.439 2.81e-10 ***
## Boston$indus 0.393925 0.052588 7.491 3.11e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.545 on 502 degrees of freedom
## Multiple R-squared: 0.2351, Adjusted R-squared: 0.2306
## F-statistic: 51.45 on 3 and 502 DF, p-value: < 2.2e-16
# Variables "chas" is excluded from the model
model5 <- update(model4,~.-Boston$chas)
summary(model5)
##
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.376 -2.182 -0.649 0.516 83.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.546889 1.737527 4.919 1.18e-06 ***
## Boston$black -0.025906 0.003943 -6.570 1.26e-10 ***
## Boston$indus 0.386709 0.052472 7.370 7.07e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.557 on 503 degrees of freedom
## Multiple R-squared: 0.2313, Adjusted R-squared: 0.2282
## F-statistic: 75.67 on 2 and 503 DF, p-value: < 2.2e-16
We can observe the changes adjusted R-squared value has gone through. The very first model, similarly to the one with “chas” variable, explained 14% of the variance but in case of the latter the predictor “chas” was not significant. The model significantly improved when the “indus” variable was added as a predictor - it could explain up to 23% of the variance in the dependent variable but the “chas” variable was still not statistically significant. After that variable was excluded the adjusted R-squared dropped a little bit (from 0.231 to 0.228) but at least all the predictors in the last model are statistically significant.
In simple words, judging by the coefficients for the used variables:
Multicollinearity - important problem. What does it mean?
As we know, multicollinearity occurs when explanatory variables in a model are correlated with each other which poses a problem since one of the requirements is their independence. When we interpret the model summary and coefficients we do that with the assumption that when we explain the relation between the dependent variable and one of the explanatory variables other variables are held constant. However, in case of multicollinearity a change in one of the variables would also cause a change in another explanatory variable.
# Easy way to check for multicollinearity
vif(model4)
## Boston$chas Boston$black Boston$indus
## 1.009877 1.152777 1.154607
Values for all three variables (1.009877, 1.152777, 1.154607) are less than 5 which indicates that everything is okay. The smallest value of the variance inflation factor is 1 and as we can see values of vif for all three variables are not that much bigger than that. We would be concerned if the values exceed 5 or 10.
Moving on to the outliers and leverages. Highly influential outliers would affect values of coefficients in our regression (slope and intercept).
outlierTest(model4) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value Bonferroni p
## 381 12.763595 1.6802e-32 8.5020e-30
## 406 8.874146 1.2515e-17 6.3325e-15
## 419 8.379306 5.4015e-16 2.7331e-13
## 411 4.909203 1.2393e-06 6.2710e-04
## 405 4.653195 4.1891e-06 2.1197e-03
## 399 4.450539 1.0565e-05 5.3458e-03
## 415 4.417722 1.2232e-05 6.1895e-03
qqPlot(model4, main="QQ Plot") #qq plot for studentized resid
## [1] 381 406
# How many outliers do we have?
We can clearly see that we have two outliers: 381 and 406. They are also presented on the QQ plot.
leveragePlots(model4) # leverage plots
par(mfrow=c(2,2)) # to put 4 graphs on 1 screen
plot(model4) # diagnostics plots
When it comes to the “Residuals vs Fitted” graph that is used to check the linear relationship we are looking for a straight horizontal line and it appears to be the case for our model. The second graph is used to check the normality of residual distribution and in ideal case the dots would follow a straight dashed line: unfortunately, we observe that as we move along the line at the very end of it there are values that clearly stand out (381, 406, 419). “Scale-Location” graph is used to check the homogeneity of variance of the residuals and the ideal case would be a horizontal line with points equally spread around it - that is not our case. The line of Cook’s distance appears on the last graph and there are some points that are on or close to it.
We can also perform other tests to assess the adequacy of our model. For instance, we can plot the distribution of studentized residuals.
sresid <- studres(model4)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
It appears quite clear that the distribution of studentized residuals is far from being normal.
We move on to evaluate the homoscedasticity and perform non-constant error variance test.
ncvTest(model4) # if significant -> sign of heteroscedasticity
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 329.4616, Df = 1, p = < 2.22e-16
P-value is statistically significant (<0.05) which is a sign of heteroscedasticity (the residuals don’t have a constant variance).
bptest(model4)# if significant -> sign of heteroscedasticity
##
## studentized Breusch-Pagan test
##
## data: model4
## BP = 13.797, df = 3, p-value = 0.003195
The same conclusion can be made based of the significant p-value of this test.
In order to solve the “heteroscedasticity” problem we can use a package ‘e1071’.
crimBCMod <- caret::BoxCoxTrans(Boston$crim)
print(crimBCMod)
## Box-Cox Transformation
##
## 506 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
##
## Largest/Smallest: 14100
## Sample Skewness: 5.19
##
## Estimated Lambda: -0.1
## With fudge factor, Lambda = 0 will be used for transformations
# Transformation value is "estimated Lambda"
Boston$crim1<-predict(crimBCMod, Boston$crim)
hist(Boston$crim)
hist(Boston$crim1)
# And repeat the model4
model4.1 <- lm(Boston$crim1 ~ Boston$chas + Boston$indus + Boston$black)
ncvTest(model4.1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 97.17714, Df = 1, p = < 2.22e-16
As we can see, p-value is still significant and that didn’t solve the problem.
Answer the following questions:
In case of the outliers, those are the points for which the y result does not follow the general trend. Levereages are extreme x values.
Not necessarily. Even if a leverage point has a very high or very low x value, it might still be “on the line” of the general trend and its Y values won’t be that different from a general trend - it won’t be an outlier. However, it might also be the case that an outlier (Y value that is different from a general trend) has an abnormally high or low x value - in this case we would be dealing with an outlier that is also a leverage point. That is the case that is usually refferred to as a high leverage point or a leverage point with a high influence.
As it was mentioned previously, leverge points might be found on the line of the general trend and in such cases they shouldn’t be deleted. Outliers might also be not influential (the line doesn’t change that much if they are excluded) or jointly non-influential. In general, only those points of high influence that change the trend line should be deleted.
I, personally, wasn’t sure about the distinction between these two concepts so below I link the source I used to wrap my head around this concept.
Source: pointshttps://online.stat.psu.edu/stat462/node/170/
As we remember, there were three outliers in the dataset: 381, 406, 419. However, it was the 419 point that was found on the Cook’s distance line which makes it a high-influence case. We shall delete it from the data before building a new model.
Boston1 <- Boston[-c(419),]
model6 <- lm(Boston1$crim ~ Boston1$black + Boston1$ptratio + Boston1$rm)
summary(model6)
##
## Call:
## lm(formula = Boston1$crim ~ Boston1$black + Boston1$ptratio +
## Boston1$rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.673 -3.227 -1.423 0.986 86.164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.619824 5.179607 1.471 0.14189
## Boston1$black -0.027024 0.003636 -7.432 4.65e-13 ***
## Boston1$ptratio 0.752565 0.160548 4.687 3.57e-06 ***
## Boston1$rm -1.332336 0.490878 -2.714 0.00687 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.226 on 501 degrees of freedom
## Multiple R-squared: 0.1942, Adjusted R-squared: 0.1893
## F-statistic: 40.24 on 3 and 501 DF, p-value: < 2.2e-16
The model includes such predictors as:
Judging by the results of the summary, the model is better than the null model built based on the mean (p-value < 0.05) and explains around 18% of the variance in the dependent variable. All predictors are significant.
The regression equation would look like this:
\[Predicted Crime rate = 7.735900 - 0.027024 * "black" + 0.752565 * "ptratio" - 1.332336 * "rm"\]
The interpretation would be:
With each one-point increase in the value of variable “black” (in this case a result of a multiplication with the proportion of black people in the town) the crime rate decreases by 0.027024. With each one-point increase in the value of the variable “ptratio” the crime rate increases by 0.752565. With one-point increase in the value of the variable “rm” the crime rate decreases by 1.332336
A larger pupil-teacher ratio (more students per one teacher) might indicate the lower status of the area which is why the crime rate is higher in such case. Interestingly enough, the higher is the average number of rooms per dwelling the lower is the crime rate. This would probably be the other way around if an afluent area with personal houses and working-class apartments were compared, but it is quite an interesting finding that makes us question how different areas in the city differ in terms of the type of dwellings.
We shall move on and check multicollinearity.
vif(model6)
## Boston1$black Boston1$ptratio Boston1$rm
## 1.036312 1.166858 1.149924
All the values of vif are smaller than 5 which indicates that there is no problem of multicollinearity in our model.
outlierTest(model6)
## rstudent unadjusted p-value Bonferroni p
## 381 14.148244 1.8331e-38 9.2573e-36
## 406 9.494794 9.0626e-20 4.5766e-17
## 411 5.203873 2.8539e-07 1.4412e-04
## 405 4.965383 9.4234e-07 4.7588e-04
## 399 4.751888 2.6379e-06 1.3321e-03
## 415 4.483766 9.1035e-06 4.5973e-03
qqPlot(model6, main="QQ Plot")
## [1] 381 406
There are two outliers: 381 and 406.
plot(model6)
None of the points lie further away from or on the Cook’s distance line which means there are no high-influence cases and they shouldn’t be deleted.
Nested models – one of the model contains ALL the predictors.
If you delete shared factors from both models, one of them becomes EMPTY: Y1=x1+x2+x3 Y2=x1+x2+x3+x4+x5
Non-nested models – both models have original predictors. Two types of non-nested models: * Partly non-nested: Y1=x1+x2+x7 Y2=x1+x2+x3+x4+x5 * Сompletely non-nested: Y1=x1+x2+x3 Y2=x4+x5+x6
Comparing nested models -> anova {stats}. Comparing residuals = unexplained variances with control by df (degrees of freedom)
anova (model1, model4)
## Analysis of Variance Table
##
## Model 1: Boston$crim ~ Boston$black
## Model 2: Boston$crim ~ Boston$chas + Boston$black + Boston$indus
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 31823
## 2 502 28577 2 3245.9 28.509 1.874e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value is small (<0.05) which means that the addition of two other predictors (DF = 2) DID improve the model .
Comparing non-nested models -> AIC (Aikake Information Criteria).
model4 <- lm(crim~chas+black + indus, data = Boston)
model7 <- lm(crim~dis+medv, data = Boston)
AIC(model4) #3487.085
## [1] 3487.085
AIC(model7) #3484.536
## [1] 3484.536
Since in this case “the lower the better” the new model is better than the former one (3484.536 < 3487.085).
How to create a good model? Forward vs. backward selection
modelEMPTY <- lm(Boston$crim~0) # -> adding factors and comparing models = forward
summary(modelEMPTY)
##
## Call:
## lm(formula = Boston$crim ~ 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.006 0.082 0.257 3.677 88.976
##
## No Coefficients
##
## Residual standard error: 9.322 on 506 degrees of freedom
modelFULL <- lm(Boston$crim~., data = Boston) # -> removing insignificat factors = backward
summary (modelFULL)
##
## Call:
## lm(formula = Boston$crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.122 -2.570 -0.650 1.481 67.119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.815843 6.826482 4.221 2.90e-05 ***
## zn 0.081738 0.017823 4.586 5.74e-06 ***
## indus -0.126213 0.077567 -1.627 0.104348
## chas -0.597332 1.093331 -0.546 0.585079
## nox -22.455793 5.066704 -4.432 1.15e-05 ***
## rm 0.584749 0.567938 1.030 0.303704
## age -0.017450 0.016735 -1.043 0.297574
## dis -0.970012 0.261062 -3.716 0.000226 ***
## rad 0.137292 0.095503 1.438 0.151194
## tax -0.003363 0.004776 -0.704 0.481675
## ptratio -0.141233 0.173306 -0.815 0.415504
## black -0.002798 0.003443 -0.813 0.416728
## lstat 0.026137 0.071007 0.368 0.712967
## medv -0.231933 0.056176 -4.129 4.29e-05 ***
## crim1 3.156447 0.347787 9.076 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.965 on 491 degrees of freedom
## Multiple R-squared: 0.5324, Adjusted R-squared: 0.5191
## F-statistic: 39.94 on 14 and 491 DF, p-value: < 2.2e-16
Backward method is usually preferred because of the suppressor effects (when a predictor only has an effect when another predictor is held constant). Forward selection method is more likely to fall into the trap of the type 2 error and eliminate the predictor that actually does predict the dependent variable.
Visualizing the relationship between crime and black variables (the graph was already presented in the pratical task #1).
ggplot(Boston, aes( y = crim, x = black))+
geom_point(shape = 21, colour = "black", fill = "#fdcdac", size = 3, stroke = 0.2)+
labs(title = "Relationship between crime rate and share of black people in the area", y="Crime rate", x="Share of black people in the area")+
geom_smooth(method = "lm", se=FALSE)+
xlim (200, 400)
In case of the model #4 and the model #6 none of them contain all the predictors. They share a variable “black” as a predictor, but none of them become empty if the shared variable is deleted. In this case we are dealing with a partly non-nested models. The shared variable “black” is the reason why they are not completely non-nested. Therefore, we have to perform the AIC test to make a conclusion on which of them is better.
model4 <- lm(crim ~ chas + black + indus, data = Boston)
model6 <- lm(crim ~ black + ptratio + rm, data = Boston)
AIC(model4)
## [1] 3487.085
AIC(model6)
## [1] 3504.023
The value is smaller for the former model (4) and we conclude that it’s better than the one we created (3487.085 < 3504.023).