Practical task 2

We begin by loadning the Boston dataset from the MASS package. This time we will also be focusing on explaining the variable crime rate.

# Attaching the packages to work with
library(sjPlot)
library(ggplot2)
library(MASS)
library(lmtest)
library(car)

# Loading the dataset and taking a look at the variables in it
?Boston
dim(Boston)

## [1] 506  14

names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Models constructed previously

Since the models that are used for this script were alreday previously constructed in the practical task #1, I will shortly pinpoint the main differences between them based on the variables included in the equaltion and model significance.

# The very first model simply uses numeric variable "black" as the predictor
model1 <- lm(Boston$crim~Boston$black)
summary(model1)

## 
## Call:
## lm(formula = Boston$crim ~ Boston$black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.756  -2.299  -2.095  -1.296  86.822 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.553529   1.425903  11.609   <2e-16 ***
## Boston$black -0.036280   0.003873  -9.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.1466 
## F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16

# The variable "chas" is added to the equation as a predictor
model3 <- lm(Boston$crim~Boston$chas+Boston$black)
summary(model3)

## 
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.794  -2.380  -2.173  -0.995  86.728 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.579673   1.426456  11.623   <2e-16 ***
## Boston$chas  -1.259561   1.394069  -0.904    0.367    
## Boston$black -0.036109   0.003878  -9.310   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.948 on 503 degrees of freedom
## Multiple R-squared:  0.1497, Adjusted R-squared:  0.1463 
## F-statistic: 44.26 on 2 and 503 DF,  p-value: < 2.2e-16

# The variable "indus" is added to the previous model
model4<-update(model3, ~.+Boston$indus) 
summary (model4)

## 
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black + Boston$indus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.454  -2.078  -0.653   0.513  83.496 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.441410   1.736139   4.862 1.56e-06 ***
## Boston$chas  -2.116562   1.328385  -1.593    0.112    
## Boston$black -0.025425   0.003949  -6.439 2.81e-10 ***
## Boston$indus  0.393925   0.052588   7.491 3.11e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.545 on 502 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2306 
## F-statistic: 51.45 on 3 and 502 DF,  p-value: < 2.2e-16

# Variables "chas" is excluded from the model
model5 <- update(model4,~.-Boston$chas)
summary(model5)

## 
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.376  -2.182  -0.649   0.516  83.712 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.546889   1.737527   4.919 1.18e-06 ***
## Boston$black -0.025906   0.003943  -6.570 1.26e-10 ***
## Boston$indus  0.386709   0.052472   7.370 7.07e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.557 on 503 degrees of freedom
## Multiple R-squared:  0.2313, Adjusted R-squared:  0.2282 
## F-statistic: 75.67 on 2 and 503 DF,  p-value: < 2.2e-16

We can observe the changes adjusted R-squared value has gone through. The very first model, similarly to the one with “chas” variable, explained 14% of the variance but in case of the latter the predictor “chas” was not significant. The model significantly improved when the “indus” variable was added as a predictor - it could explain up to 23% of the variance in the dependent variable but the “chas” variable was still not statistically significant. After that variable was excluded the adjusted R-squared dropped a little bit (from 0.231 to 0.228) but at least all the predictors in the last model are statistically significant.

In simple words, judging by the coefficients for the used variables:

With increase in the variable “indus” the crime rate rises: the more non-retail business acres are in the area the higher is the crime rate
With increse in the variable “black” the crime rate decreases: the more black people live in the area the lower is the crime rate there
In case of variable “chas” equal to 1 crime rate decreases: the crime rate is lower in areas where the houses are on the banks od the river (supposedly, more afluent areas)

Simple diagnostics

4.1

Multicollinearity - important problem. What does it mean?

As we know, multicollinearity occurs when explanatory variables in a model are correlated with each other which poses a problem since one of the requirements is their independence. When we interpret the model summary and coefficients we do that with the assumption that when we explain the relation between the dependent variable and one of the explanatory variables other variables are held constant. However, in case of multicollinearity a change in one of the variables would also cause a change in another explanatory variable.

# Easy way to check for multicollinearity
vif(model4)

##  Boston$chas Boston$black Boston$indus 
##     1.009877     1.152777     1.154607

Values for all three variables (1.009877, 1.152777, 1.154607) are less than 5 which indicates that everything is okay. The smallest value of the variance inflation factor is 1 and as we can see values of vif for all three variables are not that much bigger than that. We would be concerned if the values exceed 5 or 10.

4.2

Moving on to the outliers and leverages. Highly influential outliers would affect values of coefficients in our regression (slope and intercept).

outlierTest(model4) # Bonferonni p-value for most extreme obs

##      rstudent unadjusted p-value Bonferroni p
## 381 12.763595         1.6802e-32   8.5020e-30
## 406  8.874146         1.2515e-17   6.3325e-15
## 419  8.379306         5.4015e-16   2.7331e-13
## 411  4.909203         1.2393e-06   6.2710e-04
## 405  4.653195         4.1891e-06   2.1197e-03
## 399  4.450539         1.0565e-05   5.3458e-03
## 415  4.417722         1.2232e-05   6.1895e-03

qqPlot(model4, main="QQ Plot") #qq plot for studentized resid

## [1] 381 406

# How many outliers do we have?

We can clearly see that we have two outliers: 381 and 406. They are also presented on the QQ plot.

leveragePlots(model4) # leverage plots

par(mfrow=c(2,2)) # to put 4 graphs on 1 screen
plot(model4) # diagnostics plots

When it comes to the “Residuals vs Fitted” graph that is used to check the linear relationship we are looking for a straight horizontal line and it appears to be the case for our model. The second graph is used to check the normality of residual distribution and in ideal case the dots would follow a straight dashed line: unfortunately, we observe that as we move along the line at the very end of it there are values that clearly stand out (381, 406, 419). “Scale-Location” graph is used to check the homogeneity of variance of the residuals and the ideal case would be a horizontal line with points equally spread around it - that is not our case. The line of Cook’s distance appears on the last graph and there are some points that are on or close to it.

4.3

We can also perform other tests to assess the adequacy of our model. For instance, we can plot the distribution of studentized residuals.

sresid <- studres(model4) 
hist(sresid, freq=FALSE, 
     main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit)

It appears quite clear that the distribution of studentized residuals is far from being normal.

4.5

We move on to evaluate the homoscedasticity and perform non-constant error variance test.

ncvTest(model4) # if significant -> sign of heteroscedasticity

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 329.4616, Df = 1, p = < 2.22e-16

P-value is statistically significant (<0.05) which is a sign of heteroscedasticity (the residuals don’t have a constant variance).

bptest(model4)# if significant -> sign of heteroscedasticity

## 
##  studentized Breusch-Pagan test
## 
## data:  model4
## BP = 13.797, df = 3, p-value = 0.003195

The same conclusion can be made based of the significant p-value of this test.

In order to solve the “heteroscedasticity” problem we can use a package ‘e1071’.

crimBCMod <- caret::BoxCoxTrans(Boston$crim)
print(crimBCMod)

## Box-Cox Transformation
## 
## 506 data points used to estimate Lambda
## 
## Input data summary:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620 
## 
## Largest/Smallest: 14100 
## Sample Skewness: 5.19 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations

# Transformation value is "estimated Lambda"
Boston$crim1<-predict(crimBCMod, Boston$crim)
hist(Boston$crim)

hist(Boston$crim1)

# And repeat the model4
model4.1 <- lm(Boston$crim1 ~ Boston$chas + Boston$indus + Boston$black)
ncvTest(model4.1)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 97.17714, Df = 1, p = < 2.22e-16

As we can see, p-value is still significant and that didn’t solve the problem.

Task 1/2

Answer the following questions:

What is the difference between Outliers and leverages?

In case of the outliers, those are the points for which the y result does not follow the general trend. Levereages are extreme x values.

Leverages always are outliers?

Not necessarily. Even if a leverage point has a very high or very low x value, it might still be “on the line” of the general trend and its Y values won’t be that different from a general trend - it won’t be an outlier. However, it might also be the case that an outlier (Y value that is different from a general trend) has an abnormally high or low x value - in this case we would be dealing with an outlier that is also a leverage point. That is the case that is usually refferred to as a high leverage point or a leverage point with a high influence.

Is it true that both leverages and outliers always must be deleted from the database?

As it was mentioned previously, leverge points might be found on the line of the general trend and in such cases they shouldn’t be deleted. Outliers might also be not influential (the line doesn’t change that much if they are excluded) or jointly non-influential. In general, only those points of high influence that change the trend line should be deleted.

I, personally, wasn’t sure about the distinction between these two concepts so below I link the source I used to wrap my head around this concept.

Source: pointshttps://online.stat.psu.edu/stat462/node/170/

As we remember, there were three outliers in the dataset: 381, 406, 419. However, it was the 419 point that was found on the Cook’s distance line which makes it a high-influence case. We shall delete it from the data before building a new model.

Boston1 <- Boston[-c(419),]
model6 <- lm(Boston1$crim ~ Boston1$black + Boston1$ptratio + Boston1$rm)
summary(model6)

## 
## Call:
## lm(formula = Boston1$crim ~ Boston1$black + Boston1$ptratio + 
##     Boston1$rm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.673  -3.227  -1.423   0.986  86.164 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.619824   5.179607   1.471  0.14189    
## Boston1$black   -0.027024   0.003636  -7.432 4.65e-13 ***
## Boston1$ptratio  0.752565   0.160548   4.687 3.57e-06 ***
## Boston1$rm      -1.332336   0.490878  -2.714  0.00687 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.226 on 501 degrees of freedom
## Multiple R-squared:  0.1942, Adjusted R-squared:  0.1893 
## F-statistic: 40.24 on 3 and 501 DF,  p-value: < 2.2e-16

The model includes such predictors as:

“black” - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
“ptratio” - pupil-teacher ratio by town.
“rm” - average number of rooms per dwelling

Judging by the results of the summary, the model is better than the null model built based on the mean (p-value < 0.05) and explains around 18% of the variance in the dependent variable. All predictors are significant.

The regression equation would look like this:

\[Predicted Crime rate = 7.735900 - 0.027024 * "black" + 0.752565 * "ptratio" - 1.332336 * "rm"\]

The interpretation would be:

With each one-point increase in the value of variable “black” (in this case a result of a multiplication with the proportion of black people in the town) the crime rate decreases by 0.027024. With each one-point increase in the value of the variable “ptratio” the crime rate increases by 0.752565. With one-point increase in the value of the variable “rm” the crime rate decreases by 1.332336

A larger pupil-teacher ratio (more students per one teacher) might indicate the lower status of the area which is why the crime rate is higher in such case. Interestingly enough, the higher is the average number of rooms per dwelling the lower is the crime rate. This would probably be the other way around if an afluent area with personal houses and working-class apartments were compared, but it is quite an interesting finding that makes us question how different areas in the city differ in terms of the type of dwellings.

We shall move on and check multicollinearity.

vif(model6)

##   Boston1$black Boston1$ptratio      Boston1$rm 
##        1.036312        1.166858        1.149924

All the values of vif are smaller than 5 which indicates that there is no problem of multicollinearity in our model.

outlierTest(model6)

##      rstudent unadjusted p-value Bonferroni p
## 381 14.148244         1.8331e-38   9.2573e-36
## 406  9.494794         9.0626e-20   4.5766e-17
## 411  5.203873         2.8539e-07   1.4412e-04
## 405  4.965383         9.4234e-07   4.7588e-04
## 399  4.751888         2.6379e-06   1.3321e-03
## 415  4.483766         9.1035e-06   4.5973e-03

qqPlot(model6, main="QQ Plot")

## [1] 381 406

There are two outliers: 381 and 406.

plot(model6)

None of the points lie further away from or on the Cook’s distance line which means there are no high-influence cases and they shouldn’t be deleted.

5. Models comparison. Which model is better?

Nested models – one of the model contains ALL the predictors.

If you delete shared factors from both models, one of them becomes EMPTY: Y1=x1+x2+x3 Y2=x1+x2+x3+x4+x5

Non-nested models – both models have original predictors. Two types of non-nested models: * Partly non-nested: Y1=x1+x2+x7 Y2=x1+x2+x3+x4+x5 * Сompletely non-nested: Y1=x1+x2+x3 Y2=x4+x5+x6

Comparing nested models -> anova {stats}. Comparing residuals = unexplained variances with control by df (degrees of freedom)

anova (model1, model4)

## Analysis of Variance Table
## 
## Model 1: Boston$crim ~ Boston$black
## Model 2: Boston$crim ~ Boston$chas + Boston$black + Boston$indus
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    504 31823                                  
## 2    502 28577  2    3245.9 28.509 1.874e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is small (<0.05) which means that the addition of two other predictors (DF = 2) DID improve the model .

Comparing non-nested models -> AIC (Aikake Information Criteria).

model4 <- lm(crim~chas+black + indus, data = Boston)
model7 <- lm(crim~dis+medv, data = Boston)

AIC(model4) #3487.085

## [1] 3487.085

AIC(model7) #3484.536

## [1] 3484.536

Since in this case “the lower the better” the new model is better than the former one (3484.536 < 3487.085).

How to create a good model? Forward vs. backward selection

modelEMPTY <- lm(Boston$crim~0) # -> adding factors and comparing models = forward
summary(modelEMPTY)

## 
## Call:
## lm(formula = Boston$crim ~ 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  0.006  0.082  0.257  3.677 88.976 
## 
## No Coefficients
## 
## Residual standard error: 9.322 on 506 degrees of freedom

modelFULL <- lm(Boston$crim~., data = Boston) # -> removing insignificat factors = backward
summary (modelFULL)

## 
## Call:
## lm(formula = Boston$crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.122 -2.570 -0.650  1.481 67.119 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.815843   6.826482   4.221 2.90e-05 ***
## zn            0.081738   0.017823   4.586 5.74e-06 ***
## indus        -0.126213   0.077567  -1.627 0.104348    
## chas         -0.597332   1.093331  -0.546 0.585079    
## nox         -22.455793   5.066704  -4.432 1.15e-05 ***
## rm            0.584749   0.567938   1.030 0.303704    
## age          -0.017450   0.016735  -1.043 0.297574    
## dis          -0.970012   0.261062  -3.716 0.000226 ***
## rad           0.137292   0.095503   1.438 0.151194    
## tax          -0.003363   0.004776  -0.704 0.481675    
## ptratio      -0.141233   0.173306  -0.815 0.415504    
## black        -0.002798   0.003443  -0.813 0.416728    
## lstat         0.026137   0.071007   0.368 0.712967    
## medv         -0.231933   0.056176  -4.129 4.29e-05 ***
## crim1         3.156447   0.347787   9.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.965 on 491 degrees of freedom
## Multiple R-squared:  0.5324, Adjusted R-squared:  0.5191 
## F-statistic: 39.94 on 14 and 491 DF,  p-value: < 2.2e-16

Backward method is usually preferred because of the suppressor effects (when a predictor only has an effect when another predictor is held constant). Forward selection method is more likely to fall into the trap of the type 2 error and eliminate the predictor that actually does predict the dependent variable.

6. Graphics

Visualizing the relationship between crime and black variables (the graph was already presented in the pratical task #1).

ggplot(Boston, aes( y = crim, x = black))+
geom_point(shape = 21, colour = "black", fill = "#fdcdac", size = 3, stroke = 0.2)+
  labs(title = "Relationship between crime rate and share of black people in the area", y="Crime rate", x="Share of black people in the area")+
  geom_smooth(method = "lm", se=FALSE)+
  xlim (200, 400)

Task 2/2

In case of the model #4 and the model #6 none of them contain all the predictors. They share a variable “black” as a predictor, but none of them become empty if the shared variable is deleted. In this case we are dealing with a partly non-nested models. The shared variable “black” is the reason why they are not completely non-nested. Therefore, we have to perform the AIC test to make a conclusion on which of them is better.

model4 <- lm(crim ~ chas + black + indus, data = Boston)
model6 <- lm(crim ~ black + ptratio + rm, data = Boston)

AIC(model4)

## [1] 3487.085

AIC(model6)

## [1] 3504.023

The value is smaller for the former model (4) and we conclude that it’s better than the one we created (3487.085 < 3504.023).

Task 2

Nadezhda Bykova

11.04.2020