Data

Let’s explain crime rates in Boston -> dataset “Boston” from package MASS

library(MASS)
#?Boston
dim(Boston)
## [1] 506  14
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Regression Modeling

What is our dependent variable?

  1. choose one numeric predictor (proportion of blacks -> crime rate) and create simple regression model
model1 <- lm(Boston$crim~Boston$black)
summary(model1) # getting results
## 
## Call:
## lm(formula = Boston$crim ~ Boston$black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.756  -2.299  -2.095  -1.296  86.822 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.553529   1.425903  11.609   <2e-16 ***
## Boston$black -0.036280   0.003873  -9.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.1466 
## F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16

Describe the trend. Is relation positive/negative? Strong or weak? Explanatory power?

Here we see quite weak negative trend (estimate = -0.036). The model explains about 15% of the variability of crime rate.

  1. multiple regression: taking into account several predictors (proportion of blacks and bounding river).
model3 <- lm(Boston$crim~Boston$chas+Boston$black) 
summary(model3) #does something change?
## 
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.794  -2.380  -2.173  -0.995  86.728 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.579673   1.426456  11.623   <2e-16 ***
## Boston$chas  -1.259561   1.394069  -0.904    0.367    
## Boston$black -0.036109   0.003878  -9.310   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.948 on 503 degrees of freedom
## Multiple R-squared:  0.1497, Adjusted R-squared:  0.1463 
## F-statistic: 44.26 on 2 and 503 DF,  p-value: < 2.2e-16

Adjusted R-squared (0.1463) is relatively the same so we cannot state if the model better or worse than the previous one. Also, we can see that there’s no significant relation between crime rate in the city and if it bounds the river.

  1. “update” function: to add or remove variables from your model
model4<-update(model3, ~.+Boston$indus) # point means "the same as in previos model"
summary (model4)
## 
## Call:
## lm(formula = Boston$crim ~ Boston$chas + Boston$black + Boston$indus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.454  -2.078  -0.653   0.513  83.496 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.441410   1.736139   4.862 1.56e-06 ***
## Boston$chas  -2.116562   1.328385  -1.593    0.112    
## Boston$black -0.025425   0.003949  -6.439 2.81e-10 ***
## Boston$indus  0.393925   0.052588   7.491 3.11e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.545 on 502 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2306 
## F-statistic: 51.45 on 3 and 502 DF,  p-value: < 2.2e-16
model5 <- update(model4,~.-Boston$chas)
summary(model5)
## 
## Call:
## lm(formula = Boston$crim ~ Boston$black + Boston$indus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.376  -2.182  -0.649   0.516  83.712 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.546889   1.737527   4.919 1.18e-06 ***
## Boston$black -0.025906   0.003943  -6.570 1.26e-10 ***
## Boston$indus  0.386709   0.052472   7.370 7.07e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.557 on 503 degrees of freedom
## Multiple R-squared:  0.2313, Adjusted R-squared:  0.2282 
## F-statistic: 75.67 on 2 and 503 DF,  p-value: < 2.2e-16

Simple diagnostics

1 Multicollinearity - important problem. What does it mean?
What is the threat of Multicollinearity?

library (car)
vif(model4) # easy way to check for multicollinearity
##  Boston$chas Boston$black Boston$indus 
##     1.009877     1.152777     1.154607

Boston\(chas Boston\)black Boston$indus 1.009877 1.152777 1.154607 -> less than 5, everything’s OK The value of 10 means that independent variables must be changed!

2 Outliers and leverages It affects the values of coefficients in our regression (slope and intercept)

outlierTest(model4) # Bonferonni p-value for most extreme obs
##      rstudent unadjusted p-value Bonferroni p
## 381 12.763595         1.6802e-32   8.5020e-30
## 406  8.874146         1.2515e-17   6.3325e-15
## 419  8.379306         5.4015e-16   2.7331e-13
## 411  4.909203         1.2393e-06   6.2710e-04
## 405  4.653195         4.1891e-06   2.1197e-03
## 399  4.450539         1.0565e-05   5.3458e-03
## 415  4.417722         1.2232e-05   6.1895e-03
qqPlot(model4, main="QQ Plot") #qq plot for studentized resid 

## [1] 381 406

How many outliers do we have?

leveragePlots(model4) # leverage plots

par(mfrow=c(2,2)) # to put 4 graphs on 1 screen
plot(model4) # diagnostics plots

  • note for myself: leverage в данной ситуации это ‘419’ потому что только она залезает на красную линию на 4м графике. Отличие leverage от outlier в том, что аутлаеры они просто не в теме, а лавараги ещё и одеяло на себя перетягивают (а-та-та!). Поэтому от последних надо избавляться. Stay tuned ~

What can you say about our regression?

3 Other tests to assess the adequacy of our model distribution of studentized residuals

library(MASS)
sresid <- studres(model4) 
hist(sresid, freq=FALSE, 
     main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit)

distribution of studentized residuals IS NOT NORMAL

4 Evaluate homoscedasticity non-constant error variance test

ncvTest(model4) # if significant -> sign of heteroscedasticity
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 329.4616, Df = 1, p = < 2.22e-16
# OR 
library(lmtest)
bptest(model4)# if significant -> sign of heteroscedasticity
## 
##  studentized Breusch-Pagan test
## 
## data:  model4
## BP = 13.797, df = 3, p-value = 0.003195

Try to solve our “heteroscedasticity” problem

# install.packages('e1071', dependencies=TRUE)
crimBCMod <- caret::BoxCoxTrans(Boston$crim)
print(crimBCMod)
## Box-Cox Transformation
## 
## 506 data points used to estimate Lambda
## 
## Input data summary:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620 
## 
## Largest/Smallest: 14100 
## Sample Skewness: 5.19 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations
# transformation value is "estimated Lambda"
Boston$crim1<-predict(crimBCMod, Boston$crim)
hist(Boston$crim)

hist(Boston$crim1)

# and repeat the model4
model4.1 <- lm(Boston$crim1 ~ Boston$chas + Boston$indus + Boston$black)
ncvTest(model4.1) # did not solve the problem
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 97.17714, Df = 1, p = < 2.22e-16

TASK - part 1/2:

  • Delete influential cases in the databases.
  • Create own model for a factor predictor adding more predictors.
  • Interpret the results.
  • Check multicollinearity, outliers, and leverages of your model.
  • SEND OUTPUT AND INTERPRETATIONS TO .
# delelting influencial outliers (leverage): 419 
Boston1 <-  Boston[-c(419), ]
# new redictor, welcome
Boston1$rad <- as.factor(Boston1$rad)
plot(Boston1$rad)

So, for the task I had to built a model with factor variable. As there’re not much factor variables, and one was already used, I’m taking ‘rad’ variable. That is the index of accessibility to radial highways with 9 levels. Distribution can be seen above.

# new model and interpretation of its results
modelNew1 <- lm(Boston1$crim~Boston1$rad)
summary (modelNew1)
## 
## Call:
## lm(formula = Boston1$crim ~ Boston1$rad)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.917 -0.593 -0.076  0.086 76.681 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.03603    1.37235   0.026    0.979    
## Boston1$rad2   0.04726    1.85817   0.025    0.980    
## Boston1$rad3   0.06133    1.69546   0.036    0.971    
## Boston1$rad4   0.35787    1.49190   0.240    0.811    
## Boston1$rad5   0.65176    1.48690   0.438    0.661    
## Boston1$rad6   0.11403    1.82539   0.062    0.950    
## Boston1$rad7   0.11437    2.02461   0.056    0.955    
## Boston1$rad8   0.33538    1.85817   0.180    0.857    
## Boston1$rad24 12.25933    1.47339   8.321 8.54e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.137 on 496 degrees of freedom
## Multiple R-squared:  0.4245, Adjusted R-squared:  0.4152 
## F-statistic: 45.74 on 8 and 496 DF,  p-value: < 2.2e-16

For now the model seems to be quite nice: p-value is low enough (<.05), adjusted R-squared is 0.4152, which means that the model explains about 41.5% of variance of the predicted variable (crime rate). As for the significant factors, if index of accessibility to radial highways is equal to 24, it have positive relation to crime rate. In other words, when index of accessibility to radial highways is equal to 24, the crime rate increases by 12.3.

NewModel <- lm(Boston1$crim ~ Boston1$rad + Boston1$black)
summary(NewModel)
## 
## Call:
## lm(formula = Boston1$crim ~ Boston1$rad + Boston1$black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.435  -0.428  -0.084   0.093  77.443 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.816147   1.898510   1.483   0.1386    
## Boston1$rad2   0.026837   1.851751   0.014   0.9884    
## Boston1$rad3   0.083791   1.689612   0.050   0.9605    
## Boston1$rad4   0.311084   1.486893   0.209   0.8344    
## Boston1$rad5   0.508351   1.483303   0.343   0.7320    
## Boston1$rad6   0.100410   1.819077   0.055   0.9560    
## Boston1$rad7   0.108371   2.017590   0.054   0.9572    
## Boston1$rad8   0.306846   1.851776   0.166   0.8685    
## Boston1$rad24 11.551505   1.506072   7.670  9.2e-14 ***
## Boston1$black -0.007142   0.003383  -2.111   0.0353 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.116 on 495 degrees of freedom
## Multiple R-squared:  0.4297, Adjusted R-squared:  0.4193 
## F-statistic: 41.43 on 9 and 495 DF,  p-value: < 2.2e-16

I’ve also decided to add one more predictor - proportion of blacks. As I remember it was quite significant factor. Now I see, that its presence is still significant and have negative relation to the crime rate, but it’s quite weak predictor. As for the explanatory power of the whole model, it increase just a little bit, now it explains almost 42% of variance of the predicted variable (crime rate).

# Checking:

# multicollinearity
vif(NewModel)
##                   GVIF Df GVIF^(1/(2*Df))
## Boston1$rad   1.252151  8        1.014153
## Boston1$black 1.252151  1        1.118996
# outliers
outlierTest(NewModel)
##      rstudent unadjusted p-value Bonferroni p
## 381 15.512078         1.6898e-44   8.5334e-42
## 406 10.165554         3.5962e-22   1.8161e-19
## 411  6.355010         4.7410e-10   2.3942e-07
## 415  5.437406         8.5137e-08   4.2994e-05
## 405  4.959572         9.7322e-07   4.9148e-04
## 399  4.494258         8.7052e-06   4.3961e-03
## 427  3.946641         9.0751e-05   4.5829e-02
qqPlot(NewModel, main="QQ Plot")

## [1] 381 406
# leverages
leveragePlots(NewModel) # leverage plots

par(mfrow=c(2,2)) # to put 4 graphs on 1 screen
plot(NewModel) # diagnostics plots

  • multicollinearity: all the values are < 5 , so everything is OK;
  • outliers: we see a couple of outliers (406 and 381);
  • leverages: no leverages :)

Models comparison: which model is better?

Nested models – one of the model contains ALL the predictors. If you delete shared factors from both models, one of them becomes EMPTY: Y1=x1+x2+x3 Y2=x1+x2+x3+x4+x5

Non-nested models – both models have original predictors. Two types of non-nested models:

  1. partly non-nested: Y1=x1+x2+x7 Y2=x1+x2+x3+x4+x5

  2. completely non-nested: Y1=x1+x2+x3 Y2=x4+x5+x6

Comparing nested models -> anova {stats}. Comparing residuals = unexplained variances with control by df (degrees of freedom) look at p-value

anova (model1, model4) # which one is better?

Comparing non-nested models -> AIC (Aikake Information Criteria). The lower, the better!!

model7 <- lm(Boston$crim~Boston$dis+Boston$medv)
AIC(model4) #3487.085
## [1] 3487.085
AIC(model7) #3484.536 - better.
## [1] 3484.536

How to create a good model? Forward vs. backward selection

modelEMPTY <- lm(Boston$crim~0) # -> adding factors and comparing models = forward
summary(modelEMPTY)
## 
## Call:
## lm(formula = Boston$crim ~ 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  0.006  0.082  0.257  3.677 88.976 
## 
## No Coefficients
## 
## Residual standard error: 9.322 on 506 degrees of freedom
modelFULL <- lm(Boston$crim~., data = Boston) # -> removing insignificat factors = backward
summary (modelFULL)
## 
## Call:
## lm(formula = Boston$crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.122 -2.570 -0.650  1.481 67.119 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.815843   6.826482   4.221 2.90e-05 ***
## zn            0.081738   0.017823   4.586 5.74e-06 ***
## indus        -0.126213   0.077567  -1.627 0.104348    
## chas         -0.597332   1.093331  -0.546 0.585079    
## nox         -22.455793   5.066704  -4.432 1.15e-05 ***
## rm            0.584749   0.567938   1.030 0.303704    
## age          -0.017450   0.016735  -1.043 0.297574    
## dis          -0.970012   0.261062  -3.716 0.000226 ***
## rad           0.137292   0.095503   1.438 0.151194    
## tax          -0.003363   0.004776  -0.704 0.481675    
## ptratio      -0.141233   0.173306  -0.815 0.415504    
## black        -0.002798   0.003443  -0.813 0.416728    
## lstat         0.026137   0.071007   0.368 0.712967    
## medv         -0.231933   0.056176  -4.129 4.29e-05 ***
## crim1         3.156447   0.347787   9.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.965 on 491 degrees of freedom
## Multiple R-squared:  0.5324, Adjusted R-squared:  0.5191 
## F-statistic: 39.94 on 14 and 491 DF,  p-value: < 2.2e-16

what drawbacks do these methods have?

Graphics

library(ggplot2)
regr <- ggplot(Boston, aes(crim,black))
regr+geom_point()

Perhaps a simple way to solve the problem of heteroscedasticity of residuals is to convert numeric variables into categorical.

TASK - part 2/2:

  • Comparing with model 4, do you have a nested model or partly/completely non-nested model?
  • Use anova or AIC to compare two models (model 4 and your model).
  • SEND OUTPUT AND INTERPRETATIONS TO .

Model 4 has following predictors: chas + black + indus My NewModel as predictors has: rad + black And both models have one outcome - crime rate.

So, these two models are partly non-nested, because they have one common variale ‘black’, and other predictors are different.

Let’s compare them

AIC(model4) #3487.085
## [1] 3487.085
AIC(NewModel) #3274.053 - the lower, the better
## [1] 3274.053

Woohoo, mine is better :) * yes, I’m proud of myself *