Linear Model Diagnostics

Olesya Volchenko

January 23, 2019

Linear model assumptions

https://www.statisticssolutions.com/assumptions-of-linear-regression/

Dataset information from the help file

Crowding and Crime in U. S. Metropolitan Areas

The Freedman data frame has 110 rows and 4 columns. The observations are U. S. metropolitan areas with 1968 populations of 250,000 or more. There are some missing data.

population - Total 1968 population, 1000s.

nonwhite - Percent nonwhite population, 1960.

density - Population per square mile, 1968.

crime - Crime rate per 100,000, 1969.

Source: United States (1970) Statistical Abstract of the United States. Bureau of the Census.

References:

Let’s take a look at our dataset

Crime <- as.data.frame(Freedman [,c("crime", "population", "nonwhite", "density")])
head(Crime)
##             crime population nonwhite density
## Akron        2602        675      7.3     746
## Albany       1388        713      2.6     322
## Albuquerque  5018         NA      3.3      NA
## Allentown    1182        534      0.8     491
## Anaheim      3341       1261      1.4    1612
## Atlanta      2805       1330     22.8     770

I. Modeling

  1. Check for the normality of the dependent variable
plot(density(Crime$crime))

I. Modeling

fit <- lm (crime ~ population + nonwhite, data=Crime)
summary (fit)
## 
## Call:
## lm(formula = crime ~ population + nonwhite, data = Crime)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1989.18  -669.74    77.22   607.64  2217.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.185e+03  1.398e+02  15.630  < 2e-16 ***
## population  2.376e-01  5.639e-02   4.214 5.63e-05 ***
## nonwhite    2.611e+01  8.724e+00   2.993   0.0035 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 873.1 on 97 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.2279, Adjusted R-squared:  0.212 
## F-statistic: 14.32 on 2 and 97 DF,  p-value: 3.56e-06

Overall picture

scatterplotMatrix (Crime, spread=FALSE, lty.smooth=2,
                   main="Scatterplot matrix")

Outliers

Outliers

Outliers: visual

qqPlot(fit, labels=row.names(Crime), simulate=TRUE, 
       main="Q-Q Plot")

##  philadelphia San.Francisco 
##            75            89

Outliers: formal

outlierTest(fit)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##               rstudent unadjusted p-value Bonferonni p
## San.Francisco 2.649709          0.0094217      0.94217

If the result of Bonferroni test is higher than 0.05, then the test is non significant, and the outliers are not a serious problem for the model

Leverages (hat values)

head(hatvalues(fit))
##       Akron      Albany   Allentown     Anaheim     Atlanta Bakersfield 
##  0.01187444  0.01692672  0.02068355  0.01873706  0.02464981  0.01377398
plot(hatvalues(fit))
abline(h=c(2,3)*3/110,lty=2)
text(hatvalues(fit), rownames(Crime))

3 stands for the number of independent variables in the best model +1 (2+1)

110 is number of observations. h = c(2,3)*(k+1)/n

(k + 1)/n is the average hatvalue. The 2(k + 1)/n rule comes from results in Belsley, Kuh, and Welsch (1980), Regression Diagnostics, concerning the distribution of the hatvalues when n is large relative to k + 1, and when X is multivariate normal. For smaller n, this tends to nominate too many points, and thus suggests the rule 3(k + 1)/n

Cook’s distance

plot(cooks.distance(fit))
text(cooks.distance(fit), labels = rownames(Crime))
abline(h=4/(110-2-1), lty=2)

Formula is the following: 4/(n-k-1), where n stands for number of observations, k - for number of independent variables in the model

You can use identify() function identify(cooks.distance(fit), labels=row.names(Crime))

A plot with Cook’s distances, outliers, hat Values

plot(hatvalues(fit), rstudent(fit), type='n')
abline(h=c(-2, 0, 2), lty=2)
abline(v=c(2,3)*3/110, lty=2)
cook <- sqrt(cooks.distance(fit))
points(hatvalues(fit), rstudent(fit), cex=10*cook/max(cook))
text(hatvalues(fit), rstudent(fit), rownames(Crime))

We have 3 horizontal dashed lines on -2, 0 and 2. That corresponds to the thresholds of studentized residuals.

We have 2 vertical lines for leverage thresholds

Size of each circle is related to cooks distance for each observation

Dfbeta

Let’s create an object containing all 110*2 dfbeta from the model

dfbs<- dfbetas(fit)
dfbs[1:5,]
##            (Intercept)   population     nonwhite
## Akron      0.007324608 -0.002091105 -0.002404142
## Albany    -0.150587673  0.026128053  0.094957595
## Allentown -0.190306713  0.043140974  0.129449432
## Anaheim    0.112634531  0.013963544 -0.089082386
## Atlanta    0.006630463 -0.001355014 -0.040975470
plot(dfbs[,c(2,3)], pch = 16)
abline(v=0, lty=2)
abline(h=0, lty=2)
text(dfbs[,2] + 0.02, dfbs[,3], labels=row.names(Crime), cex = 0.5)

The case of Fresno increases the effect of population and decreases the effect of nonwhite, meanwhile the Memphis case decreases the effect of population

What’s wrong with dfbeta? There are too many of them (the number of cases multiplied by the number of observations)

Use Cook’s distances instead

Residuals normality: visual

plot(density(fit$res))

hist(fit$res)

Residuals normality: formal

library(olsrr)
ols_test_normality(fit)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9939         0.9342 
## Kolmogorov-Smirnov        0.0605         0.8574 
## Cramer-von Mises          8.4233         0.0000 
## Anderson-Darling          0.2041         0.8714 
## -----------------------------------------------

Non-linearity

Scatterplot shows non-linear relations of two variables (the effect may be non-linear)

scatterplot(crime ~ nonwhite, data = Crime)
text(Crime$nonwhite, Crime$crime, labels=row.names(Crime), cex = 0.5)

Non-linearity

scatterplot(crime ~ population, data = Crime)
text(Crime$population, Crime$crime, labels=row.names(Crime), cex = 0.5)

Non-linearity for all variables

If you want to see possible non-linearity for all the variables in the regression, use crPlots function

crPlots (fit)

If you’d like to know how to fix your variables

boxTidwell(crime ~ population + nonwhite, data=Crime)
##            MLE of lambda Score Statistic (z) Pr(>|z|)   
## population      -0.50213             -2.4717 0.013448 * 
## nonwhite        -0.35661             -2.7972 0.005155 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## iterations =  6

If p-values of test are smaller than 0.05, don’t transform.

https://www.youtube.com/watch?v=V_Ut4O1qyqM

Non-normality of residuals

We can transform DV

library(MASS)
par(mfrow = c(1, 2))
boxcox (fit)
boxcox(fit, lambda=seq(0, 6, by=.01)) # there should be three vertical lines on the graph

where the left and the right stand for confidence intervals, and the median one is of the major interest here as it shows to what power your variable should be raised for the best transformation.

If you don’t see three lines, change the parameters of seq() in such a way that you see all three vertical dotted lines.

Heteroscedasticity: general idea

also known as non-constant variance

Heteroscedasticity: our case

par(mfrow = c(1, 2))
plot(fitted.values(fit), rstudent(fit))
abline(h=0, lty=2) 

spreadLevelPlot(fit) 

## 
## Suggested power transformation:  0.5363445

Blue line should be horizontal

ncvTest(fit) 
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.2488035, Df = 1, p = 0.61792

test for heteroscedasticity. If more than 0.05 - not significant, no heteroscedasticity.

Multicollinearity

vif(fit)
## population   nonwhite 
##   1.004993   1.004993

If less then 4 - no problem

If there are categorical variables in the model, it would show GVIF (generalized)

It’s OK to have values higher than 4 if you have an interactive term.

Multicollinearity simulation

Copy the code below into R:

#No Multicollinearity
x1 <- rnorm(100, 0, 1)
x2 <- rnorm(100, 0, 1)
#x1 and x2 are independent
e <- rnorm(100, 0, 0.5)
y <- 0.5 * x1 + 0.5 * x2 + e 
mydata <- data.frame(x1, x2, e, y)
model1 <- lm(y ~ x1 + x2, data = mydata)
summary(model1)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3534 -0.3776  0.0243  0.4261  1.3629 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.08486    0.05447  -1.558    0.122    
## x1           0.44646    0.05657   7.892 4.54e-12 ***
## x2           0.51328    0.04824  10.640  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5429 on 97 degrees of freedom
## Multiple R-squared:  0.6547, Adjusted R-squared:  0.6476 
## F-statistic: 91.95 on 2 and 97 DF,  p-value: < 2.2e-16
#library(rgl)
#plot3d(y, x1, x2, size=3)

#Multicollinearity
x21 <- rnorm(100, 0, 1)
x22 <- x21 + rnorm(100, 0, 0.5)
e2 <- rnorm(100, 0, 0.5)
y2 <- 0.5 * x21 + 0.5 * x22 + e2 
model2 <- lm(y2 ~ x21 + x22)
summary(model2)
## 
## Call:
## lm(formula = y2 ~ x21 + x22)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3462 -0.2536  0.0360  0.2808  1.2516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02461    0.04753  -0.518  0.60574    
## x21          0.68727    0.11268   6.099 2.17e-08 ***
## x22          0.28793    0.10192   2.825  0.00574 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4751 on 97 degrees of freedom
## Multiple R-squared:  0.8488, Adjusted R-squared:  0.8457 
## F-statistic: 272.3 on 2 and 97 DF,  p-value: < 2.2e-16
#plot3d(y2, x21, x22, size=3)

vif(model1)
##      x1      x2 
## 1.00226 1.00226
vif(model2)
##      x21      x22 
## 7.061465 7.061465

Diagnostics to-do list