Olesya Volchenko
January 23, 2019
https://www.statisticssolutions.com/assumptions-of-linear-regression/
Crowding and Crime in U. S. Metropolitan Areas
The Freedman data frame has 110 rows and 4 columns. The observations are U. S. metropolitan areas with 1968 populations of 250,000 or more. There are some missing data.
population - Total 1968 population, 1000s.
nonwhite - Percent nonwhite population, 1960.
density - Population per square mile, 1968.
crime - Crime rate per 100,000, 1969.
Source: United States (1970) Statistical Abstract of the United States. Bureau of the Census.
References:
Crime <- as.data.frame(Freedman [,c("crime", "population", "nonwhite", "density")])
head(Crime)## crime population nonwhite density
## Akron 2602 675 7.3 746
## Albany 1388 713 2.6 322
## Albuquerque 5018 NA 3.3 NA
## Allentown 1182 534 0.8 491
## Anaheim 3341 1261 1.4 1612
## Atlanta 2805 1330 22.8 770
plot(density(Crime$crime))fit <- lm (crime ~ population + nonwhite, data=Crime)
summary (fit)##
## Call:
## lm(formula = crime ~ population + nonwhite, data = Crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1989.18 -669.74 77.22 607.64 2217.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.185e+03 1.398e+02 15.630 < 2e-16 ***
## population 2.376e-01 5.639e-02 4.214 5.63e-05 ***
## nonwhite 2.611e+01 8.724e+00 2.993 0.0035 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 873.1 on 97 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.2279, Adjusted R-squared: 0.212
## F-statistic: 14.32 on 2 and 97 DF, p-value: 3.56e-06
scatterplotMatrix (Crime, spread=FALSE, lty.smooth=2,
main="Scatterplot matrix")qqPlot(fit, labels=row.names(Crime), simulate=TRUE,
main="Q-Q Plot")## philadelphia San.Francisco
## 75 89
outlierTest(fit)## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## San.Francisco 2.649709 0.0094217 0.94217
If the result of Bonferroni test is higher than 0.05, then the test is non significant, and the outliers are not a serious problem for the model
head(hatvalues(fit))## Akron Albany Allentown Anaheim Atlanta Bakersfield
## 0.01187444 0.01692672 0.02068355 0.01873706 0.02464981 0.01377398
plot(hatvalues(fit))
abline(h=c(2,3)*3/110,lty=2)
text(hatvalues(fit), rownames(Crime))3 stands for the number of independent variables in the best model +1 (2+1)
110 is number of observations. h = c(2,3)*(k+1)/n
(k + 1)/n is the average hatvalue. The 2(k + 1)/n rule comes from results in Belsley, Kuh, and Welsch (1980), Regression Diagnostics, concerning the distribution of the hatvalues when n is large relative to k + 1, and when X is multivariate normal. For smaller n, this tends to nominate too many points, and thus suggests the rule 3(k + 1)/n
plot(cooks.distance(fit))
text(cooks.distance(fit), labels = rownames(Crime))
abline(h=4/(110-2-1), lty=2)Formula is the following: 4/(n-k-1), where n stands for number of observations, k - for number of independent variables in the model
You can use identify() function identify(cooks.distance(fit), labels=row.names(Crime))
plot(hatvalues(fit), rstudent(fit), type='n')
abline(h=c(-2, 0, 2), lty=2)
abline(v=c(2,3)*3/110, lty=2)
cook <- sqrt(cooks.distance(fit))
points(hatvalues(fit), rstudent(fit), cex=10*cook/max(cook))
text(hatvalues(fit), rstudent(fit), rownames(Crime))We have 3 horizontal dashed lines on -2, 0 and 2. That corresponds to the thresholds of studentized residuals.
We have 2 vertical lines for leverage thresholds
Size of each circle is related to cooks distance for each observation
Let’s create an object containing all 110*2 dfbeta from the model
dfbs<- dfbetas(fit)dfbs[1:5,]## (Intercept) population nonwhite
## Akron 0.007324608 -0.002091105 -0.002404142
## Albany -0.150587673 0.026128053 0.094957595
## Allentown -0.190306713 0.043140974 0.129449432
## Anaheim 0.112634531 0.013963544 -0.089082386
## Atlanta 0.006630463 -0.001355014 -0.040975470
plot(dfbs[,c(2,3)], pch = 16)
abline(v=0, lty=2)
abline(h=0, lty=2)
text(dfbs[,2] + 0.02, dfbs[,3], labels=row.names(Crime), cex = 0.5)The case of Fresno increases the effect of population and decreases the effect of nonwhite, meanwhile the Memphis case decreases the effect of population
What’s wrong with dfbeta? There are too many of them (the number of cases multiplied by the number of observations)
Use Cook’s distances instead
plot(density(fit$res))hist(fit$res)library(olsrr)
ols_test_normality(fit)## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9939 0.9342
## Kolmogorov-Smirnov 0.0605 0.8574
## Cramer-von Mises 8.4233 0.0000
## Anderson-Darling 0.2041 0.8714
## -----------------------------------------------
Scatterplot shows non-linear relations of two variables (the effect may be non-linear)
scatterplot(crime ~ nonwhite, data = Crime)
text(Crime$nonwhite, Crime$crime, labels=row.names(Crime), cex = 0.5)scatterplot(crime ~ population, data = Crime)
text(Crime$population, Crime$crime, labels=row.names(Crime), cex = 0.5)If you want to see possible non-linearity for all the variables in the regression, use crPlots function
crPlots (fit)boxTidwell(crime ~ population + nonwhite, data=Crime)## MLE of lambda Score Statistic (z) Pr(>|z|)
## population -0.50213 -2.4717 0.013448 *
## nonwhite -0.35661 -2.7972 0.005155 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## iterations = 6
If p-values of test are smaller than 0.05, don’t transform.
We can transform DV
library(MASS)
par(mfrow = c(1, 2))
boxcox (fit)
boxcox(fit, lambda=seq(0, 6, by=.01)) # there should be three vertical lines on the graphwhere the left and the right stand for confidence intervals, and the median one is of the major interest here as it shows to what power your variable should be raised for the best transformation.
If you don’t see three lines, change the parameters of seq() in such a way that you see all three vertical dotted lines.
also known as non-constant variance
par(mfrow = c(1, 2))
plot(fitted.values(fit), rstudent(fit))
abline(h=0, lty=2)
spreadLevelPlot(fit) ##
## Suggested power transformation: 0.5363445
Blue line should be horizontal
ncvTest(fit) ## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2488035, Df = 1, p = 0.61792
test for heteroscedasticity. If more than 0.05 - not significant, no heteroscedasticity.
vif(fit)## population nonwhite
## 1.004993 1.004993
If less then 4 - no problem
If there are categorical variables in the model, it would show GVIF (generalized)
It’s OK to have values higher than 4 if you have an interactive term.
Copy the code below into R:
#No Multicollinearity
x1 <- rnorm(100, 0, 1)
x2 <- rnorm(100, 0, 1)
#x1 and x2 are independent
e <- rnorm(100, 0, 0.5)
y <- 0.5 * x1 + 0.5 * x2 + e
mydata <- data.frame(x1, x2, e, y)
model1 <- lm(y ~ x1 + x2, data = mydata)
summary(model1)##
## Call:
## lm(formula = y ~ x1 + x2, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3534 -0.3776 0.0243 0.4261 1.3629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08486 0.05447 -1.558 0.122
## x1 0.44646 0.05657 7.892 4.54e-12 ***
## x2 0.51328 0.04824 10.640 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5429 on 97 degrees of freedom
## Multiple R-squared: 0.6547, Adjusted R-squared: 0.6476
## F-statistic: 91.95 on 2 and 97 DF, p-value: < 2.2e-16
#library(rgl)
#plot3d(y, x1, x2, size=3)
#Multicollinearity
x21 <- rnorm(100, 0, 1)
x22 <- x21 + rnorm(100, 0, 0.5)
e2 <- rnorm(100, 0, 0.5)
y2 <- 0.5 * x21 + 0.5 * x22 + e2
model2 <- lm(y2 ~ x21 + x22)
summary(model2)##
## Call:
## lm(formula = y2 ~ x21 + x22)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3462 -0.2536 0.0360 0.2808 1.2516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.02461 0.04753 -0.518 0.60574
## x21 0.68727 0.11268 6.099 2.17e-08 ***
## x22 0.28793 0.10192 2.825 0.00574 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4751 on 97 degrees of freedom
## Multiple R-squared: 0.8488, Adjusted R-squared: 0.8457
## F-statistic: 272.3 on 2 and 97 DF, p-value: < 2.2e-16
#plot3d(y2, x21, x22, size=3)
vif(model1)## x1 x2
## 1.00226 1.00226
vif(model2)## x21 x22
## 7.061465 7.061465