Problem statement: Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
### Brief intro
This dataset covers crime in small cities. Predictors are education characteristics and police funding. Overall crime rate and violent crime rate are targets. The cities are not identified (shucks).
The data dictionary:
Dataset: “mlr06.csv”
crime_rate = total overall reported crime rate per 1 million residents
v_crime = reported violent crime rate per 100,000 residents
police_exp = annual police funding in $/resident
HS_pct = % of people 25 years+ with 4 yrs. of high school
notHS_pct = % of 16 to 19 year-olds not in highschool and not highschool graduates
college18-24 = % of 18 to 24 year-olds in college
college25_plus = % of people 25 years+ with at least 4 years of college
Reference: Life In America’s Small Cities, By G.S. Thomas
df <- data.frame(read.csv("mlr06.csv"))
describe(df)[,-c(1,3)]## n sd median trimmed mad min max range skew
## crime_rate 50 293.94 654.5 674.92 243.15 341 1740 1399 1.44
## v_crime 50 573.74 454.0 532.77 358.79 29 3545 3516 2.73
## police_exp 50 13.82 34.5 35.60 6.67 16 86 70 1.72
## HS_pct 50 9.97 59.0 58.75 13.34 42 81 39 0.05
## notHS_pct 50 6.02 14.0 15.00 5.93 4 34 30 0.65
## college18_24 50 14.80 25.0 27.62 8.90 7 81 74 1.63
## college_25plus 50 5.16 12.0 13.00 2.97 8 36 28 2.03
## kurtosis se
## crime_rate 2.28 41.57
## v_crime 11.07 81.14
## police_exp 3.24 1.95
## HS_pct -1.14 1.41
## notHS_pct 0.24 0.85
## college18_24 2.89 2.09
## college_25plus 5.40 0.73
There are no missing data.
anyNA(df)## [1] FALSE
Some distribution plots. Lots of left skew.
varnames <- colnames(df)
for (i in 1:7){
hist(df[,i], freq=F, col='lightblue',
breaks=30, main=paste(varnames[i]))
lines(density(df[,i]), col='red', lwd=3)
}Correlations: Nothing surprising here
High school and college education are negatively correlated with crime rate and violent crime. Police expenditures are positively correlated with crime. Lack of high school is positively correlated with higher crime.
round(cor(df),3)## crime_rate v_crime police_exp HS_pct notHS_pct college18_24
## crime_rate 1.000 0.757 0.533 -0.135 0.323 -0.175
## v_crime 0.757 1.000 0.509 -0.184 0.291 -0.199
## police_exp 0.533 0.509 1.000 0.120 0.312 -0.277
## HS_pct -0.135 -0.184 0.120 1.000 -0.537 0.182
## notHS_pct 0.323 0.291 0.312 -0.537 1.000 -0.627
## college18_24 -0.175 -0.199 -0.277 0.182 -0.627 1.000
## college_25plus -0.026 -0.046 0.125 0.681 -0.514 0.592
## college_25plus
## crime_rate -0.026
## v_crime -0.046
## police_exp 0.125
## HS_pct 0.681
## notHS_pct -0.514
## college18_24 0.592
## college_25plus 1.000
corrplot::corrplot(cor(df))Scatterplots, crime_rate
Only modest linearity, it seems.
varnames <- colnames(df)
#par(mfrow=c(7,1))
for (i in 3:7){
plot(df[,i], df$crime_rate, col='blue', main=paste(varnames[i]))
abline(lm(df$crime_rate~df[,i]), col='red', lwd=3)
}Scatterplots, v_crime
Same here. A mashup.
varnames <- colnames(df)
#par(mfrow=c(7,1))
for (i in 3:7){
plot(df[,i], df$v_crime, col='blue', main=paste(varnames[i]))
abline(lm(df$v_crime~df[,i]), col='red', lwd=3)
}Try a basic multivariate linear model
It appears that the only significant relationship is police expenditures. But the model is not good: adjusted R2 is only .26 and the F-statistic is very small at 4.4 even though it is statistically significant at p < .05.
The model says that communities with a higher crime spend more on police; about $11 per resident for every crime per 1 million residents.
fit1 <- lm(crime_rate ~., data=df[,-2])
summary(fit1)##
## Call:
## lm(formula = crime_rate ~ ., data = df[, -2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -346.60 -148.46 -62.43 111.75 788.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 489.649 472.366 1.037 0.305592
## police_exp 10.981 3.078 3.568 0.000884 ***
## HS_pct -6.088 6.544 -0.930 0.357219
## notHS_pct 5.480 10.053 0.545 0.588428
## college18_24 0.377 4.417 0.085 0.932367
## college_25plus 5.500 13.754 0.400 0.691150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 253.2 on 44 degrees of freedom
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.2579
## F-statistic: 4.405 on 5 and 44 DF, p-value: 0.002444
Add a quadratic term
None of our predictors look particularly curvy, but let’s try HS_pct, since it had the second best p-value in our original model.
Hardly made a difference. Data just isn’t set up that way.
fit_quad <- lm(crime_rate ~ police_exp
+ I(HS_pct^2) + notHS_pct
+ college18_24 + college_25plus,
data=df[,-2])
summary(fit_quad)##
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct +
## college18_24 + college_25plus, data = df[, -2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -355.53 -144.56 -60.79 112.54 784.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 334.202054 316.063325 1.057 0.296105
## police_exp 10.934176 3.067791 3.564 0.000893 ***
## I(HS_pct^2) -0.059577 0.056752 -1.050 0.299552
## notHS_pct 5.107059 9.873728 0.517 0.607581
## college18_24 -0.009014 4.453775 -0.002 0.998394
## college_25plus 7.545981 14.348909 0.526 0.601604
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.6 on 44 degrees of freedom
## Multiple R-squared: 0.3371, Adjusted R-squared: 0.2618
## F-statistic: 4.475 on 5 and 44 DF, p-value: 0.002207
Dichotomous term
We don’t have a categorical variable in this data, so we’ll make one: Whether the ratio of 25 plus college grads minus the ratio of nonHS grads is greater than one. In other words, is crime less when education skews disproportionately higher.
df$ed_ratio <- with(df, ifelse(college_25plus-notHS_pct > 0, 1, 0))Fit it. Alas, a loser approach. Teensy improvement in R2.
fit_di <- lm(crime_rate ~ police_exp
+ I(HS_pct^2) + notHS_pct
+ college18_24 + college_25plus
+ ed_ratio,
data=df[,-2])
summary(fit_di)##
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct +
## college18_24 + college_25plus + ed_ratio, data = df[, -2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -422.49 -157.50 -26.99 91.99 724.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 420.88870 315.77188 1.333 0.18959
## police_exp 10.25012 3.04906 3.362 0.00163 **
## I(HS_pct^2) -0.05417 0.05593 -0.969 0.33820
## notHS_pct -1.25866 10.52390 -0.120 0.90536
## college18_24 -0.09514 4.38158 -0.022 0.98278
## college_25plus 13.69294 14.64726 0.935 0.35509
## ed_ratio -178.91498 113.86187 -1.571 0.12344
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 248.4 on 43 degrees of freedom
## Multiple R-squared: 0.3731, Adjusted R-squared: 0.2856
## F-statistic: 4.265 on 6 and 43 DF, p-value: 0.001863
Try an interaction term instead
We’ll dump that stupid dichotomous variable and sub in an interactive.
Well, that really messed up stuff. Again things got worse.
fit_int <- lm(crime_rate ~ police_exp
+ I(HS_pct^2) + notHS_pct
+ college18_24 + college_25plus
+ police_exp:college_25plus,
data=df[,-2])
summary(fit_int)##
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct +
## college18_24 + college_25plus + police_exp:college_25plus,
## data = df[, -2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -357.18 -149.24 -57.36 112.01 783.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 256.08887 491.62979 0.521 0.605
## police_exp 13.45002 12.42645 1.082 0.285
## I(HS_pct^2) -0.05976 0.05739 -1.041 0.304
## notHS_pct 4.73519 10.14001 0.467 0.643
## college18_24 -0.20726 4.60173 -0.045 0.964
## college_25plus 13.40994 31.57682 0.425 0.673
## police_exp:college_25plus -0.16140 0.77195 -0.209 0.835
##
## Residual standard error: 255.3 on 43 degrees of freedom
## Multiple R-squared: 0.3378, Adjusted R-squared: 0.2454
## F-statistic: 3.655 on 6 and 43 DF, p-value: 0.005085
Here is one combination, which bombs. In fact, none of the possible combinations improves the original model (which wasn’t great either).
fit_int2 <- lm(crime_rate ~ police_exp
+ I(HS_pct^2) + notHS_pct
+ college18_24 + college_25plus
+ college_25plus:notHS_pct,
data=df[,-2])
summary(fit_int2)##
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct +
## college18_24 + college_25plus + college_25plus:notHS_pct,
## data = df[, -2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -364.02 -138.85 -63.62 109.76 775.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 413.60978 366.64287 1.128 0.26553
## police_exp 10.37292 3.34948 3.097 0.00344 **
## I(HS_pct^2) -0.06584 0.05903 -1.115 0.27087
## notHS_pct -2.05023 19.09660 -0.107 0.91500
## college18_24 0.31218 4.55424 0.069 0.94567
## college_25plus 2.77458 18.10175 0.153 0.87890
## notHS_pct:college_25plus 0.66235 1.50753 0.439 0.66260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254.9 on 43 degrees of freedom
## Multiple R-squared: 0.3401, Adjusted R-squared: 0.248
## F-statistic: 3.693 on 6 and 43 DF, p-value: 0.004776