Problem statement: Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?


### Brief intro

This dataset covers crime in small cities. Predictors are education characteristics and police funding. Overall crime rate and violent crime rate are targets. The cities are not identified (shucks).

The data dictionary:

Dataset: “mlr06.csv”

crime_rate = total overall reported crime rate per 1 million residents
v_crime = reported violent crime rate per 100,000 residents
police_exp = annual police funding in $/resident
HS_pct = % of people 25 years+ with 4 yrs. of high school
notHS_pct = % of 16 to 19 year-olds not in highschool and not highschool graduates
college18-24 = % of 18 to 24 year-olds in college
college25_plus = % of people 25 years+ with at least 4 years of college

Reference: Life In America’s Small Cities, By G.S. Thomas

df <- data.frame(read.csv("mlr06.csv"))
describe(df)[,-c(1,3)]
##                 n     sd median trimmed    mad min  max range skew
## crime_rate     50 293.94  654.5  674.92 243.15 341 1740  1399 1.44
## v_crime        50 573.74  454.0  532.77 358.79  29 3545  3516 2.73
## police_exp     50  13.82   34.5   35.60   6.67  16   86    70 1.72
## HS_pct         50   9.97   59.0   58.75  13.34  42   81    39 0.05
## notHS_pct      50   6.02   14.0   15.00   5.93   4   34    30 0.65
## college18_24   50  14.80   25.0   27.62   8.90   7   81    74 1.63
## college_25plus 50   5.16   12.0   13.00   2.97   8   36    28 2.03
##                kurtosis    se
## crime_rate         2.28 41.57
## v_crime           11.07 81.14
## police_exp         3.24  1.95
## HS_pct            -1.14  1.41
## notHS_pct          0.24  0.85
## college18_24       2.89  2.09
## college_25plus     5.40  0.73

There are no missing data.

anyNA(df)
## [1] FALSE

Some distribution plots. Lots of left skew.

varnames <- colnames(df)
for (i in 1:7){
 hist(df[,i], freq=F, col='lightblue',
      breaks=30, main=paste(varnames[i]))
 lines(density(df[,i]), col='red', lwd=3)
}

Correlations: Nothing surprising here

High school and college education are negatively correlated with crime rate and violent crime. Police expenditures are positively correlated with crime. Lack of high school is positively correlated with higher crime.

round(cor(df),3)
##                crime_rate v_crime police_exp HS_pct notHS_pct college18_24
## crime_rate          1.000   0.757      0.533 -0.135     0.323       -0.175
## v_crime             0.757   1.000      0.509 -0.184     0.291       -0.199
## police_exp          0.533   0.509      1.000  0.120     0.312       -0.277
## HS_pct             -0.135  -0.184      0.120  1.000    -0.537        0.182
## notHS_pct           0.323   0.291      0.312 -0.537     1.000       -0.627
## college18_24       -0.175  -0.199     -0.277  0.182    -0.627        1.000
## college_25plus     -0.026  -0.046      0.125  0.681    -0.514        0.592
##                college_25plus
## crime_rate             -0.026
## v_crime                -0.046
## police_exp              0.125
## HS_pct                  0.681
## notHS_pct              -0.514
## college18_24            0.592
## college_25plus          1.000
corrplot::corrplot(cor(df))

Scatterplots, crime_rate

Only modest linearity, it seems.

varnames <- colnames(df)
#par(mfrow=c(7,1))
for (i in 3:7){
 plot(df[,i], df$crime_rate, col='blue', main=paste(varnames[i]))
 abline(lm(df$crime_rate~df[,i]), col='red', lwd=3)
}

Scatterplots, v_crime

Same here. A mashup.

varnames <- colnames(df)
#par(mfrow=c(7,1))
for (i in 3:7){
 plot(df[,i], df$v_crime, col='blue', main=paste(varnames[i]))
 abline(lm(df$v_crime~df[,i]), col='red', lwd=3)
}

Try a basic multivariate linear model

It appears that the only significant relationship is police expenditures. But the model is not good: adjusted R2 is only .26 and the F-statistic is very small at 4.4 even though it is statistically significant at p < .05.

The model says that communities with a higher crime spend more on police; about $11 per resident for every crime per 1 million residents.

fit1 <- lm(crime_rate ~., data=df[,-2])
summary(fit1)
## 
## Call:
## lm(formula = crime_rate ~ ., data = df[, -2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -346.60 -148.46  -62.43  111.75  788.21 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     489.649    472.366   1.037 0.305592    
## police_exp       10.981      3.078   3.568 0.000884 ***
## HS_pct           -6.088      6.544  -0.930 0.357219    
## notHS_pct         5.480     10.053   0.545 0.588428    
## college18_24      0.377      4.417   0.085 0.932367    
## college_25plus    5.500     13.754   0.400 0.691150    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 253.2 on 44 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.2579 
## F-statistic: 4.405 on 5 and 44 DF,  p-value: 0.002444

Add a quadratic term

None of our predictors look particularly curvy, but let’s try HS_pct, since it had the second best p-value in our original model.

Hardly made a difference. Data just isn’t set up that way.

fit_quad <- lm(crime_rate ~ police_exp 
               + I(HS_pct^2) + notHS_pct
               + college18_24 + college_25plus, 
               data=df[,-2])
summary(fit_quad)
## 
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct + 
##     college18_24 + college_25plus, data = df[, -2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -355.53 -144.56  -60.79  112.54  784.41 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    334.202054 316.063325   1.057 0.296105    
## police_exp      10.934176   3.067791   3.564 0.000893 ***
## I(HS_pct^2)     -0.059577   0.056752  -1.050 0.299552    
## notHS_pct        5.107059   9.873728   0.517 0.607581    
## college18_24    -0.009014   4.453775  -0.002 0.998394    
## college_25plus   7.545981  14.348909   0.526 0.601604    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.6 on 44 degrees of freedom
## Multiple R-squared:  0.3371, Adjusted R-squared:  0.2618 
## F-statistic: 4.475 on 5 and 44 DF,  p-value: 0.002207

Dichotomous term

We don’t have a categorical variable in this data, so we’ll make one: Whether the ratio of 25 plus college grads minus the ratio of nonHS grads is greater than one. In other words, is crime less when education skews disproportionately higher.

df$ed_ratio <- with(df, ifelse(college_25plus-notHS_pct > 0, 1, 0))

Fit it. Alas, a loser approach. Teensy improvement in R2.

fit_di <- lm(crime_rate ~ police_exp 
               + I(HS_pct^2) + notHS_pct
               + college18_24 + college_25plus
               + ed_ratio,
               data=df[,-2])
summary(fit_di)
## 
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct + 
##     college18_24 + college_25plus + ed_ratio, data = df[, -2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -422.49 -157.50  -26.99   91.99  724.07 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     420.88870  315.77188   1.333  0.18959   
## police_exp       10.25012    3.04906   3.362  0.00163 **
## I(HS_pct^2)      -0.05417    0.05593  -0.969  0.33820   
## notHS_pct        -1.25866   10.52390  -0.120  0.90536   
## college18_24     -0.09514    4.38158  -0.022  0.98278   
## college_25plus   13.69294   14.64726   0.935  0.35509   
## ed_ratio       -178.91498  113.86187  -1.571  0.12344   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 248.4 on 43 degrees of freedom
## Multiple R-squared:  0.3731, Adjusted R-squared:  0.2856 
## F-statistic: 4.265 on 6 and 43 DF,  p-value: 0.001863

Try an interaction term instead

We’ll dump that stupid dichotomous variable and sub in an interactive.

Well, that really messed up stuff. Again things got worse.

fit_int <- lm(crime_rate ~ police_exp 
               + I(HS_pct^2) + notHS_pct
               + college18_24 + college_25plus
               + police_exp:college_25plus, 
               data=df[,-2])
summary(fit_int)
## 
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct + 
##     college18_24 + college_25plus + police_exp:college_25plus, 
##     data = df[, -2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -357.18 -149.24  -57.36  112.01  783.94 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)               256.08887  491.62979   0.521    0.605
## police_exp                 13.45002   12.42645   1.082    0.285
## I(HS_pct^2)                -0.05976    0.05739  -1.041    0.304
## notHS_pct                   4.73519   10.14001   0.467    0.643
## college18_24               -0.20726    4.60173  -0.045    0.964
## college_25plus             13.40994   31.57682   0.425    0.673
## police_exp:college_25plus  -0.16140    0.77195  -0.209    0.835
## 
## Residual standard error: 255.3 on 43 degrees of freedom
## Multiple R-squared:  0.3378, Adjusted R-squared:  0.2454 
## F-statistic: 3.655 on 6 and 43 DF,  p-value: 0.005085

Here is one combination, which bombs. In fact, none of the possible combinations improves the original model (which wasn’t great either).

fit_int2 <- lm(crime_rate ~ police_exp 
               + I(HS_pct^2) + notHS_pct
               + college18_24 + college_25plus
               + college_25plus:notHS_pct, 
               data=df[,-2])
summary(fit_int2)
## 
## Call:
## lm(formula = crime_rate ~ police_exp + I(HS_pct^2) + notHS_pct + 
##     college18_24 + college_25plus + college_25plus:notHS_pct, 
##     data = df[, -2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -364.02 -138.85  -63.62  109.76  775.10 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)              413.60978  366.64287   1.128  0.26553   
## police_exp                10.37292    3.34948   3.097  0.00344 **
## I(HS_pct^2)               -0.06584    0.05903  -1.115  0.27087   
## notHS_pct                 -2.05023   19.09660  -0.107  0.91500   
## college18_24               0.31218    4.55424   0.069  0.94567   
## college_25plus             2.77458   18.10175   0.153  0.87890   
## notHS_pct:college_25plus   0.66235    1.50753   0.439  0.66260   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254.9 on 43 degrees of freedom
## Multiple R-squared:  0.3401, Adjusted R-squared:  0.248 
## F-statistic: 3.693 on 6 and 43 DF,  p-value: 0.004776

Conclusion: Better data needed