Background

Introduction

The data we will be evaluating today is a data from 1960 used by criminologists to study the effect of punishment regimes on crime rates, and in this exercise, we’ll try to make a linear regression model to predict the crime rate and see how different kinds of predictor variables affects it.

We will use MSE and RMSE as a measure of our model’s performance For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.

Library

library(tidyverse)
library(lattice)
library(GGally)
library(lmtest)
library(car)
library(MLmetrics)
library(effects)

Data

crime <- read.csv("data_input/crime.csv") %>% 
  dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

NA check

anyNA(crime)

## [1] FALSE

Data Preview

knitr::kable(head(crime))

percent_m	is_south	mean_education	police_exp60	police_exp59	labour_participation	m_per1000f	state_pop	nonwhites_per1000	unemploy_m24	unemploy_m39	gdp	inequality	prob_prison	time_prison	crime_rate
151	1	91	58	56	510	950	33	301	108	41	394	261	0.084602	26.2011	791
143	0	113	103	95	583	1012	13	102	96	36	557	194	0.029599	25.2999	1635
142	1	89	45	44	533	969	18	219	94	33	318	250	0.083401	24.3006	578
136	0	121	149	141	577	994	157	80	102	39	673	167	0.015801	29.9012	1969
141	0	121	109	101	591	985	18	30	91	20	578	174	0.041399	21.2998	1234
121	0	110	118	115	547	964	25	44	84	29	689	126	0.034201	20.9995	682

Variables Description

percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp60: police expenditure in 1960
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: avg time served in prisons
crime_rate: crime rate in an unspecified category

Data Wrangling

crime <- crime %>% 
  mutate(is_south = as.factor(is_south))

Let’s join the police_exp and unemploy to avoid multicolinearity

crime <- crime %>% 
  mutate(police_exp = police_exp59+police_exp60,
         unemploy = unemploy_m24+unemploy_m39)

EDA

GGCORR

ggcorr(crime, label = T, hjust = .9, layout.exp = 2, label_size = 3, cex = 3)

## Warning in ggcorr(crime, label = T, hjust = 0.9, layout.exp = 2, label_size =
## 3, : data in column(s) 'is_south' are not numeric and were ignored

From initial ggcorr test, we’d like to perform some further correlation test on low performing variables (<= 0.2), namely :

percent_m
nonwhites_per1000
unemploy_m24
inequality

CorTest

print("Correlation of percent_m :")

## [1] "Correlation of percent_m :"

cor.test(crime$crime_rate, crime$percent_m)$estimate

##        cor 
## -0.0894724

print("Correlation of nonwhites_per1000 :")

## [1] "Correlation of nonwhites_per1000 :"

cor.test(crime$crime_rate, crime$nonwhites_per1000)$estimate

##        cor 
## 0.03259884

print("Correlation of unemploy_m24 :")

## [1] "Correlation of unemploy_m24 :"

cor.test(crime$crime_rate, crime$unemploy)$estimate

##        cor 
## 0.02352789

print("Correlation of inequality :")

## [1] "Correlation of inequality :"

cor.test(crime$crime_rate, crime$inequality)$estimate

##        cor 
## -0.1790237

Cross Validation

set.seed(10)
index <- sample(nrow(crime), nrow(crime)*.8)

crime_train <- crime[index,]
crime_test <- crime[-index,]

Modelling

We are using version 4, but here’s the trial data and how we get to the version that we’re happy with.

Version 1 - All Variables

Model with all variables included. Our Adjusted R-squared is 0.7046.

crime_model_v1 <- lm(formula = crime_rate ~ ., data = crime_train)
summary(crime_model_v1)

## 
## Call:
## lm(formula = crime_rate ~ ., data = crime_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -276.6 -105.7  -32.7  134.6  464.1 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          -5661.6419  2057.0277  -2.752  0.01194 * 
## percent_m                1.0782     5.7781   0.187  0.85376   
## is_south1              -52.2503   164.0244  -0.319  0.75321   
## mean_education          12.0181     7.6205   1.577  0.12972   
## police_exp60            18.1742    14.3110   1.270  0.21800   
## police_exp59           -11.2350    15.7665  -0.713  0.48394   
## labour_participation    -0.4187     1.6501  -0.254  0.80215   
## m_per1000f               2.6628     2.4307   1.095  0.28571   
## state_pop                0.4553     1.6496   0.276  0.78524   
## nonwhites_per1000        1.1159     0.8003   1.394  0.17781   
## unemploy_m24            -5.4864     4.7685  -1.151  0.26285   
## unemploy_m39             9.4924    10.1819   0.932  0.36178   
## gdp                      2.0402     1.2520   1.630  0.11811   
## inequality               9.5846     3.1156   3.076  0.00573 **
## prob_prison          -5961.3539  2673.0938  -2.230  0.03678 * 
## time_prison            -16.7451    10.0799  -1.661  0.11152   
## police_exp                   NA         NA      NA       NA   
## unemploy                     NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220.5 on 21 degrees of freedom
## Multiple R-squared:  0.8325, Adjusted R-squared:  0.7129 
## F-statistic:  6.96 on 15 and 21 DF,  p-value: 3.873e-05

Version 2 - Stepwise Model

We will try to work with variables recommended by the step function with a bit of tweaks. This model results in Adjusted R squared of 0.725. A 2% improvement from the first model.

step(crime_model_v1, direction = "backward", trace = 0)

## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp60 + m_per1000f + 
##     nonwhites_per1000 + gdp + inequality + prob_prison + time_prison, 
##     data = crime_train)
## 
## Coefficients:
##       (Intercept)     mean_education       police_exp60         m_per1000f  
##         -5715.989             11.389              9.103              1.785  
## nonwhites_per1000                gdp         inequality        prob_prison  
##             1.010              2.390             11.027          -5397.630  
##       time_prison  
##           -13.770

crime_model_v2 <- lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + 
    inequality + prob_prison, data = crime_train)
summary(crime_model_v2)

## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + 
##     inequality + prob_prison, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -514.48 -122.42   23.25  147.64  466.70 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5623.6011  1305.3749  -4.308 0.000154 ***
## mean_education    11.7668     5.8591   2.008 0.053392 .  
## police_exp         6.5360     0.8188   7.982 5.18e-09 ***
## m_per1000f         2.7494     1.3149   2.091 0.044826 *  
## inequality         8.4805     1.6584   5.114 1.55e-05 ***
## prob_prison    -3162.6889  1813.1384  -1.744 0.091013 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 215.8 on 31 degrees of freedom
## Multiple R-squared:  0.7632, Adjusted R-squared:  0.725 
## F-statistic: 19.99 on 5 and 31 DF,  p-value: 7.09e-09

Version 3 - Customized

We’ve tweaked the predictor variables in this model so that it yields the best result. This model has Adjusted R squared of 0.7331. A 3% improvement from the first model.

crime_model_v3 <- lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + gdp + percent_m +
    inequality + prob_prison + is_south, data = crime_train)
summary(crime_model_v3)

## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + 
##     gdp + percent_m + inequality + prob_prison + is_south, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391.70 -120.27  -38.94  124.94  430.71 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6981.3167  1474.8403  -4.734 5.74e-05 ***
## mean_education    12.7353     6.0076   2.120  0.04301 *  
## police_exp         5.4347     0.9975   5.448 8.16e-06 ***
## m_per1000f         2.4680     1.3589   1.816  0.08007 .  
## gdp                1.7694     1.0616   1.667  0.10673    
## percent_m          3.3433     4.4455   0.752  0.45830    
## inequality         9.9214     2.4635   4.027  0.00039 ***
## prob_prison    -2986.3250  2059.0858  -1.450  0.15808    
## is_south1         83.4337   126.9245   0.657  0.51632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 212.6 on 28 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7331 
## F-statistic: 13.36 on 8 and 28 DF,  p-value: 9.82e-08

Version 4 - Lowering Std Error

From the version 3, we noticed there are 2 variables with a very high coefficient and standard error values, prob_prison and is_south. In this iteration, we removed them and received Adjusted R-squared value at 0.73.

crime_model_v4 <- lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + gdp + percent_m +
    inequality, data = crime_train)
summary(crime_model_v4)

## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + 
##     gdp + percent_m + inequality, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -387.77 -148.59  -10.31  131.84  451.11 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7362.8235  1446.2070  -5.091 1.80e-05 ***
## mean_education    12.5536     5.7968   2.166   0.0384 *  
## police_exp         5.6253     0.9774   5.756 2.77e-06 ***
## m_per1000f         2.2382     1.3389   1.672   0.1050    
## gdp                2.2170     1.0175   2.179   0.0373 *  
## percent_m          3.5920     4.3579   0.824   0.4163    
## inequality        11.0016     2.2876   4.809 3.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 213 on 30 degrees of freedom
## Multiple R-squared:  0.7768, Adjusted R-squared:  0.7322 
## F-statistic: 17.41 on 6 and 30 DF,  p-value: 1.442e-08

Assumptions

Normality

hist(crime_model_v4$residuals)

shapiro.test(crime_model_v4$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime_model_v4$residuals
## W = 0.97983, p-value = 0.7256

The histogram indicates there’s a bit of a skew in our distribution, but it passes shapiro test with no problem.

Homoscedacity

plot(crime_model_v4$fitted.values, crime_model_v4$residuals)
abline(h = 0, col = "red")

As we can see, with no distinct pattern, we can conclude that our model passes the homoscedacity assumption.

Multicolinearity

vif(crime_model_v4)

## mean_education     police_exp     m_per1000f            gdp      percent_m 
##       3.399311       2.866657       1.371557       8.188752       2.361084 
##     inequality 
##       6.768903

Our vif test indicates that there’s no strong linear relationship in between our predictor variables.

Train and Test RMSE

Here’s the comparison between the train MSE RMSE data and the test MSE RMSE data.

Conclusion

Model Performance

Variables used to build the model are mean education, combined police expenditure from 1959 and 1960, number of males per 1000 female, gdp, percentage of males aged 14-24, and income inequality. With our data test range from 508 to 1272, our RMSE at 269.4 is still considered quite large. Meaning our model is still inaccurate in predicting our target variable.

There’s a great chance that such inaccuracy happens because of the small observation number of only 47 in our dataset, causing difficulty for our model to learn the correct impact of each predictor variables. To improve our model, we recommend getting more data.

Model Insight

We found that with our crime rate target is affected significantly by mean education, police expenditure, GDP, and inequality.

I think police expenditure logically will rise as crime rate rises, so it may not be a very good indicator on what causing the crime rate to go up.

Our three other predictor are mean education, GDP, and inequality. At first glance it can be weird to see, because all three are positively correlated with the crime rate. That means -if we were to follow the model blindly- an increase in mean education, GDP, and inequality should contributes to an increase in our crime rate. That just doesn’t make any sense in my perspective.

If we take a closer look, the inequality has a very high coefficient value of 11 compared to the other two variables. This means for every rise in a unit of inequality, our crime rate is going up by 11.

In my opinion, mean education and GDP is positively correlated because it’s measuring the the average of the population. So we shouldn’t conclude that a rise in mean education and GDP causes the crime rate to go up. It just goes to show how big the inequality in the society is.

In conclusion, the way to reduce the crime rate is to greatly reduce the amount of inequality.

Crime Rate Linear Regression Model

Deo Ivan Mareza

2/18/2020