Background

Introduction

The data we will be evaluating today is a data from 1960 used by criminologists to study the effect of punishment regimes on crime rates, and in this exercise, we’ll try to make a linear regression model to predict the crime rate and see how different kinds of predictor variables affects it.

We will use MSE and RMSE as a measure of our model’s performance For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.

Data

NA check

## [1] FALSE

Data Preview

percent_m is_south mean_education police_exp60 police_exp59 labour_participation m_per1000f state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality prob_prison time_prison crime_rate
151 1 91 58 56 510 950 33 301 108 41 394 261 0.084602 26.2011 791
143 0 113 103 95 583 1012 13 102 96 36 557 194 0.029599 25.2999 1635
142 1 89 45 44 533 969 18 219 94 33 318 250 0.083401 24.3006 578
136 0 121 149 141 577 994 157 80 102 39 673 167 0.015801 29.9012 1969
141 0 121 109 101 591 985 18 30 91 20 578 174 0.041399 21.2998 1234
121 0 110 118 115 547 964 25 44 84 29 689 126 0.034201 20.9995 682

Variables Description

  • percent_m: percentage of males aged 14-24
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: mean years of schooling
  • police_exp60: police expenditure in 1960
  • police_exp59: police expenditure in 1959
  • labour_participation: labour force participation rate
  • m_per1000f: number of males per 1000 females
  • state_pop: state population
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: unemployment rate of urban males aged 14-24
  • unemploy_m39: unemployment rate of urban males aged 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: avg time served in prisons
  • crime_rate: crime rate in an unspecified category

EDA

GGCORR

## Warning in ggcorr(crime, label = T, hjust = 0.9, layout.exp = 2, label_size =
## 3, : data in column(s) 'is_south' are not numeric and were ignored

From initial ggcorr test, we’d like to perform some further correlation test on low performing variables (<= 0.2), namely :

  • percent_m
  • nonwhites_per1000
  • unemploy_m24
  • inequality

CorTest

## [1] "Correlation of percent_m :"
##        cor 
## -0.0894724
## [1] "Correlation of nonwhites_per1000 :"
##        cor 
## 0.03259884
## [1] "Correlation of unemploy_m24 :"
##        cor 
## 0.02352789
## [1] "Correlation of inequality :"
##        cor 
## -0.1790237

Modelling

We are using version 4, but here’s the trial data and how we get to the version that we’re happy with.

Version 1 - All Variables

Model with all variables included. Our Adjusted R-squared is 0.7046.

## 
## Call:
## lm(formula = crime_rate ~ ., data = crime_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -276.6 -105.7  -32.7  134.6  464.1 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          -5661.6419  2057.0277  -2.752  0.01194 * 
## percent_m                1.0782     5.7781   0.187  0.85376   
## is_south1              -52.2503   164.0244  -0.319  0.75321   
## mean_education          12.0181     7.6205   1.577  0.12972   
## police_exp60            18.1742    14.3110   1.270  0.21800   
## police_exp59           -11.2350    15.7665  -0.713  0.48394   
## labour_participation    -0.4187     1.6501  -0.254  0.80215   
## m_per1000f               2.6628     2.4307   1.095  0.28571   
## state_pop                0.4553     1.6496   0.276  0.78524   
## nonwhites_per1000        1.1159     0.8003   1.394  0.17781   
## unemploy_m24            -5.4864     4.7685  -1.151  0.26285   
## unemploy_m39             9.4924    10.1819   0.932  0.36178   
## gdp                      2.0402     1.2520   1.630  0.11811   
## inequality               9.5846     3.1156   3.076  0.00573 **
## prob_prison          -5961.3539  2673.0938  -2.230  0.03678 * 
## time_prison            -16.7451    10.0799  -1.661  0.11152   
## police_exp                   NA         NA      NA       NA   
## unemploy                     NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220.5 on 21 degrees of freedom
## Multiple R-squared:  0.8325, Adjusted R-squared:  0.7129 
## F-statistic:  6.96 on 15 and 21 DF,  p-value: 3.873e-05

Version 2 - Stepwise Model

We will try to work with variables recommended by the step function with a bit of tweaks. This model results in Adjusted R squared of 0.725. A 2% improvement from the first model.

## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp60 + m_per1000f + 
##     nonwhites_per1000 + gdp + inequality + prob_prison + time_prison, 
##     data = crime_train)
## 
## Coefficients:
##       (Intercept)     mean_education       police_exp60         m_per1000f  
##         -5715.989             11.389              9.103              1.785  
## nonwhites_per1000                gdp         inequality        prob_prison  
##             1.010              2.390             11.027          -5397.630  
##       time_prison  
##           -13.770
## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + 
##     inequality + prob_prison, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -514.48 -122.42   23.25  147.64  466.70 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5623.6011  1305.3749  -4.308 0.000154 ***
## mean_education    11.7668     5.8591   2.008 0.053392 .  
## police_exp         6.5360     0.8188   7.982 5.18e-09 ***
## m_per1000f         2.7494     1.3149   2.091 0.044826 *  
## inequality         8.4805     1.6584   5.114 1.55e-05 ***
## prob_prison    -3162.6889  1813.1384  -1.744 0.091013 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 215.8 on 31 degrees of freedom
## Multiple R-squared:  0.7632, Adjusted R-squared:  0.725 
## F-statistic: 19.99 on 5 and 31 DF,  p-value: 7.09e-09

Version 3 - Customized

We’ve tweaked the predictor variables in this model so that it yields the best result. This model has Adjusted R squared of 0.7331. A 3% improvement from the first model.

## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + 
##     gdp + percent_m + inequality + prob_prison + is_south, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391.70 -120.27  -38.94  124.94  430.71 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6981.3167  1474.8403  -4.734 5.74e-05 ***
## mean_education    12.7353     6.0076   2.120  0.04301 *  
## police_exp         5.4347     0.9975   5.448 8.16e-06 ***
## m_per1000f         2.4680     1.3589   1.816  0.08007 .  
## gdp                1.7694     1.0616   1.667  0.10673    
## percent_m          3.3433     4.4455   0.752  0.45830    
## inequality         9.9214     2.4635   4.027  0.00039 ***
## prob_prison    -2986.3250  2059.0858  -1.450  0.15808    
## is_south1         83.4337   126.9245   0.657  0.51632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 212.6 on 28 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7331 
## F-statistic: 13.36 on 8 and 28 DF,  p-value: 9.82e-08

Version 4 - Lowering Std Error

From the version 3, we noticed there are 2 variables with a very high coefficient and standard error values, prob_prison and is_south. In this iteration, we removed them and received Adjusted R-squared value at 0.73.

## 
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + 
##     gdp + percent_m + inequality, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -387.77 -148.59  -10.31  131.84  451.11 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7362.8235  1446.2070  -5.091 1.80e-05 ***
## mean_education    12.5536     5.7968   2.166   0.0384 *  
## police_exp         5.6253     0.9774   5.756 2.77e-06 ***
## m_per1000f         2.2382     1.3389   1.672   0.1050    
## gdp                2.2170     1.0175   2.179   0.0373 *  
## percent_m          3.5920     4.3579   0.824   0.4163    
## inequality        11.0016     2.2876   4.809 3.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 213 on 30 degrees of freedom
## Multiple R-squared:  0.7768, Adjusted R-squared:  0.7322 
## F-statistic: 17.41 on 6 and 30 DF,  p-value: 1.442e-08

Assumptions

Normality

## 
##  Shapiro-Wilk normality test
## 
## data:  crime_model_v4$residuals
## W = 0.97983, p-value = 0.7256

The histogram indicates there’s a bit of a skew in our distribution, but it passes shapiro test with no problem.

Homoscedacity

As we can see, with no distinct pattern, we can conclude that our model passes the homoscedacity assumption.

Multicolinearity

## mean_education     police_exp     m_per1000f            gdp      percent_m 
##       3.399311       2.866657       1.371557       8.188752       2.361084 
##     inequality 
##       6.768903

Our vif test indicates that there’s no strong linear relationship in between our predictor variables.

Train and Test RMSE

Here’s the comparison between the train MSE RMSE data and the test MSE RMSE data.

Conclusion

Model Performance

Variables used to build the model are mean education, combined police expenditure from 1959 and 1960, number of males per 1000 female, gdp, percentage of males aged 14-24, and income inequality. With our data test range from 508 to 1272, our RMSE at 269.4 is still considered quite large. Meaning our model is still inaccurate in predicting our target variable.

There’s a great chance that such inaccuracy happens because of the small observation number of only 47 in our dataset, causing difficulty for our model to learn the correct impact of each predictor variables. To improve our model, we recommend getting more data.

Model Insight

We found that with our crime rate target is affected significantly by mean education, police expenditure, GDP, and inequality.

I think police expenditure logically will rise as crime rate rises, so it may not be a very good indicator on what causing the crime rate to go up.

Our three other predictor are mean education, GDP, and inequality. At first glance it can be weird to see, because all three are positively correlated with the crime rate. That means -if we were to follow the model blindly- an increase in mean education, GDP, and inequality should contributes to an increase in our crime rate. That just doesn’t make any sense in my perspective.

If we take a closer look, the inequality has a very high coefficient value of 11 compared to the other two variables. This means for every rise in a unit of inequality, our crime rate is going up by 11.

In my opinion, mean education and GDP is positively correlated because it’s measuring the the average of the population. So we shouldn’t conclude that a rise in mean education and GDP causes the crime rate to go up. It just goes to show how big the inequality in the society is.

In conclusion, the way to reduce the crime rate is to greatly reduce the amount of inequality.