Crime Statistics


Using linear regression, let’s predict USA’s crime rates depending on various socioeconomic factors. The data that we’ll be using was collected from 47 states in the 1960s.

Pre-Processing

Load Dataset

Let’s start by reading the data set using read.csv. Here we’ll also rename the columns in order to make the variables easier to understand.

library(dplyr)
data <- read.csv("crime.csv") %>% select(-c(X))
names(data) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

Let’s take a glimpse at the data set in order to see all the variables.

glimpse(data)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

This is what all the variables mean:

  • percent_m: percentage of males aged 14-24
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: mean years of schooling
  • police_exp60: police expenditure in 1960
  • police_exp59: police expenditure in 1959
  • labour_participation: labour force participation rate
  • m_per1000f: number of males per 1000 females
  • state_pop: state population
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: unemployment rate of urban males aged 14-24
  • unemploy_m39: unemployment rate of urban males aged 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: average time served in prisons
  • crime_rate: crime rate in an unspecified category

Clean Up

Let’s check for missing values by counting the amount of NaN in our data.

data %>% is.na() %>% colSums()
##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

It would seem like our data is already clean. The next thing would be changing wrong data types. That would be turning is_south into a factor. For this, let’s use mutate.

data <- data %>%
  mutate(is_south=as.factor(is_south))
rmarkdown::paged_table(data)

Now our data should be ready.

Modeling

Correlation

Let’s do some correlation checking before we start making the model. Here we can use GGally’s ggcorr in order to make a correlation plot.

library(GGally)
ggcorr(data, hjust = 1, layout.exp = 3, label = TRUE)

We’re going to predict the crime_rate, so according to this plot it would seem that police_exp59 and police_exp60 are the two most contributing variables. Aside from that, prob_prison, gdp, state_pop, and mean_education also seems to contribute somewhat.

Let’s try making a model using only police_exp59 and police_exp60.

model_police <- lm(crime_rate ~ police_exp60 + police_exp59, data= data)
summary(model_police)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + police_exp59, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -636.09 -168.62   35.44  141.80  532.10 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    158.26     125.93   1.257   0.2155  
## police_exp60    25.62      12.34   2.076   0.0438 *
## police_exp59   -17.83      13.12  -1.359   0.1810  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 281.3 on 44 degrees of freedom
## Multiple R-squared:  0.494,  Adjusted R-squared:  0.471 
## F-statistic: 21.48 on 2 and 44 DF,  p-value: 3.094e-07

Since this is multiple linear regression, we need to look at the Adjusted R-squared of the model in order to determine its quality. It would seem like this model only has an R-squared value of 0.471, which is quite terrible since we’d like the value to be as close to 1 as possible.

All Predictors

Let’s instead try to make a model with all of the predictors.

model_all <- lm(crime_rate ~ ., data= data)
summary(model_all)
## 
## Call:
## lm(formula = crime_rate ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
## percent_m                8.7830     4.1714   2.106 0.043443 *  
## is_south1               -3.8035   148.7551  -0.026 0.979765    
## mean_education          18.8324     6.2088   3.033 0.004861 ** 
## police_exp60            19.2804    10.6110   1.817 0.078892 .  
## police_exp59           -10.9422    11.7478  -0.931 0.358830    
## labour_participation    -0.6638     1.4697  -0.452 0.654654    
## m_per1000f               1.7407     2.0354   0.855 0.398995    
## state_pop               -0.7330     1.2896  -0.568 0.573845    
## nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
## unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
## unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
## gdp                      0.9617     1.0367   0.928 0.360754    
## inequality               7.0672     2.2717   3.111 0.003983 ** 
## prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
## time_prison             -3.4790     7.1653  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

This one has an R-squared value of 0.7078 which is much better than the previous model, but still not great. ## Step Let’s try using step which will automatically select the best predictors for us. We’ll start with using backward since we can simply use the previous model with all the predictors.

model_backward <- step(object= model_all, direction= "backward",
                       trace= F)
summary(model_backward)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

This one has a value of 0.7444, which is better than the previous model. Let’s also try to make the models using forward and both in order to see if we can get a better Adjusted R-squared score. For these two models we have to create a model with no predictors whatsoever, so we’ll simply input 1 where the predictors normally would have been.

model_none <- lm(crime_rate ~ 1, data)
model_forward <- step(object = model_none,
                      direction = "forward",
                      scope = list(upper= model_all),
                      trace=F)
summary(model_forward)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education + 
##     percent_m + prob_prison + unemploy_m39, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5040.505    899.843  -5.602 1.72e-06 ***
## police_exp60      11.502      1.375   8.363 2.56e-10 ***
## inequality         6.765      1.394   4.855 1.88e-05 ***
## mean_education    19.647      4.475   4.390 8.07e-05 ***
## percent_m         10.502      3.330   3.154  0.00305 ** 
## prob_prison    -3801.836   1528.097  -2.488  0.01711 *  
## unemploy_m39       8.937      4.091   2.185  0.03483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11
model_both <- step(object = model_none,
                      direction = "both",
                      scope = list(upper= model_all),
                      trace=F)
summary(model_both)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education + 
##     percent_m + prob_prison + unemploy_m39, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5040.505    899.843  -5.602 1.72e-06 ***
## police_exp60      11.502      1.375   8.363 2.56e-10 ***
## inequality         6.765      1.394   4.855 1.88e-05 ***
## mean_education    19.647      4.475   4.390 8.07e-05 ***
## percent_m         10.502      3.330   3.154  0.00305 ** 
## prob_prison    -3801.836   1528.097  -2.488  0.01711 *  
## unemploy_m39       8.937      4.091   2.185  0.03483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11

It would seem like the both and forward models are not much different, both of the models having a value of 0.7307. This makes them better than the all model but worse than the backward model. In this case, let’s use the two best models to predict our data.

data$pred_backward <- predict(model_backward, data)
data$pred_forward <- predict(model_forward, data)
rmarkdown::paged_table(data)

Evaluating The Model

Error

While we could count the margin of error of each prediction, it would be very tedious. Let’s instead use MAPE (Mean Absolute Percentage Error) from MLmetrics.

library(MLmetrics)
MAPE(data$pred_backward, data$crime_rate)
## [1] 0.1748751
MAPE(data$pred_forward, data$crime_rate)
## [1] 0.1695908

It seems like although the backward model has a better R-squared value, the forward model has a lower error percentage.

Linearity

Another important thing to check is whether or not our model fits the typical assumptions of a linear regression model (BLUE model). One of the assumptions being Linearity. In this case let’s just analyze the backward model. We’ll plot the fitted and residual values in a scatter plot.

plot(model_backward, which = 1)
abline(h = 100, col = "green")
abline(h = -100, col = "green")


We want the residual’s “bounce randomly” value to be around 0. It would seem like ours still fit this condition as there’s no random spike in value. ## Normality of Residuals Next, we want to make sure that our residuals are distributed normally, with most of them near the value of 0. We’ll use a histogram to plot this out.

# histogram residual
hist(model_backward$residuals)


It would seem like that the distribution of most of our residuals are near the value of 0, with a bit of a spike around the value of 200. To be absolutely sure, we can also use the Shapiro-Wilk Normality Test on our residuals.

shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.98511, p-value = 0.8051

In order to pass this test, our p-value has to be above alpha, which tends to be 0.05. Our p-value is 0.8051 > 0.05, so our backward model has passed the test.

Homoscedasticity of Residuals

Afterwards, we have to check if our errors are spread randomly without any sort of patterns. This condition is called Homoscedacity. We can try to check by using a scatter plot like below.

plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")


However, it can be rather hard if we’re not familiar with the different patterns that typically happen. So, we can just use bptest or Studentized Breusch-Pagan Test.

library(lmtest)
bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 13.51, df = 8, p-value = 0.09546

Same as before, we want our p-value to be above alpha (0.05) In this case, our p-value is 0.09546 > 0.05 Our backward model has also passed the Breusch-Pagan test.

No Multicollinearity

Last but not least, we have to check if our model has no strong correlation between predictors or Multicollinearity. We can use vif or Variance Inflation Factor for this.

library(car)
vif(model_backward)
##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

The ideal condition is for all of our predictors to have a VIF value of less than 10. It would seem like all of our predictors passed this test as well.

Conclusion

Model Evaluation

The model with the best Adjusted R-squared is model_backward which uses the predictors: percent_m, mean_education, police_exp60, m_per1000f, unemploy_m24, unemploy_m39, inequality and prob_prison. Our model has fulfilled the BLUE model assumption and has a MAPE value of around 17.5%. Although, our model could still use some improvement considering the Adjusted R-squared still only has a score of about 74.4%.

Crime Rate

If we take a look once more at the summary of model_backward,

summary(model_backward)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

We can see that the predictors that have the most effect on the target variable are police_exp60 (police expenditure) and inequality, both having a positive correlation to the target variable. From this we can conclude that a large difference caused by social classes as well as an increase of police expenditure can greatly affect crime rates. This means, in order to decrease crime rates, we must also decrease social inequality. Aside from that, spending too much on the police force seems to have some unforeseen consequences.