Regression for Crime Rate
Daniel Lumban Gaol

Introduction :

The following is an analysis of the average crime rate of how socio-demographic variables influence crime rates in a region.

Case :

We want to know what factors are very influential from the socio demographic, where we can predict the crime rate from the predictor variables we get.

1. Library and Setup

# Data Wrangling
library(tidyverse)

# Cek asumsi model
library(lmtest)
library(car)

# Menghitung error
library(MLmetrics)

# Visualiasi Korelasi
library(GGally)

2. Import Data

crime_data<- read.csv("crime.csv")

str(crime_data)

## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Data Description:

percent_m: percentage of males aged 14-24

is_south: whether it is in a Southern state. 1 for Yes, 0 for No.

mean_education: mean years of schooling

police_exp60: police expenditure in 1960

police_exp59: police expenditure in 1959

labour_participation: labour force participation rate

m_per1000f: number of males per 1000 females

state_pop: state population

nonwhites_per1000: number of non-whites resident per 1000 people

unemploy_m24: unemployment rate of urban males aged 14-24

unemploy_m39: unemployment rate of urban males aged 35-39

gdp: gross domestic product per head

inequality: income inequality

prob_prison: probability of imprisonment

time_prison: avg time served in prisons

crime_rate: crime rate in an unspecified category Produce a linear model

3. Exploratory Data Analysis

#checking missing value in each column
colSums(is.na(crime_data))

##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

ggcorr(crime_data, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 3)

From the correlation graph above, we can quickly see the relationship between the predictors and the target variables, if you look at the predictors that have a strong relationship with the target variable are gdp, state_pop, police_exp59, police_exp60 and mean_education.

Cross Validation

Split the data crime_data to be data train and data test, with the proporsion 80% train, 20% test

RNGkind(sample.kind =  "Rounding")
library(rsample)
set.seed(100)

index <- initial_split(crime_data, prop = 0.8, strata = "crime_rate")
data_train <- training(index)
data_test <- testing(index)

3. Model

A model with all predictors

model_crime <- lm(crime_rate ~ ., data_train)

summary(model_crime)

## 
## Call:
## lm(formula = crime_rate ~ ., data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -399.35 -113.24   -7.22   97.57  476.97 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          -7208.2954  1929.4526  -3.736  0.00108 **
## percent_m                9.6455     4.8526   1.988  0.05888 . 
## is_south               -47.8474   169.2143  -0.283  0.77989   
## mean_education          21.3571     7.1292   2.996  0.00646 **
## police_exp60            17.1212    13.5617   1.262  0.21943   
## police_exp59            -9.4456    14.7226  -0.642  0.52750   
## labour_participation    -0.7024     1.7103  -0.411  0.68510   
## m_per1000f               2.1027     2.3316   0.902  0.37648   
## state_pop               -0.7186     1.4062  -0.511  0.61423   
## nonwhites_per1000        0.5840     0.7237   0.807  0.42796   
## unemploy_m24            -7.6064     5.0030  -1.520  0.14205   
## unemploy_m39            21.6649    10.5631   2.051  0.05183 . 
## gdp                      1.6308     1.2472   1.308  0.20393   
## inequality               8.5440     2.6982   3.167  0.00431 **
## prob_prison          -5256.7028  2542.4776  -2.068  0.05011 . 
## time_prison             -5.6192     8.0215  -0.701  0.49063   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 224.4 on 23 degrees of freedom
## Multiple R-squared:  0.8106, Adjusted R-squared:  0.6871 
## F-statistic: 6.564 on 15 and 23 DF,  p-value: 3.539e-05

From the results of the model with all predictors, get the value Adjusted R-square = 0.6871 or 68%, and predictors that have a significant value are:

percent_m, mean_education, police_exp60, unemploy_m39, inequality, prob_prison.

Backward Method

A more appropriate measure if you want to find a trade-off between the model’s ability to follow the data pattern and its predictive ability is to use the Akaike Information Criterion (AIC), with a good model having a low AIC.

Backward method is done by gradually removing variables from the model. If after removing a variable the AIC value decreases, the new model will be selected. But if after removing a variable the AIC value increases, then the previous model will still be used.

step(model_crime, direction = "backward")

## Start:  AIC=433.64
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
##     police_exp59 + labour_participation + m_per1000f + state_pop + 
##     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - is_south              1      4025 1161857 431.78
## - labour_participation  1      8491 1166323 431.93
## - state_pop             1     13144 1170977 432.08
## - police_exp59          1     20721 1178553 432.33
## - time_prison           1     24703 1182536 432.46
## - nonwhites_per1000     1     32781 1190613 432.73
## - m_per1000f            1     40944 1198776 433.00
## <none>                              1157832 433.64
## - police_exp60          1     80234 1238066 434.25
## - gdp                   1     86072 1243904 434.44
## - unemploy_m24          1    116362 1274195 435.38
## - percent_m             1    198891 1356723 437.82
## - unemploy_m39          1    211761 1369593 438.19
## - prob_prison           1    215194 1373027 438.29
## - mean_education        1    451769 1609602 444.49
## - inequality            1    504755 1662588 445.75
## 
## Step:  AIC=431.78
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
##     time_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1      5003 1166860 429.94
## - state_pop             1     13507 1175365 430.23
## - police_exp59          1     19507 1181364 430.43
## - time_prison           1     22069 1183926 430.51
## - nonwhites_per1000     1     29061 1190918 430.74
## - m_per1000f            1     38392 1200249 431.04
## <none>                              1161857 431.78
## - police_exp60          1     79661 1241518 432.36
## - gdp                   1     82402 1244259 432.45
## - unemploy_m24          1    119971 1281828 433.61
## - percent_m             1    194949 1356807 435.83
## - unemploy_m39          1    211611 1373469 436.30
## - prob_prison           1    226287 1388144 436.72
## - mean_education        1    449036 1610894 442.52
## - inequality            1    553702 1715560 444.98
## 
## Step:  AIC=429.94
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + time_prison
## 
##                     Df Sum of Sq     RSS    AIC
## - state_pop          1     15447 1182307 428.46
## - police_exp59       1     15772 1182632 428.47
## - time_prison        1     22496 1189356 428.69
## - nonwhites_per1000  1     26133 1192993 428.81
## - m_per1000f         1     33584 1200444 429.05
## <none>                           1166860 429.94
## - police_exp60       1     74734 1241594 430.37
## - gdp                1     81264 1248124 430.57
## - unemploy_m24       1    115047 1281907 431.61
## - prob_prison        1    221972 1388832 434.74
## - percent_m          1    223150 1390010 434.77
## - unemploy_m39       1    237300 1404160 435.16
## - mean_education     1    459305 1626165 440.89
## - inequality         1    550275 1717135 443.01
## 
## Step:  AIC=428.46
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     m_per1000f + nonwhites_per1000 + unemploy_m24 + unemploy_m39 + 
##     gdp + inequality + prob_prison + time_prison
## 
##                     Df Sum of Sq     RSS    AIC
## - police_exp59       1     16142 1198449 426.99
## - nonwhites_per1000  1     27344 1209651 427.35
## - time_prison        1     34943 1217250 427.59
## - m_per1000f         1     58977 1241285 428.36
## <none>                           1182307 428.46
## - police_exp60       1     70105 1252412 428.70
## - gdp                1     75736 1258044 428.88
## - unemploy_m24       1    136978 1319285 430.73
## - prob_prison        1    218748 1401055 433.08
## - percent_m          1    239595 1421902 433.65
## - unemploy_m39       1    255139 1437446 434.08
## - mean_education     1    459010 1641318 439.25
## - inequality         1    540718 1723025 441.15
## 
## Step:  AIC=426.99
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison
## 
##                     Df Sum of Sq     RSS    AIC
## - nonwhites_per1000  1     19427 1217876 425.61
## - time_prison        1     22157 1220606 425.70
## <none>                           1198449 426.99
## - gdp                1     67475 1265924 427.12
## - m_per1000f         1     96293 1294742 428.00
## - unemploy_m24       1    149740 1348189 429.58
## - prob_prison        1    202618 1401067 431.08
## - percent_m          1    229478 1427927 431.82
## - unemploy_m39       1    270803 1469252 432.93
## - mean_education     1    450871 1649320 437.44
## - police_exp60       1    537596 1736045 439.44
## - inequality         1    540740 1739189 439.51
## 
## Step:  AIC=425.61
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
##     time_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - time_prison     1     15300 1233175 424.10
## - gdp             1     59411 1277287 425.47
## <none>                        1217876 425.61
## - m_per1000f      1     83058 1300934 426.19
## - unemploy_m24    1    142864 1360740 427.94
## - prob_prison     1    184473 1402348 429.11
## - unemploy_m39    1    269060 1486936 431.40
## - percent_m       1    336661 1554536 433.13
## - mean_education  1    439519 1657395 435.63
## - inequality      1    620091 1837967 439.66
## - police_exp60    1    810365 2028241 443.51
## 
## Step:  AIC=424.1
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - gdp             1     54881 1288056 423.80
## <none>                        1233175 424.10
## - m_per1000f      1    120010 1353186 425.72
## - unemploy_m24    1    136881 1370057 426.21
## - prob_prison     1    187861 1421036 427.63
## - unemploy_m39    1    256783 1489958 429.48
## - percent_m       1    321597 1554772 431.14
## - mean_education  1    445529 1678704 434.13
## - inequality      1    605673 1838848 437.68
## - police_exp60    1    814813 2047988 441.88
## 
## Step:  AIC=423.8
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1288056 423.80
## - m_per1000f      1    150733 1438789 426.11
## - unemploy_m24    1    170339 1458395 426.64
## - prob_prison     1    219958 1508013 427.95
## - percent_m       1    276977 1565033 429.39
## - unemploy_m39    1    307352 1595408 430.14
## - mean_education  1    514283 1802339 434.90
## - inequality      1    669391 1957447 438.12
## - police_exp60    1   1266345 2554401 448.50

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = data_train)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education    police_exp60      m_per1000f  
##      -7471.644           9.400          21.209          10.021           2.891  
##   unemploy_m24    unemploy_m39      inequality     prob_prison  
##         -7.789          23.685           6.482       -3779.431

Model with the lowest AIC using the backward method is 423.8 with the model :

crime.rate = 9.400(percent_m) + 21.209(mean_education) + 10.021(police_exp60) + 2.891(m_per1000f) + -7.789(unemploy_m24) + 23.685(unemploy_39) + 6.482(inequality) + -3779.431(prob_prison)

crime.model <- lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
                m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
                data = data_train)

summary(crime.model)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.41 -118.32    8.16  134.38  477.61 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7471.644   1427.148  -5.235  1.2e-05 ***
## percent_m          9.400      3.701   2.540  0.01651 *  
## mean_education    21.209      6.128   3.461  0.00164 ** 
## police_exp60      10.021      1.845   5.431  6.9e-06 ***
## m_per1000f         2.891      1.543   1.874  0.07074 .  
## unemploy_m24      -7.789      3.910  -1.992  0.05556 .  
## unemploy_m39      23.685      8.852   2.676  0.01197 *  
## inequality         6.482      1.642   3.949  0.00044 ***
## prob_prison    -3779.431   1669.797  -2.263  0.03101 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.2 on 30 degrees of freedom
## Multiple R-squared:  0.7893, Adjusted R-squared:  0.7332 
## F-statistic: 14.05 on 8 and 30 DF,  p-value: 3.014e-08

After using the crime.model model, the value Adjusted R-Squared = 0.7332 / 73% occurs an increase of the Adjusted R-Squared value from the previous model, if all predictors are used, Adjusted R-square = 0.6871 or 68%

The model predict evaluation

predict.crime <- predict(crime.model, newdata = data_test)
head(predict.crime,1)

##        3 
## 358.1575

as.data.frame(predict.crime)

##    predict.crime
## 3       358.1575
## 12      725.2124
## 14      821.2269
## 17      420.4453
## 20     1279.9018
## 27      253.2388
## 34     1006.1802
## 47     1156.3398

interpretation :

The results of the prediction of average criminals if using the model from (crime.model) on the data (data_test) on the data in the first row will get a crime rate score 391.6707

Model Error

MAE(predict.crime, data_test$crime_rate) # Mean Average Error

## [1] 144.1993

MAPE(predict.crime, data_test$crime_rate) # Mean Average Percetage Error

## [1] 0.2174213

RMSE(predict.crime, data_test$crime_rate) # Root Mean Squared Error

## [1] 163.8212

interpretation :

MAE : On average, the prediction of crime rate missed by 144.1993, can be higher or lower

MAPE : The model predicts medical charges with a probability of being deviated by 21% of their actual value

RMSE : The model gave a mean of 163.8212 deviant prediction results,RMSE can be used if we are more concerned with very large errors.

4. Assumption Test

Multicollinearity

terms: All of variabel VIF < 10

vif(crime.model)

##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.148244       4.276462       2.912328       1.931593       4.382927 
##   unemploy_m39     inequality    prob_prison 
##       4.492641       3.924630       1.360445

all values of vif are <10, so that the variable does not occur multicollinearity or between predictor variables does not have a strong relationship.

Error normally distributed

plot(density(crime.model$residuals))

From the graph above, it can be concluded that the error of the data is normally distributed, assuming the curve is symmetrical and resembles a bell so that the mean, median and mode lie at one point.

terms : p-value > 0.05

shapiro.test(crime.model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime.model$residuals
## W = 0.98922, p-value = 0.9666

Because the p-value is> 0.05, it can be said that the residual / error is normally distributed

Heteroskesdaticity/ Unequal Variance

terms : p-value > 0.05

bptest(crime.model)

## 
##  studentized Breusch-Pagan test
## 
## data:  crime.model
## BP = 12.847, df = 8, p-value = 0.1172

P-value> 0.05 so that H0 is accepted. This also means that the residuals do not have a pattern (heteroscedasticity) where all the existing patterns have been captured by the model created.

5. Conclutions

The model of the model_crime that we tested using all the predictors got an R-square result of 68%, after that when using the backward method, he would iterate until he found the right model with the lowestAIC reference, namely 423, with the increase in the R-square rate to 73% with the model name ‘crime.model’ so that the socio-demographic factors that affect the crime rate in an area is:

m_per1000f, unemploy_m24, prob_prison, unemploy_m39, percent_m, mean_education, inequality, police_exp60

After testing the analysis the model has good criteria with RMSE : 163.8212