Regression for Crime Rate
Daniel Lumban Gaol
The following is an analysis of the average crime rate of how socio-demographic variables influence crime rates in a region.
We want to know what factors are very influential from the socio demographic, where we can predict the crime rate from the predictor variables we get.
# Data Wrangling
library(tidyverse)
# Cek asumsi model
library(lmtest)
library(car)
# Menghitung error
library(MLmetrics)
# Visualiasi Korelasi
library(GGally)
crime_data<- read.csv("crime.csv")
str(crime_data)
## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Data Description:
percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp60: police expenditure in 1960
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: avg time served in prisons
crime_rate: crime rate in an unspecified category Produce a linear model
#checking missing value in each column
colSums(is.na(crime_data))
## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
ggcorr(crime_data, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 3)
From the correlation graph above, we can quickly see the relationship between the predictors and the target variables, if you look at the predictors that have a strong relationship with the target variable are gdp, state_pop, police_exp59, police_exp60 and mean_education.
Split the data crime_data to be data train and data test, with the proporsion 80% train, 20% test
RNGkind(sample.kind = "Rounding")
library(rsample)
set.seed(100)
index <- initial_split(crime_data, prop = 0.8, strata = "crime_rate")
data_train <- training(index)
data_test <- testing(index)
model_crime <- lm(crime_rate ~ ., data_train)
summary(model_crime)
##
## Call:
## lm(formula = crime_rate ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -399.35 -113.24 -7.22 97.57 476.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7208.2954 1929.4526 -3.736 0.00108 **
## percent_m 9.6455 4.8526 1.988 0.05888 .
## is_south -47.8474 169.2143 -0.283 0.77989
## mean_education 21.3571 7.1292 2.996 0.00646 **
## police_exp60 17.1212 13.5617 1.262 0.21943
## police_exp59 -9.4456 14.7226 -0.642 0.52750
## labour_participation -0.7024 1.7103 -0.411 0.68510
## m_per1000f 2.1027 2.3316 0.902 0.37648
## state_pop -0.7186 1.4062 -0.511 0.61423
## nonwhites_per1000 0.5840 0.7237 0.807 0.42796
## unemploy_m24 -7.6064 5.0030 -1.520 0.14205
## unemploy_m39 21.6649 10.5631 2.051 0.05183 .
## gdp 1.6308 1.2472 1.308 0.20393
## inequality 8.5440 2.6982 3.167 0.00431 **
## prob_prison -5256.7028 2542.4776 -2.068 0.05011 .
## time_prison -5.6192 8.0215 -0.701 0.49063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 224.4 on 23 degrees of freedom
## Multiple R-squared: 0.8106, Adjusted R-squared: 0.6871
## F-statistic: 6.564 on 15 and 23 DF, p-value: 3.539e-05
From the results of the model with all predictors, get the value Adjusted R-square = 0.6871 or 68%, and predictors that have a significant value are:
percent_m, mean_education, police_exp60, unemploy_m39, inequality, prob_prison.
A more appropriate measure if you want to find a trade-off between the model’s ability to follow the data pattern and its predictive ability is to use the Akaike Information Criterion (AIC), with a good model having a low AIC.
Backward method is done by gradually removing variables from the model. If after removing a variable the AIC value decreases, the new model will be selected. But if after removing a variable the AIC value increases, then the previous model will still be used.
step(model_crime, direction = "backward")
## Start: AIC=433.64
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## police_exp59 + labour_participation + m_per1000f + state_pop +
## nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - is_south 1 4025 1161857 431.78
## - labour_participation 1 8491 1166323 431.93
## - state_pop 1 13144 1170977 432.08
## - police_exp59 1 20721 1178553 432.33
## - time_prison 1 24703 1182536 432.46
## - nonwhites_per1000 1 32781 1190613 432.73
## - m_per1000f 1 40944 1198776 433.00
## <none> 1157832 433.64
## - police_exp60 1 80234 1238066 434.25
## - gdp 1 86072 1243904 434.44
## - unemploy_m24 1 116362 1274195 435.38
## - percent_m 1 198891 1356723 437.82
## - unemploy_m39 1 211761 1369593 438.19
## - prob_prison 1 215194 1373027 438.29
## - mean_education 1 451769 1609602 444.49
## - inequality 1 504755 1662588 445.75
##
## Step: AIC=431.78
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
## labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison +
## time_prison
##
## Df Sum of Sq RSS AIC
## - labour_participation 1 5003 1166860 429.94
## - state_pop 1 13507 1175365 430.23
## - police_exp59 1 19507 1181364 430.43
## - time_prison 1 22069 1183926 430.51
## - nonwhites_per1000 1 29061 1190918 430.74
## - m_per1000f 1 38392 1200249 431.04
## <none> 1161857 431.78
## - police_exp60 1 79661 1241518 432.36
## - gdp 1 82402 1244259 432.45
## - unemploy_m24 1 119971 1281828 433.61
## - percent_m 1 194949 1356807 435.83
## - unemploy_m39 1 211611 1373469 436.30
## - prob_prison 1 226287 1388144 436.72
## - mean_education 1 449036 1610894 442.52
## - inequality 1 553702 1715560 444.98
##
## Step: AIC=429.94
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
## m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 +
## unemploy_m39 + gdp + inequality + prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - state_pop 1 15447 1182307 428.46
## - police_exp59 1 15772 1182632 428.47
## - time_prison 1 22496 1189356 428.69
## - nonwhites_per1000 1 26133 1192993 428.81
## - m_per1000f 1 33584 1200444 429.05
## <none> 1166860 429.94
## - police_exp60 1 74734 1241594 430.37
## - gdp 1 81264 1248124 430.57
## - unemploy_m24 1 115047 1281907 431.61
## - prob_prison 1 221972 1388832 434.74
## - percent_m 1 223150 1390010 434.77
## - unemploy_m39 1 237300 1404160 435.16
## - mean_education 1 459305 1626165 440.89
## - inequality 1 550275 1717135 443.01
##
## Step: AIC=428.46
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
## m_per1000f + nonwhites_per1000 + unemploy_m24 + unemploy_m39 +
## gdp + inequality + prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - police_exp59 1 16142 1198449 426.99
## - nonwhites_per1000 1 27344 1209651 427.35
## - time_prison 1 34943 1217250 427.59
## - m_per1000f 1 58977 1241285 428.36
## <none> 1182307 428.46
## - police_exp60 1 70105 1252412 428.70
## - gdp 1 75736 1258044 428.88
## - unemploy_m24 1 136978 1319285 430.73
## - prob_prison 1 218748 1401055 433.08
## - percent_m 1 239595 1421902 433.65
## - unemploy_m39 1 255139 1437446 434.08
## - mean_education 1 459010 1641318 439.25
## - inequality 1 540718 1723025 441.15
##
## Step: AIC=426.99
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - nonwhites_per1000 1 19427 1217876 425.61
## - time_prison 1 22157 1220606 425.70
## <none> 1198449 426.99
## - gdp 1 67475 1265924 427.12
## - m_per1000f 1 96293 1294742 428.00
## - unemploy_m24 1 149740 1348189 429.58
## - prob_prison 1 202618 1401067 431.08
## - percent_m 1 229478 1427927 431.82
## - unemploy_m39 1 270803 1469252 432.93
## - mean_education 1 450871 1649320 437.44
## - police_exp60 1 537596 1736045 439.44
## - inequality 1 540740 1739189 439.51
##
## Step: AIC=425.61
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison +
## time_prison
##
## Df Sum of Sq RSS AIC
## - time_prison 1 15300 1233175 424.10
## - gdp 1 59411 1277287 425.47
## <none> 1217876 425.61
## - m_per1000f 1 83058 1300934 426.19
## - unemploy_m24 1 142864 1360740 427.94
## - prob_prison 1 184473 1402348 429.11
## - unemploy_m39 1 269060 1486936 431.40
## - percent_m 1 336661 1554536 433.13
## - mean_education 1 439519 1657395 435.63
## - inequality 1 620091 1837967 439.66
## - police_exp60 1 810365 2028241 443.51
##
## Step: AIC=424.1
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## - gdp 1 54881 1288056 423.80
## <none> 1233175 424.10
## - m_per1000f 1 120010 1353186 425.72
## - unemploy_m24 1 136881 1370057 426.21
## - prob_prison 1 187861 1421036 427.63
## - unemploy_m39 1 256783 1489958 429.48
## - percent_m 1 321597 1554772 431.14
## - mean_education 1 445529 1678704 434.13
## - inequality 1 605673 1838848 437.68
## - police_exp60 1 814813 2047988 441.88
##
## Step: AIC=423.8
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## unemploy_m24 + unemploy_m39 + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## <none> 1288056 423.80
## - m_per1000f 1 150733 1438789 426.11
## - unemploy_m24 1 170339 1458395 426.64
## - prob_prison 1 219958 1508013 427.95
## - percent_m 1 276977 1565033 429.39
## - unemploy_m39 1 307352 1595408 430.14
## - mean_education 1 514283 1802339 434.90
## - inequality 1 669391 1957447 438.12
## - police_exp60 1 1266345 2554401 448.50
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = data_train)
##
## Coefficients:
## (Intercept) percent_m mean_education police_exp60 m_per1000f
## -7471.644 9.400 21.209 10.021 2.891
## unemploy_m24 unemploy_m39 inequality prob_prison
## -7.789 23.685 6.482 -3779.431
Model with the lowest AIC using the backward method is 423.8 with the model :
crime.rate = 9.400(percent_m) + 21.209(mean_education) + 10.021(police_exp60) + 2.891(m_per1000f) + -7.789(unemploy_m24) + 23.685(unemploy_39) + 6.482(inequality) + -3779.431(prob_prison)
crime.model <- lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
data = data_train)
summary(crime.model)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.41 -118.32 8.16 134.38 477.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7471.644 1427.148 -5.235 1.2e-05 ***
## percent_m 9.400 3.701 2.540 0.01651 *
## mean_education 21.209 6.128 3.461 0.00164 **
## police_exp60 10.021 1.845 5.431 6.9e-06 ***
## m_per1000f 2.891 1.543 1.874 0.07074 .
## unemploy_m24 -7.789 3.910 -1.992 0.05556 .
## unemploy_m39 23.685 8.852 2.676 0.01197 *
## inequality 6.482 1.642 3.949 0.00044 ***
## prob_prison -3779.431 1669.797 -2.263 0.03101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207.2 on 30 degrees of freedom
## Multiple R-squared: 0.7893, Adjusted R-squared: 0.7332
## F-statistic: 14.05 on 8 and 30 DF, p-value: 3.014e-08
After using the crime.model model, the value Adjusted R-Squared = 0.7332 / 73% occurs an increase of the Adjusted R-Squared value from the previous model, if all predictors are used, Adjusted R-square = 0.6871 or 68%
predict.crime <- predict(crime.model, newdata = data_test)
head(predict.crime,1)
## 3
## 358.1575
as.data.frame(predict.crime)
## predict.crime
## 3 358.1575
## 12 725.2124
## 14 821.2269
## 17 420.4453
## 20 1279.9018
## 27 253.2388
## 34 1006.1802
## 47 1156.3398
interpretation :
The results of the prediction of average criminals if using the model from (crime.model) on the data (data_test) on the data in the first row will get a crime rate score 391.6707
MAE(predict.crime, data_test$crime_rate) # Mean Average Error
## [1] 144.1993
MAPE(predict.crime, data_test$crime_rate) # Mean Average Percetage Error
## [1] 0.2174213
RMSE(predict.crime, data_test$crime_rate) # Root Mean Squared Error
## [1] 163.8212
interpretation :
MAE : On average, the prediction of crime rate missed by 144.1993, can be higher or lower
MAPE : The model predicts medical charges with a probability of being deviated by 21% of their actual value
RMSE : The model gave a mean of 163.8212 deviant prediction results,RMSE can be used if we are more concerned with very large errors.
terms: All of variabel VIF < 10
vif(crime.model)
## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.148244 4.276462 2.912328 1.931593 4.382927
## unemploy_m39 inequality prob_prison
## 4.492641 3.924630 1.360445
all values of vif are <10, so that the variable does not occur multicollinearity or between predictor variables does not have a strong relationship.
plot(density(crime.model$residuals))
From the graph above, it can be concluded that the error of the data is normally distributed, assuming the curve is symmetrical and resembles a bell so that the mean, median and mode lie at one point.
terms : p-value > 0.05
shapiro.test(crime.model$residuals)
##
## Shapiro-Wilk normality test
##
## data: crime.model$residuals
## W = 0.98922, p-value = 0.9666
Because the p-value is> 0.05, it can be said that the residual / error is normally distributed
terms : p-value > 0.05
bptest(crime.model)
##
## studentized Breusch-Pagan test
##
## data: crime.model
## BP = 12.847, df = 8, p-value = 0.1172
P-value> 0.05 so that H0 is accepted. This also means that the residuals do not have a pattern (heteroscedasticity) where all the existing patterns have been captured by the model created.
The model of the model_crime that we tested using all the predictors got an R-square result of 68%, after that when using the backward method, he would iterate until he found the right model with the lowestAIC reference, namely 423, with the increase in the R-square rate to 73% with the model name ‘crime.model’ so that the socio-demographic factors that affect the crime rate in an area is:
m_per1000f, unemploy_m24, prob_prison, unemploy_m39, percent_m, mean_education, inequality, police_exp60
After testing the analysis the model has good criteria with RMSE : 163.8212