This RMarkdown is intended for analyzing the linear regression models that can be applied to the data that we have : crime reports in several states in the US in 1960. The results of linear regression model is expected to show the effect of socio-economic condition variables on the target variable crime rate or crime_rate so that it can offer solutions that can reduce the crime rate in the US.
We are going to use these libraries.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
crime <- read.csv("crime.csv")
head(crime)
## X M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time y
## 1 1 151 1 91 58 56 510 950 33 301 108 41 394 261 0.084602 26.2011 791
## 2 2 143 0 113 103 95 583 1012 13 102 96 36 557 194 0.029599 25.2999 1635
## 3 3 142 1 89 45 44 533 969 18 219 94 33 318 250 0.083401 24.3006 578
## 4 4 136 0 121 149 141 577 994 157 80 102 39 673 167 0.015801 29.9012 1969
## 5 5 141 0 121 109 101 591 985 18 30 91 20 578 174 0.041399 21.2998 1234
## 6 6 121 0 110 118 115 547 964 25 44 84 29 689 126 0.034201 20.9995 682
Futhermore we will inspect our data
tail(crime)
## X M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time y
## 42 42 141 0 109 56 54 523 968 4 2 107 37 489 170 0.088904 12.1996 542
## 43 43 162 1 99 75 70 522 996 40 208 73 27 496 224 0.054902 31.9989 823
## 44 44 136 0 121 95 96 574 1012 29 36 111 37 622 162 0.028100 30.0001 1030
## 45 45 139 1 88 46 41 480 968 19 49 135 53 457 249 0.056202 32.5996 455
## 46 46 126 0 104 106 97 599 989 40 24 78 25 593 171 0.046598 16.6999 508
## 47 47 130 0 121 90 91 623 1049 3 22 113 40 588 160 0.052802 16.0997 849
dim(crime)
## [1] 47 17
names(crime)
## [1] "X" "M" "So" "Ed" "Po1" "Po2" "LF" "M.F" "Pop" "NW"
## [11] "U1" "U2" "GDP" "Ineq" "Prob" "Time" "y"
Now we want to exclude X from out Dataset because we will not use it.
crime <- crime %>%
select(-X) %>%
setNames(c("percent_m", "is_south", "mean_education", "police_exp60",
"police_exp59", "labour_participation", "m_per1000f", "state_pop",
"nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality",
"prob_prison", "time_prison", "crime_rate"))
We would like to know the type of our data.
str(crime)
## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
we would want to change “is_south” into factor
crime <- crime %>%
mutate_at("is_south", function(x) ifelse(x==0, "No", "Yes")) %>%
mutate (is_south = factor(is_south))
now we have to check if there is NA in our data
colSums(is.na(crime))
## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
GREAT! No missing values! Then by using ggcorr, we will check the correlation of each variable to see which predictor variables are possible to determine the target variable crime_rate.
ggcorr(crime, label=T, label_size = 2)
## Warning in ggcorr(crime, label = T, label_size = 2): data in column(s)
## 'is_south' are not numeric and were ignored
Based on the the result above, it can be seen that the nonwhites_per1000 and crime_rate variables show a correlation value of 0 so that it can be said that the nonwhites_per1000 variable is not suitable to be used as a predictor variable.
Before creating the model, we go deeper into our dataset with dividing data_train and data_test first so that we can train the model first via data train and then evaluate the model via data test.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
row_data <- nrow(crime)
index <- sample(row_data, row_data*0.8)
data_train <- crime[ index, ]
data_test <- crime[ -index, ]
based on the correlation between variables in the previous chunk, we create a simple linear model using one of the variables that has a high correlation with the crime_rate variable, namely the police_exp60 variable.
model_crime <- lm(crime_rate ~ police_exp60, data_train)
summary(model_crime)
##
## Call:
## lm(formula = crime_rate ~ police_exp60, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -587.13 -138.38 25.91 150.55 566.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164.930 150.015 1.099 0.279
## police_exp60 8.775 1.616 5.431 4.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 306.9 on 35 degrees of freedom
## Multiple R-squared: 0.4573, Adjusted R-squared: 0.4418
## F-statistic: 29.5 on 1 and 35 DF, p-value: 4.346e-06
From the model that we have created, it has an R-Squared value of 0.4573, meaning that the model can explain 45.73% of the data, while the remaining 54.27% is explained by variables or other factors that we do not include in the model.
R-Squared itself has a range between 0-1 (0-100%) and a good R-Squared is close to 1. However, finding the highest R-square does not necessarily guarantee to give the best model because R-square only considers how well the model follows existing data patterns and do not reflect the model’s ability to predict new data. A more appropriate measure if you want to find a trade-off between the model’s ability to follow data patterns and its predictive ability is to use the Akaike Information Criterion (AIC), and a good model has a low AIC.
Let’s try to do the selection of predictor variables using the stepwise method with direction = both and starting from the model with the complete predictor variable, to get the lowest possible AIC.
model_full <- lm(crime_rate ~., data_train)
model_step <- step(model_full, direction="both")
## Start: AIC=414.49
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## police_exp59 + labour_participation + m_per1000f + state_pop +
## nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - labour_participation 1 653 1143026 412.52
## - time_prison 1 1921 1144294 412.56
## - police_exp59 1 2875 1145248 412.59
## - state_pop 1 3010 1145383 412.59
## - is_south 1 8757 1151130 412.78
## - nonwhites_per1000 1 13116 1155489 412.92
## - gdp 1 13142 1155515 412.92
## - police_exp60 1 28035 1170408 413.39
## - m_per1000f 1 45287 1187660 413.93
## <none> 1142373 414.49
## - prob_prison 1 70404 1212777 414.71
## - percent_m 1 97608 1239981 415.53
## - unemploy_m24 1 103796 1246169 415.71
## - mean_education 1 200844 1343218 418.49
## - unemploy_m39 1 228152 1370525 419.23
## - inequality 1 244314 1386687 419.67
##
## Step: AIC=412.52
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## police_exp59 + m_per1000f + state_pop + nonwhites_per1000 +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison +
## time_prison
##
## Df Sum of Sq RSS AIC
## - police_exp59 1 2351 1145376 410.59
## - time_prison 1 2678 1145704 410.60
## - state_pop 1 4131 1147157 410.65
## - is_south 1 8388 1151414 410.79
## - gdp 1 12638 1155664 410.92
## - nonwhites_per1000 1 13220 1156246 410.94
## - police_exp60 1 27439 1170465 411.39
## - m_per1000f 1 52705 1195731 412.18
## <none> 1143026 412.52
## - prob_prison 1 71104 1214130 412.75
## - percent_m 1 103075 1246101 413.71
## - unemploy_m24 1 121084 1264110 414.24
## + labour_participation 1 653 1142373 414.49
## - mean_education 1 200303 1343329 416.49
## - unemploy_m39 1 227521 1370547 417.23
## - inequality 1 246070 1389096 417.73
##
## Step: AIC=410.59
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 +
## unemploy_m39 + gdp + inequality + prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - state_pop 1 3598 1148975 408.71
## - time_prison 1 6058 1151435 408.79
## - is_south 1 8688 1154064 408.87
## - nonwhites_per1000 1 11027 1156404 408.95
## - gdp 1 11041 1156418 408.95
## <none> 1145376 410.59
## - prob_prison 1 68965 1214341 410.76
## - m_per1000f 1 88286 1233663 411.34
## - percent_m 1 102660 1248036 411.77
## - unemploy_m24 1 122681 1268058 412.36
## + police_exp59 1 2351 1143026 412.52
## + labour_participation 1 129 1145248 412.59
## - mean_education 1 203773 1349149 414.65
## - unemploy_m39 1 233268 1378645 415.45
## - inequality 1 245177 1390554 415.77
## - police_exp60 1 358986 1504363 418.68
##
## Step: AIC=408.71
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## m_per1000f + nonwhites_per1000 + unemploy_m24 + unemploy_m39 +
## gdp + inequality + prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - time_prison 1 3983 1152958 406.84
## - gdp 1 9092 1158067 407.00
## - is_south 1 9896 1158871 407.03
## - nonwhites_per1000 1 12118 1161093 407.10
## <none> 1148975 408.71
## - prob_prison 1 68307 1217282 408.84
## - percent_m 1 105640 1254614 409.96
## + state_pop 1 3598 1145376 410.59
## + police_exp59 1 1818 1147157 410.65
## + labour_participation 1 789 1148186 410.68
## - unemploy_m24 1 142101 1291075 411.02
## - m_per1000f 1 152198 1301173 411.31
## - mean_education 1 206104 1355079 412.81
## - unemploy_m39 1 246722 1395697 413.91
## - inequality 1 274184 1423159 414.63
## - police_exp60 1 402260 1551235 417.81
##
## Step: AIC=406.84
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## m_per1000f + nonwhites_per1000 + unemploy_m24 + unemploy_m39 +
## gdp + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## - is_south 1 10049 1163007 405.16
## - gdp 1 10978 1163937 405.19
## - nonwhites_per1000 1 17994 1170952 405.41
## <none> 1152958 406.84
## - percent_m 1 111548 1264506 408.25
## + police_exp59 1 4347 1148611 408.70
## + time_prison 1 3983 1148975 408.71
## + state_pop 1 1523 1151435 408.79
## + labour_participation 1 1016 1151943 408.80
## - prob_prison 1 142080 1295039 409.14
## - unemploy_m24 1 144421 1297379 409.20
## - m_per1000f 1 148215 1301173 409.31
## - mean_education 1 202307 1355265 410.82
## - unemploy_m39 1 251866 1404825 412.15
## - inequality 1 271850 1424809 412.67
## - police_exp60 1 398984 1551942 415.83
##
## Step: AIC=405.16
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison
##
## Df Sum of Sq RSS AIC
## - gdp 1 11334 1174342 403.52
## - nonwhites_per1000 1 13901 1176909 403.60
## <none> 1163007 405.16
## - percent_m 1 103127 1266134 406.30
## + is_south 1 10049 1152958 406.84
## + police_exp59 1 4698 1158310 407.01
## + time_prison 1 4136 1158871 407.03
## + state_pop 1 2286 1160721 407.08
## + labour_participation 1 100 1162907 407.15
## - unemploy_m24 1 135954 1298961 407.25
## - m_per1000f 1 148204 1311211 407.60
## - prob_prison 1 213993 1377000 409.41
## - mean_education 1 231928 1394935 409.89
## - unemploy_m39 1 242747 1405754 410.17
## - inequality 1 262973 1425981 410.70
## - police_exp60 1 389276 1552284 413.84
##
## Step: AIC=403.52
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## nonwhites_per1000 + unemploy_m24 + unemploy_m39 + inequality +
## prob_prison
##
## Df Sum of Sq RSS AIC
## - nonwhites_per1000 1 9466 1183808 401.81
## <none> 1174342 403.52
## - percent_m 1 95312 1269653 404.40
## + gdp 1 11334 1163007 405.16
## + is_south 1 10405 1163937 405.19
## + time_prison 1 6091 1168250 405.32
## + police_exp59 1 3274 1171067 405.41
## + state_pop 1 540 1173802 405.50
## + labour_participation 1 455 1173887 405.50
## - unemploy_m24 1 143492 1317833 405.78
## - m_per1000f 1 177963 1352305 406.74
## - mean_education 1 227959 1402300 408.08
## - prob_prison 1 242508 1416850 408.46
## - unemploy_m39 1 249410 1423751 408.64
## - inequality 1 286285 1460626 409.59
## - police_exp60 1 616530 1790872 417.13
##
## Step: AIC=401.81
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## unemploy_m24 + unemploy_m39 + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## <none> 1183808 401.81
## + time_prison 1 10289 1173518 403.49
## + nonwhites_per1000 1 9466 1174342 403.52
## + gdp 1 6899 1176909 403.60
## + is_south 1 6684 1177123 403.60
## + labour_participation 1 1948 1181860 403.75
## + police_exp59 1 1540 1182267 403.76
## + state_pop 1 647 1183160 403.79
## - unemploy_m24 1 134893 1318700 403.81
## - percent_m 1 149388 1333195 404.21
## - m_per1000f 1 169039 1352847 404.75
## - mean_education 1 220608 1404416 406.14
## - prob_prison 1 237275 1421082 406.57
## - unemploy_m39 1 239975 1423783 406.64
## - inequality 1 461989 1645796 412.00
## - police_exp60 1 1025300 2209107 422.90
summary(model_step)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -394.74 -125.65 27.95 128.24 465.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7289.928 1383.644 -5.269 1.33e-05 ***
## percent_m 7.636 4.062 1.880 0.0706 .
## mean_education 16.603 7.269 2.284 0.0301 *
## police_exp60 9.219 1.872 4.925 3.41e-05 ***
## m_per1000f 3.712 1.857 2.000 0.0553 .
## unemploy_m24 -7.495 4.196 -1.786 0.0849 .
## unemploy_m39 22.563 9.471 2.382 0.0242 *
## inequality 5.625 1.702 3.306 0.0026 **
## prob_prison -3837.725 1619.981 -2.369 0.0250 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 205.6 on 28 degrees of freedom
## Multiple R-squared: 0.8052, Adjusted R-squared: 0.7495
## F-statistic: 14.46 on 8 and 28 DF, p-value: 4.236e-08
Therefore, we will further analyze model_step:
crime_rate = -7289.92 + 7.63 percent_m + 16.60 mean_education + 9.21 police_exp60 + 3.71 m_per1000f - 7.49 unemploy_m24 + 22.56 unemploy_m39 + 5.62 inequality - 3837.72 prob_prison
The predictions and evaluations carried out in this section are aimed to check whether the performance of the model in our train data or test data shows one of the following conditions:
Let’s predict the model’s performance in the data train:
pred_train <- predict(model_step, data_train)
data.frame(MAE = MAE(pred_train, data_train$crime_rate),
MAPE = MAPE(pred_train, data_train$crime_rate),
RMSE = RMSE(pred_train, data_train$crime_rate)
)
## MAE MAPE RMSE
## 1 146.433 0.1835156 178.8709
Then we predict the performance of the model in the test data:
pred_test <- predict(model_step, data_test)
data.frame(MAE = MAE(pred_test, data_test$crime_rate),
MAPE = MAPE(pred_test, data_test$crime_rate),
RMSE = RMSE(pred_test, data_test$crime_rate)
)
## MAE MAPE RMSE
## 1 164.2806 0.2299896 189.1708
From the two performance models above, it can be seen that the performance of the train data and test data is similar or not much different. So we can conclude that the model is neither overfit nor underfitted.
The model assumes that this error has a normal distribution centered on 0. To check the distribution of the residual value whether it is normally distributed (forming a bell curve) or not, we can use a histogram.
hist(model_step$residuals)
Another alternative is to perform statistical tests using the Shapiro-Wilk test,
shapiro.test(model_step$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_step$residuals
## W = 0.98774, p-value = 0.9498
Based on the histogram, it can be seen at a glance that the model_step residuals are normally distributed. Furthermore, the Shapiro-Wilk test shows a p-value > 0.05 so that H0 is accepted, which means that the residuals or errors are normally distributed.
bptest(model_step)
##
## studentized Breusch-Pagan test
##
## data: model_step
## BP = 11.85, df = 8, p-value = 0.158
The Breusch-Pagan statistical test shows p-value > 0.05 so that H0 is accepted and it can be interpreted that the residual does not form a certain pattern so that we can conclude that the model already has a constant error.
Multicollinearity occurs when the predictor variables used in the model have a strong relationship. The presence or absence of multicollinearity can be seen from the value of VIF (Variance Inflation Factor). VIF is a measure that explains how much the coefficient variance increases due to multicollinearity. When the VIF value is more than 10, it means that there is multicollinearity.
vif(model_step)
## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.298626 5.609506 2.990798 2.276976 4.760707
## unemploy_m39 inequality prob_prison
## 5.522660 4.416847 1.392422
From the multicollinearity test above, there is no VIF value equal to or more than 10 so that means there is no multicollinearity between variables (the predictor variables are mutually independent).
model_step has good criteria as a linear regression model. Then, in addition to the model_step interpretation that has been described in the tests carried out, here are some other interpretations:
When all predictor variables are 0, then crime_rate is -7289.92
Through the interpretations described above, to reduce the crime rate or crime_rate, the US government can do:
The decrease in police budget spending and the possibility of using other predictor variables outside of the existing data, which are related to the police and are deemed more appropriate to use.
Based on the model generated from the step model, it can be seen that an increase of 1 dollar in police_exp60 will be followed by an increase in the value of crime_rate. Because it can be said that the police spending budget should not increase or maybe it can be lowered to reduce the crime rate.
Furthermore, based on a study conducted by researchers especially Wilson (1968, https://www.ncjrs.gov/pdffiles1/Digitization/58531NCJRS.pdf page 10 paragraph 3), to be able to reduce crime_rate more precisely, the variables that need to be considered it is not the number of police recruits, the amount of expenditure made, or the number of police patrolling but the behavior of the police patrolling. The first behavior that can be shown by the police can be “aggressive” or active observation and checking. The second behavior that can be shown is “passive” or non-active, such as: 1) The police do not stop and give sanctions to motorists who violate traffic regulations, 2) The police do not check suspicious people or vehicles, 3) The police rarely check or pretend - pretending to be a “decoy” in an area with a high crime rate.
One of the studies that supports Wilson’s opinion is the research conducted by Boydstun (1975, https://www.ncjrs.gov/pdffiles1/Digitization/58531NCJRS.pdf page 10 paragraph 4 - page 11 paragraph 1). Boydstun’s research shows that areas in the city of San Diego where police conduct field interrogations or street stops experienced a significant reduction in certain types of crime. The types of crimes in question are robbery, theft, rape or other immoral acts, and dangerous fraud. Meanwhile, areas in the city of San Diego where the police did not implement field interrogations or street stops, did not experience a change in the crime rate.