Crime is growing very rapidly, both in number and type. These crimes develop along with the progress of the times, especially in developing countries. Urban is the center of crime or crime, it happens because of several factors, namely social inequality, economy, environment, etc. Crime develops in line with population growth, development, modernization and urbanization. Thus it is said that the development of the city is always accompanied by the development of the quality and quantity of crime. As a result, the development of the situation has caused unrest in the community and government in the city. The development and progress of the world today seems to be increasingly complex with the existence of various kinds of actions or human behavior. The thought patterns and actions that describe these are not only “positive” thought patterns or actions, however, there are also “negative” actions that harm others and themselves.
We will learn to use a linear regression model using a crime dataset. We want to know the relationship between variables, especially between the probability of people being imprisoned and other variables. With this study, we want to know what variables can influence someone to commit a crime in the city. By knowing what factors influence the occurrence of crime, we can prevent it in the future.
Load required library.
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)Load data to perform regression model
crime <- read.csv(file="data_input/crime.csv",stringsAsFactors = F) %>%
dplyr::select(-X)after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() only.
head(crime)the data structure is still difficult for us to read, for that we will change the column names based on the references in the glossary data.
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60",
"police_exp59", "labour_participation", "m_per1000f",
"state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39",
"gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)Glossary data crime :
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation
ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per
1000 peopleunemploy_m24: unemployment rate of urban males aged
14-24unemploy_m39: unemployment rate of urban males aged
35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryCheck the structure of the data copier. Are there any columns whose data types do not match?
str(crime)#> 'data.frame': 47 obs. of 16 variables:
#> $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
#> $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
#> $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
#> $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
#> $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
#> $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
#> $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
#> $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
#> $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
#> $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
#> $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
#> $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
#> $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
#> $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
#> $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
#> $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Columns whose data types do not match:
crime <- crime %>%
mutate(is_south = as.factor(is_south))
str(crime)#> 'data.frame': 47 obs. of 16 variables:
#> $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
#> $ is_south : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
#> $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
#> $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
#> $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
#> $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
#> $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
#> $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
#> $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
#> $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
#> $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
#> $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
#> $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
#> $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
#> $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
#> $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Check blank data in our dataset
colSums(is.na(crime))#> percent_m is_south mean_education
#> 0 0 0
#> police_exp60 police_exp59 labour_participation
#> 0 0 0
#> m_per1000f state_pop nonwhites_per1000
#> 0 0 0
#> unemploy_m24 unemploy_m39 gdp
#> 0 0 0
#> inequality prob_prison time_prison
#> 0 0 0
#> crime_rate
#> 0
from the results of our exploratory data, we get the result that there is no NA or balnk data.
before we do the modeling we have done data analysis exploratory is the phase where we explore the data variables, see if there are any patterns that can show any kind of correlation between variables. Find the correlation between variables.
ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)In the correlation graph, we can be seen that all variables have positive and negative effects on prob_prison where GDP has the highest negative correlation compared to other factors.
we will create a model with one variable (Simple Linear Regression) to see the significance of the gdp variable with prob_prison.
model_gdp <- lm(formula = prob_prison ~ gdp, crime)
summary(model_gdp)#>
#> Call:
#> lm(formula = prob_prison ~ gdp, data = crime)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.035792 -0.013269 -0.003581 0.008438 0.086533
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.11584205 0.01559902 7.426 0.00000000239 ***
#> gdp -0.00013086 0.00002921 -4.480 0.00005086116 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01912 on 45 degrees of freedom
#> Multiple R-squared: 0.3084, Adjusted R-squared: 0.293
#> F-statistic: 20.07 on 1 and 45 DF, p-value: 0.00005086
Interpretasi Model
Intercept is the target value of the variable when the predictor value is 0
Slope: Every time there is an increase of 1 unit in the predictor, the value of the target variable will decrease by the slope
Negative coefficient indicates negative correlation between predictor and target variable
based on a model made with one variable (Simple Linear Regression) we can assume our model is significant and has a negative correlation, knowing the correlation between gdp and prob_prison indicates that an increase of 1 slope **gdp* * can reduce the chances of people being imprisoned.
Let’s plot the equations of the lines obtained as well as their scatter diagrams.
plot(crime$gdp,crime$prob_prison)
abline(model_gdp)Previously we have made a model with Simple Linear Regression with significant model results between predictors and targets, to have another alternative model we will use multi-variable predictors to compare a model with one variable with a model that has many variables.
model_all <- lm(formula = prob_prison~. , data = crime)
summary(model_all)#>
#> Call:
#> lm(formula = prob_prison ~ ., data = crime)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.023273 -0.007986 -0.000401 0.005584 0.047269
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.062325373 0.143536075 0.434 0.667140
#> percent_m 0.000103585 0.000328562 0.315 0.754672
#> is_south1 0.008113041 0.010879854 0.746 0.461470
#> mean_education 0.000580481 0.000511212 1.135 0.264869
#> police_exp60 0.001434510 0.000782307 1.834 0.076317 .
#> police_exp59 -0.001502976 0.000836453 -1.797 0.082110 .
#> labour_participation -0.000069833 0.000108085 -0.646 0.522969
#> m_per1000f 0.000006356 0.000151951 0.042 0.966905
#> state_pop -0.000018327 0.000095596 -0.192 0.849221
#> nonwhites_per1000 0.000088464 0.000045450 1.946 0.060719 .
#> unemploy_m24 -0.000212096 0.000317864 -0.667 0.509547
#> unemploy_m39 0.000410978 0.000642768 0.639 0.527268
#> gdp -0.000014757 0.000077506 -0.190 0.850241
#> inequality 0.000058520 0.000191734 0.305 0.762243
#> time_prison -0.001644298 0.000440990 -3.729 0.000772 ***
#> crime_rate -0.000026438 0.000012374 -2.137 0.040627 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01543 on 31 degrees of freedom
#> Multiple R-squared: 0.6897, Adjusted R-squared: 0.5396
#> F-statistic: 4.595 on 15 and 31 DF, p-value: 0.000165
Interpretasi Model
based on a model made with more than one variable (Multiple Linear Regression) we can assume our model only significant with several variable which are time_prison with negative correlation and crime_rate also have negative correlation.
Step-wise Regression is a method for creating a regression model that will choose the predictor variable that produces the smallest error. At each step and each variable will be considered to be added or removed to the model until the best model is obtained. To evaluate the model at each step of the stepwise regression, the AIC (Akaike Information Criterion/ Information Loss) value is used. AIC shows a lot of missing information on the model.
model_step <- stats::step(model_all, trace = 0)
summary(model_step)#>
#> Call:
#> lm(formula = prob_prison ~ is_south + police_exp60 + police_exp59 +
#> nonwhites_per1000 + time_prison + crime_rate, data = crime)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.022473 -0.009673 0.000416 0.006346 0.045034
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.09986839 0.01035794 9.642 0.00000000000547 ***
#> is_south1 0.01171026 0.00719409 1.628 0.11143
#> police_exp60 0.00113740 0.00066090 1.721 0.09298 .
#> police_exp59 -0.00121732 0.00068249 -1.784 0.08207 .
#> nonwhites_per1000 0.00007985 0.00003277 2.437 0.01936 *
#> time_prison -0.00167478 0.00031381 -5.337 0.00000402593162 ***
#> crime_rate -0.00002132 0.00000774 -2.755 0.00879 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01395 on 40 degrees of freedom
#> Multiple R-squared: 0.6729, Adjusted R-squared: 0.6238
#> F-statistic: 13.71 on 6 and 40 DF, p-value: 0.00000002155
by using the step() function we get several predictor
variables that are significant with our target, namely
nonwhites_per1000, time_prison and
crime_rate, and by using the step()
function we get R- square which is better than the previous model,
namely 0.6238
One of the goals of predictive modeling is to make predictions for data that is not yet owned. After making predictions from the data, we must find out whether the machine learning model that has been made is good enough by seeing whether the prediction results have produced the smallest error. There are several ways to do model evaluation on the regression model:
# r-squared
summary(model_gdp)$r.squared#> [1] 0.3083966
summary(model_all)$adj.r.squared#> [1] 0.5396279
summary(model_step)$adj.r.squared#> [1] 0.6238253
Some error values that exist:
pred_gdp <- model_gdp$fitted.values
pred_all <- model_all$fitted.values
pred_step <- model_step$fitted.values
crime$pred_gdp <- pred_gdp
crime$pred_all <- pred_all
crime$pred_step <- pred_stepRMSE(crime$pred_gdp, crime$prob_prison)#> [1] 0.01870644
RMSE(crime$pred_all, crime$prob_prison)#> [1] 0.01252908
RMSE(crime$pred_step, crime$prob_prison)#> [1] 0.01286496
MAE(crime$pred_gdp, crime$prob_prison)#> [1] 0.01313478
MAE(crime$pred_all, crime$prob_prison)#> [1] 0.009242512
MAE(crime$pred_step, crime$prob_prison)#> [1] 0.00973549
summary(crime$prob_prison)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00690 0.03270 0.04210 0.04709 0.05445 0.11980
If the MAE or RMSE value is relatively small compared to the data range, the model has a fairly small error and if the MAE value is relatively large compared to the data range, the model has a large enough error. based on the MAE value that we get from the three models we have created, we can assume that model_all and model_step have relatively small errors.
MAPE(crime$pred_gdp, crime$prob_prison) * 100#> [1] 40.38656
MAPE(crime$pred_all, crime$prob_prison) * 100#> [1] 22.67447
MAPE(crime$pred_step, crime$prob_prison) * 100#> [1] 22.92335
MAPE has a range in percent, the smaller the MAPE value, the better the model we have. From the results of the MAPE calculation, it can be seen that the model with 1 predictor has a very large error percentage.
Limitations of the linear regression model:
In addition to the limitations above, there are several assumptions applied by the model. Linear regression has several assumptions that need to be met so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is to want an interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions do not have to be met.
The hope is that when making a linear regression model, the resulting error is normally distributed. This means that many errors are gathered around the number 0. To test this assumption, you can do:
hist()
function.hist(model_gdp$residuals, breaks = 20)hist(model_all$residuals, breaks = 20)hist(model_step$residuals, breaks = 20)
as we can see from 3 models that have been made, only model_gdp
having error gathered outside 0 score.
Statistical test using shapiro.test(). (expected value > alpha so that the decision taken is failed to reject H0) H0 : error is normally distributed H1 : error is not normally distributed
the expected p value > 0.05
Shapiro-Wilk hypothesis test:
shapiro.test(model_gdp$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_gdp$residuals
#> W = 0.84045, p-value = 0.00001464
shapiro.test(model_all$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_all$residuals
#> W = 0.9399, p-value = 0.01767
shapiro.test(model_step$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_step$residuals
#> W = 0.95471, p-value = 0.06635
Conclusion: the residuals/errors in the model_step prediction results (except model_gdp and model_all) are normally distributed, which means that the prediction results obtained by the model are quite accurate because many errors accumulate at point 0, If the assumption of normality is not met, then the results of the significance test and the standard error value of the intercept and slope of each predictor produced are biased or do not reflect the true value as in model_gdp and model_all. If the residual has an abnormal distribution, you can transform the data on the target variable or add sample data.
Homocesdasticity indicates that the residual variance or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it Heterocesdasticity and it will affect the standard error value in the estimate/coefficient of the predictor which is biased (too narrow or too wide).
plot(crime$prob_prison, model_gdp$residuals)
abline(h = 0, col = "red")plot(crime$prob_prison, model_all$residuals)
abline(h = 0, col = "red")plot(crime$prob_prison, model_step$residuals)
abline(h = 0, col = "red")Homocesdasticity can be checked visually by seeing whether there is a pattern between the predicted results from the data and the residual value. In the following plot, it can be seen that there is no certain pattern so that we can conclude that the model already has a constant error.
lmtestBreusch-Pagan hypothesis test: (hopefully p-value > alpha to fail to reject H0) H0: The error variance spreads constant (Homoscedasticity) H1: The error variance spreads not constant to form a pattern (Heteroscedasticity)
bptest(model_gdp)#>
#> studentized Breusch-Pagan test
#>
#> data: model_gdp
#> BP = 0.8905, df = 1, p-value = 0.3453
bptest(model_all)#>
#> studentized Breusch-Pagan test
#>
#> data: model_all
#> BP = 20.843, df = 15, p-value = 0.1419
bptest(model_step)#>
#> studentized Breusch-Pagan test
#>
#> data: model_step
#> BP = 15.069, df = 6, p-value = 0.01973
For both models, P-value > 0.05 so H0 is accepted. This also means that the residual has no pattern (Heteroscedasticity) where all existing patterns have been successfully captured by the model created. it’s just that the model_step P-value < 0.05 so that it rejects H0 which has a Heteroscedasticity pattern.
The hope is that in the linear regression model, there will be no multicollinearity. Multicollinearity occurs when the predictor variables used in the model have a strong relationship. The presence or absence of multicollinearity can be seen from the value of VIF (Variance Inflation Factor):
When the VIF value is more than 10, it means that there is multicollinearity. Hope to get VIF < 10
vif(model_all)#> percent_m is_south mean_education
#> 3.295529 5.248747 6.321396
#> police_exp60 police_exp59 labour_participation
#> 104.473392 105.725946 3.687467
#> m_per1000f state_pop nonwhites_per1000
#> 3.875038 2.560112 4.221628
#> unemploy_m24 unemploy_m39 gdp
#> 6.347459 5.695580 10.810051
#> inequality time_prison crime_rate
#> 11.309475 1.887790 4.426502
vif(model_step)#> is_south police_exp60 police_exp59 nonwhites_per1000
#> 2.808538 91.251423 86.140023 2.685643
#> time_prison crime_rate
#> 1.169928 2.119789
If we have a multico-indicated model where there are predictors that have vif > 10, then we can recreate the model using one of the multico-indicated predictors or not using these predictors or perform feature engineering by creating a new variable that contains the average of the two variable indicated multico.
The obtained Model_Step has an R-square of 0.6238 and a MAPE of 22.92335. In addition, after the analysis test was carried out, the model had good criteria, but it was not good at the time of statistical testing with Breusch-Pagan.
Based on this model, the probability of a person being imprisoned is negatively correlated with the value of GDP. In other words, people who think that the value of GDP will reduce the likelihood of a person being imprisoned, also a high average time serving a sentence in prison will reduce a person to repeat his crime again.