Background

Crime is growing very rapidly, both in number and type. These crimes develop along with the progress of the times, especially in developing countries. Urban is the center of crime or crime, it happens because of several factors, namely social inequality, economy, environment, etc. Crime develops in line with population growth, development, modernization and urbanization. Thus it is said that the development of the city is always accompanied by the development of the quality and quantity of crime. As a result, the development of the situation has caused unrest in the community and government in the city. The development and progress of the world today seems to be increasingly complex with the existence of various kinds of actions or human behavior. The thought patterns and actions that describe these are not only “positive” thought patterns or actions, however, there are also “negative” actions that harm others and themselves.

Goal

We will learn to use a linear regression model using a crime dataset. We want to know the relationship between variables, especially between the probability of people being imprisoned and other variables. With this study, we want to know what variables can influence someone to commit a crime in the city. By knowing what factors influence the occurrence of crime, we can prevent it in the future.

Data Preperation

Load required library.

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)

Load data to perform regression model

crime <- read.csv(file="data_input/crime.csv",stringsAsFactors = F) %>%
  dplyr::select(-X)

Inspect Data

after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() only.

head(crime)

the data structure is still difficult for us to read, for that we will change the column names based on the references in the glossary data.

names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", 
                  "police_exp59", "labour_participation", "m_per1000f", 
                  "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", 
                  "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)

Glossary data crime :

percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp60: police expenditure in 1960
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: avg time served in prisons
crime_rate: crime rate in an unspecified category

Exploratory Data Analysis

Check the structure of the data copier. Are there any columns whose data types do not match?

str(crime)

#> 'data.frame':    47 obs. of  16 variables:
#>  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
#>  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
#>  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
#>  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
#>  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
#>  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
#>  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
#>  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
#>  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
#>  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
#>  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
#>  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
#>  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
#>  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
#>  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
#>  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Columns whose data types do not match:

is_south -> as.factor

crime <- crime %>% 
  mutate(is_south = as.factor(is_south))

str(crime)

#> 'data.frame':    47 obs. of  16 variables:
#>  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
#>  $ is_south            : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
#>  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
#>  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
#>  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
#>  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
#>  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
#>  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
#>  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
#>  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
#>  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
#>  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
#>  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
#>  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
#>  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
#>  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Check blank data in our dataset

colSums(is.na(crime))

#>            percent_m             is_south       mean_education 
#>                    0                    0                    0 
#>         police_exp60         police_exp59 labour_participation 
#>                    0                    0                    0 
#>           m_per1000f            state_pop    nonwhites_per1000 
#>                    0                    0                    0 
#>         unemploy_m24         unemploy_m39                  gdp 
#>                    0                    0                    0 
#>           inequality          prob_prison          time_prison 
#>                    0                    0                    0 
#>           crime_rate 
#>                    0

from the results of our exploratory data, we get the result that there is no NA or balnk data.

Modeling

before we do the modeling we have done data analysis exploratory is the phase where we explore the data variables, see if there are any patterns that can show any kind of correlation between variables. Find the correlation between variables.

ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

In the correlation graph, we can be seen that all variables have positive and negative effects on prob_prison where GDP has the highest negative correlation compared to other factors.

Simple Linear Regression

we will create a model with one variable (Simple Linear Regression) to see the significance of the gdp variable with prob_prison.

model_gdp <- lm(formula = prob_prison ~ gdp, crime)
summary(model_gdp)

#> 
#> Call:
#> lm(formula = prob_prison ~ gdp, data = crime)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.035792 -0.013269 -0.003581  0.008438  0.086533 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value      Pr(>|t|)    
#> (Intercept)  0.11584205  0.01559902   7.426 0.00000000239 ***
#> gdp         -0.00013086  0.00002921  -4.480 0.00005086116 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01912 on 45 degrees of freedom
#> Multiple R-squared:  0.3084, Adjusted R-squared:  0.293 
#> F-statistic: 20.07 on 1 and 45 DF,  p-value: 0.00005086

Interpretasi Model

Intercept is the target value of the variable when the predictor value is 0
Slope: Every time there is an increase of 1 unit in the predictor, the value of the target variable will decrease by the slope
Negative coefficient indicates negative correlation between predictor and target variable

based on a model made with one variable (Simple Linear Regression) we can assume our model is significant and has a negative correlation, knowing the correlation between gdp and prob_prison indicates that an increase of 1 slope **gdp* * can reduce the chances of people being imprisoned.

Let’s plot the equations of the lines obtained as well as their scatter diagrams.

plot(crime$gdp,crime$prob_prison)
abline(model_gdp)

Multiple Linear Regression

Previously we have made a model with Simple Linear Regression with significant model results between predictors and targets, to have another alternative model we will use multi-variable predictors to compare a model with one variable with a model that has many variables.

model_all <- lm(formula = prob_prison~. , data = crime)

summary(model_all)

#> 
#> Call:
#> lm(formula = prob_prison ~ ., data = crime)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.023273 -0.007986 -0.000401  0.005584  0.047269 
#> 
#> Coefficients:
#>                          Estimate   Std. Error t value Pr(>|t|)    
#> (Intercept)           0.062325373  0.143536075   0.434 0.667140    
#> percent_m             0.000103585  0.000328562   0.315 0.754672    
#> is_south1             0.008113041  0.010879854   0.746 0.461470    
#> mean_education        0.000580481  0.000511212   1.135 0.264869    
#> police_exp60          0.001434510  0.000782307   1.834 0.076317 .  
#> police_exp59         -0.001502976  0.000836453  -1.797 0.082110 .  
#> labour_participation -0.000069833  0.000108085  -0.646 0.522969    
#> m_per1000f            0.000006356  0.000151951   0.042 0.966905    
#> state_pop            -0.000018327  0.000095596  -0.192 0.849221    
#> nonwhites_per1000     0.000088464  0.000045450   1.946 0.060719 .  
#> unemploy_m24         -0.000212096  0.000317864  -0.667 0.509547    
#> unemploy_m39          0.000410978  0.000642768   0.639 0.527268    
#> gdp                  -0.000014757  0.000077506  -0.190 0.850241    
#> inequality            0.000058520  0.000191734   0.305 0.762243    
#> time_prison          -0.001644298  0.000440990  -3.729 0.000772 ***
#> crime_rate           -0.000026438  0.000012374  -2.137 0.040627 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01543 on 31 degrees of freedom
#> Multiple R-squared:  0.6897, Adjusted R-squared:  0.5396 
#> F-statistic: 4.595 on 15 and 31 DF,  p-value: 0.000165

Interpretasi Model

based on a model made with more than one variable (Multiple Linear Regression) we can assume our model only significant with several variable which are time_prison with negative correlation and crime_rate also have negative correlation.

Step-wise Regression for Feature Selection

Step-wise Regression is a method for creating a regression model that will choose the predictor variable that produces the smallest error. At each step and each variable will be considered to be added or removed to the model until the best model is obtained. To evaluate the model at each step of the stepwise regression, the AIC (Akaike Information Criterion/ Information Loss) value is used. AIC shows a lot of missing information on the model.

model_step <- stats::step(model_all, trace = 0)
summary(model_step)

#> 
#> Call:
#> lm(formula = prob_prison ~ is_south + police_exp60 + police_exp59 + 
#>     nonwhites_per1000 + time_prison + crime_rate, data = crime)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.022473 -0.009673  0.000416  0.006346  0.045034 
#> 
#> Coefficients:
#>                      Estimate  Std. Error t value         Pr(>|t|)    
#> (Intercept)        0.09986839  0.01035794   9.642 0.00000000000547 ***
#> is_south1          0.01171026  0.00719409   1.628          0.11143    
#> police_exp60       0.00113740  0.00066090   1.721          0.09298 .  
#> police_exp59      -0.00121732  0.00068249  -1.784          0.08207 .  
#> nonwhites_per1000  0.00007985  0.00003277   2.437          0.01936 *  
#> time_prison       -0.00167478  0.00031381  -5.337 0.00000402593162 ***
#> crime_rate        -0.00002132  0.00000774  -2.755          0.00879 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01395 on 40 degrees of freedom
#> Multiple R-squared:  0.6729, Adjusted R-squared:  0.6238 
#> F-statistic: 13.71 on 6 and 40 DF,  p-value: 0.00000002155

by using the step() function we get several predictor variables that are significant with our target, namely nonwhites_per1000, time_prison and crime_rate, and by using the step() function we get R- square which is better than the previous model, namely 0.6238

Model Evaluation

One of the goals of predictive modeling is to make predictions for data that is not yet owned. After making predictions from the data, we must find out whether the machine learning model that has been made is good enough by seeing whether the prediction results have produced the smallest error. There are several ways to do model evaluation on the regression model:

using the value of R-Squared and Adj R-Squared
- to determine how well the model explains the variance of the target variable. This value ranges from 0 to 1, where the closer to 1, the model is able to explain the target variable with the existing predictors.

# r-squared
summary(model_gdp)$r.squared

#> [1] 0.3083966

summary(model_all)$adj.r.squared

#> [1] 0.5396279

summary(model_step)$adj.r.squared

#> [1] 0.6238253

use error value
- to see if the prediction made produces the smallest error value

Some error values that exist:

pred_gdp <- model_gdp$fitted.values
pred_all <- model_all$fitted.values
pred_step <- model_step$fitted.values

crime$pred_gdp <- pred_gdp
crime$pred_all <- pred_all
crime$pred_step <- pred_step

Root Mean Squared Error (RMSE)

RMSE(crime$pred_gdp, crime$prob_prison)

#> [1] 0.01870644

RMSE(crime$pred_all, crime$prob_prison)

#> [1] 0.01252908

RMSE(crime$pred_step, crime$prob_prison)

#> [1] 0.01286496

Mean Absolute Error (MAE)

MAE(crime$pred_gdp, crime$prob_prison)

#> [1] 0.01313478

MAE(crime$pred_all, crime$prob_prison)

#> [1] 0.009242512

MAE(crime$pred_step, crime$prob_prison)

#> [1] 0.00973549

summary(crime$prob_prison)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.00690 0.03270 0.04210 0.04709 0.05445 0.11980

If the MAE or RMSE value is relatively small compared to the data range, the model has a fairly small error and if the MAE value is relatively large compared to the data range, the model has a large enough error. based on the MAE value that we get from the three models we have created, we can assume that model_all and model_step have relatively small errors.

Mean Absolute Percentage Error (MAPE)

MAPE(crime$pred_gdp, crime$prob_prison) * 100

#> [1] 40.38656

MAPE(crime$pred_all, crime$prob_prison) * 100

#> [1] 22.67447

MAPE(crime$pred_step, crime$prob_prison) * 100

#> [1] 22.92335

MAPE has a range in percent, the smaller the MAPE value, the better the model we have. From the results of the MAPE calculation, it can be seen that the model with 1 predictor has a very large error percentage.

Assumption of Linear Regression

Limitations of the linear regression model:

Good for data where the target and predictor have a linear/strongly correlated relationship
Outliers can affect model performance

In addition to the limitations above, there are several assumptions applied by the model. Linear regression has several assumptions that need to be met so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is to want an interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions do not have to be met.

Normality of Residual

The hope is that when making a linear regression model, the resulting error is normally distributed. This means that many errors are gathered around the number 0. To test this assumption, you can do:

Visualization of histogram residuals, using the `hist()` function.

hist(model_gdp$residuals, breaks = 20)

hist(model_all$residuals, breaks = 20)

hist(model_step$residuals, breaks = 20)

as we can see from 3 models that have been made, only model_gdp having error gathered outside 0 score.

shapiro.test()

Statistical test using shapiro.test(). (expected value > alpha so that the decision taken is failed to reject H0) H0 : error is normally distributed H1 : error is not normally distributed

the expected p value > 0.05

Shapiro-Wilk hypothesis test:

shapiro.test(model_gdp$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_gdp$residuals
#> W = 0.84045, p-value = 0.00001464

shapiro.test(model_all$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_all$residuals
#> W = 0.9399, p-value = 0.01767

shapiro.test(model_step$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_step$residuals
#> W = 0.95471, p-value = 0.06635

Conclusion: the residuals/errors in the model_step prediction results (except model_gdp and model_all) are normally distributed, which means that the prediction results obtained by the model are quite accurate because many errors accumulate at point 0, If the assumption of normality is not met, then the results of the significance test and the standard error value of the intercept and slope of each predictor produced are biased or do not reflect the true value as in model_gdp and model_all. If the residual has an abnormal distribution, you can transform the data on the target variable or add sample data.

Homoscedasticity of Residua

Homocesdasticity indicates that the residual variance or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it Heterocesdasticity and it will affect the standard error value in the estimate/coefficient of the predictor which is biased (too narrow or too wide).

Visualization with a scatterplot between the actual value and the error value

plot(crime$prob_prison, model_gdp$residuals)
abline(h = 0, col = "red")

plot(crime$prob_prison, model_all$residuals)
abline(h = 0, col = "red")

plot(crime$prob_prison, model_step$residuals)
abline(h = 0, col = "red")

Homocesdasticity can be checked visually by seeing whether there is a pattern between the predicted results from the data and the residual value. In the following plot, it can be seen that there is no certain pattern so that we can conclude that the model already has a constant error.

Test statistics with Breusch-Pagan from package lmtest

Breusch-Pagan hypothesis test: (hopefully p-value > alpha to fail to reject H0) H0: The error variance spreads constant (Homoscedasticity) H1: The error variance spreads not constant to form a pattern (Heteroscedasticity)

bptest(model_gdp)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_gdp
#> BP = 0.8905, df = 1, p-value = 0.3453

bptest(model_all)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_all
#> BP = 20.843, df = 15, p-value = 0.1419

bptest(model_step)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_step
#> BP = 15.069, df = 6, p-value = 0.01973

For both models, P-value > 0.05 so H0 is accepted. This also means that the residual has no pattern (Heteroscedasticity) where all existing patterns have been successfully captured by the model created. it’s just that the model_step P-value < 0.05 so that it rejects H0 which has a Heteroscedasticity pattern.

No Multicolinearity

The hope is that in the linear regression model, there will be no multicollinearity. Multicollinearity occurs when the predictor variables used in the model have a strong relationship. The presence or absence of multicollinearity can be seen from the value of VIF (Variance Inflation Factor):

When the VIF value is more than 10, it means that there is multicollinearity. Hope to get VIF < 10

vif(model_all)

#>            percent_m             is_south       mean_education 
#>             3.295529             5.248747             6.321396 
#>         police_exp60         police_exp59 labour_participation 
#>           104.473392           105.725946             3.687467 
#>           m_per1000f            state_pop    nonwhites_per1000 
#>             3.875038             2.560112             4.221628 
#>         unemploy_m24         unemploy_m39                  gdp 
#>             6.347459             5.695580            10.810051 
#>           inequality          time_prison           crime_rate 
#>            11.309475             1.887790             4.426502

vif(model_step)

#>          is_south      police_exp60      police_exp59 nonwhites_per1000 
#>          2.808538         91.251423         86.140023          2.685643 
#>       time_prison        crime_rate 
#>          1.169928          2.119789

If we have a multico-indicated model where there are predictors that have vif > 10, then we can recreate the model using one of the multico-indicated predictors or not using these predictors or perform feature engineering by creating a new variable that contains the average of the two variable indicated multico.

Conclusion

The obtained Model_Step has an R-square of 0.6238 and a MAPE of 22.92335. In addition, after the analysis test was carried out, the model had good criteria, but it was not good at the time of statistical testing with Breusch-Pagan.

Based on this model, the probability of a person being imprisoned is negatively correlated with the value of GDP. In other words, people who think that the value of GDP will reduce the likelihood of a person being imprisoned, also a high average time serving a sentence in prison will reduce a person to repeat his crime again.

probability of people being imprisoned using Regression Model

Ruli Erinton

23 May 2022

Background

Goal

Data Preperation

Inspect Data

Exploratory Data Analysis

Modeling

Simple Linear Regression

Multiple Linear Regression

Step-wise Regression for Feature Selection

Model Evaluation

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Mean Absolute Percentage Error (MAPE)

Assumption of Linear Regression

Normality of Residual

Visualization of histogram residuals, using the `hist()` function.

shapiro.test()

Homoscedasticity of Residua

No Multicolinearity

Conclusion

probability of people being imprisoned using Regression Model

Ruli Erinton

23 May 2022

Background

Goal

Data Preperation

Inspect Data

Exploratory Data Analysis

Modeling

Simple Linear Regression

Multiple Linear Regression

Step-wise Regression for Feature Selection

Model Evaluation

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Mean Absolute Percentage Error (MAPE)

Assumption of Linear Regression

Normality of Residual

Visualization of histogram residuals, using the hist() function.

shapiro.test()

Homoscedasticity of Residua

No Multicolinearity

Conclusion

Visualization of histogram residuals, using the `hist()` function.