Linear Regresion on Crime Rate Prediction

Dedy Gusnadi Sianipar

4/16/2021

Introduction

What I`ll Do

i will use linear regression model using crime dataset. i want to know the relationship among variables, especially between the crime rate with other variables. and also want to predit the crime rate of city based on the historical data. crime dataset are contain Crime report data in several states in the US in 1960

Business Goal`s

In particular, the Government wants to look at the crime rate in an area:

  1. Which variable is significant in predicting the crime rate in an area
  2. Some of these variables both describe crime rates

Based on the data, the government has collected a large data set of various human information in an area.We will monitor to model crime rates with the available independent variables. This will be used by the Government to understand how crime rates are with the independent variables.They can accordingly manipulate the police expenditure, prob_prison ,etc. to meet certain crime rate. The government can use this information to anticipate crime so that can save the people / can reduce the crime rate in an area. Further, the model will be a good way for goverment to understand the crime-rate dynamics on a next year

Data Preparation

First, Load the required library package

# data wrangling
library(tidyverse)
# check asumsi model
library(lmtest)
library (car)

# calculate error
library(MLmetrics)
# correlation
library(GGally)

options(scipen = 999)

Load dataset

crime <- read.csv("data_input/crime.csv")

The data has 47 rows and 16 colomns. Our target variable is there crime_rate,

Data Description:

  • percent_m: percentage of males aged 14-24

  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.

  • mean_education: mean years of schooling

  • police_exp60: police expenditure in 1960

  • police_exp59: police expenditure in 1959

  • labour_participation: labour force participation rate

  • m_per1000f: number of males per 1000 females

  • state_pop: state population

  • nonwhites_per1000: number of non-whites resident per 1000 people

  • unemploy_m24: unemployment rate of urban males aged 14-24

  • unemploy_m39: unemployment rate of urban males aged 35-39

  • gdp: gross domestic product per head

  • inequality: income inequality

  • prob_prison: probability of imprisonment

  • time_prison: avg time served in prisons

  • crime_rate: crime rate in an unspecified category Produce a linear model

Check Dataset

Before we go further, first we need to make sure that our data is clean and will be useful.

Data Type

df_data_crime <- crime %>% 
  mutate(is_south = as.logical(is_south))
glimpse(df_data_crime)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south             <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRU~
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~

Check Unique Data

Because variable is_south is logical, we must check number of data

table(df_data_crime$is_south)
## 
## FALSE  TRUE 
##    31    16

Based on the table, the total unique data is 47 rows, so there are no empty rows

Check Missing Value

Check it again for all variable

colSums(is.na(df_data_crime))
##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

ggcorr(df_data_crime %>% select(-is_south), label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 4,)

The graphic shows that a lot of variable has strong correlation with the crime_rate variables.

Modeling

#model full / used all variable to be predictors
model_full_ptor <- lm(formula = crime_rate~., data = df_data_crime)
model_stepcrime<- step(model_full_ptor,direction = "both",trace = 0)
summary(model_stepcrime)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = df_data_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 0.0000040395 ***
## percent_m          9.332      3.350   2.786      0.00828 ** 
## mean_education    18.012      5.275   3.414      0.00153 ** 
## police_exp60      10.265      1.552   6.613 0.0000000826 ***
## m_per1000f         2.234      1.360   1.642      0.10874    
## unemploy_m24      -6.087      3.339  -1.823      0.07622 .  
## unemploy_m39      18.735      7.248   2.585      0.01371 *  
## inequality         6.133      1.396   4.394 0.0000863344 ***
## prob_prison    -3796.032   1490.646  -2.547      0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 0.0000000001159

interpretation:

  • Adjusted RSquare: 74.44% = The model can explain 74.44% of the variation of the crime_rate, the rest of the other factors that are not yet modeled

  • Predictor p-value:

  1. m_per1000f is not significant (no *), it means that the number of men / 10 000 women has little effect on the crime rate (even has no effect)

  2. unemployment_m24 is not significant (no *), meaning that the unemployment rate for males aged 14-24 years has little effect on the crime rate (even does not affect it)

  • If percent_m increases by 1, it will increase(because the value is positif) the crime rate to 9,332
  • If mean_education increases by 1, it will increase (because the value is positif) the crime rate to 18,012
  • If inequality increases by 1, it will increase (because the value is positif) the crime rate to 6,133
  • If police_exp60 increases by 1, it will increase (because the value is positif) the crime rate to 10,265
  • If unemployment_m39 increases by 1, it will increase (because the value is positif) the crime rate to 18,735
  • If prob_prison increases by 1, it will reduce(because the value is negatif) crime rate to 3796

Evaluation

Evaluasi Model

crime_test <- read.csv("data_input/crime_test.csv")
crime_test$is_south <- as.logical(crime_test$is_south)
pred_crime_test <-  Predict(model_stepcrime,newdata = crime_test)

Check error From the model

# mean absolut error
MAE(pred_crime_test,crime_test$crime_rate)
## [1] 180.7295
# mean absolute percentage error
MAPE(pred_crime_test,crime_test$crime_rate)
## [1] 0.2339692
# Root Mean Squared Error
RMSE(pred_crime_test,crime_test$crime_rate)
## [1] 215.5166

MAE 180.7295, the mean crime rate prediction missed by 180

MAPE 0.233 = average crime rate prediction missed by 23%

RMSE 215.5166, the average predicted crime rate missed was 215.51

Asumption

  1. linearity

ensure, whether the predictor (model step crime – without is_south) has a strong relationship

ggcorr(df_data_crime %>% 
         select(
           percent_m,
           mean_education,
           police_exp60,
           m_per1000f,
           unemploy_m24,
           unemploy_m39,
           inequality,
           prob_prison,
           crime_rate
         ),label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 4)

data.frame(prediksi = model_full_ptor$fitted.values, # prediksi
     error = model_full_ptor$residuals) %>% 
  ggplot(aes(prediksi,error))+
  geom_hline(yintercept= 0)+ # garis lurus disumbu y = 0
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  # error

based on ggcor and fitted values, it can be seen that the relationship between the target and the predictors is quite strong, even though there are some predictors that are not very strongly correlated with the target.

  1. Normality Test The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk/plot-density/qqPlot normality test

2.1 first, check with density plot

plot(density((model_full_ptor$residuals)))

the density plot shows the residual is normal distribution

2.2 Second, we use the shapiro test

# condition accept if : p-value > 0.05
shapiro.test(model_full_ptor$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_full_ptor$residuals
## W = 0.9846, p-value = 0.7849

because the p-value> 0.05, the residual / error is normally distributed

2.3 Third, we check the data distribution using qqPlot

qqPlot(model_full_ptor$residuals)

## [1] 11 19

The qqPlot also shows that the data is normally distributed, as evidenced by all the points in the blue area (not outside the dotted line) and following the blue line pattern.

Based on the three tests, it can be concluded that the data error is normally distributed

  1. heterokesdasticity / unequal variance

Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.

3.1 BP test

# condition accept if : p-value > 0.05
bptest(model_full_ptor)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_full_ptor
## BP = 18.469, df = 15, p-value = 0.2388

based on the bp test, the results obtained p-value> 0.05, which means that the conditions are met

3.2 check with plot

plot(model_full_ptor$fitted.values, # prediction
     model_full_ptor$residuals) # error

from the plot also does not form a pattern so that the conditions are met.

  1. Multikolinearitas
# condition accept if : vif < 10
vif(model_stepcrime)
##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

based on the test results, condition accept because there is no predictor value greater than 10

Conclusion

The variables that are useful for describing crime rates are percent_m, mean_education, police_exp60, m_per1000f, unemployment_m24, unemployment_m39, inequality, prob_prison, crime_rate. The R-squared model is high, with 78.88% of the variables explaining the crime rate. The accuracy of the model in predicting the crime rate as measured by RMSE, MAPE and MAE with values of 215.5166, 23% and 180 respectively.Based on the assumption test, the model used can measure the crime rate.

Factors that must be considered by the government in preventing an increase in the crime rate, namely

  1. The unemployment factor aged 35-39 years

  2. The factor of the number of males aged 14-24 years

  3. Factors Average years of schooling

  4. Equality factor

  5. Police Expenditure Factors in 1959

  6. The possible factor of imprisonment