1. DATA INTRODUCTION

On this occasion, we will analyze using a linear regression model on the data crime.csv.

We will explore the target variable crime_rate to be able to analyze how socio-demographic variables influence the crime rate in an area.

1.1. Data Preparation

crime <- read.csv("crime.csv")
head(crime)

As we can see in the data display above, the column names in the data are still in the form of codes. Therefore, we will change the column names to match the proper column names.

crime <- crime %>% select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

head(crime)

Check the data information.

str(crime) 
## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

1.2. Data Preprocessing

From the data information above, there is one variable that has a data type that does not match the data. Therefore, we will change the data type to match. And we will check if any data has missing values.

crime <- crime %>% 
  mutate(is_south = as.factor(is_south))

str(crime)
## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...
anyNA(crime)
## [1] FALSE

2. DATA ANALYSIS

2.1. Feature Selection

Check the variable correlation.

ggcorr(crime, label = T, hjust = 1, layout.exp = 4)
## Warning in ggcorr(crime, label = T, hjust = 1, layout.exp = 4): data in
## column(s) 'is_south' are not numeric and were ignored

The variables with a strong correlation with crime_rate are police_exp59(0.7) and police_exp60 (0.7).

2.2. Build The Regression Model

We will make several linear regression model to predict crime_rate. Based on the two most potential variables, based on the combination of the two most potential variables, and based on the combination of all the existing variables.

crime_rate_RM_59 <- lm(crime_rate ~ police_exp59, crime)

crime_rate_RM_60 <- lm(crime_rate ~ police_exp60, crime)

crime_rate_RM_5960 <- lm(crime_rate ~ police_exp59 + police_exp60, crime)

crime_rate_RM_all <- lm(crime_rate ~ ., crime)

summary(crime_rate_RM_59)
## 
## Call:
## lm(formula = crime_rate ~ police_exp59, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -595.58 -156.76   12.29  146.74  593.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   165.164    130.427   1.266    0.212    
## police_exp59    9.222      1.537   6.001 3.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 291.4 on 45 degrees of freedom
## Multiple R-squared:  0.4445, Adjusted R-squared:  0.4322 
## F-statistic: 36.01 on 1 and 45 DF,  p-value: 3.114e-07
summary(crime_rate_RM_60)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -586.91 -155.63   32.52  139.58  568.84 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   144.464    126.693   1.140     0.26    
## police_exp60    8.948      1.409   6.353 9.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283.9 on 45 degrees of freedom
## Multiple R-squared:  0.4728, Adjusted R-squared:  0.4611 
## F-statistic: 40.36 on 1 and 45 DF,  p-value: 9.338e-08
summary(crime_rate_RM_5960)
## 
## Call:
## lm(formula = crime_rate ~ police_exp59 + police_exp60, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -636.09 -168.62   35.44  141.80  532.10 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    158.26     125.93   1.257   0.2155  
## police_exp59   -17.83      13.12  -1.359   0.1810  
## police_exp60    25.62      12.34   2.076   0.0438 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 281.3 on 44 degrees of freedom
## Multiple R-squared:  0.494,  Adjusted R-squared:  0.471 
## F-statistic: 21.48 on 2 and 44 DF,  p-value: 3.094e-07
summary(crime_rate_RM_all)
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
## percent_m                8.7830     4.1714   2.106 0.043443 *  
## is_south1               -3.8035   148.7551  -0.026 0.979765    
## mean_education          18.8324     6.2088   3.033 0.004861 ** 
## police_exp60            19.2804    10.6110   1.817 0.078892 .  
## police_exp59           -10.9422    11.7478  -0.931 0.358830    
## labour_participation    -0.6638     1.4697  -0.452 0.654654    
## m_per1000f               1.7407     2.0354   0.855 0.398995    
## state_pop               -0.7330     1.2896  -0.568 0.573845    
## nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
## unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
## unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
## gdp                      0.9617     1.0367   0.928 0.360754    
## inequality               7.0672     2.2717   3.111 0.003983 ** 
## prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
## time_prison             -3.4790     7.1653  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

And We will also buil a step-wise regression forward model.

# preparation : model without predictor
crime_rate_none <- lm(crime_rate ~ 1, data = crime)

# stepwise regression - forward
crime_rate_sw <- step(object = crime_rate_none, scope = list(upper = crime_rate_RM_all), direction = "forward", trace = F)

summary(crime_rate_sw)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education + 
##     percent_m + prob_prison + unemploy_m39, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5040.505    899.843  -5.602 1.72e-06 ***
## police_exp60      11.502      1.375   8.363 2.56e-10 ***
## inequality         6.765      1.394   4.855 1.88e-05 ***
## mean_education    19.647      4.475   4.390 8.07e-05 ***
## percent_m         10.502      3.330   3.154  0.00305 ** 
## prob_prison    -3801.836   1528.097  -2.488  0.01711 *  
## unemploy_m39       8.937      4.091   2.185  0.03483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11

2.3. Model Evaluation

We will compare the r-squared of each model that we have created.

summary(crime_rate_RM_59)$r.squared
## [1] 0.4445077
summary(crime_rate_RM_60)$r.squared
## [1] 0.4727999
summary(crime_rate_RM_5960)$adj.r.squared
## [1] 0.4710441
summary(crime_rate_RM_all)$adj.r.squared
## [1] 0.7078062
summary(crime_rate_sw)$adj.r.squared
## [1] 0.7307463

And now We will compare the RMSE of each model that we have created. And also W will use 98% Confidence Interval.

pred_crime_rate_RM_59 <- predict(crime_rate_RM_59,
                                 newdata = crime,
                                 interval = "prediction",
                                 level = 0.98)

pred_crime_rate_RM_60 <- predict(crime_rate_RM_60,
                                 newdata = crime,
                                 interval = "prediction",
                                 level = 0.98)

pred_crime_rate_RM_5960 <- predict(crime_rate_RM_5960,
                                   newdata = crime,
                                   interval = "prediction",
                                   level = 0.98)

pred_crime_rate_RM_all <- predict(crime_rate_RM_all,
                                  newdata = crime,
                                  interval = "prediction",
                                  level = 0.98)

pred_crime_rate_sw <- predict(crime_rate_sw,
                              newdata = crime,
                              interval = "prediction",
                              level = 0.98)

# evaluation
RMSE(pred_crime_rate_RM_59, crime$crime_rate)
## [1] 651.7798
RMSE(pred_crime_rate_RM_60, crime$crime_rate)
## [1] 634.9648
RMSE(pred_crime_rate_RM_5960, crime$crime_rate)
## [1] 633.3421
RMSE(pred_crime_rate_RM_all, crime$crime_rate)
## [1] 513.6302
RMSE(pred_crime_rate_sw, crime$crime_rate)
## [1] 464.1492

Based on the evaluation that We conduct above, for the r-squared & RMSE, the best model we can use is linear regression model with all predictor (crime_rate_RM_all).

3. MODEL VALIDATION

There are several assumptions applied by the model. It can be checked whether these assumptions are met to ensure the goodness of the model.

3.1. Normality of Residuals

Histogram residual visualization.

hist(crime_rate_RM_all$residuals)

Saphiro statistic test.

shapiro.test(crime_rate_RM_all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  crime_rate_RM_all$residuals
## W = 0.9846, p-value = 0.7849
  • H0: error/residual normally distributed
  • H1: error/residual not normally distributed

p-value >= 0.05 = accept H0
p-value < 0.05 = accept H1

Conclusion: error/residual is normally distributed, assumption test is met.

3.2. Homoscedasticity

Scatterplot visualization.

plot(x = crime_rate_RM_all$fitted.values, y = crime_rate_RM_all$residuals)
abline(h = 0, col = "red")

Breusch-Pagan hypothesis statistic test.

bptest(crime_rate_RM_all)
## 
##  studentized Breusch-Pagan test
## 
## data:  crime_rate_RM_all
## BP = 18.469, df = 15, p-value = 0.2388
  • H0: Error variance spreads constant (Homoscedasticity)
  • H1: Error variance spreads not constant/forms a pattern (Heteroscedasticity)

p-value >= 0.05 = accept H0
p-value < 0.05 = accept H1

Conclusion: Variance of error spreads constant (Homoscedasticity), assumption test is met.

3.3. Multicollinearity

VIF (Variance Inflation Factor) test.

vif(crime_rate_RM_all)
##            percent_m             is_south       mean_education 
##             2.892448             5.342783             5.077447 
##         police_exp60         police_exp59 labour_participation 
##           104.658667           113.559262             3.712690 
##           m_per1000f            state_pop    nonwhites_per1000 
##             3.785934             2.536708             4.674088 
##         unemploy_m24         unemploy_m39                  gdp 
##             6.063931             5.088880            10.530375 
##           inequality          prob_prison          time_prison 
##             8.644528             2.809459             2.713785
  • VIF > 10 : multicollinearity
  • VIF < 10 : no multicollinearity

Conclusion: Our model has multicollinearity, the assumption test is not met.

4. MODEL INTERPRETATION & RECOMENDATION

From the normality of residuals, Homoscedasticity, and Multicolinearity test, our crime_rate_RM_all model only fail the assumption in the Multicolinearity test. Therefore, we will validate the second best model, namely crime_rate_sw model.

#Normality of Residuals
shapiro.test(crime_rate_sw$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  crime_rate_sw$residuals
## W = 0.96898, p-value = 0.2423
#Homoscadasticity
bptest(crime_rate_sw)
## 
##  studentized Breusch-Pagan test
## 
## data:  crime_rate_sw
## BP = 14.873, df = 6, p-value = 0.02127
#Multicolinearity
vif(crime_rate_sw)
##   police_exp60     inequality mean_education      percent_m    prob_prison 
##       1.908124       3.530429       2.862893       2.000244       1.378714 
##   unemploy_m39 
##       1.363074

From the three validation test above, our crime_rate_sw model has fail in the Homoscedasticity assumption.

The risk of assumption test:
* Normality is not met: there is a large probability of error in our model (far from 0).
* Homoscedasticity is not met: there is a possibility of a large error in a certain range of values.
* Multicoll not met: less efficient.

So, it can be concluded that our crime_rate_RM_all model is the best model because it can handle errors that may appear better despite the lack of efficiency than crime_rate_sw model.

Based on the model that we have made, all predictor variables are very useful for predicting the target variable. In other words, all socio-demographic variables will influence the crime rate in an area.