1 Initialization

  

1.1 Library Call

library(dplyr)
library(GGally)
library(car)
library(gvlma)

  

1.2 Read Crime Data Source

crime <- read.csv("crime.csv") %>% select(-X) 
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate") 

  

2 Modeling & Variable Selection

  

2.1 Check Data Correlation using ggcorr

ggcorr(crime, label = T,label_size = 2.9, hjust=1, layout.exp = 5)

  

  • The target is crime_rate, so any variable which has low correlation with crime_rate could be ignored, such as:
    • time_prison
    • inequality
    • unemploy_m39
    • unemploy_m24
    • nonwhites_per1000
    • m_per1000f
    • labour_participation
    • is_south
    • percent_m
  • Any variable aside than crime_rate should not have high correlation, such as:
    • police_exp 59 & police_exp60
    • nonwhites_per1000 & is_south
    • mean_education & inequality
    • gdp & inequality

2.2 Create Linear Model

lm.crime <- lm(crime_rate ~ ., crime)
lm.crime.none <- lm(crime_rate ~ 1, crime)

2.3 Doing stepwise check

2.3.1 Backward Direction Step

summary(step(lm.crime, data=crime, direction="backward", trace = 0))
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

  

2.3.2 Forward Direction Step

summary(step(lm.crime.none, scope = list(upper=lm.crime), data=crime, direction="forward",trace = 0))
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education + 
##     percent_m + prob_prison + unemploy_m39, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5040.505    899.843  -5.602 1.72e-06 ***
## police_exp60      11.502      1.375   8.363 2.56e-10 ***
## inequality         6.765      1.394   4.855 1.88e-05 ***
## mean_education    19.647      4.475   4.390 8.07e-05 ***
## percent_m         10.502      3.330   3.154  0.00305 ** 
## prob_prison    -3801.836   1528.097  -2.488  0.01711 *  
## unemploy_m39       8.937      4.091   2.185  0.03483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11

  

2.3.3 Bi-Direction Step

2.3.3.1 From Upper Level

summary(step(lm.crime, scope = list(upper=lm.crime), data=crime, direction="both",trace = 0))
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

  

2.3.3.2 From Lower Level

summary(step(lm.crime.none, scope = list(upper=lm.crime), data=crime, direction="both",trace = 0))
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education + 
##     percent_m + prob_prison + unemploy_m39, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5040.505    899.843  -5.602 1.72e-06 ***
## police_exp60      11.502      1.375   8.363 2.56e-10 ***
## inequality         6.765      1.394   4.855 1.88e-05 ***
## mean_education    19.647      4.475   4.390 8.07e-05 ***
## percent_m         10.502      3.330   3.154  0.00305 ** 
## prob_prison    -3801.836   1528.097  -2.488  0.01711 *  
## unemploy_m39       8.937      4.091   2.185  0.03483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11

  

2.4 Summary

Checking from 4 different directions of stepwise, gave us the best adj. R Square if we chose below variables to be included into the model:

  • percent_m
  • mean_education
  • police_exp60
  • m_per1000f
  • unemploy_m24
  • unemploy_m39
  • inequality
  • prob_prison

Hence our Linear Model call will be

lm.adj <- lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
    m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
    data = crime)

3 Post-Modelling Analysis

3.1 Residual Plots

3.1.1 Residuals vs Fitted

plot(lm.adj, which=c(1,1))

  

This plot shows that the residuals don’t have non-linear patterns, because we can see on the plot that the residuals are spread around a horizontal line without distinct patterns

  

3.1.2 Normal Q-Q

plot(lm.adj, which=c(2,2))

  

From the normal Q-Q Plot given, we could assume that the model is normally distributed, beacause the points is faling along a straight line. And if we look closely to the pattern, the normal distribution is assumed as skewed left type (positive skewness).

  

3.1.3 Scale Location Plot

plot(lm.adj, which=c(3,3))

  

From this plot we can see that the model is homoscedastic because the residuals are tend to spread equally along the predictor range

  

3.1.4 Residuals vs Leverage

plot(lm.adj, which=c(5,5))

  

We can barely see cook’s distance lines because all cases are well inside of the cook’s, hence the outliers didn’t have significant impact to the data model

  

3.2 Variance Inflation Factor

vif(lm.adj)
##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

  

Analyzing the value of VIF, we can assume that the multicollinearity of our data model is low (< 10)

  

3.3 Global Valiadation of Linear Models Assumptions

gvlma(lm.adj)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education    police_exp60  
##      -6426.101           9.332          18.012          10.265  
##     m_per1000f    unemploy_m24    unemploy_m39      inequality  
##          2.234          -6.087          18.735           6.133  
##    prob_prison  
##      -3796.032  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lm.adj) 
## 
##                     Value p-value                Decision
## Global Stat        2.4658  0.6508 Assumptions acceptable.
## Skewness           0.0841  0.7718 Assumptions acceptable.
## Kurtosis           0.1539  0.6948 Assumptions acceptable.
## Link Function      2.0589  0.1513 Assumptions acceptable.
## Heteroscedasticity 0.1689  0.6811 Assumptions acceptable.

3.3.1 Global Stat

   Global Stat showing the p-value > 0.05, indicates the linear relationship between X’s and Y

3.3.2 Skewness

   The Skewness value indicate the normal distribution is slightly skewed to left (positive skewness), p-value > 0.05 indicates that the data is distributed normally (no need for data transformation)

3.3.3 Kurtosis

   The kurtosis value indicate the normal distribution is slightly leptokurtic - fatter tails (positive kurtosis), p-value > 0.05 indicates that the data is distributed normally (no need for data transformation)

3.3.5 Heteroscedasticity

   Heteroscedasticity showing the p-value > 0.05, indicates that the model assumption of homoscedasticity is verified

  

3.4 Breusch-Pagan Test

lmtest::bptest(lm.adj)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm.adj
## BP = 13.51, df = 8, p-value = 0.09546

   The BP Test showing the p-value > 0.05, indicates that the model assumption of homoscedasticity is verified

3.5 Summary

After doing several test above we could safely assume that the model is good to be used

4 Overall Summary

The final beta values for defining crime rate based on our data is

x
(Intercept) -6426.101018
prob_prison -3796.031826
unemploy_m24 -6.086633
m_per1000f 2.233975
inequality 6.133494
percent_m 9.332155
police_exp60 10.265316
mean_education 18.012011
unemploy_m39 18.734512

  

Looking at the model shown, in order to decrease the crime rate significantly, we should increase the probability of the people to be imprisoned. Another recommendation that the model can give, we should decrease mean years of schooling (whaaaaaat??) and unemployed Male at the age of 35-39