Analysis Objective

This analysis is done to find the different socioeconomic variables most highly correlated to crime rates (crime_rate).

The analysis will use regression model which is expected to achieve an adjusted R-squared value above the grading threshold of 0.701 and the residual plot resembles a random scatterplot.

Libraries and Setup

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(leaps)
library(GGally)
## Loading required package: ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(shiny)
library(mediumr)

Import the crime.csv data and rename all the column names

crime <- read.csv("data_input/crime.csv") %>% 
  dplyr::select(-X)

names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

anyNA(crime)
## [1] FALSE

Adding new columns “police_exp” and “unemploy_m_all”

We want to add the sum of “police_exp60” and “police_exp59” to column “police_exp” and the sum of “unemploy_m39” and “unemploy_m24” to column “unemployed_m_all”.

We don’t include “police_exp60”, “police_exp59”, “unemploy_m39” and “unemploy_m24” into our analysis.

crime <- crime %>%
  mutate(police_exp = police_exp60 + police_exp59) %>%
  mutate(unemploy_m_all = unemploy_m39 + unemploy_m24)

crime <- subset(crime, select=-c(police_exp59, police_exp60, unemploy_m39, unemploy_m24))
head(crime)
##   percent_m is_south mean_education labour_participation m_per1000f
## 1       151        1             91                  510        950
## 2       143        0            113                  583       1012
## 3       142        1             89                  533        969
## 4       136        0            121                  577        994
## 5       141        0            121                  591        985
## 6       121        0            110                  547        964
##   state_pop nonwhites_per1000 gdp inequality prob_prison time_prison
## 1        33               301 394        261    0.084602     26.2011
## 2        13               102 557        194    0.029599     25.2999
## 3        18               219 318        250    0.083401     24.3006
## 4       157                80 673        167    0.015801     29.9012
## 5        18                30 578        174    0.041399     21.2998
## 6        25                44 689        126    0.034201     20.9995
##   crime_rate police_exp unemploy_m_all
## 1        791        114            149
## 2       1635        198            132
## 3        578         89            127
## 4       1969        290            141
## 5       1234        210            111
## 6        682        233            113
summary(crime)
##    percent_m        is_south      mean_education  labour_participation
##  Min.   :119.0   Min.   :0.0000   Min.   : 87.0   Min.   :480.0       
##  1st Qu.:130.0   1st Qu.:0.0000   1st Qu.: 97.5   1st Qu.:530.5       
##  Median :136.0   Median :0.0000   Median :108.0   Median :560.0       
##  Mean   :138.6   Mean   :0.3404   Mean   :105.6   Mean   :561.2       
##  3rd Qu.:146.0   3rd Qu.:1.0000   3rd Qu.:114.5   3rd Qu.:593.0       
##  Max.   :177.0   Max.   :1.0000   Max.   :122.0   Max.   :641.0       
##    m_per1000f       state_pop      nonwhites_per1000      gdp       
##  Min.   : 934.0   Min.   :  3.00   Min.   :  2.0     Min.   :288.0  
##  1st Qu.: 964.5   1st Qu.: 10.00   1st Qu.: 24.0     1st Qu.:459.5  
##  Median : 977.0   Median : 25.00   Median : 76.0     Median :537.0  
##  Mean   : 983.0   Mean   : 36.62   Mean   :101.1     Mean   :525.4  
##  3rd Qu.: 992.0   3rd Qu.: 41.50   3rd Qu.:132.5     3rd Qu.:591.5  
##  Max.   :1071.0   Max.   :168.00   Max.   :423.0     Max.   :689.0  
##    inequality     prob_prison       time_prison      crime_rate    
##  Min.   :126.0   Min.   :0.00690   Min.   :12.20   Min.   : 342.0  
##  1st Qu.:165.5   1st Qu.:0.03270   1st Qu.:21.60   1st Qu.: 658.5  
##  Median :176.0   Median :0.04210   Median :25.80   Median : 831.0  
##  Mean   :194.0   Mean   :0.04709   Mean   :26.60   Mean   : 905.1  
##  3rd Qu.:227.5   3rd Qu.:0.05445   3rd Qu.:30.45   3rd Qu.:1057.5  
##  Max.   :276.0   Max.   :0.11980   Max.   :44.00   Max.   :1993.0  
##    police_exp    unemploy_m_all 
##  Min.   : 87.0   Min.   : 91.0  
##  1st Qu.:121.5   1st Qu.:111.5  
##  Median :151.0   Median :126.0  
##  Mean   :165.2   Mean   :129.4  
##  3rd Qu.:200.5   3rd Qu.:143.5  
##  Max.   :323.0   Max.   :188.0

Assumption Checks

First, we want to check the correlation of other variables with the “crime_rate” variable.

round(cor(crime, crime$crime_rate), 3)
##                        [,1]
## percent_m            -0.089
## is_south             -0.091
## mean_education        0.323
## labour_participation  0.189
## m_per1000f            0.214
## state_pop             0.337
## nonwhites_per1000     0.033
## gdp                   0.441
## inequality           -0.179
## prob_prison          -0.427
## time_prison           0.150
## crime_rate            1.000
## police_exp            0.679
## unemploy_m_all        0.024

From this, we can assume that “percent_m”, “is_south”, “nonwhites_per1000”, and “unemploy_m_all” have no correlation to “crime_rate”.

The first assumption of the formula used for our model is:

formula = crime_rate ~ mean_education + labour_participation + m_per1000f + state_pop + gdp + inequality + prob_prison + time_prison + police_exp

crime_rate_model1 <- lm(formula = crime_rate ~ mean_education + labour_participation + m_per1000f + state_pop + gdp + inequality + prob_prison + time_prison + police_exp, data = crime)

summary(crime_rate_model1)
## 
## Call:
## lm(formula = crime_rate ~ mean_education + labour_participation + 
##     m_per1000f + state_pop + gdp + inequality + prob_prison + 
##     time_prison + police_exp, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -458.06 -150.16   29.22  144.11  390.38 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5352.4861  1572.1075  -3.405 0.001607 ** 
## mean_education          11.7720     6.1366   1.918 0.062813 .  
## labour_participation    -0.5037     1.1234  -0.448 0.656484    
## m_per1000f               2.1825     1.6044   1.360 0.181975    
## state_pop               -1.2944     1.3105  -0.988 0.329682    
## gdp                      0.8584     1.0211   0.841 0.405900    
## inequality               8.9176     2.0937   4.259 0.000135 ***
## prob_prison          -3015.0440  2145.2508  -1.405 0.168229    
## time_prison              5.8685     6.7594   0.868 0.390881    
## police_exp               6.0735     1.0940   5.551 2.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 224.9 on 37 degrees of freedom
## Multiple R-squared:  0.7279, Adjusted R-squared:  0.6617 
## F-statistic:    11 on 9 and 37 DF,  p-value: 4.278e-08

Since the adjusted R-squared value is below our target value 0.701, we will not use this formula.

Secondly, we do stepwise regression where the choice of predictive variables is carried out by an automatic procedure.

lm.all <- lm(crime_rate ~., crime)
# we only use it as a guidance when building our model
step(lm.all, direction="backward")
## Start:  AIC=518.45
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + gdp + inequality + 
##     prob_prison + time_prison + police_exp + unemploy_m_all
## 
##                        Df Sum of Sq     RSS    AIC
## - nonwhites_per1000     1      1009 1600646 516.48
## - time_prison           1      2145 1601782 516.51
## - labour_participation  1      3396 1603033 516.55
## - m_per1000f            1      6255 1605891 516.63
## - is_south              1     17497 1617133 516.96
## - state_pop             1     23927 1623563 517.15
## - unemploy_m_all        1     43247 1642883 517.71
## - gdp                   1     62432 1662068 518.25
## <none>                              1599636 518.45
## - prob_prison           1    147449 1747085 520.60
## - percent_m             1    173809 1773446 521.30
## - mean_education        1    244509 1844145 523.14
## - inequality            1    515785 2115422 529.59
## - police_exp            1   1024471 2624107 539.71
## 
## Step:  AIC=516.48
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + gdp + inequality + prob_prison + 
##     time_prison + police_exp + unemploy_m_all
## 
##                        Df Sum of Sq     RSS    AIC
## - time_prison           1      3014 1603660 514.57
## - labour_participation  1      4926 1605572 514.63
## - m_per1000f            1      5514 1606160 514.64
## - state_pop             1     24238 1624883 515.19
## - is_south              1     25581 1626227 515.23
## - unemploy_m_all        1     48928 1649574 515.90
## - gdp                   1     61999 1662645 516.27
## <none>                              1600646 516.48
## - prob_prison           1    151807 1752453 518.74
## - percent_m             1    199479 1800124 520.00
## - mean_education        1    243802 1844448 521.14
## - inequality            1    519500 2120145 527.69
## - police_exp            1   1356048 2956693 543.32
## 
## Step:  AIC=514.57
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + gdp + inequality + prob_prison + 
##     police_exp + unemploy_m_all
## 
##                        Df Sum of Sq     RSS    AIC
## - m_per1000f            1      3815 1607476 512.68
## - labour_participation  1      5837 1609497 512.74
## - state_pop             1     21514 1625174 513.20
## - is_south              1     25696 1629356 513.32
## - unemploy_m_all        1     50242 1653902 514.02
## - gdp                   1     65128 1668788 514.44
## <none>                              1603660 514.57
## - percent_m             1    227419 1831079 518.80
## - prob_prison           1    232450 1836111 518.93
## - mean_education        1    241857 1845517 519.17
## - inequality            1    522622 2126282 525.83
## - police_exp            1   1358738 2962398 541.41
## 
## Step:  AIC=512.68
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     state_pop + gdp + inequality + prob_prison + police_exp + 
##     unemploy_m_all
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1     14595 1622070 511.11
## - is_south              1     25496 1632971 511.42
## - state_pop             1     40425 1647901 511.85
## - gdp                   1     68865 1676340 512.65
## <none>                              1607476 512.68
## - unemploy_m_all        1    106387 1713862 513.69
## - prob_prison           1    230168 1837643 516.97
## - percent_m             1    265485 1872961 517.87
## - mean_education        1    280743 1888218 518.25
## - inequality            1    558268 2165744 524.69
## - police_exp            1   1443206 3050682 540.79
## 
## Step:  AIC=511.11
## crime_rate ~ percent_m + is_south + mean_education + state_pop + 
##     gdp + inequality + prob_prison + police_exp + unemploy_m_all
## 
##                  Df Sum of Sq     RSS    AIC
## - is_south        1     14140 1636210 509.51
## - state_pop       1     45007 1667078 510.39
## <none>                        1622070 511.11
## - gdp             1     85006 1707076 511.51
## - unemploy_m_all  1     91852 1713922 511.69
## - prob_prison     1    227914 1849985 515.29
## - percent_m       1    278219 1900289 516.55
## - mean_education  1    406829 2028899 519.62
## - inequality      1    771466 2393536 527.39
## - police_exp      1   1430499 3052570 538.82
## 
## Step:  AIC=509.51
## crime_rate ~ percent_m + mean_education + state_pop + gdp + inequality + 
##     prob_prison + police_exp + unemploy_m_all
## 
##                  Df Sum of Sq     RSS    AIC
## - state_pop       1     45088 1681298 508.79
## <none>                        1636210 509.51
## - unemploy_m_all  1     85168 1721378 509.90
## - gdp             1     98370 1734580 510.26
## - prob_prison     1    219956 1856166 513.44
## - percent_m       1    325254 1961464 516.04
## - mean_education  1    403684 2039894 517.88
## - inequality      1   1008407 2644617 530.08
## - police_exp      1   1529176 3165386 538.53
## 
## Step:  AIC=508.79
## crime_rate ~ percent_m + mean_education + gdp + inequality + 
##     prob_prison + police_exp + unemploy_m_all
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1681298 508.79
## - unemploy_m_all  1     89362 1770660 509.23
## - gdp             1     90985 1772283 509.27
## - prob_prison     1    187011 1868308 511.75
## - percent_m       1    393761 2075059 516.68
## - mean_education  1    516691 2197989 519.39
## - inequality      1    963607 2644905 528.09
## - police_exp      1   1603714 3285012 538.27
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + gdp + 
##     inequality + prob_prison + police_exp + unemploy_m_all, data = crime)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education             gdp  
##      -5559.360          10.455          15.530           1.385  
##     inequality     prob_prison      police_exp  unemploy_m_all  
##          8.514       -3384.687           5.522           1.876

From stepwise regression, we get the optimized regression formula for our model :

formula = crime_rate ~ percent_m + mean_education + gdp + inequality + prob_prison + police_exp + unemploy_m_all

Let’s try to run this formula to lm() function and we will inspect the p-value and the R-squared value of the model.

crime_rate_model2 <- lm(formula = crime_rate ~ percent_m + mean_education + gdp + inequality + prob_prison + police_exp + unemploy_m_all, 
    data = crime)

summary(crime_rate_model2)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + gdp + 
##     inequality + prob_prison + police_exp + unemploy_m_all, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -441.22 -103.95   -9.48   88.75  485.20 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5559.3597  1093.5825  -5.084 9.62e-06 ***
## percent_m         10.4554     3.4595   3.022  0.00442 ** 
## mean_education    15.5299     4.4858   3.462  0.00132 ** 
## gdp                1.3845     0.9530   1.453  0.15429    
## inequality         8.5140     1.8008   4.728 2.94e-05 ***
## prob_prison    -3384.6875  1625.0821  -2.083  0.04388 *  
## police_exp         5.5221     0.9054   6.099 3.77e-07 ***
## unemploy_m_all     1.8763     1.3032   1.440  0.15792    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.6 on 39 degrees of freedom
## Multiple R-squared:  0.7557, Adjusted R-squared:  0.7118 
## F-statistic: 17.23 on 7 and 39 DF,  p-value: 3.76e-10

From summary(), we know that there are 2 variable with p-value more than 0.05 which are not too significant to the model/formula.

The adjusted R-squared value for this formula is 0.7118.

  1. We use “all-possible-regression” to re-check all possible subsets of the variables and finds the model that best fits the data according to a specified criteria, which minimum our adjusted r-squared value 0.701.

The nbest parameter also allow us to specify the number of subsets of each size to record, and I will use 5.

We use the all variables from “crime” dataset on regsubsets() from library(leap).

crime_rate_check <- regsubsets(crime_rate ~ percent_m + is_south + mean_education + labour_participation + m_per1000f + state_pop + nonwhites_per1000 + gdp + inequality + prob_prison + time_prison + police_exp + unemploy_m_all, crime, nbest = 1)

plot(crime_rate_check, scale = "adjr", main = "All possible regression: ranked by Adjusted R-squared")

From all-possible-regression method with regsubsets(), we get this formula :

formula = crime_rate ~ percent_m + mean_education + gdp + inequality + prob_prison + police_exp

We choose the formula with the less variables or the less assumptions but we still achieve the minimum R-squared value 0.701.

crime_rate_model3 <- lm(formula = crime_rate ~ percent_m + mean_education + gdp + inequality + prob_prison + police_exp, 
    data = crime)

summary(crime_rate_model3)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + gdp + 
##     inequality + prob_prison + police_exp, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -460.76  -91.08   -2.90  116.30  430.33 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5024.2858  1042.1977  -4.821 2.09e-05 ***
## percent_m          9.0229     3.3575   2.687   0.0104 *  
## mean_education    14.3643     4.4710   3.213   0.0026 ** 
## gdp                1.4533     0.9645   1.507   0.1397    
## inequality         8.5562     1.8246   4.689 3.17e-05 ***
## prob_prison    -3363.9367  1646.6671  -2.043   0.0477 *  
## police_exp         5.4262     0.9150   5.931 5.92e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210.4 on 40 degrees of freedom
## Multiple R-squared:  0.7427, Adjusted R-squared:  0.7041 
## F-statistic: 19.24 on 6 and 40 DF,  p-value: 2.134e-10

From the three methods used to find out which formula is the most appropriate for our model, I choose the third one (crime_rate_model3) from all-possible-regression for our regression model.

summary(crime_rate_model3)$adj.r.squared
## [1] 0.704072
summary(crime_rate_model3)$call
## lm(formula = crime_rate ~ percent_m + mean_education + gdp + 
##     inequality + prob_prison + police_exp, data = crime)

So, we can say that :

crime_rate = -5024.2858 + 9.0229 * percent_m + 14.3643 * mean_education + 1.4533 * gdp + 8.5562 * inequality - 3363.9367 * prob_prison + 5.4262 * police_exp

The higher the percentage of males, mean education, gdp, inequality, and police expenses, the higher the crime rate will be. The lower the probability of imprisonment, the higher the crime will be.

It’s interesting that percentage of males (percent_m) and inequality variables have strong relationship (low p-value or high significance) with crime_rate, eventhough they’re not giving strong correlation (-0.089 and 0.179, respectively) in the first place.

Residual Plots

We look for any unwanted patterns in the plot that may indicate a violation in model assumptions related to incorrect specifications.

If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption.

plot(crime$crime_rate,
  residuals(crime_rate_model3), ylim=c(-600,600), ylab="Residuals", xlab="Crime Rate", pch = 2, cex = 0.4, col = "blue", main = "Residual Plot from Crime Rate Model")
abline(h = 0, col="red", lwd = 1.5)

This residual plot do not show any non-random pattern, and at this point suffice to say we can put the model specifications into production use knowing that it can’t be improved upon any further.

Using the model for prediction

We will predict our regression model for 2 new data from Jakarta and Bandung, to predict the crime rate in each cities. (Note : This is only dummy data)

Jakarta <- data.frame(percent_m = 141, mean_education = 96, gdp = 482, inequality = 195, prob_prison = 0.04, police_exp = 120)

Bandung <- data.frame(percent_m = 128, mean_education = 165, gdp = 322, inequality = 201, prob_prison = NA, police_exp = NA)

Since there are any NA data on Bandung data.frame, we can use the most data occured or median from “crime” dataset to predict crime rate for Bandung.

Bandung$prob_prison <- median(crime$prob_prison)
Bandung$police_exp <- median(crime$police_exp)

Bandung
##   percent_m mean_education gdp inequality prob_prison police_exp
## 1       128            165 322        201      0.0421        151

We use predict() to find out the crime rate for Jakarta and Bandung.

predict(crime_rate_model3, Jakarta, interval = "confidence")
##        fit      lwr      upr
## 1 512.4377 382.4231 642.4522
predict(crime_rate_model3, Bandung, interval = "confidence")
##        fit      lwr     upr
## 1 1366.239 627.0174 2105.46

Apparently, Bandung has the higher crime rate, so be careful and take care when you visit the city.