Predicting Crime Rate in Crime.csv Dataset Using Regression Model

In this section we will analyse about crime using linear regression to understand the trend of crime of the dataset. Dataset called crime.csv from 1960s crime data in crime.

1. Load Dataset

We install library we need for data processing.

library(dplyr)
library(leaps)
library(MASS)
library(GGally)
library(lmtest)
library(car)

We will read the dataset from our folder so we can process it.

crime <- read.csv("crime.csv")

We need to check the dataset.

head(crime)

##   X   M So  Ed Po1 Po2  LF  M.F Pop  NW  U1 U2 GDP Ineq     Prob    Time    y
## 1 1 151  1  91  58  56 510  950  33 301 108 41 394  261 0.084602 26.2011  791
## 2 2 143  0 113 103  95 583 1012  13 102  96 36 557  194 0.029599 25.2999 1635
## 3 3 142  1  89  45  44 533  969  18 219  94 33 318  250 0.083401 24.3006  578
## 4 4 136  0 121 149 141 577  994 157  80 102 39 673  167 0.015801 29.9012 1969
## 5 5 141  0 121 109 101 591  985  18  30  91 20 578  174 0.041399 21.2998 1234
## 6 6 121  0 110 118 115 547  964  25  44  84 29 689  126 0.034201 20.9995  682

To make column clear we need to rename the column so we can understand better.

crime <- read.csv("crime.csv") %>% 
dplyr::select(-X)
names(crime) <- c("percent_m", 
                  "is_south", 
                  "mean_education", 
                  "police_exp60", 
                  "police_exp59", 
                  "labour_participation", 
                  "m_per1000f", 
                  "state_pop", 
                  "nonwhites_per1000", 
                  "unemploy_m24", 
                  "unemploy_m39", 
                  "gdp", 
                  "inequality", 
                  "prob_prison", 
                  "time_prison", 
                  "crime_rate")

After we rename it we check the structure of the data.

glimpse(crime)

## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

The data is good so we can explore the data to understand the data.

2. Explanatory Data Analysis

The dataset was collected in 1960 and a full description of the dataset wasn’t conveniently available. I use the description I gathered from the authors of the MASS package. After you rename the dataset (in your coursebook, around line 410 to line 420), the variables are:
- percent_m: percentage of males aged 14-24 - is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
- mean_education: mean years of schooling
- police_exp60: police expenditure in 1960
- police_exp59: police expenditure in 1959 - labour_participation:labour force participation rate
- m_per1000f: number of males per 1000 females
- state_pop: state population
- nonwhites_per1000: number of non-whites resident per 1000 people
- unemploy_m24: unemployment rate of urban males aged 14-24
- unemploy_m39: unemployment rate of urban males aged 35-39
- gdp: gross domestic product per head
- inequality: income inequality
- prob_prison: probability of imprisonment
- time_prison: avg time served in prisons
- crime_rate: crime rate in an unspecified category

Testing any missing value

anyNA(crime)

## [1] FALSE

The data shown that there is no missing value so we can analyse further.

Seeking correlation with any variable

ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

We can see the correlation between each column. In our analysis we want to analyze crime rate in data set so we want to see any corelation in column crime_rate. We can see police_exp59 and police_exp60 have the highest correlation with crime_rate.

3. Making Regression Linear Model

We want to check which variable had the highest correlation with crime_rate. In the picture above we can see that police_exp59 and police_exp60 have the highest correlation with crime_rate with a score of 0.7. We will make the regression model using predictor variable from police_exp59 and police_exp60.

lm(crime_rate ~ police_exp59 , data = crime)

## 
## Call:
## lm(formula = crime_rate ~ police_exp59, data = crime)
## 
## Coefficients:
##  (Intercept)  police_exp59  
##      165.164         9.222

lm(crime_rate ~ police_exp60 , data = crime)

## 
## Call:
## lm(formula = crime_rate ~ police_exp60, data = crime)
## 
## Coefficients:
##  (Intercept)  police_exp60  
##      144.464         8.948

We can save them into objects called crime_r and crime_s

crime_r <- lm(crime_rate ~ police_exp59 , data = crime)
crime_s <- lm(crime_rate ~ police_exp60 , data = crime)

We can check the model using summary() function

summary(crime_r)

## 
## Call:
## lm(formula = crime_rate ~ police_exp59, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -595.58 -156.76   12.29  146.74  593.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   165.164    130.427   1.266    0.212    
## police_exp59    9.222      1.537   6.001 3.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 291.4 on 45 degrees of freedom
## Multiple R-squared:  0.4445, Adjusted R-squared:  0.4322 
## F-statistic: 36.01 on 1 and 45 DF,  p-value: 3.114e-07

summary(crime_s)

## 
## Call:
## lm(formula = crime_rate ~ police_exp60, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -586.91 -155.63   32.52  139.58  568.84 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   144.464    126.693   1.140     0.26    
## police_exp60    8.948      1.409   6.353 9.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283.9 on 45 degrees of freedom
## Multiple R-squared:  0.4728, Adjusted R-squared:  0.4611 
## F-statistic: 40.36 on 1 and 45 DF,  p-value: 9.338e-08

From the summary we can understand that police_exp60 is a better predictor because it has bigger R-squared and Adjusted R-squared (0.4445 and 0.4322 versus 0.4728 and 0.4611). To go further, we will draw the plot so we can see the picture of the model.

plot(crime$police_exp60, crime$crime_rate)
abline(crime_s$coefficients[1],crime_s$coefficients[2])

Now we get the regression model using one variable. We will compare it with other method using multiple variable to see which model performs better.

4. Multiple Regression Model

We will use r function to choose the predictor variable for us to make our model. The function we use is step-wise regression with backward elimination method.

crime_m  <- lm(crime_rate ~ ., crime)
step(crime_m, direction = "backward")

## Start:  AIC=514.65
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
##     police_exp59 + labour_participation + m_per1000f + state_pop + 
##     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - is_south              1        29 1354974 512.65
## - labour_participation  1      8917 1363862 512.96
## - time_prison           1     10304 1365250 513.00
## - state_pop             1     14122 1369068 513.14
## - nonwhites_per1000     1     18395 1373341 513.28
## - m_per1000f            1     31967 1386913 513.74
## - gdp                   1     37613 1392558 513.94
## - police_exp59          1     37919 1392865 513.95
## <none>                              1354946 514.65
## - unemploy_m24          1     83722 1438668 515.47
## - police_exp60          1    144306 1499252 517.41
## - unemploy_m39          1    181536 1536482 518.56
## - percent_m             1    193770 1548716 518.93
## - prob_prison           1    199538 1554484 519.11
## - mean_education        1    402117 1757063 524.86
## - inequality            1    423031 1777977 525.42
## 
## Step:  AIC=512.65
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
##     time_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - time_prison           1     10341 1365315 511.01
## - labour_participation  1     10878 1365852 511.03
## - state_pop             1     14127 1369101 511.14
## - nonwhites_per1000     1     21626 1376600 511.39
## - m_per1000f            1     32449 1387423 511.76
## - police_exp59          1     37954 1392929 511.95
## - gdp                   1     39223 1394197 511.99
## <none>                              1354974 512.65
## - unemploy_m24          1     96420 1451395 513.88
## - police_exp60          1    144302 1499277 515.41
## - unemploy_m39          1    189859 1544834 516.81
## - percent_m             1    195084 1550059 516.97
## - prob_prison           1    204463 1559437 517.26
## - mean_education        1    403140 1758114 522.89
## - inequality            1    488834 1843808 525.13
## 
## Step:  AIC=511.01
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1     10533 1375848 509.37
## - nonwhites_per1000     1     15482 1380797 509.54
## - state_pop             1     21846 1387161 509.75
## - police_exp59          1     28932 1394247 509.99
## - gdp                   1     36070 1401385 510.23
## - m_per1000f            1     41784 1407099 510.42
## <none>                              1365315 511.01
## - unemploy_m24          1     91420 1456735 512.05
## - police_exp60          1    134137 1499452 513.41
## - unemploy_m39          1    184143 1549458 514.95
## - percent_m             1    186110 1551425 515.01
## - prob_prison           1    237493 1602808 516.54
## - mean_education        1    409448 1774763 521.33
## - inequality            1    502909 1868224 523.75
## 
## Step:  AIC=509.37
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison
## 
##                     Df Sum of Sq     RSS    AIC
## - nonwhites_per1000  1     11675 1387523 507.77
## - police_exp59       1     21418 1397266 508.09
## - state_pop          1     27803 1403651 508.31
## - m_per1000f         1     31252 1407100 508.42
## - gdp                1     35035 1410883 508.55
## <none>                           1375848 509.37
## - unemploy_m24       1     80954 1456802 510.06
## - police_exp60       1    123896 1499744 511.42
## - unemploy_m39       1    190746 1566594 513.47
## - percent_m          1    217716 1593564 514.27
## - prob_prison        1    226971 1602819 514.54
## - mean_education     1    413254 1789103 519.71
## - inequality         1    500944 1876792 521.96
## 
## Step:  AIC=507.77
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + 
##     inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - police_exp59    1     16706 1404229 506.33
## - state_pop       1     25793 1413315 506.63
## - m_per1000f      1     26785 1414308 506.66
## - gdp             1     31551 1419073 506.82
## <none>                        1387523 507.77
## - unemploy_m24    1     83881 1471404 508.52
## - police_exp60    1    118348 1505871 509.61
## - unemploy_m39    1    201453 1588976 512.14
## - prob_prison     1    216760 1604282 512.59
## - percent_m       1    309214 1696737 515.22
## - mean_education  1    402754 1790276 517.74
## - inequality      1    589736 1977259 522.41
## 
## Step:  AIC=506.33
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - state_pop       1     22345 1426575 505.07
## - gdp             1     32142 1436371 505.39
## - m_per1000f      1     36808 1441037 505.54
## <none>                        1404229 506.33
## - unemploy_m24    1     86373 1490602 507.13
## - unemploy_m39    1    205814 1610043 510.76
## - prob_prison     1    218607 1622836 511.13
## - percent_m       1    307001 1711230 513.62
## - mean_education  1    389502 1793731 515.83
## - inequality      1    608627 2012856 521.25
## - police_exp60    1   1050202 2454432 530.57
## 
## Step:  AIC=505.07
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - gdp             1     26493 1453068 503.93
## <none>                        1426575 505.07
## - m_per1000f      1     84491 1511065 505.77
## - unemploy_m24    1     99463 1526037 506.24
## - prob_prison     1    198571 1625145 509.20
## - unemploy_m39    1    208880 1635455 509.49
## - percent_m       1    320926 1747501 512.61
## - mean_education  1    386773 1813348 514.35
## - inequality      1    594779 2021354 519.45
## - police_exp60    1   1127277 2553852 530.44
## 
## Step:  AIC=503.93
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1453068 503.93
## - m_per1000f      1    103159 1556227 505.16
## - unemploy_m24    1    127044 1580112 505.87
## - prob_prison     1    247978 1701046 509.34
## - unemploy_m39    1    255443 1708511 509.55
## - percent_m       1    296790 1749858 510.67
## - mean_education  1    445788 1898855 514.51
## - inequality      1    738244 2191312 521.24
## - police_exp60    1   1672038 3125105 537.93

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education    police_exp60      m_per1000f  
##      -6426.101           9.332          18.012          10.265           2.234  
##   unemploy_m24    unemploy_m39      inequality     prob_prison  
##         -6.087          18.735           6.133       -3796.032

Now, we get our new model. We want to see the performance of the model with summary() function.

summary(lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
    m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
    data = crime))

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

The model is good with R-squard of 0.7888 and adjusted R-squared of 0.7444. We will save it into new object called crime_new as our new model.

crime_new  <- lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
    m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
    data = crime)
crime_new

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education    police_exp60      m_per1000f  
##      -6426.101           9.332          18.012          10.265           2.234  
##   unemploy_m24    unemploy_m39      inequality     prob_prison  
##         -6.087          18.735           6.133       -3796.032

These are our regression model to predict crime_rate.

Model crime_new

crime_rate = 9.332(percent_m) + 18.012(mean_education) + 10.265(police_exp60) + 2.234(m_per1000f) - 6.087(unemploy_m24) + 18.735(unemploy_m39) + 6.133 (inequality ) - 3796.032(prob_prison) - 6426.101

Model crime_s

crime_rate = 144.464 + 8.948(police_exp60)

5. Prediction Model and Error Testing

Now, we want to test the model by using regression model into our prediction model.

Prediction model based on crime_new

pred_new <- predict(crime_new, crime, interval = "confidence", level = 0.95)
summary(pred_new)

##       fit              lwr              upr        
##  Min.   : 301.9   Min.   : 118.1   Min.   : 485.7  
##  1st Qu.: 723.7   1st Qu.: 555.1   1st Qu.: 885.5  
##  Median : 797.6   Median : 663.3   Median : 963.5  
##  Mean   : 905.1   Mean   : 736.3   Mean   :1073.8  
##  3rd Qu.:1124.7   3rd Qu.: 932.5   3rd Qu.:1272.0  
##  Max.   :1932.2   Max.   :1668.5   Max.   :2195.9

Prediction model based on crime_s

pred_s <- predict(crime_s, crime, interval = "confidence", level = 0.95)
summary(pred_s)

##       fit              lwr              upr        
##  Min.   : 547.1   Min.   : 406.3   Min.   : 688.0  
##  1st Qu.: 703.7   1st Qu.: 598.7   1st Qu.: 808.8  
##  Median : 842.4   Median : 756.7   Median : 928.2  
##  Mean   : 905.1   Mean   : 792.2   Mean   :1018.0  
##  3rd Qu.:1079.6   3rd Qu.: 979.4   3rd Qu.:1179.7  
##  Max.   :1629.9   Max.   :1385.4   Max.   :1874.4

Now, we want to check Residual Standard Error (RSE) of the model. Now we can use formula to calculate RSE to check if the error of the prediction model is high or low. The lower the RSE means that the model can predict better.

crime_new RSE

sqrt(mean((crime_new$residuals^2)))

## [1] 175.8304

crime_s RSE

sqrt(mean((crime_s$residuals^2)))

## [1] 277.8192

From the model crime_new has lower RSE than crime_s means crime_new has lower error than the other model to predict crime_rate.

6. Evaluating Model

We can see the error of the models using graph to see how the errors is shown.

Normality Histogram

hist(crime_new$residuals)

hist(crime_s$residuals)

From the graph we can see the histogram looks like has normal distribution. To analyze further we can use shapiro.test to check the value of normal distribution.

crime_new Shapiro-Wilk normality test

shapiro.test(crime_new$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime_new$residuals
## W = 0.98511, p-value = 0.8051

crime_s Shapiro-Wilk normality test

shapiro.test(crime_s$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime_s$residuals
## W = 0.97601, p-value = 0.439

Shapiro-Wilk hypothesis test are:

H0: Residuals are normally distributed

H1: Residuals are not normally distributed

For both model, P-value > 0.05 then H0 accepted. It means residuals are distributed normally.

Heteroscedasticity

plot(crime$crime_rate, crime_new$residuals)
abline(h = 0, col = "red")

bptest(crime_new)

## 
##  studentized Breusch-Pagan test
## 
## data:  crime_new
## BP = 13.51, df = 8, p-value = 0.09546

plot(crime$crime_rate, crime_s$residuals)
abline(h = 0, col = "red")

bptest(crime_s)

## 
##  studentized Breusch-Pagan test
## 
## data:  crime_s
## BP = 21.098, df = 1, p-value = 4.364e-06

For both model, P-value > 0.05 so h0 accepted. It means the residual don’t have heteroscedasticity pattern.

Variance Inflation Factor (Multicollinearity)

VIF can be calculated if there are more than one predictor. Only crime_new model has many predictor so we can calculate its VIF.

vif(crime_new)

##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

There are no predictor that has more than 10. It mean there are no multicollinearity found.

7. Summary and Recommendation

From all the analysis we can interpret that:

Model crime_new perform better than crime_s.
Model crime_new have better R square and adjusted R square.
Model crime_new have much lower RSE, increasing accuracy of prediction of the model.
Both model have normal distribution.
There are no multicollinearity in model crime_new.

From those interpretation, we recommend using model crime_new in predicting crime_rate in crime.csv data set. The model perform relatively well and pretty consistent so for future prediction of crime rate there is possibility of using model crime_new or using it as refference.