Introduction

This report details the process of building a regression model for crime rate from crime rate data.

Data : Crime Rate

Data Inspection

We first invoke the required libraries.

library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)

options(scipen = 100, max.print = 1e+06)

Then download the data.

In our dataset we have 17 columns and 47 rows.

The column names are:

colnames(crime)
##  [1] "V1"   "M"    "So"   "Ed"   "Po1"  "Po2"  "LF"   "M.F"  "Pop"  "NW"  
## [11] "U1"   "U2"   "GDP"  "Ineq" "Prob" "Time" "y"

We will rename the columns as follows:

crime <- read.csv("data_input/crime.csv") %>% 
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

str(crime)
## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

After you rename the dataset the variables are: - percent_m: percentage of males aged 14-24, - is_south: whether it is in a Southern state. 1 for Yes, 0 for No, - mean_education: mean years of schooling, - police_exp60: police expenditure in 1960, - police_exp59: police expenditure in 1959, - labour_participation:labour force participation rate, - m_per1000f: number of males per 1000 females, - state_pop: state population, - nonwhites_per1000: number of non-whites resident per 1000 people, - unemploy_m24: unemployment rate of urban males aged 14-24, - unemploy_m39: unemployment rate of urban males aged 35-39, - gdp: gross domestic product per head, - inequality: income inequality, - prob_prison: probability of imprisonment, - time_prison: avg time served in prisons, - crime_rate: crime rate in an unspecified category .

Inspection of data types.

str(crime)
## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

We check for missing values.

colSums(is.na(crime))
##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

There are no missing values..

summary(crime)
##    percent_m        is_south      mean_education   police_exp60  
##  Min.   :119.0   Min.   :0.0000   Min.   : 87.0   Min.   : 45.0  
##  1st Qu.:130.0   1st Qu.:0.0000   1st Qu.: 97.5   1st Qu.: 62.5  
##  Median :136.0   Median :0.0000   Median :108.0   Median : 78.0  
##  Mean   :138.6   Mean   :0.3404   Mean   :105.6   Mean   : 85.0  
##  3rd Qu.:146.0   3rd Qu.:1.0000   3rd Qu.:114.5   3rd Qu.:104.5  
##  Max.   :177.0   Max.   :1.0000   Max.   :122.0   Max.   :166.0  
##   police_exp59    labour_participation   m_per1000f       state_pop     
##  Min.   : 41.00   Min.   :480.0        Min.   : 934.0   Min.   :  3.00  
##  1st Qu.: 58.50   1st Qu.:530.5        1st Qu.: 964.5   1st Qu.: 10.00  
##  Median : 73.00   Median :560.0        Median : 977.0   Median : 25.00  
##  Mean   : 80.23   Mean   :561.2        Mean   : 983.0   Mean   : 36.62  
##  3rd Qu.: 97.00   3rd Qu.:593.0        3rd Qu.: 992.0   3rd Qu.: 41.50  
##  Max.   :157.00   Max.   :641.0        Max.   :1071.0   Max.   :168.00  
##  nonwhites_per1000  unemploy_m24     unemploy_m39        gdp       
##  Min.   :  2.0     Min.   : 70.00   Min.   :20.00   Min.   :288.0  
##  1st Qu.: 24.0     1st Qu.: 80.50   1st Qu.:27.50   1st Qu.:459.5  
##  Median : 76.0     Median : 92.00   Median :34.00   Median :537.0  
##  Mean   :101.1     Mean   : 95.47   Mean   :33.98   Mean   :525.4  
##  3rd Qu.:132.5     3rd Qu.:104.00   3rd Qu.:38.50   3rd Qu.:591.5  
##  Max.   :423.0     Max.   :142.00   Max.   :58.00   Max.   :689.0  
##    inequality     prob_prison       time_prison      crime_rate    
##  Min.   :126.0   Min.   :0.00690   Min.   :12.20   Min.   : 342.0  
##  1st Qu.:165.5   1st Qu.:0.03270   1st Qu.:21.60   1st Qu.: 658.5  
##  Median :176.0   Median :0.04210   Median :25.80   Median : 831.0  
##  Mean   :194.0   Mean   :0.04709   Mean   :26.60   Mean   : 905.1  
##  3rd Qu.:227.5   3rd Qu.:0.05445   3rd Qu.:30.45   3rd Qu.:1057.5  
##  Max.   :276.0   Max.   :0.11980   Max.   :44.00   Max.   :1993.0

We check for correlation.

 ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Data Wrangling

Upon inspection we have identified that all data are in suitable form for our analysis. No further modifications to the dataset is required for now.

Regression Modelling

We are now begin to split the dataset into “train” and “test”. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparison to see if the model will overfit. We will split 70% of the data as the “train” set.

set.seed(123)
samplesize <- round(0.7 * nrow(crime), 0)
index <- sample(seq_len(nrow(crime)), size = samplesize)

crime_train <- crime[index, ]
crime_test <- crime[-index, ]

Computation

We begin the procedures for modelling the dataset using the Linear Regression Model.

LM Model 1

Firstly we will try to model “crime_rate” using linear regression using all remaining predictor variables.

set.seed(123)
crime_lm <- lm(crime_rate ~ ., data = crime_train)

summary(crime_lm)
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304.71  -91.82  -31.14  101.84  426.16 
## 
## Coefficients:
##                         Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)          -10332.3968   2232.6587  -4.628  0.00024 ***
## percent_m                 9.6677      4.4748   2.160  0.04530 *  
## is_south                  8.1011    223.3111   0.036  0.97148    
## mean_education           26.2959      7.6978   3.416  0.00329 ** 
## police_exp60             20.0908     11.8628   1.694  0.10858    
## police_exp59            -14.8569     13.5366  -1.098  0.28771    
## labour_participation     -4.6592      2.1721  -2.145  0.04669 *  
## m_per1000f                8.2657      2.8606   2.889  0.01019 *  
## state_pop                 1.3699      1.7925   0.764  0.45521    
## nonwhites_per1000         1.3334      0.9991   1.335  0.19960    
## unemploy_m24            -11.9795      5.1323  -2.334  0.03212 *  
## unemploy_m39             14.2389      9.8688   1.443  0.16724    
## gdp                       0.6436      1.2624   0.510  0.61671    
## inequality                4.2472      2.8552   1.488  0.15518    
## prob_prison             940.1481   3823.0185   0.246  0.80869    
## time_prison              14.7756      9.4304   1.567  0.13558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.9 on 17 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.7766 
## F-statistic: 8.414 on 15 and 17 DF,  p-value: 0.00003846

From the above we can identify that predictor variables which have Pr(>|t|) values larger than 0.05. These are ‘is_south’, ‘police_exp60’, ‘police_exp59’, ‘state_pop’, ‘nonwhites_per1000’, ‘unemploy_m39’, ‘gdp’, ‘inequality’, ‘prob_prison’, ‘time_prison’. Pr(>|t|) values larger than 0.05 signify predictors that are statistically insignificant to the prediction of the target ‘crime_rate’. Thus, we can simplify the model by removing all predictors identified with such Pr(>|t|) values, re-compute the model and see if we can improve on the Adjusted R-squared of 0.7766.

crime2 <- crime %>% select(!c(is_south, police_exp60, police_exp59, state_pop, nonwhites_per1000, unemploy_m39, gdp, inequality, prob_prison, time_prison))
 
set.seed(123)
samplesize <- round(0.7 * nrow(crime2), 0)
index <- sample(seq_len(nrow(crime2)), size = samplesize)

crime2_train <- crime2[index, ]
crime2_test <- crime2[-index, ]


set.seed(123)
crime_lm2 <- lm(crime_rate ~ ., data = crime2_train)

summary(crime_lm2)
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime2_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -683.32 -308.90  -23.25  185.01  939.75 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)          -3415.575   2711.750  -1.260    0.219
## percent_m                3.349      7.444   0.450    0.656
## mean_education          14.013      8.712   1.608    0.119
## labour_participation    -3.359      2.903  -1.157    0.257
## m_per1000f               5.157      3.948   1.306    0.203
## unemploy_m24            -8.766      6.644  -1.320    0.198
## 
## Residual standard error: 414.2 on 27 degrees of freedom
## Multiple R-squared:  0.1904, Adjusted R-squared:  0.0405 
## F-statistic:  1.27 on 5 and 27 DF,  p-value: 0.3054

In this second attempt, the R2 we get is 0.0405. Or roughly 4% of the variance found in the response variable (crime_rate) can be explained by the predictor variables.This is not an improvement but a deterioration compared to the first attempt at a linear regression model. We will stick with the first model as our LM 1 to compare with other LMs.

LM Model 2

In this second model we will run the StepAIC() function which computes step-wise regressions via backward and forward elimination (both ways). Then the computations will choose the best linear regression model for us.

library(MASS)

crime_lm <- lm(crime_rate ~., data = crime_train)

# Stepwise regression model
step.model <- stepAIC(crime_lm, direction = "both", 
                      trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     labour_participation + m_per1000f + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + inequality + time_prison, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -283.69  -80.50  -16.92   82.84  429.87 
## 
## Coefficients:
##                        Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)          -9643.5294  1570.9487  -6.139 0.00000353 ***
## percent_m                8.3543     3.9050   2.139   0.043752 *  
## mean_education          25.3573     6.1671   4.112   0.000459 ***
## police_exp60             8.3365     1.7646   4.724   0.000103 ***
## labour_participation    -3.5144     1.5221  -2.309   0.030727 *  
## m_per1000f               7.1285     2.0723   3.440   0.002337 ** 
## nonwhites_per1000        1.1725     0.6474   1.811   0.083817 .  
## unemploy_m24           -10.2895     3.9585  -2.599   0.016368 *  
## unemploy_m39            16.0240     8.4283   1.901   0.070453 .  
## inequality               4.6137     1.7115   2.696   0.013205 *  
## time_prison             15.7576     5.6929   2.768   0.011222 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.1 on 22 degrees of freedom
## Multiple R-squared:  0.8682, Adjusted R-squared:  0.8082 
## F-statistic: 14.49 on 10 and 22 DF,  p-value: 0.0000001705

Using the stepAIC() we have achieved an Adjusted R-squared of 0.8082. Notice the predictors chosen for the final formula.

Before we proceed to evaluating the LMs we take the opportunity to identify the most important variable.

varImp(step.model)
##                       Overall
## percent_m            2.139380
## mean_education       4.111694
## police_exp60         4.724382
## labour_participation 2.308865
## m_per1000f           3.439948
## nonwhites_per1000    1.811009
## unemploy_m24         2.599377
## unemploy_m39         1.901216
## inequality           2.695704
## time_prison          2.767931

We can see above that the top two predictors are ‘mean_education’ and ‘police_exp60’.

Evaluation

LM 1

We compute the RMSE (Root Mean Square Error) with the train dataset.

# RMSE of train dataset

RMSE(pred = crime_lm$fitted.values, obs = crime_train$crime_rate)
## [1] 143.446

We compute the RMSE (Root Mean Square Error) with the test dataset.

lm_pred1 <- predict(crime_lm, newdata = crime_test)
RMSE(pred = lm_pred1, obs = crime_test$crime_rate)
## [1] 434.199

The RMSE (root mean square error) is an indicator of the difference between the actual values and the predicted values. We see that there is a difference between RMSEs for the train and test datasets (RMSE for train dataset < RMSE for test dataset). This suggests that the model overfits the train dataset and is not a good for for unseen data.

LM 2

We compute the RMSE (Root Mean Square Error) with the train dataset.

RMSE(pred = step.model$fitted.values, obs = crime_train$crime_rate)
## [1] 151.1674

We compute the RMSE (Root Mean Square Error) with the test dataset.

lm_pred2 <- predict(step.model, newdata = crime_test)
RMSE(pred = lm_pred2, obs = crime_test$crime_rate)
## [1] 376.4441

Again the RMSE for the train dataset < RMSE for the test dataset but to a lesser degree than in LM 1.

Assumptions

The Linear Regression model is only as good as its assumptions are valid. In this section we explore the validity of the assumptions for the best linear regression model - LM 2 - that we have obtained.

The assumptions of a linear regression model are :-

*Linearity: The relationship between predictor and the mean of target is linear.

*Homoscedasticity: The variance of residual is the same for any value of predictor.

*Independence: Observations are independent of each other.

*Normality: For any fixed value of predictor, the target is normally distributed

Linearity Test

The linear regression model makes the assumption that the relationship between the predictors and the target must be linear.

model.diag.metrics <- augment(step.model)
head(model.diag.metrics)
## # A tibble: 6 × 18
##   .rownames crime_rate percent_m mean_…¹ polic…² labou…³ m_per…⁴ nonwh…⁵ unemp…⁶
##   <chr>          <int>     <int>   <int>   <int>   <int>   <int>   <int>   <int>
## 1 31               373       140      93      55     535    1045      20     135
## 2 15               798       152      87      57     530     986      72      92
## 3 14               664       135     117      62     595     986      46      77
## 4 3                578       142      89      45     533     969     219      94
## 5 42               542       141     109      56     523     968       2     107
## 6 37               831       177      87      58     638     974     349      76
## # … with 9 more variables: unemploy_m39 <int>, inequality <int>,
## #   time_prison <dbl>, .fitted <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, .std.resid <dbl>, and abbreviated variable names
## #   ¹​mean_education, ²​police_exp60, ³​labour_participation, ⁴​m_per1000f,
## #   ⁵​nonwhites_per1000, ⁶​unemploy_m24
library(ggfortify)
## Registered S3 method overwritten by 'ggfortify':
##   method          from   
##   autoplot.glmnet parsnip
par(mfrow = c(2, 2))
autoplot(step.model)

We can see above that the plot of residuals vs fitted show that the blue line remains close to the dotted line and the residuals show no regular pattern - these are signs suggesting that the assumption of linearity in the model holds.

The Q-Q plot shows almost a straight line suggesting that there are no signs of skewness which is typical of data derived from a normal distribution.

The residuals vs leverage plot shows that the model produces some extreme predictions (outliers vs actual data) which are highly influential to error measures.

The standardized residual square rooted vs Fitted values does not show a horizontal blue line but instead show some variance with respect to fitted values; this may suggest the presence of heteroscedasticity.

We can conclude that the linearity test holds for this model.

Normality Test

As an addition to the Q-Q plot, we can perform the Shapiro-Wilk test to confirm normality.

shapiro.test(step.model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  step.model$residuals
## W = 0.97453, p-value = 0.6143

The p value is more than 0.05 which suggests that the distribution of the data is normally distributed. Hence we can assume normality, which confirms the reading of our Q-Q plot.

Autocorrelation

The linear model assumes that there is no correlation between the residuals. The Durbin Watson test detects the presence of autocorrelation in the residuals of a regression model.

durbinWatsonTest(step.model)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1715621      1.641577   0.258
##  Alternative hypothesis: rho != 0

The D-W statistic is between 1.5 and 2.5; hence the null hypothesis is not rejected. We read this as an indication that there is no autocorrelation between residuals.

Heteroscedasticity

bptest(step.model)
## 
##  studentized Breusch-Pagan test
## 
## data:  step.model
## BP = 9.5527, df = 10, p-value = 0.4806

No sufficient evidence is gleaned here to confirm that heteroscedasticity is present since p-value is more than 0.05. This does not concur with our observation on plot of standardized residual square rooted vs fitted values.

Multicollinearity

We compute the Variance Inflation Factor of the model.

vif(step.model)
##            percent_m       mean_education         police_exp60 
##             2.008588             4.492356             3.075057 
## labour_participation           m_per1000f    nonwhites_per1000 
##             3.797952             3.449977             2.662333 
##         unemploy_m24         unemploy_m39           inequality 
##             4.145015             4.330561             3.842859 
##          time_prison 
##             1.739850

There are no values >=10 hence there is no evidence for multi-collinearity.

Interpretation & Summary

The best regression model (LM 2) is obtained by the stepwise regression modelling in both forward and backward directions. LM2 achieves an Adjusted R-squared of 0.8082. All the tests conducted on LM2 appear to agree with the assumptions of a linear regression model.