Linear Regression on Crime Dataset

Introduction

This article will discus Linear Regression Analysis using Crime Dataset. The dataset used by criminologists to study the effect of punishment regimes on crime rates. Using linear regression analysis, we are going to build a model to predict crime rate based on various factors that influence it.

Source: Google.com

Case Analysis

Load Packages

These are the basic packages that I will use throughout the process:

library(readr) # to read data
library(dplyr) # to tidy data
library(GGally) # to make correlation matrix
library(lmtest) # for Breusch-Pagan/heteroscedasticity test
library(car) # for multicollinearity test

Load Dataset

crime <- read_csv("data_input/crime.csv")
crime

Data Wrangling

# inspecting data structure
glimpse(crime)

## Observations: 47
## Variables: 17
## $ X1   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ M    <dbl> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 13...
## $ So   <dbl> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, ...
## $ Ed   <dbl> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, ...
## $ Po1  <dbl> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67,...
## $ Po2  <dbl> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, ...
## $ LF   <dbl> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 59...
## $ M.F  <dbl> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, ...
## $ Pop  <dbl> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 3...
## $ NW   <dbl> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10...
## $ U1   <dbl> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77...
## $ U2   <dbl> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 4...
## $ GDP  <dbl> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 58...
## $ Ineq <dbl> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 17...
## $ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.03420...
## $ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20....
## $ y    <dbl> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 167...

The dataset was collected in 1960. Since the full description of the dataset wasn’t available, the descriptions for columns were retrieved from the authors of the MASS package. The variables are:

X1: Index
M: percentage of males aged 14-24
So: whether it is in a Southern state. 1 for Yes, 0 for No.
Ed: mean years of schooling
Po1: police expenditure in 1960
Po2: police expenditure in 1959
LF: labour force participation rate
M.F: number of males per 1000 females
Pop: state population
NW: number of non-whites resident per 1000 people
U1: unemployment rate of urban males aged 14-24
U2: unemployment rate of urban males aged 35-39
GDP: gross domestic product per head
Ineq: income inequality
Prob: probability of imprisonment
Time: average time served in prisons
y: crime rate in an unspecified category

We are going to remove the first column because it only contain index data which will not be needed in our analysis. Additionally, we will also rename the columns for easier analysis and transform So column into a factor data type.

# select column
crime <- crime %>% 
  select(-X1)

# rename column 
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
  
# transform & making new column
crime <- crime %>%
  mutate(is_south = as.factor(is_south))

crime

Explanatory Data Analysis

# check for missing value
anyNA(crime)

## [1] FALSE

# correlation matrix
ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

The correlation matrix above reveals the correlation between all of the numeric variables of the crime dataset (is_south was excluded because of its categorical value). From this matrix, we can see that police_exp60 and police_exp59 have high correlation with one another. This condition is not ideal for a linear regression because one of the assumption when we are building a linear regression model is non-multicollinearity, ie. there is no predictor variable that correlate strongly with one another. Further discussion about multicollinearity is in the last section of this article. Based on this result, I will combine those two variables into one pol_exp column that holds police expenditures data from the last 2 years.

crimex <- crime %>%
  mutate(pol_exp = police_exp60+police_exp59) %>% 
  select(-c(police_exp60,police_exp59))

crimex

# updated correlation matrix
ggcorr(crimex, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

From the updated correlation matrix, it can be seen that the police expenditures have the highest correlation with the crime rate. Meanwhile, the number of non-white residents has no correlation with the crime rate. From the analysis above, I decided to not using nonwhites_per1000 variable as a predictor as it has no correlation for our target variable. The use of nonwhites_per1000 variable might affect the calculation of our model variable significance and promoting incorrect significancies).

Model Building

Simple Linear Regression

In building simple linear regression, it is important to have predictor variable that has high linear relationship with the target variables. Based on the correlation matrix, I will be using pol_exp as my predictor variable. Prior to model building, I will inspect if there is any outlier from our data.

# further EDA - outlier inspection
plot(crimex$pol_exp, crimex$crime_rate)

boxplot(crimex$pol_exp)

Based on the boxplot, there is one outlier in our pol_exp data. I will remove this outlier first before building our model because based on the scatterplot, it looks like it has a high influence on our model.

# remove outliers
crime.clean <- crimex %>% 
  filter(pol_exp<320)
         
boxplot(crime.clean$pol_exp)

plot(crime.clean$pol_exp, crime.clean$crime_rate)

# model building
slm.o <- lm(crime_rate ~ pol_exp, crimex) 
slm <- lm(crime_rate ~ pol_exp, crime.clean) # without outlier


# plotting the linear regression model
plot(crimex$pol_exp, crimex$crime_rate)
abline(slm.o$coefficients[1],slm.o$coefficients[2], col = "red")
abline(slm$coefficients[1],slm$coefficients[2], col = "green") # without outlier

summary(slm.o)

## 
## Call:
## lm(formula = crime_rate ~ pol_exp, data = crimex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -581.07 -151.94   20.33  147.24  580.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 152.0674   128.5331   1.183    0.243    
## pol_exp       4.5573     0.7354   6.197 1.59e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 287.2 on 45 degrees of freedom
## Multiple R-squared:  0.4605, Adjusted R-squared:  0.4485 
## F-statistic:  38.4 on 1 and 45 DF,  p-value: 1.591e-07

summary(slm)

## 
## Call:
## lm(formula = crime_rate ~ pol_exp, data = crime.clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -650.8 -153.4   28.0  163.1  541.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.4783   130.6502   0.348    0.729    
## pol_exp       5.2941     0.7679   6.894 1.64e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.8 on 44 degrees of freedom
## Multiple R-squared:  0.5193, Adjusted R-squared:  0.5083 
## F-statistic: 47.53 on 1 and 44 DF,  p-value: 1.636e-08

From both of the regression model, pol_exp has positive corelation with crime.rate and is significantly affecting crime_rate (3 stars/p-value < 0.05).

If we look at the abline of the plot, we can see that the outliers of our data have a high influence over our model. Therefore, it is safer to remove the outliers from the data. The removal of outliers increase the R-squared of the model ~0.6 points. Based on this analysis, I will use slm as our simple linear model. Additionally, It also has higher R-squared as one of the tools for evaluating a linear regression model.

R-squared value tells us how well our model describes our data. It measures the extent to which the variance in our target variable (crime_rate) can be explained by the predictor variables (pol_exp, etc.). As we increase the number of predictor variables, our model’s R-squared value will also increase as it incorporates new legitimate information as well as the noise introduced by these extra variables. Because of this, it is better to evaluate the adjusted R-squared of our model. Adjusted R-squared does not increase the way regular multiple R-squared does because it is adjusted for the number of predictor variables in our model. It increases only when the new variable actually leads to a better prediction.

slm model can be written as:

crime_rate = 45.4783 + 5.2941(pol_exp)

which also means:

every 1 unit increase of police expenditure in the last 2 years, there will be an increase in crime rate at about 5.2941 times of the police expenditure.

Even so, our simple linear model only have an adjusted R-squared of 0.5083. An improvement can be made by adding more data/adding more variables to our model that may explain our target. On the constraint of only having this dataset, I will try to add more variables and use multiple linear regression for this case.

Multiple Linear Regression

On building multiple linear regression, I will use automatic feature selection using step-wise regression with backward elimination. I will use crime_rate as our target variable and all the remaining variables as our predictor, except the nonwhites_per1000 variable which has zero correlation with crime_rate.

mlm <- lm(crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
    m_per1000f + state_pop + unemploy_m24 + 
    unemploy_m39 + gdp + inequality + prob_prison + time_prison + 
    pol_exp, crime.clean)
step(mlm, direction = "backward")

## Start:  AIC=499.41
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + 
##     inequality + prob_prison + time_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - state_pop             1        22 1298610 497.42
## - is_south              1       113 1298701 497.42
## - time_prison           1      3277 1301864 497.53
## - labour_participation  1      3529 1302116 497.54
## - gdp                   1      8271 1306859 497.71
## - unemploy_m24          1     36151 1334738 498.68
## - m_per1000f            1     42497 1341084 498.90
## <none>                              1298587 499.41
## - unemploy_m39          1     88024 1386611 500.43
## - prob_prison           1    117246 1415833 501.39
## - percent_m             1    168873 1467461 503.04
## - mean_education        1    299178 1597766 506.95
## - inequality            1    432350 1730938 510.63
## - pol_exp               1   1067853 2366441 525.02
## 
## Step:  AIC=497.42
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - is_south              1       114 1298724 495.42
## - time_prison           1      3523 1302132 495.54
## - labour_participation  1      3546 1302155 495.54
## - gdp                   1      8633 1307242 495.72
## - unemploy_m24          1     36360 1334970 496.69
## - m_per1000f            1     48877 1347487 497.12
## <none>                              1298610 497.42
## - unemploy_m39          1     88625 1387234 498.45
## - prob_prison           1    117555 1416165 499.40
## - percent_m             1    168904 1467514 501.04
## - mean_education        1    299444 1598054 504.96
## - inequality            1    464523 1763132 509.48
## - pol_exp               1   1163183 2461793 524.84
## 
## Step:  AIC=495.42
## crime_rate ~ percent_m + mean_education + labour_participation + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - time_prison           1      3507 1302231 493.54
## - labour_participation  1      4958 1303683 493.59
## - gdp                   1      9106 1307830 493.74
## - unemploy_m24          1     41820 1340544 494.88
## - m_per1000f            1     49582 1348307 495.14
## <none>                              1298724 495.42
## - unemploy_m39          1     93362 1392087 496.61
## - prob_prison           1    128301 1427026 497.75
## - percent_m             1    180480 1479204 499.41
## - mean_education        1    299371 1598095 502.96
## - inequality            1    597093 1895817 510.82
## - pol_exp               1   1168618 2467342 522.94
## 
## Step:  AIC=493.54
## crime_rate ~ percent_m + mean_education + labour_participation + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + pol_exp
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1      3805 1306036 491.68
## - gdp                   1     10889 1313120 491.93
## - unemploy_m24          1     44275 1346506 493.08
## - m_per1000f            1     46320 1348551 493.15
## <none>                              1302231 493.54
## - unemploy_m39          1    102971 1405202 495.04
## - percent_m             1    213507 1515738 498.53
## - prob_prison           1    220749 1522981 498.75
## - mean_education        1    295885 1598116 500.96
## - inequality            1    610210 1912441 509.22
## - pol_exp               1   1172569 2474800 521.08
## 
## Step:  AIC=491.68
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## - gdp             1     10398 1316434 490.04
## - unemploy_m24    1     40481 1346517 491.08
## - m_per1000f      1     44751 1350787 491.23
## <none>                        1306036 491.68
## - unemploy_m39    1    104836 1410871 493.23
## - prob_prison     1    218502 1524538 496.79
## - percent_m       1    231590 1537626 497.19
## - mean_education  1    300525 1606560 499.20
## - inequality      1    609144 1915180 507.29
## - pol_exp         1   1208349 2514385 519.81
## 
## Step:  AIC=490.04
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + inequality + prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## - unemploy_m24    1     47775 1364208 489.68
## - m_per1000f      1     50927 1367360 489.79
## <none>                        1316434 490.04
## - unemploy_m39    1    118175 1434609 492.00
## - percent_m       1    221285 1537719 495.19
## - prob_prison     1    257245 1573678 496.25
## - mean_education  1    327854 1644287 498.27
## - inequality      1    828730 2145163 510.50
## - pol_exp         1   1736089 3052523 526.73
## 
## Step:  AIC=489.68
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m39 + 
##     inequality + prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## - m_per1000f      1     18044 1382252 488.29
## <none>                        1364208 489.68
## - unemploy_m39    1     86771 1450980 490.52
## - percent_m       1    225573 1589781 494.72
## - prob_prison     1    262958 1627167 495.79
## - mean_education  1    297172 1661380 496.75
## - inequality      1    952361 2316569 512.04
## - pol_exp         1   3019469 4383677 541.38
## 
## Step:  AIC=488.29
## crime_rate ~ percent_m + mean_education + unemploy_m39 + inequality + 
##     prob_prison + pol_exp
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1382252 488.29
## - unemploy_m39    1    115767 1498020 489.99
## - prob_prison     1    257895 1640148 494.16
## - percent_m       1    265460 1647713 494.37
## - mean_education  1    530001 1912253 501.22
## - inequality      1   1020942 2403194 511.73
## - pol_exp         1   3001462 4383714 539.38

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 + 
##     inequality + prob_prison + pol_exp, data = crime.clean)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education    unemploy_m39  
##      -4605.850           8.724          16.721           7.064  
##     inequality     prob_prison         pol_exp  
##          7.051       -3863.787           6.796

This step-wise regression method will return the best formula that gives lowest AIC (Akaike Information Criterion), whereas the lower the AIC, the lower the information loss that our model has.

# assigning multiple linear regression model to an object
mlm.step <- lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 + 
    inequality + prob_prison + pol_exp, data = crime.clean)

summary(mlm.step)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + unemploy_m39 + 
##     inequality + prob_prison + pol_exp, data = crime.clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -575.90 -117.56   -1.68  125.80  448.44 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4605.8496   860.8376  -5.350 4.12e-06 ***
## percent_m          8.7238     3.1876   2.737 0.009295 ** 
## mean_education    16.7212     4.3241   3.867 0.000407 ***
## unemploy_m39       7.0635     3.9083   1.807 0.078433 .  
## inequality         7.0513     1.3138   5.367 3.91e-06 ***
## prob_prison    -3863.7869  1432.3620  -2.697 0.010267 *  
## pol_exp            6.7957     0.7385   9.202 2.55e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 188.3 on 39 degrees of freedom
## Multiple R-squared:  0.7985, Adjusted R-squared:  0.7676 
## F-statistic: 25.77 on 6 and 39 DF,  p-value: 3.888e-12

If we compare between our simple linear regression and multiple linear regression model, we can see that our multiple linear regression gave better adjusted R-squared of 0.7676 from previously 0.5083. Therefore, I will use the mlm.step model for our prediction.

From this model, we can see that probability have negative correlation with crime_rate, while other variable positively correlate with crime_rate. The formula of our final model is:

crime_rate = 8.7238(percent_m) + 16.7212(mean_education) + 7.0635(unemploy_m39) + 7.0513(inequality) - 3863.7869(prob_prison) + 6.7957(pol_exp) - 4605.8496

Model Prediction

# make prediction
crimex$cr_pred <- predict(mlm.step, crimex)

# calculating RMSE
sqrt(mean((crimex$cr_pred - crimex$crime_rate)^2))

## [1] 194.0982

range(crime$crime_rate)

## [1]  342 1993

Based on this result, our linear regression model still gave a considerably high error to our prediction and therefore still needs improvement. This result is actually quite expected from the look of our scatterplot, whereas our data only vaguely resemble a linear line between our predictor (which has the highest correlation) with our target variable.

We can improve this model by adding more data (47 observations might not be enough), or searching for another predictor variables that may have a better linear correlation with our target variable)

Another machine learning approach to predict the crime rate should also be considered as this linear regression model is not very robust compared to other machine learning algorithms such as Random Forest.

Model Assumption Evaluation

Linearity

This assumption is already checked prior to model building using correlation matrix in EDA step, and through scatterplot between our predictor and our target variable.

Normality

hist(mlm.step$residuals, breaks = 5)

shapiro.test(mlm.step$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mlm.step$residuals
## W = 0.96654, p-value = 0.2051

P-value > 0.05; H0 accepted. This result means that the residuals of our model distributed normally, therefore our model will have error around its mean.

Heteroscedasticity

plot(crime.clean$crime_rate, mlm.step$residuals)
abline(h = 0, col = "red")

bptest(mlm.step)

## 
##  studentized Breusch-Pagan test
## 
## data:  mlm.step
## BP = 8.3082, df = 6, p-value = 0.2164

P-value > 0.05; H0 accepted. This means that our model residuals do not have a pattern that cannot be captured by our model.

Variance Inflation Factor (Multicollinearity)

vif(mlm.step)

##      percent_m mean_education   unemploy_m39     inequality    prob_prison 
##       1.970723       3.035998       1.412245       3.486175       1.343397 
##        pol_exp 
##       1.955926

There are no values equal to or more than 10 so Multicollinearity between variables is not found (predictor variables are mutually independent).

Based on the analysis, our mlm.step model passed all of the assumption evaluation.

Conclusions

The mlm.step model gave an R-squared of 0.7676 and has an RMSE of 194. The predictor variables that will contribute to the mlm.step model include percent_m, mean_education, unemploy_m39, inequality, prob_prison, and pol_exp. We can improve this model by adding more data (47 observations might not be enough), or searching for another predictor variables that may have a better linear correlation with our target variable). You can also optimize the prediction by using different machine learning approach to predict the crime rate. As you know, the linear regression model is not very robust compared to other machine learning algorithms such as Random Forest, although it has great interpretability.

Lastly,

Thank you for reading, happy learning, and keep improving!