The data we will be evaluating today is a data from 1960 used by criminologists to study the effect of punishment regimes on crime rates, and in this exercise, we’ll try to make a linear regression model to predict the crime rate and see how different kinds of predictor variables affects it.
We will use MSE and RMSE as a measure of our model’s performance For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.
crime <- read.csv("data_input/crime.csv") %>%
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")NA check
## [1] FALSE
Data Preview
| percent_m | is_south | mean_education | police_exp60 | police_exp59 | labour_participation | m_per1000f | state_pop | nonwhites_per1000 | unemploy_m24 | unemploy_m39 | gdp | inequality | prob_prison | time_prison | crime_rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 151 | 1 | 91 | 58 | 56 | 510 | 950 | 33 | 301 | 108 | 41 | 394 | 261 | 0.084602 | 26.2011 | 791 |
| 143 | 0 | 113 | 103 | 95 | 583 | 1012 | 13 | 102 | 96 | 36 | 557 | 194 | 0.029599 | 25.2999 | 1635 |
| 142 | 1 | 89 | 45 | 44 | 533 | 969 | 18 | 219 | 94 | 33 | 318 | 250 | 0.083401 | 24.3006 | 578 |
| 136 | 0 | 121 | 149 | 141 | 577 | 994 | 157 | 80 | 102 | 39 | 673 | 167 | 0.015801 | 29.9012 | 1969 |
| 141 | 0 | 121 | 109 | 101 | 591 | 985 | 18 | 30 | 91 | 20 | 578 | 174 | 0.041399 | 21.2998 | 1234 |
| 121 | 0 | 110 | 118 | 115 | 547 | 964 | 25 | 44 | 84 | 29 | 689 | 126 | 0.034201 | 20.9995 | 682 |
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes, 0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per 1000 peopleunemploy_m24: unemployment rate of urban males aged 14-24unemploy_m39: unemployment rate of urban males aged 35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryLet’s join the police_exp and unemploy to avoid multicolinearity
## Warning in ggcorr(crime, label = T, hjust = 0.9, layout.exp = 2, label_size =
## 3, : data in column(s) 'is_south' are not numeric and were ignored
From initial ggcorr test, we’d like to perform some further correlation test on low performing variables (<= 0.2), namely :
percent_mnonwhites_per1000unemploy_m24inequality## [1] "Correlation of percent_m :"
## cor
## -0.0894724
## [1] "Correlation of nonwhites_per1000 :"
## cor
## 0.03259884
## [1] "Correlation of unemploy_m24 :"
## cor
## 0.02352789
## [1] "Correlation of inequality :"
## cor
## -0.1790237
We are using version 4, but here’s the trial data and how we get to the version that we’re happy with.
Model with all variables included. Our Adjusted R-squared is 0.7046.
##
## Call:
## lm(formula = crime_rate ~ ., data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -276.6 -105.7 -32.7 134.6 464.1
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5661.6419 2057.0277 -2.752 0.01194 *
## percent_m 1.0782 5.7781 0.187 0.85376
## is_south1 -52.2503 164.0244 -0.319 0.75321
## mean_education 12.0181 7.6205 1.577 0.12972
## police_exp60 18.1742 14.3110 1.270 0.21800
## police_exp59 -11.2350 15.7665 -0.713 0.48394
## labour_participation -0.4187 1.6501 -0.254 0.80215
## m_per1000f 2.6628 2.4307 1.095 0.28571
## state_pop 0.4553 1.6496 0.276 0.78524
## nonwhites_per1000 1.1159 0.8003 1.394 0.17781
## unemploy_m24 -5.4864 4.7685 -1.151 0.26285
## unemploy_m39 9.4924 10.1819 0.932 0.36178
## gdp 2.0402 1.2520 1.630 0.11811
## inequality 9.5846 3.1156 3.076 0.00573 **
## prob_prison -5961.3539 2673.0938 -2.230 0.03678 *
## time_prison -16.7451 10.0799 -1.661 0.11152
## police_exp NA NA NA NA
## unemploy NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 220.5 on 21 degrees of freedom
## Multiple R-squared: 0.8325, Adjusted R-squared: 0.7129
## F-statistic: 6.96 on 15 and 21 DF, p-value: 3.873e-05
We will try to work with variables recommended by the step function with a bit of tweaks. This model results in Adjusted R squared of 0.725. A 2% improvement from the first model.
##
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp60 + m_per1000f +
## nonwhites_per1000 + gdp + inequality + prob_prison + time_prison,
## data = crime_train)
##
## Coefficients:
## (Intercept) mean_education police_exp60 m_per1000f
## -5715.989 11.389 9.103 1.785
## nonwhites_per1000 gdp inequality prob_prison
## 1.010 2.390 11.027 -5397.630
## time_prison
## -13.770
crime_model_v2 <- lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f +
inequality + prob_prison, data = crime_train)
summary(crime_model_v2)##
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f +
## inequality + prob_prison, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -514.48 -122.42 23.25 147.64 466.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5623.6011 1305.3749 -4.308 0.000154 ***
## mean_education 11.7668 5.8591 2.008 0.053392 .
## police_exp 6.5360 0.8188 7.982 5.18e-09 ***
## m_per1000f 2.7494 1.3149 2.091 0.044826 *
## inequality 8.4805 1.6584 5.114 1.55e-05 ***
## prob_prison -3162.6889 1813.1384 -1.744 0.091013 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 215.8 on 31 degrees of freedom
## Multiple R-squared: 0.7632, Adjusted R-squared: 0.725
## F-statistic: 19.99 on 5 and 31 DF, p-value: 7.09e-09
We’ve tweaked the predictor variables in this model so that it yields the best result. This model has Adjusted R squared of 0.7331. A 3% improvement from the first model.
crime_model_v3 <- lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + gdp + percent_m +
inequality + prob_prison + is_south, data = crime_train)
summary(crime_model_v3)##
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f +
## gdp + percent_m + inequality + prob_prison + is_south, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -391.70 -120.27 -38.94 124.94 430.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6981.3167 1474.8403 -4.734 5.74e-05 ***
## mean_education 12.7353 6.0076 2.120 0.04301 *
## police_exp 5.4347 0.9975 5.448 8.16e-06 ***
## m_per1000f 2.4680 1.3589 1.816 0.08007 .
## gdp 1.7694 1.0616 1.667 0.10673
## percent_m 3.3433 4.4455 0.752 0.45830
## inequality 9.9214 2.4635 4.027 0.00039 ***
## prob_prison -2986.3250 2059.0858 -1.450 0.15808
## is_south1 83.4337 126.9245 0.657 0.51632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 212.6 on 28 degrees of freedom
## Multiple R-squared: 0.7924, Adjusted R-squared: 0.7331
## F-statistic: 13.36 on 8 and 28 DF, p-value: 9.82e-08
From the version 3, we noticed there are 2 variables with a very high coefficient and standard error values, prob_prison and is_south. In this iteration, we removed them and received Adjusted R-squared value at 0.73.
crime_model_v4 <- lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f + gdp + percent_m +
inequality, data = crime_train)
summary(crime_model_v4)##
## Call:
## lm(formula = crime_rate ~ mean_education + police_exp + m_per1000f +
## gdp + percent_m + inequality, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -387.77 -148.59 -10.31 131.84 451.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7362.8235 1446.2070 -5.091 1.80e-05 ***
## mean_education 12.5536 5.7968 2.166 0.0384 *
## police_exp 5.6253 0.9774 5.756 2.77e-06 ***
## m_per1000f 2.2382 1.3389 1.672 0.1050
## gdp 2.2170 1.0175 2.179 0.0373 *
## percent_m 3.5920 4.3579 0.824 0.4163
## inequality 11.0016 2.2876 4.809 3.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 213 on 30 degrees of freedom
## Multiple R-squared: 0.7768, Adjusted R-squared: 0.7322
## F-statistic: 17.41 on 6 and 30 DF, p-value: 1.442e-08
##
## Shapiro-Wilk normality test
##
## data: crime_model_v4$residuals
## W = 0.97983, p-value = 0.7256
The histogram indicates there’s a bit of a skew in our distribution, but it passes shapiro test with no problem.
As we can see, with no distinct pattern, we can conclude that our model passes the homoscedacity assumption.
## mean_education police_exp m_per1000f gdp percent_m
## 3.399311 2.866657 1.371557 8.188752 2.361084
## inequality
## 6.768903
Our vif test indicates that there’s no strong linear relationship in between our predictor variables.
Here’s the comparison between the train MSE RMSE data and the test MSE RMSE data.
Variables used to build the model are mean education, combined police expenditure from 1959 and 1960, number of males per 1000 female, gdp, percentage of males aged 14-24, and income inequality. With our data test range from 508 to 1272, our RMSE at 269.4 is still considered quite large. Meaning our model is still inaccurate in predicting our target variable.
There’s a great chance that such inaccuracy happens because of the small observation number of only 47 in our dataset, causing difficulty for our model to learn the correct impact of each predictor variables. To improve our model, we recommend getting more data.
We found that with our crime rate target is affected significantly by mean education, police expenditure, GDP, and inequality.
I think police expenditure logically will rise as crime rate rises, so it may not be a very good indicator on what causing the crime rate to go up.
Our three other predictor are mean education, GDP, and inequality. At first glance it can be weird to see, because all three are positively correlated with the crime rate. That means -if we were to follow the model blindly- an increase in mean education, GDP, and inequality should contributes to an increase in our crime rate. That just doesn’t make any sense in my perspective.
If we take a closer look, the inequality has a very high coefficient value of 11 compared to the other two variables. This means for every rise in a unit of inequality, our crime rate is going up by 11.
In my opinion, mean education and GDP is positively correlated because it’s measuring the the average of the population. So we shouldn’t conclude that a rise in mean education and GDP causes the crime rate to go up. It just goes to show how big the inequality in the society is.
In conclusion, the way to reduce the crime rate is to greatly reduce the amount of inequality.