Using linear regression, let’s predict USA’s crime rates
depending on various socioeconomic factors. The data that we’ll be using
was collected from 47 states in the 1960s.
Let’s start by reading the data set using read.csv
. Here
we’ll also rename the columns in order to make the variables easier to
understand.
library(dplyr)
data <- read.csv("crime.csv") %>% select(-c(X))
names(data) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
Let’s take a glimpse
at the data set in order to see all
the variables.
glimpse(data)
## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
This is what all the variables mean:
percent_m
: percentage of males aged 14-24is_south
: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education
: mean years of schoolingpolice_exp60
: police expenditure in 1960police_exp59
: police expenditure in 1959labour_participation
: labour force participation
ratem_per1000f
: number of males per 1000 femalesstate_pop
: state populationnonwhites_per1000
: number of non-whites resident per
1000 peopleunemploy_m24
: unemployment rate of urban males aged
14-24unemploy_m39
: unemployment rate of urban males aged
35-39gdp
: gross domestic product per headinequality
: income inequalityprob_prison
: probability of imprisonmenttime_prison
: average time served in prisonscrime_rate
: crime rate in an unspecified categoryLet’s check for missing values by counting the amount of NaN in our data.
data %>% is.na() %>% colSums()
## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
It would seem like our data is already clean. The next thing would be
changing wrong data types. That would be turning is_south into a factor.
For this, let’s use mutate
.
data <- data %>%
mutate(is_south=as.factor(is_south))
rmarkdown::paged_table(data)
Now our data should be ready.
Let’s do some correlation checking before we start making the model.
Here we can use GGally’s ggcorr
in order to make a
correlation plot.
library(GGally)
ggcorr(data, hjust = 1, layout.exp = 3, label = TRUE)
We’re going to predict the
crime_rate
, so according to this
plot it would seem that police_exp59 and police_exp60 are the two most
contributing variables. Aside from that, prob_prison, gdp, state_pop,
and mean_education also seems to contribute somewhat.
Let’s try making a model using only police_exp59 and police_exp60.
model_police <- lm(crime_rate ~ police_exp60 + police_exp59, data= data)
summary(model_police)
##
## Call:
## lm(formula = crime_rate ~ police_exp60 + police_exp59, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -636.09 -168.62 35.44 141.80 532.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 158.26 125.93 1.257 0.2155
## police_exp60 25.62 12.34 2.076 0.0438 *
## police_exp59 -17.83 13.12 -1.359 0.1810
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 281.3 on 44 degrees of freedom
## Multiple R-squared: 0.494, Adjusted R-squared: 0.471
## F-statistic: 21.48 on 2 and 44 DF, p-value: 3.094e-07
Since this is multiple linear regression, we need to look at the Adjusted R-squared of the model in order to determine its quality. It would seem like this model only has an R-squared value of 0.471, which is quite terrible since we’d like the value to be as close to 1 as possible.
Let’s instead try to make a model with all of the predictors.
model_all <- lm(crime_rate ~ ., data= data)
summary(model_all)
##
## Call:
## lm(formula = crime_rate ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5984.2876 1628.3184 -3.675 0.000893 ***
## percent_m 8.7830 4.1714 2.106 0.043443 *
## is_south1 -3.8035 148.7551 -0.026 0.979765
## mean_education 18.8324 6.2088 3.033 0.004861 **
## police_exp60 19.2804 10.6110 1.817 0.078892 .
## police_exp59 -10.9422 11.7478 -0.931 0.358830
## labour_participation -0.6638 1.4697 -0.452 0.654654
## m_per1000f 1.7407 2.0354 0.855 0.398995
## state_pop -0.7330 1.2896 -0.568 0.573845
## nonwhites_per1000 0.4204 0.6481 0.649 0.521279
## unemploy_m24 -5.8271 4.2103 -1.384 0.176238
## unemploy_m39 16.7800 8.2336 2.038 0.050161 .
## gdp 0.9617 1.0367 0.928 0.360754
## inequality 7.0672 2.2717 3.111 0.003983 **
## prob_prison -4855.2658 2272.3746 -2.137 0.040627 *
## time_prison -3.4790 7.1653 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
This one has an R-squared value of 0.7078 which is much better than
the previous model, but still not great. ## Step Let’s try using step
which will automatically select the best predictors for us. We’ll start
with using backward
since we can simply use the previous
model with all the predictors.
model_backward <- step(object= model_all, direction= "backward",
trace= F)
summary(model_backward)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
This one has a value of 0.7444, which is better than the previous
model. Let’s also try to make the models using forward
and
both
in order to see if we can get a better Adjusted
R-squared score. For these two models we have to create a model with no
predictors whatsoever, so we’ll simply input 1
where the
predictors normally would have been.
model_none <- lm(crime_rate ~ 1, data)
model_forward <- step(object = model_none,
direction = "forward",
scope = list(upper= model_all),
trace=F)
summary(model_forward)
##
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education +
## percent_m + prob_prison + unemploy_m39, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470.68 -78.41 -19.68 133.12 556.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5040.505 899.843 -5.602 1.72e-06 ***
## police_exp60 11.502 1.375 8.363 2.56e-10 ***
## inequality 6.765 1.394 4.855 1.88e-05 ***
## mean_education 19.647 4.475 4.390 8.07e-05 ***
## percent_m 10.502 3.330 3.154 0.00305 **
## prob_prison -3801.836 1528.097 -2.488 0.01711 *
## unemploy_m39 8.937 4.091 2.185 0.03483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
model_both <- step(object = model_none,
direction = "both",
scope = list(upper= model_all),
trace=F)
summary(model_both)
##
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education +
## percent_m + prob_prison + unemploy_m39, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470.68 -78.41 -19.68 133.12 556.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5040.505 899.843 -5.602 1.72e-06 ***
## police_exp60 11.502 1.375 8.363 2.56e-10 ***
## inequality 6.765 1.394 4.855 1.88e-05 ***
## mean_education 19.647 4.475 4.390 8.07e-05 ***
## percent_m 10.502 3.330 3.154 0.00305 **
## prob_prison -3801.836 1528.097 -2.488 0.01711 *
## unemploy_m39 8.937 4.091 2.185 0.03483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
It would seem like the both and forward models are not much different, both of the models having a value of 0.7307. This makes them better than the all model but worse than the backward model. In this case, let’s use the two best models to predict our data.
data$pred_backward <- predict(model_backward, data)
data$pred_forward <- predict(model_forward, data)
rmarkdown::paged_table(data)
While we could count the margin of error of each prediction, it would be very tedious. Let’s instead use MAPE (Mean Absolute Percentage Error) from MLmetrics.
library(MLmetrics)
MAPE(data$pred_backward, data$crime_rate)
## [1] 0.1748751
MAPE(data$pred_forward, data$crime_rate)
## [1] 0.1695908
It seems like although the backward model has a better R-squared value, the forward model has a lower error percentage.
Another important thing to check is whether or not our model fits the typical assumptions of a linear regression model (BLUE model). One of the assumptions being Linearity. In this case let’s just analyze the backward model. We’ll plot the fitted and residual values in a scatter plot.
plot(model_backward, which = 1)
abline(h = 100, col = "green")
abline(h = -100, col = "green")
We want the residual’s “bounce randomly” value to be around 0. It
would seem like ours still fit this condition as there’s no random spike
in value. ## Normality of Residuals Next, we want to make sure that our
residuals are distributed normally, with most of them near the value of
0. We’ll use a histogram to plot this out.
# histogram residual
hist(model_backward$residuals)
It would seem like that the distribution of most of our residuals
are near the value of 0, with a bit of a spike around the value of 200.
To be absolutely sure, we can also use the Shapiro-Wilk Normality Test
on our residuals.
shapiro.test(model_backward$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.98511, p-value = 0.8051
In order to pass this test, our p-value has to be above alpha, which tends to be 0.05. Our p-value is 0.8051 > 0.05, so our backward model has passed the test.
Afterwards, we have to check if our errors are spread randomly without any sort of patterns. This condition is called Homoscedacity. We can try to check by using a scatter plot like below.
plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")
However, it can be rather hard if we’re not familiar with the
different patterns that typically happen. So, we can just use
bptest
or Studentized Breusch-Pagan Test.
library(lmtest)
bptest(model_backward)
##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 13.51, df = 8, p-value = 0.09546
Same as before, we want our p-value to be above alpha (0.05) In this case, our p-value is 0.09546 > 0.05 Our backward model has also passed the Breusch-Pagan test.
Last but not least, we have to check if our model has no strong
correlation between predictors or Multicollinearity. We can use
vif
or Variance Inflation Factor for this.
library(car)
vif(model_backward)
## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
The ideal condition is for all of our predictors to have a VIF value of less than 10. It would seem like all of our predictors passed this test as well.
The model with the best Adjusted R-squared is
model_backward
which uses the predictors: percent_m,
mean_education, police_exp60, m_per1000f, unemploy_m24, unemploy_m39,
inequality and prob_prison. Our model has fulfilled the BLUE model
assumption and has a MAPE value of around 17.5%. Although, our model
could still use some improvement considering the Adjusted R-squared
still only has a score of about 74.4%.
If we take a look once more at the summary of
model_backward
,
summary(model_backward)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
We can see that the predictors that have the most effect on the target variable are police_exp60 (police expenditure) and inequality, both having a positive correlation to the target variable. From this we can conclude that a large difference caused by social classes as well as an increase of police expenditure can greatly affect crime rates. This means, in order to decrease crime rates, we must also decrease social inequality. Aside from that, spending too much on the police force seems to have some unforeseen consequences.