We will make crime rate prediction using linear regression model to see what are important variabel.
library(tidyverse)
library(caret)
library(GGally)
library(car)
library(lmtest)
crime <- read.csv("crime.csv")
glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
Check missing value
colSums(is.na(crime))
## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
There is no missing value at our dataset.
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
ggcorr(crime, label = T, label_size = 3, hjust = 1, layout.exp = 2)
The graphic show that not many good correlation with crime_rate, we can mention some strong correlations like police_exp59, police_60, gdp, and mean_education.
Now we will try to model the linear regression using crime_rate as the target variable.
set.seed(100)
crime_lm <- lm(crime_rate ~. , data = crime)
summary(crime_lm)
##
## Call:
## lm(formula = crime_rate ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5984.2876 1628.3184 -3.675 0.000893 ***
## percent_m 8.7830 4.1714 2.106 0.043443 *
## is_south -3.8035 148.7551 -0.026 0.979765
## mean_education 18.8324 6.2088 3.033 0.004861 **
## police_exp60 19.2804 10.6110 1.817 0.078892 .
## police_exp59 -10.9422 11.7478 -0.931 0.358830
## labour_participation -0.6638 1.4697 -0.452 0.654654
## m_per1000f 1.7407 2.0354 0.855 0.398995
## state_pop -0.7330 1.2896 -0.568 0.573845
## nonwhites_per1000 0.4204 0.6481 0.649 0.521279
## unemploy_m24 -5.8271 4.2103 -1.384 0.176238
## unemploy_m39 16.7800 8.2336 2.038 0.050161 .
## gdp 0.9617 1.0367 0.928 0.360754
## inequality 7.0672 2.2717 3.111 0.003983 **
## prob_prison -4855.2658 2272.3746 -2.137 0.040627 *
## time_prison -3.4790 7.1653 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
From the result above our R-squared 0.7078 , but we can choose only significant column, it’s shown by p-value < 0.05 or the easy way is by looking the star (*). We will save column police_exp60 and inequality.
Choose significant column
crime2 <- crime %>%
select(c(police_exp60, inequality, crime_rate))
Create second linear regession model
set.seed(100)
crime_lm2 <- lm(crime_rate ~., data = crime2)
summary(crime_lm2)
##
## Call:
## lm(formula = crime_rate ~ ., data = crime2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -703.87 -121.81 -14.56 154.55 506.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -944.662 343.947 -2.747 0.00869 **
## police_exp60 12.415 1.637 7.582 1.62e-09 ***
## inequality 4.095 1.220 3.357 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.2 on 44 degrees of freedom
## Multiple R-squared: 0.5803, Adjusted R-squared: 0.5612
## F-statistic: 30.42 on 2 and 44 DF, p-value: 5.061e-09
Our result with significant column is not quite good so far, our R-squared is 0.5612 the value is lower than the first one.
We make another model to get good R-squared, at this time we try stepwise model with direction both.
First stepwise model with all column
set.seed(100)
crime_step <- stats::step(crime_lm, direction = "both", trace = 0)
summary(crime_step)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
Second stepwise model with significant column
set.seed(100)
crime_step2 <- stats::step(crime_lm2, direction = "both", trace = 0)
summary(crime_step2)
##
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality, data = crime2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -703.87 -121.81 -14.56 154.55 506.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -944.662 343.947 -2.747 0.00869 **
## police_exp60 12.415 1.637 7.582 1.62e-09 ***
## inequality 4.095 1.220 3.357 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.2 on 44 degrees of freedom
## Multiple R-squared: 0.5803, Adjusted R-squared: 0.5612
## F-statistic: 30.42 on 2 and 44 DF, p-value: 5.061e-09
Based on our result of two models above. We can see the value of R-squared, our first model R-squared is 0.7444 and the second one is 0.5612. The first model is better than second model eventhough the second one with significant column. We will keep first model stepwise as our best model.
We use RMSE to see our model performance. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE.
Stepwise performance
RMSE(crime_step$fitted.values, crime$crime_rate)
## [1] 175.8304
Linear Regression performance
RMSE(crime_lm$fitted.values, crime$crime_rate)
## [1] 169.79
From RMSE calculation we get stepwise performance 175.8304 and linear regression performance 169.7918 . linear regression performance is better than stepwise.
After making some models, we have to check assumptions from our model. Here some assumptions :
The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.
data.frame(residual = crime_step$residuals, fitted = crime_step$fitted.values) %>%
ggplot(aes(fitted, residual)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Based on ggcor and fitted values, it can be seen that the relationship between the target and the predictors is quite strong, even though there are some predictors that are not very strongly correlated with the target.
The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.
shapiro.test(crime_step$residuals)
##
## Shapiro-Wilk normality test
##
## data: crime_step$residuals
## W = 0.98511, p-value = 0.8051
p-value > 0.05 that means our model have normal distribution. We can see from plot below.
qqPlot(crime_step$residuals)
## [1] 11 19
The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms (no autocorrelation). If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.
Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.
durbinWatsonTest(crime_step)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1049091 1.752067 0.352
## Alternative hypothesis: rho != 0
p-value > 0.05, we can conclude autocorrelation is not present.
Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual
bptest(crime_step)
##
## studentized Breusch-Pagan test
##
## data: crime_step
## BP = 13.51, df = 8, p-value = 0.09546
p-value > 0.05 that means our model is not heteroscedasticity but homocesdasticity.
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif(crime_step)
## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
Based on result our dataset is multicolinearity because there is no value greater than 5.
Variables that are useful to describe the variances in crime_rate are percent_m, mean_education, police_exp60, m_per1000f, unemployment_m24, unemployment_m39, inequality, prob_prison. Our final model has satisfied the classical assumptions. The Adjusted R-squared of the model is quite good, with 74.44% of the variables can explain the variances in the car price. The accuracy of the model in predicting the crime rate is measured with RMSE, our RMSE result is 175.8304. Based on assumption test our model can be used to predict crime_rate.