1 Introduction

We will make crime rate prediction using linear regression model to see what are important variabel.

2 Setup Library

library(tidyverse)
library(caret)
library(GGally)
library(car)
library(lmtest)

3 Load dataset

crime <- read.csv("crime.csv")
glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

Check missing value

colSums(is.na(crime))
##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

There is no missing value at our dataset.

4 Explaratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

4.1 Check Correlation

ggcorr(crime, label = T, label_size = 3, hjust = 1, layout.exp = 2)

The graphic show that not many good correlation with crime_rate, we can mention some strong correlations like police_exp59, police_60, gdp, and mean_education.

5 Modelling

5.1 Linear Regression

Now we will try to model the linear regression using crime_rate as the target variable.

set.seed(100)
crime_lm <- lm(crime_rate ~. , data = crime)
summary(crime_lm)
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
## percent_m                8.7830     4.1714   2.106 0.043443 *  
## is_south                -3.8035   148.7551  -0.026 0.979765    
## mean_education          18.8324     6.2088   3.033 0.004861 ** 
## police_exp60            19.2804    10.6110   1.817 0.078892 .  
## police_exp59           -10.9422    11.7478  -0.931 0.358830    
## labour_participation    -0.6638     1.4697  -0.452 0.654654    
## m_per1000f               1.7407     2.0354   0.855 0.398995    
## state_pop               -0.7330     1.2896  -0.568 0.573845    
## nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
## unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
## unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
## gdp                      0.9617     1.0367   0.928 0.360754    
## inequality               7.0672     2.2717   3.111 0.003983 ** 
## prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
## time_prison             -3.4790     7.1653  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

From the result above our R-squared 0.7078 , but we can choose only significant column, it’s shown by p-value < 0.05 or the easy way is by looking the star (*). We will save column police_exp60 and inequality.

Choose significant column

crime2 <- crime  %>% 
          select(c(police_exp60, inequality, crime_rate))

Create second linear regession model

set.seed(100)
crime_lm2 <- lm(crime_rate ~., data = crime2)
summary(crime_lm2)
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -703.87 -121.81  -14.56  154.55  506.45 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -944.662    343.947  -2.747  0.00869 ** 
## police_exp60   12.415      1.637   7.582 1.62e-09 ***
## inequality      4.095      1.220   3.357  0.00163 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.2 on 44 degrees of freedom
## Multiple R-squared:  0.5803, Adjusted R-squared:  0.5612 
## F-statistic: 30.42 on 2 and 44 DF,  p-value: 5.061e-09

Our result with significant column is not quite good so far, our R-squared is 0.5612 the value is lower than the first one.

5.2 Stepwise

We make another model to get good R-squared, at this time we try stepwise model with direction both.

First stepwise model with all column

set.seed(100)
crime_step <- stats::step(crime_lm, direction = "both", trace = 0)

summary(crime_step)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

Second stepwise model with significant column

set.seed(100)
crime_step2 <- stats::step(crime_lm2, direction = "both", trace = 0)

summary(crime_step2)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality, data = crime2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -703.87 -121.81  -14.56  154.55  506.45 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -944.662    343.947  -2.747  0.00869 ** 
## police_exp60   12.415      1.637   7.582 1.62e-09 ***
## inequality      4.095      1.220   3.357  0.00163 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.2 on 44 degrees of freedom
## Multiple R-squared:  0.5803, Adjusted R-squared:  0.5612 
## F-statistic: 30.42 on 2 and 44 DF,  p-value: 5.061e-09

Based on our result of two models above. We can see the value of R-squared, our first model R-squared is 0.7444 and the second one is 0.5612. The first model is better than second model eventhough the second one with significant column. We will keep first model stepwise as our best model.

6 Evaluation

6.1 Model Performance

We use RMSE to see our model performance. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE.

Stepwise performance

RMSE(crime_step$fitted.values, crime$crime_rate)
## [1] 175.8304

Linear Regression performance

RMSE(crime_lm$fitted.values, crime$crime_rate)
## [1] 169.79

From RMSE calculation we get stepwise performance 175.8304 and linear regression performance 169.7918 . linear regression performance is better than stepwise.

7 Assumptions

After making some models, we have to check assumptions from our model. Here some assumptions :

7.1 Linearity

The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

data.frame(residual = crime_step$residuals, fitted = crime_step$fitted.values) %>%
  ggplot(aes(fitted, residual)) + 
  geom_point() + 
  geom_smooth() + 
  geom_hline(aes(yintercept = 0)) + 
  theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Based on ggcor and fitted values, it can be seen that the relationship between the target and the predictors is quite strong, even though there are some predictors that are not very strongly correlated with the target.

7.2 Normality

The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.

shapiro.test(crime_step$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  crime_step$residuals
## W = 0.98511, p-value = 0.8051

p-value > 0.05 that means our model have normal distribution. We can see from plot below.

qqPlot(crime_step$residuals)

## [1] 11 19

7.3 Autocorrelation

The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms (no autocorrelation). If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.

Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.

durbinWatsonTest(crime_step)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1049091      1.752067   0.352
##  Alternative hypothesis: rho != 0

p-value > 0.05, we can conclude autocorrelation is not present.

7.4 Heteroscedasticity

Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a linear regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to the previous statement of a non-random pattern in residual

bptest(crime_step)
## 
##  studentized Breusch-Pagan test
## 
## data:  crime_step
## BP = 13.51, df = 8, p-value = 0.09546

p-value > 0.05 that means our model is not heteroscedasticity but homocesdasticity.

7.5 Multicolinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(crime_step)
##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

Based on result our dataset is multicolinearity because there is no value greater than 5.

8 Conclusion

Variables that are useful to describe the variances in crime_rate are percent_m, mean_education, police_exp60, m_per1000f, unemployment_m24, unemployment_m39, inequality, prob_prison. Our final model has satisfied the classical assumptions. The Adjusted R-squared of the model is quite good, with 74.44% of the variables can explain the variances in the car price. The accuracy of the model in predicting the crime rate is measured with RMSE, our RMSE result is 175.8304. Based on assumption test our model can be used to predict crime_rate.