Introduction

This project was created to fulfill the Learning by Building (LBB) assignment on the Regression Model material in the Machine Learning Specialization Course. The dataset used in this project is the Crime dataset.

Import Library

library(dplyr)
library(tidyverse)
library(caret)
library(car)
library(GGally)
library(MLmetrics)
library(lmtest)

Read Data

crime <- read.csv("dataset/crime.csv")
glimpse(crime)
## Rows: 47
## Columns: 17
## $ X    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
## $ M    <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134, 128, …
## $ So   <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,…
## $ Ed   <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 113, 117…
## $ Po1  <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 62, 57,…
## $ Po2  <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61, 53, …
## $ LF   <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595, 624, …
## $ M.F  <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 972, 972…
## $ Pop  <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30, 33, 1…
## $ NW   <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, 46, 72…
## $ U1   <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, 92, 11…
## $ U2   <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43, 47, 3…
## $ GDP  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580, 507, …
## $ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172, 206, …
## $ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201, 0.042…
## $ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.6993, 24…
## $ y    <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674, 849, …

First of all, we need to rename the columns and drop the X column that used for index only

crime <- crime %>% 
  select(-X)

names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

Here is the description for each columns:

  • percent_m: percentage of males aged 14-24
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: mean years of schooling
  • police_exp60: police expenditure in 1960
  • police_exp59: police expenditure in 1959
  • labour_participation: labour force participation rate
  • m_per1000f: number of males per 1000 females
  • state_pop: state population
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: unemployment rate of urban males aged 14-24
  • unemploy_m39: unemployment rate of urban males aged 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: avg time served in prisons
  • crime_rate: crime rate in an unspecified category

After that we need to check whether there is missing value or not.

colSums(is.na(crime))
##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

There is no missing values, and continue to check the correlation between each columns or variables using ggcorr.

 ggcorr(crime, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 2)

Based on the correlation plot, we can see that not all variables have correlation with crime_rate variable. The variables that have the highest correlation with crime_rate are police_exp59 and police_exp60.

Cross Validation

In this project, I will split the 70% of data as the data train and 30% as the data test.

set.seed(123)
sample <- round(0.70 * nrow(crime), 0)
index <- sample(seq_len(nrow(crime)), size = sample)

crime_train <- crime[index, ]
crime_test <- crime[-index, ]

Build Regression Model

I will build two models first using all remaining predictor variables and second using the high correlated predictors.

model_all<- lm(crime_rate ~., data = crime_train)
summary(model_all)
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304.71  -91.82  -31.14  101.84  426.16 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.033e+04  2.233e+03  -4.628  0.00024 ***
## percent_m             9.668e+00  4.475e+00   2.160  0.04530 *  
## is_south              8.101e+00  2.233e+02   0.036  0.97148    
## mean_education        2.630e+01  7.698e+00   3.416  0.00329 ** 
## police_exp60          2.009e+01  1.186e+01   1.694  0.10858    
## police_exp59         -1.486e+01  1.354e+01  -1.098  0.28771    
## labour_participation -4.659e+00  2.172e+00  -2.145  0.04669 *  
## m_per1000f            8.266e+00  2.861e+00   2.889  0.01019 *  
## state_pop             1.370e+00  1.792e+00   0.764  0.45521    
## nonwhites_per1000     1.333e+00  9.991e-01   1.335  0.19960    
## unemploy_m24         -1.198e+01  5.132e+00  -2.334  0.03212 *  
## unemploy_m39          1.424e+01  9.869e+00   1.443  0.16724    
## gdp                   6.436e-01  1.262e+00   0.510  0.61671    
## inequality            4.247e+00  2.855e+00   1.488  0.15518    
## prob_prison           9.401e+02  3.823e+03   0.246  0.80869    
## time_prison           1.478e+01  9.430e+00   1.567  0.13558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.9 on 17 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.7766 
## F-statistic: 8.414 on 15 and 17 DF,  p-value: 3.846e-05

Based on the above summary, we can identify predictor variables with Pr(>|t|) values greater than 0.05, which include is_south, police_exp60, police_exp59, state_pop, nonwhites_per1000, unemploy_m39, gdp, inequality, prob_prison, and time_prison. Pr(>|t|) values greater than 0.05 indicate that these predictors are statistically insignificant for predicting the target variable crime_rate.

Consequently, we can simplify the model by removing these predictors with such Pr(>|t|) values and recompute the model to potentially improve the Adjusted R-squared of 0.7766.

The second one is using high correlated predictors.

model <- lm(crime_rate ~ police_exp60 + police_exp59, data = crime_train)
summary(model)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + police_exp59, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -624.25 -170.68   48.68  153.57  477.05 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)    100.02     145.47   0.688    0.497
## police_exp60    21.35      14.37   1.486    0.148
## police_exp59   -12.69      15.32  -0.829    0.414
## 
## Residual standard error: 294 on 30 degrees of freedom
## Multiple R-squared:  0.5466, Adjusted R-squared:  0.5164 
## F-statistic: 18.08 on 2 and 30 DF,  p-value: 7.03e-06

The R-squared of crime_lm approximates 0.5164, which ideally needs to be improved upon by, for example, adding more predictor variables. One method for choosing predictor variables that we’ll employ is the stepwise regression algorithm.

model_none <- lm(crime_rate ~ 1, data = crime_train)

model_forward <- step(object = model_none, #model tanpa prediktor
                      direction = "forward", 
                      scope = list(upper = model_all), #model dengan semua prediktor
                      trace = F)
summary(model_forward)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + percent_m + m_per1000f + 
##     is_south + mean_education + time_prison + inequality, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -363.98  -88.84   11.32   78.90  531.21 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7901.441   1534.858  -5.148 2.54e-05 ***
## police_exp60      11.153      1.514   7.366 1.02e-07 ***
## percent_m          8.039      3.735   2.153  0.04120 *  
## m_per1000f         3.982      1.529   2.605  0.01526 *  
## is_south         344.961    137.104   2.516  0.01866 *  
## mean_education    16.534      5.409   3.057  0.00527 ** 
## time_prison       14.378      5.668   2.537  0.01781 *  
## inequality         3.039      1.957   1.553  0.13293    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.6 on 25 degrees of freedom
## Multiple R-squared:  0.8241, Adjusted R-squared:  0.7749 
## F-statistic: 16.74 on 7 and 25 DF,  p-value: 5.419e-08
model_backward <- step(object = model_all,
                       direction = "backward",
                       trace = F)
summary(model_backward)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     labour_participation + m_per1000f + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + inequality + time_prison, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -283.69  -80.50  -16.92   82.84  429.87 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -9643.5294  1570.9487  -6.139 3.53e-06 ***
## percent_m                8.3543     3.9050   2.139 0.043752 *  
## mean_education          25.3573     6.1671   4.112 0.000459 ***
## police_exp60             8.3365     1.7646   4.724 0.000103 ***
## labour_participation    -3.5144     1.5221  -2.309 0.030727 *  
## m_per1000f               7.1285     2.0723   3.440 0.002337 ** 
## nonwhites_per1000        1.1725     0.6474   1.811 0.083817 .  
## unemploy_m24           -10.2895     3.9585  -2.599 0.016368 *  
## unemploy_m39            16.0240     8.4283   1.901 0.070453 .  
## inequality               4.6137     1.7115   2.696 0.013205 *  
## time_prison             15.7576     5.6929   2.768 0.011222 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.1 on 22 degrees of freedom
## Multiple R-squared:  0.8682, Adjusted R-squared:  0.8082 
## F-statistic: 14.49 on 10 and 22 DF,  p-value: 1.705e-07
model_both <- step(object = model_none,
                   direction = "both",
                   scope = list(upper = model_all),
                   trace = F)
summary(model_both)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + percent_m + m_per1000f + 
##     is_south + mean_education + time_prison + inequality, data = crime_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -363.98  -88.84   11.32   78.90  531.21 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7901.441   1534.858  -5.148 2.54e-05 ***
## police_exp60      11.153      1.514   7.366 1.02e-07 ***
## percent_m          8.039      3.735   2.153  0.04120 *  
## m_per1000f         3.982      1.529   2.605  0.01526 *  
## is_south         344.961    137.104   2.516  0.01866 *  
## mean_education    16.534      5.409   3.057  0.00527 ** 
## time_prison       14.378      5.668   2.537  0.01781 *  
## inequality         3.039      1.957   1.553  0.13293    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.6 on 25 degrees of freedom
## Multiple R-squared:  0.8241, Adjusted R-squared:  0.7749 
## F-statistic: 16.74 on 7 and 25 DF,  p-value: 5.419e-08

After doing step-wise regression method, I compare 3 new models and the highest adjusted R-Squared is 0.8082 from model_backward using backward elimination.

Prediction

pred_crime1 <- predict(model, crime_test)
pred_crime2 <- predict(model_backward, crime_test)

The Mean Absolute Percentage Error (MAPE) measures the percentage difference between the predicted and actual values. In our analysis, we will compare the MAPE between the first model and the second model to assess their respective accuracy in predicting the target variable. We will compare the MAPE between model and model_backward.

MAPE(pred_crime1, crime_test$crime_rate)
## [1] 0.2135468
MAPE(pred_crime2, crime_test$crime_rate)
## [1] 0.3699559

For the first model MAPE means that by average the difference with the actual number is 21.35% and for the second model the difference with the actual number is 36.99%. We can conclude that the first model is better because it is lower than the first one.

Assumption Analysis

Normality: Shapiro Test

shapiro.test(model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.96664, p-value = 0.3934
shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.97453, p-value = 0.6143

Based on the Shapiro Test, we can conclude that the error is distributed normally for the second model because the P-value is higher than 0.05 which is 0.6143.

Heteroskedasticity Test: Breusch-Pagan

bptest(model)
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 13.691, df = 2, p-value = 0.001064
bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 9.5527, df = 10, p-value = 0.4806

Based on the Breusch-Pagan, we can conclude that there is heteroscesdasticity on the first model because the P-value is lower than 0.05 which is 0.001064. For the second model we conclude that there is homocesdasticity because the P-value is above 0.05, it means the error is constant.

Multicollinearity

vif(model_backward)
##            percent_m       mean_education         police_exp60 
##             2.008588             4.492356             3.075057 
## labour_participation           m_per1000f    nonwhites_per1000 
##             3.797952             3.449977             2.662333 
##         unemploy_m24         unemploy_m39           inequality 
##             4.145015             4.330561             3.842859 
##          time_prison 
##             1.739850

Based on the Variance Inflation Factor Test above, we conclude that multicollinearity is not present in our model_backward because the VIF values for all variables are below 10.

Conclusion

As a conclusion, the second model model_backward utilizing the stepwise-regression method outperforms the first model with only one variable having the highest positive correlation with the ‘crime_rate’ variable. The reasons for the superiority of model_step are as follows: - Only model_backword exhibit normal error distribution according to the Shapiro Test. - The Variance Inflation Factor (VIF) Test indicates that multicollinearity is absent in the model_backward, as all VIF values for the variables are below 10.

In addition, the MAPE value of the backword model is worse than the model, which is 36.99%, which means a difference of about 15%.