Preface

Regression model is one of a kind from superfised machine learning, which means that the programmer itself build the formula so that the machine could predict the effect of independent variable (predictors) on the dependent variable(target variable/ the value we’d like to predict).

In this section I’m going to demonstrate the process to create/decide a model to predict crime rate with models comparison and their performance.

target variable : crime rate


Library Preparation

library(lubridate)
library(tidyverse)
library(ggplot2)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)

Read Data

head(crime)

Data Wrangling

summary(crime)
#>        X              M               So               Ed       
#>  Min.   : 1.0   Min.   :119.0   Min.   :0.0000   Min.   : 87.0  
#>  1st Qu.:12.5   1st Qu.:130.0   1st Qu.:0.0000   1st Qu.: 97.5  
#>  Median :24.0   Median :136.0   Median :0.0000   Median :108.0  
#>  Mean   :24.0   Mean   :138.6   Mean   :0.3404   Mean   :105.6  
#>  3rd Qu.:35.5   3rd Qu.:146.0   3rd Qu.:1.0000   3rd Qu.:114.5  
#>  Max.   :47.0   Max.   :177.0   Max.   :1.0000   Max.   :122.0  
#>       Po1             Po2               LF             M.F        
#>  Min.   : 45.0   Min.   : 41.00   Min.   :480.0   Min.   : 934.0  
#>  1st Qu.: 62.5   1st Qu.: 58.50   1st Qu.:530.5   1st Qu.: 964.5  
#>  Median : 78.0   Median : 73.00   Median :560.0   Median : 977.0  
#>  Mean   : 85.0   Mean   : 80.23   Mean   :561.2   Mean   : 983.0  
#>  3rd Qu.:104.5   3rd Qu.: 97.00   3rd Qu.:593.0   3rd Qu.: 992.0  
#>  Max.   :166.0   Max.   :157.00   Max.   :641.0   Max.   :1071.0  
#>       Pop               NW              U1               U2       
#>  Min.   :  3.00   Min.   :  2.0   Min.   : 70.00   Min.   :20.00  
#>  1st Qu.: 10.00   1st Qu.: 24.0   1st Qu.: 80.50   1st Qu.:27.50  
#>  Median : 25.00   Median : 76.0   Median : 92.00   Median :34.00  
#>  Mean   : 36.62   Mean   :101.1   Mean   : 95.47   Mean   :33.98  
#>  3rd Qu.: 41.50   3rd Qu.:132.5   3rd Qu.:104.00   3rd Qu.:38.50  
#>  Max.   :168.00   Max.   :423.0   Max.   :142.00   Max.   :58.00  
#>       GDP             Ineq            Prob              Time      
#>  Min.   :288.0   Min.   :126.0   Min.   :0.00690   Min.   :12.20  
#>  1st Qu.:459.5   1st Qu.:165.5   1st Qu.:0.03270   1st Qu.:21.60  
#>  Median :537.0   Median :176.0   Median :0.04210   Median :25.80  
#>  Mean   :525.4   Mean   :194.0   Mean   :0.04709   Mean   :26.60  
#>  3rd Qu.:591.5   3rd Qu.:227.5   3rd Qu.:0.05445   3rd Qu.:30.45  
#>  Max.   :689.0   Max.   :276.0   Max.   :0.11980   Max.   :44.00  
#>        y         
#>  Min.   : 342.0  
#>  1st Qu.: 658.5  
#>  Median : 831.0  
#>  Mean   : 905.1  
#>  3rd Qu.:1057.5  
#>  Max.   :1993.0
any(duplicated(crime))
#> [1] FALSE

As we see the head of the data frame,it seems that there is not so much adjustment needed to clean the table (data frame), and yet at the column So we can see the min value is 0 and the max value is 1, it’s probably the binary value. Let’s recheck the unique value.

unique(crime$So)
#> [1] 1 0

It’s true that column So has only 2 value (either 1 or 0), so that the adjustment we’d like to apply is:

  1. Rename the column to be more meaningful and understandable
  2. Change data type of So column to factor
  3. Drop column X because it shows the row numbers (unused)
crime <- crime %>% 
  select(-X) %>% 
  mutate(So = as.factor(So))


names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")


crime

Data Glossary

  1. percent_m: percentage of males aged 14-24
  2. is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  3. mean_education: mean years of schooling
  4. police_exp60: police expenditure in 1960
  5. police_exp59: police expenditure in 1959
  6. labor_participation: labor force participation rate
  7. m_per1000f: number of males per 1000 females
  8. state_pop: state population
  9. nonwhites_per1000: number of non-whites resident per 1000 people
  10. unemploy_m24: unemployment rate of urban males aged 14-24
  11. unemploy_m39: unemployment rate of urban males aged 35-39
  12. gdp: gross domestic product per head
  13. inequality: income inequality
  14. prob_prison: probability of imprisonment
  15. time_prison: average time served in prisons
  16. crime_rate: crime rate in an unspecified category

Exploratory Data Analysis

Data Distribution

par(mfrow=c(3,3))
boxplot(crime$percent_m, main = "percent_m")
boxplot(crime$mean_education,main = "mean_education")
boxplot(crime$police_exp60, main = "police_exp60")
boxplot(crime$police_exp59, main = "police_exp59")
boxplot(crime$labour_participation, main = "labour_participation")
boxplot(crime$state_pop, main = "state_pop")
boxplot(crime$nonwhites_per1000, main = "nonwhites_per1000")
boxplot(crime$unemploy_m24, main = "unemploy_m24")
boxplot(crime$unemploy_m39, main = "unemploy_m39")

par(mfrow=c(2,3))
boxplot(crime$gdp, main = "gdp")
boxplot(crime$inequality,main = "inequality")
boxplot(crime$prob_prison, main = "prob_prison")
boxplot(crime$time_prison, main = "time_prison")
boxplot(crime$crime_rate, main = "crime_rate")

From the plot above we can see each column data distribution, and we can conclude that:

percent_m, police_exp59, state_pop, nonwhites_per1000, unemploy_m24, unemploy_m39 has outliers, at the moment we won’t apply any treatment to the outliers, but we’re going to see do the outliers effect our model significantly in the further process.

Correlations

ggcorr(crime, label = T, hjust = 1, layout.exp = 3)

From the plot above we can conclude:

Crime rate has moderate-strong correlation to police_exp59 and police_exp_60.


Regression Model

Linear Model

For better understanding and efficient work, let’s create a model with all columns:

#Create linear model with lm () function
crime.lm.all <- lm(crime_rate ~. , data = crime)
summary(crime.lm.all)
#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = crime)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -395.74  -98.09   -6.69  112.99  512.67 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
#> percent_m                8.7830     4.1714   2.106 0.043443 *  
#> is_south1               -3.8035   148.7551  -0.026 0.979765    
#> mean_education          18.8324     6.2088   3.033 0.004861 ** 
#> police_exp60            19.2804    10.6110   1.817 0.078892 .  
#> police_exp59           -10.9422    11.7478  -0.931 0.358830    
#> labour_participation    -0.6638     1.4697  -0.452 0.654654    
#> m_per1000f               1.7407     2.0354   0.855 0.398995    
#> state_pop               -0.7330     1.2896  -0.568 0.573845    
#> nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
#> unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
#> unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
#> gdp                      0.9617     1.0367   0.928 0.360754    
#> inequality               7.0672     2.2717   3.111 0.003983 ** 
#> prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
#> time_prison             -3.4790     7.1653  -0.486 0.630708    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 209.1 on 31 degrees of freedom
#> Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
#> F-statistic: 8.429 on 15 and 31 DF,  p-value: 0.0000003539

Highlights:

  1. Although that police_exp59 and police_exp_60 well correlated to crime rate, we can see that the rate of police_exp59 and police_exp_60 have little significance to the crime rate when all the variable included (in fact police_exp59 considered as no significance to crime rate)

  2. The value of Adjusted R-squared: 0.7078 is close to number 1, means that our model is considered good enough to predict the crime rate, and yet we may find a better model

conclusion create different model to compare

Stepwise Model

With the step wise regression, the machine will perform elimination/addition to calculate which variable that has better effect to the target variable.

# Perform backward step wise regression to eliminate variable
crime.bwd <- step(object = crime.lm.all,
                  direction = "backward",
                  trace = F)

summary(crime.bwd)
#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -444.70 -111.07    3.03  122.15  483.30 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value     Pr(>|t|)    
#> (Intercept)    -6426.101   1194.611  -5.379 0.0000040395 ***
#> percent_m          9.332      3.350   2.786      0.00828 ** 
#> mean_education    18.012      5.275   3.414      0.00153 ** 
#> police_exp60      10.265      1.552   6.613 0.0000000826 ***
#> m_per1000f         2.234      1.360   1.642      0.10874    
#> unemploy_m24      -6.087      3.339  -1.823      0.07622 .  
#> unemploy_m39      18.735      7.248   2.585      0.01371 *  
#> inequality         6.133      1.396   4.394 0.0000863344 ***
#> prob_prison    -3796.032   1490.646  -2.547      0.01505 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 195.5 on 38 degrees of freedom
#> Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
#> F-statistic: 17.74 on 8 and 38 DF,  p-value: 0.0000000001159

Highlights

Compare to the all variable linear model, we can see that Adjusted R-squared increased to 0.7444, although the number increased insignificantly, but we can say that the step wise (backward) is a better model and also it has eliminated the columns that insignificantly effect to the crime rate.

conclusion create different model to compare

#Create forward step wise regression
crime.lm.none <- lm(crime_rate~1, data = crime) # create a base model with none predictor
summary(step(    
  object = crime.lm.none,
  direction = "forward",
  scope = list(upper = crime.lm.all), #will add from 0 predictor to full/all predictors
  trace = F
))
#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp60 + inequality + mean_education + 
#>     percent_m + prob_prison + unemploy_m39, data = crime)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -470.68  -78.41  -19.68  133.12  556.23 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value       Pr(>|t|)    
#> (Intercept)    -5040.505    899.843  -5.602 0.000001715273 ***
#> police_exp60      11.502      1.375   8.363 0.000000000256 ***
#> inequality         6.765      1.394   4.855 0.000018793765 ***
#> mean_education    19.647      4.475   4.390 0.000080720160 ***
#> percent_m         10.502      3.330   3.154        0.00305 ** 
#> prob_prison    -3801.836   1528.097  -2.488        0.01711 *  
#> unemploy_m39       8.937      4.091   2.185        0.03483 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 200.7 on 40 degrees of freedom
#> Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
#> F-statistic: 21.81 on 6 and 40 DF,  p-value: 0.00000000003418

Highlights

We can see that Adjusted R-squared is at 0.7307 (lower that backward step wise model).

conclusion at this point backward step wise is the better model, from the Adjusted R-squared comparison

Observe Outlier

summary(crime.bwd)$call
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime)

As we see the variable that are used in our step wise model, we will observe variables that contained outliers.

Which are: percent_m, unemploy_m24, unemploy_m39

par(mfrow=c(1,3))
boxplot(crime$percent_m, main = "percent_m", xlab = NULL)
boxplot(crime$unemploy_m24, main = "unemploy_m24", xlab = NULL)
boxplot(crime$unemploy_m39, main = "unemploy_m39", xlab = NULL)

max(crime$percent_m)
#> [1] 177
max(crime$unemploy_m24)
#> [1] 142
max(crime$unemploy_m39)
#> [1] 58
#test.outlier effect to model (column percent_m)
test.outlier <- crime[crime$percent_m < 177,]
lm.noOutlier1 <- lm(crime_rate ~ percent_m, data = test.outlier)
lm.wOutlier1 <- lm(crime_rate ~ percent_m, data = crime)
plot(crime$percent_m, crime$crime_rate)
abline(lm.wOutlier1, col = "blue")
abline(lm.noOutlier1, col = "red")

Highlights

As we can see that with or without outlier, the regression line doesn’t have much differences on the leverage level.

decision Observe more variable that contained outliers

#test.outlier effect to model (column unemploy_m24)
test.outlier <- crime[crime$unemploy_m24 < 142,]
lm.noOutlier2 <- lm(crime_rate ~ unemploy_m24, data = test.outlier)
lm.wOutlier2 <- lm(crime_rate ~ unemploy_m24, data = crime)
plot(crime$state_pop, crime$crime_rate)
abline(lm.wOutlier2, col = "blue")
abline(lm.noOutlier2, col = "red")

Highlights

In this particular case, from the column state_pop, there’s little differences between data which contained outlier and without outlier.

decision

Observe more variable that contained outlier

Compare the adjusted R-square with and without outlier

#test.outlier effect to model (column unemploy_m39)
test.outlier <- crime[crime$unemploy_m39 < 58,]
lm.noOutlier3 <- lm(crime_rate ~ unemploy_m39, data = test.outlier)
lm.wOutlier3 <- lm(crime_rate ~ unemploy_m39, data = crime)
plot(crime$state_pop, crime$crime_rate)
abline(lm.wOutlier3, col = "blue")
abline(lm.noOutlier3, col = "red")

Highlights

In this particular case, from the column state_pop, there’s clear leverage differences between data which contained outlier and without outlier.

decision Compare the R-square with and without outlier

Observe R-Squared

#column percent_m
summary(lm.noOutlier1)$r.squared
#> [1] 0.00738487
summary(lm.wOutlier1)$r.squared
#> [1] 0.00800531

Highlights outlier has low influence to regression line and

Decision Keep outliers from column percent_m

#column unemploy_m24
summary(lm.noOutlier2)$r.squared
#> [1] 0.004200003
summary(lm.wOutlier2)$r.squared
#> [1] 0.00254802

Highlight

There’s leverage differences in the regression line (low influence), and yet the R-square number increasing when without outlier

Decision Trim outliers from column unemploy_m24

summary(lm.noOutlier3)$r.squared
#> [1] 0.01936133
summary(lm.wOutlier3)$r.squared
#> [1] 0.03144261

Highlight There’s leverage differences in the regression line and yet the influence still considered as low influence, and the R-square number higher with outlier

Decision Keep outliers from column unemploy_m39. And adjust our model based on above conclusions.


crime_clean <- crime[crime$unemploy_m24 < 142,]
lm.crime_clean <- lm(crime_rate ~ ., data = crime_clean)
crime_clean.bwd <- step(lm.crime_clean, 
             direction = "backward",
             trace = F)
summary(crime_clean.bwd)$adj.r.squared # model with outlier adjustment
#> [1] 0.7462056
summary(crime.bwd)$adj.r.squared # model with outlier
#> [1] 0.7443692

Model Evaluation

create prediction to our data (to see performance)

crime$prediction <- predict(object = crime_clean.bwd, newdata = crime)
crime$prediction_wOut <- predict(object = crime.bwd, newdata = crime)

RMSE

To evaluate our model and see the average error that may produced to our model, we’re going to calculate with RMSE

RMSE(y_pred = crime$prediction, y_true = crime$crime_rate)
#> [1] 176.1401
RMSE(y_pred = crime$prediction_wOut, y_true = crime$crime_rate)
#> [1] 175.8304
range(crime$crime_rate)
#> [1]  342 1993

considering the range of of target variable, the RMSE is quite high


Model Assumptions

Linearity

summary(crime_clean.bwd)$call
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime_clean)
ggcorr(crime[,-(17:19)], label = T, hjust = 1, layout.exp = 3)

From all predictors all correlated to crime rate != 0, we can conclude that linearity is present

Normality of Residuals

hist(crime_clean.bwd$residuals)

shapiro.test(crime_clean.bwd$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  crime_clean.bwd$residuals
#> W = 0.98497, p-value = 0.8093

P-value > alpha (0.8093 > 0.05) we can conclude that error distributed normally

Homoscadasticity of Residuals

plot(x = crime_clean.bwd$fitted.values, y = crime_clean.bwd$residuals)
abline(h = 0, col = "red")

bptest(crime_clean.bwd)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  crime_clean.bwd
#> BP = 14.255, df = 8, p-value = 0.07537

P-value > alpha (0.07537 > 0.05) we can conclude that Heteroscedasticity is present

No Multicollinearity

vif(crime_clean.bwd)
#>      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
#>       2.115006       4.115400       2.560201       1.784947       4.106642 
#>   unemploy_m39     inequality    prob_prison 
#>       4.536383       3.724564       1.381809

All predictors vif are below 10, we can conclude there’s no multicollinearity between predictors

Conclusion

all assumptions has been fulfilled


Summary

From all the model evaluation we can conclude that:

  1. Backward step wise regression is the best model so far (based on adjusted R-squared)
  2. To get higher model performance, we can keep and trim outliers (based on the regression line influence). In this particular case, the best way to handle outliers is to trim outliers from columns unemploy_m24 and keep outliers from columns percent_m & unemploy_m24
  3. RMSE value can be considered high (based on comparison to target variable range)
  4. The best model (backward step-wise regression) can be considered has a decent performance to predict the crime rate, and yet to get better prediction we can use more complicated model to predict the crime rate.