Regression model is one of a kind from superfised machine learning, which means that the programmer itself build the formula so that the machine could predict the effect of independent variable (predictors) on the dependent variable(target variable/ the value we’d like to predict).
In this section I’m going to demonstrate the process to create/decide a model to predict crime rate with models comparison and their performance.
target variable : crime rate
library(lubridate)
library(tidyverse)
library(ggplot2)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)head(crime)summary(crime)#> X M So Ed
#> Min. : 1.0 Min. :119.0 Min. :0.0000 Min. : 87.0
#> 1st Qu.:12.5 1st Qu.:130.0 1st Qu.:0.0000 1st Qu.: 97.5
#> Median :24.0 Median :136.0 Median :0.0000 Median :108.0
#> Mean :24.0 Mean :138.6 Mean :0.3404 Mean :105.6
#> 3rd Qu.:35.5 3rd Qu.:146.0 3rd Qu.:1.0000 3rd Qu.:114.5
#> Max. :47.0 Max. :177.0 Max. :1.0000 Max. :122.0
#> Po1 Po2 LF M.F
#> Min. : 45.0 Min. : 41.00 Min. :480.0 Min. : 934.0
#> 1st Qu.: 62.5 1st Qu.: 58.50 1st Qu.:530.5 1st Qu.: 964.5
#> Median : 78.0 Median : 73.00 Median :560.0 Median : 977.0
#> Mean : 85.0 Mean : 80.23 Mean :561.2 Mean : 983.0
#> 3rd Qu.:104.5 3rd Qu.: 97.00 3rd Qu.:593.0 3rd Qu.: 992.0
#> Max. :166.0 Max. :157.00 Max. :641.0 Max. :1071.0
#> Pop NW U1 U2
#> Min. : 3.00 Min. : 2.0 Min. : 70.00 Min. :20.00
#> 1st Qu.: 10.00 1st Qu.: 24.0 1st Qu.: 80.50 1st Qu.:27.50
#> Median : 25.00 Median : 76.0 Median : 92.00 Median :34.00
#> Mean : 36.62 Mean :101.1 Mean : 95.47 Mean :33.98
#> 3rd Qu.: 41.50 3rd Qu.:132.5 3rd Qu.:104.00 3rd Qu.:38.50
#> Max. :168.00 Max. :423.0 Max. :142.00 Max. :58.00
#> GDP Ineq Prob Time
#> Min. :288.0 Min. :126.0 Min. :0.00690 Min. :12.20
#> 1st Qu.:459.5 1st Qu.:165.5 1st Qu.:0.03270 1st Qu.:21.60
#> Median :537.0 Median :176.0 Median :0.04210 Median :25.80
#> Mean :525.4 Mean :194.0 Mean :0.04709 Mean :26.60
#> 3rd Qu.:591.5 3rd Qu.:227.5 3rd Qu.:0.05445 3rd Qu.:30.45
#> Max. :689.0 Max. :276.0 Max. :0.11980 Max. :44.00
#> y
#> Min. : 342.0
#> 1st Qu.: 658.5
#> Median : 831.0
#> Mean : 905.1
#> 3rd Qu.:1057.5
#> Max. :1993.0
any(duplicated(crime))#> [1] FALSE
As we see the head of the data frame,it seems that there is not so
much adjustment needed to clean the table (data frame), and yet at the
column So we can see the min value is 0 and the max value
is 1, it’s probably the binary value. Let’s recheck the unique
value.
unique(crime$So)#> [1] 1 0
It’s true that column So has only 2 value (either 1 or
0), so that the adjustment we’d like to apply is:
So column to factorX because it shows the row numbers
(unused)crime <- crime %>%
select(-X) %>%
mutate(So = as.factor(So))
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
crimeData Glossary
par(mfrow=c(3,3))
boxplot(crime$percent_m, main = "percent_m")
boxplot(crime$mean_education,main = "mean_education")
boxplot(crime$police_exp60, main = "police_exp60")
boxplot(crime$police_exp59, main = "police_exp59")
boxplot(crime$labour_participation, main = "labour_participation")
boxplot(crime$state_pop, main = "state_pop")
boxplot(crime$nonwhites_per1000, main = "nonwhites_per1000")
boxplot(crime$unemploy_m24, main = "unemploy_m24")
boxplot(crime$unemploy_m39, main = "unemploy_m39")par(mfrow=c(2,3))
boxplot(crime$gdp, main = "gdp")
boxplot(crime$inequality,main = "inequality")
boxplot(crime$prob_prison, main = "prob_prison")
boxplot(crime$time_prison, main = "time_prison")
boxplot(crime$crime_rate, main = "crime_rate")From the plot above we can see each column data distribution, and we can conclude that:
percent_m, police_exp59, state_pop, nonwhites_per1000, unemploy_m24, unemploy_m39 has outliers, at the moment we won’t apply any treatment to the outliers, but we’re going to see do the outliers effect our model significantly in the further process.
ggcorr(crime, label = T, hjust = 1, layout.exp = 3)From the plot above we can conclude:
Crime rate has moderate-strong correlation to police_exp59 and police_exp_60.
For better understanding and efficient work, let’s create a model with all columns:
#Create linear model with lm () function
crime.lm.all <- lm(crime_rate ~. , data = crime)
summary(crime.lm.all)#>
#> Call:
#> lm(formula = crime_rate ~ ., data = crime)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -395.74 -98.09 -6.69 112.99 512.67
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5984.2876 1628.3184 -3.675 0.000893 ***
#> percent_m 8.7830 4.1714 2.106 0.043443 *
#> is_south1 -3.8035 148.7551 -0.026 0.979765
#> mean_education 18.8324 6.2088 3.033 0.004861 **
#> police_exp60 19.2804 10.6110 1.817 0.078892 .
#> police_exp59 -10.9422 11.7478 -0.931 0.358830
#> labour_participation -0.6638 1.4697 -0.452 0.654654
#> m_per1000f 1.7407 2.0354 0.855 0.398995
#> state_pop -0.7330 1.2896 -0.568 0.573845
#> nonwhites_per1000 0.4204 0.6481 0.649 0.521279
#> unemploy_m24 -5.8271 4.2103 -1.384 0.176238
#> unemploy_m39 16.7800 8.2336 2.038 0.050161 .
#> gdp 0.9617 1.0367 0.928 0.360754
#> inequality 7.0672 2.2717 3.111 0.003983 **
#> prob_prison -4855.2658 2272.3746 -2.137 0.040627 *
#> time_prison -3.4790 7.1653 -0.486 0.630708
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 209.1 on 31 degrees of freedom
#> Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
#> F-statistic: 8.429 on 15 and 31 DF, p-value: 0.0000003539
Highlights:
Although that police_exp59 and police_exp_60 well correlated to crime rate, we can see that the rate of police_exp59 and police_exp_60 have little significance to the crime rate when all the variable included (in fact police_exp59 considered as no significance to crime rate)
The value of Adjusted R-squared: 0.7078 is close to number 1, means that our model is considered good enough to predict the crime rate, and yet we may find a better model
conclusion create different model to compare
With the step wise regression, the machine will perform elimination/addition to calculate which variable that has better effect to the target variable.
# Perform backward step wise regression to eliminate variable
crime.bwd <- step(object = crime.lm.all,
direction = "backward",
trace = F)
summary(crime.bwd)#>
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
#> m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
#> data = crime)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -444.70 -111.07 3.03 122.15 483.30
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6426.101 1194.611 -5.379 0.0000040395 ***
#> percent_m 9.332 3.350 2.786 0.00828 **
#> mean_education 18.012 5.275 3.414 0.00153 **
#> police_exp60 10.265 1.552 6.613 0.0000000826 ***
#> m_per1000f 2.234 1.360 1.642 0.10874
#> unemploy_m24 -6.087 3.339 -1.823 0.07622 .
#> unemploy_m39 18.735 7.248 2.585 0.01371 *
#> inequality 6.133 1.396 4.394 0.0000863344 ***
#> prob_prison -3796.032 1490.646 -2.547 0.01505 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 195.5 on 38 degrees of freedom
#> Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
#> F-statistic: 17.74 on 8 and 38 DF, p-value: 0.0000000001159
Highlights
Compare to the all variable linear model, we can see that Adjusted R-squared increased to 0.7444, although the number increased insignificantly, but we can say that the step wise (backward) is a better model and also it has eliminated the columns that insignificantly effect to the crime rate.
conclusion create different model to compare
#Create forward step wise regression
crime.lm.none <- lm(crime_rate~1, data = crime) # create a base model with none predictor
summary(step(
object = crime.lm.none,
direction = "forward",
scope = list(upper = crime.lm.all), #will add from 0 predictor to full/all predictors
trace = F
))#>
#> Call:
#> lm(formula = crime_rate ~ police_exp60 + inequality + mean_education +
#> percent_m + prob_prison + unemploy_m39, data = crime)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -470.68 -78.41 -19.68 133.12 556.23
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5040.505 899.843 -5.602 0.000001715273 ***
#> police_exp60 11.502 1.375 8.363 0.000000000256 ***
#> inequality 6.765 1.394 4.855 0.000018793765 ***
#> mean_education 19.647 4.475 4.390 0.000080720160 ***
#> percent_m 10.502 3.330 3.154 0.00305 **
#> prob_prison -3801.836 1528.097 -2.488 0.01711 *
#> unemploy_m39 8.937 4.091 2.185 0.03483 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 200.7 on 40 degrees of freedom
#> Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
#> F-statistic: 21.81 on 6 and 40 DF, p-value: 0.00000000003418
Highlights
We can see that Adjusted R-squared is at 0.7307 (lower that backward step wise model).
conclusion at this point backward step wise is the better model, from the Adjusted R-squared comparison
summary(crime.bwd)$call#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
#> m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
#> data = crime)
As we see the variable that are used in our step wise model, we will observe variables that contained outliers.
Which are: percent_m, unemploy_m24, unemploy_m39
par(mfrow=c(1,3))
boxplot(crime$percent_m, main = "percent_m", xlab = NULL)
boxplot(crime$unemploy_m24, main = "unemploy_m24", xlab = NULL)
boxplot(crime$unemploy_m39, main = "unemploy_m39", xlab = NULL)max(crime$percent_m)#> [1] 177
max(crime$unemploy_m24)#> [1] 142
max(crime$unemploy_m39)#> [1] 58
#test.outlier effect to model (column percent_m)
test.outlier <- crime[crime$percent_m < 177,]
lm.noOutlier1 <- lm(crime_rate ~ percent_m, data = test.outlier)
lm.wOutlier1 <- lm(crime_rate ~ percent_m, data = crime)plot(crime$percent_m, crime$crime_rate)
abline(lm.wOutlier1, col = "blue")
abline(lm.noOutlier1, col = "red")Highlights
As we can see that with or without outlier, the regression line doesn’t have much differences on the leverage level.
decision Observe more variable that contained outliers
#test.outlier effect to model (column unemploy_m24)
test.outlier <- crime[crime$unemploy_m24 < 142,]
lm.noOutlier2 <- lm(crime_rate ~ unemploy_m24, data = test.outlier)
lm.wOutlier2 <- lm(crime_rate ~ unemploy_m24, data = crime)plot(crime$state_pop, crime$crime_rate)
abline(lm.wOutlier2, col = "blue")
abline(lm.noOutlier2, col = "red")
Highlights
In this particular case, from the column state_pop,
there’s little differences between data which contained outlier and
without outlier.
decision
Observe more variable that contained outlier
Compare the adjusted R-square with and without outlier
#test.outlier effect to model (column unemploy_m39)
test.outlier <- crime[crime$unemploy_m39 < 58,]
lm.noOutlier3 <- lm(crime_rate ~ unemploy_m39, data = test.outlier)
lm.wOutlier3 <- lm(crime_rate ~ unemploy_m39, data = crime)plot(crime$state_pop, crime$crime_rate)
abline(lm.wOutlier3, col = "blue")
abline(lm.noOutlier3, col = "red")
Highlights
In this particular case, from the column state_pop,
there’s clear leverage differences between data which contained outlier
and without outlier.
decision Compare the R-square with and without outlier
#column percent_m
summary(lm.noOutlier1)$r.squared#> [1] 0.00738487
summary(lm.wOutlier1)$r.squared#> [1] 0.00800531
Highlights outlier has low influence to regression line and
Decision Keep outliers from column
percent_m
#column unemploy_m24
summary(lm.noOutlier2)$r.squared#> [1] 0.004200003
summary(lm.wOutlier2)$r.squared#> [1] 0.00254802
Highlight
There’s leverage differences in the regression line (low influence), and yet the R-square number increasing when without outlier
Decision Trim outliers from column
unemploy_m24
summary(lm.noOutlier3)$r.squared#> [1] 0.01936133
summary(lm.wOutlier3)$r.squared#> [1] 0.03144261
Highlight There’s leverage differences in the regression line and yet the influence still considered as low influence, and the R-square number higher with outlier
Decision Keep outliers from column
unemploy_m39. And adjust our model based on above
conclusions.
crime_clean <- crime[crime$unemploy_m24 < 142,]
lm.crime_clean <- lm(crime_rate ~ ., data = crime_clean)
crime_clean.bwd <- step(lm.crime_clean,
direction = "backward",
trace = F)summary(crime_clean.bwd)$adj.r.squared # model with outlier adjustment#> [1] 0.7462056
summary(crime.bwd)$adj.r.squared # model with outlier#> [1] 0.7443692
create prediction to our data (to see performance)
crime$prediction <- predict(object = crime_clean.bwd, newdata = crime)
crime$prediction_wOut <- predict(object = crime.bwd, newdata = crime)To evaluate our model and see the average error that may produced to our model, we’re going to calculate with RMSE
RMSE(y_pred = crime$prediction, y_true = crime$crime_rate)#> [1] 176.1401
RMSE(y_pred = crime$prediction_wOut, y_true = crime$crime_rate)#> [1] 175.8304
range(crime$crime_rate)#> [1] 342 1993
considering the range of of target variable, the RMSE is quite high
summary(crime_clean.bwd)$call#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
#> m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
#> data = crime_clean)
ggcorr(crime[,-(17:19)], label = T, hjust = 1, layout.exp = 3)From all predictors all correlated to crime rate != 0, we can conclude that linearity is present
hist(crime_clean.bwd$residuals)shapiro.test(crime_clean.bwd$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: crime_clean.bwd$residuals
#> W = 0.98497, p-value = 0.8093
P-value > alpha (0.8093 > 0.05) we can conclude that error distributed normally
plot(x = crime_clean.bwd$fitted.values, y = crime_clean.bwd$residuals)
abline(h = 0, col = "red")bptest(crime_clean.bwd)#>
#> studentized Breusch-Pagan test
#>
#> data: crime_clean.bwd
#> BP = 14.255, df = 8, p-value = 0.07537
P-value > alpha (0.07537 > 0.05) we can conclude that Heteroscedasticity is present
vif(crime_clean.bwd)#> percent_m mean_education police_exp60 m_per1000f unemploy_m24
#> 2.115006 4.115400 2.560201 1.784947 4.106642
#> unemploy_m39 inequality prob_prison
#> 4.536383 3.724564 1.381809
All predictors vif are below 10, we can conclude there’s no multicollinearity between predictors
Conclusion
all assumptions has been fulfilled
From all the model evaluation we can conclude that:
unemploy_m24 and keep outliers from columns
percent_m & unemploy_m24