In this chapter, we will learn on how to use linear regression model using “crime” dataset.We want to know the relationship among variables, especially between the “crime rate” with other variables. We also want to find patterns which will help us in predicting the crime rate on specific area based on the historical data.
Load the required packages.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)## Warning: package 'GGally' was built under R version 4.0.5
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(lmtest)## Warning: package 'lmtest' was built under R version 4.0.5
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(MLmetrics)## Warning: package 'MLmetrics' was built under R version 4.0.5
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(ggplot2)
options(scipen = 100, max.print = 1e+06)Load the dataset.
crime <- read.csv("crime.csv")
head(crime)glimpse(crime)## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0~
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~
The data has 47 rows and 16 columns. All of the variables we have identify informations that may affect the crime rate of a certain area. For example, gdp and inequality in a certain area are 2 of the most well known factors of crime being presented.
Before we go to next step, we need to make sure that our data is already clean. We need to identify if there any missing value in our dataset.
anyNA(crime)## [1] FALSE
Once we are sure that our data is already clean, we can continue to the further process.
Exploratory data analysis (EDA) is a process where we explore the data variables to find any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between variables.
ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)From the visualization above, only nonwhites_per1000 which has no correlation with crime_rate. Furthermore, we know that gdp and inequality don’t play significant relation in the present of crime_rate in a certain area because of their low correlation score. The most related variable with crime_rate are police_exp59 and police_exp60 which make sense because of high number of police experience in an area may indicates high number of incident happened there.
But, high correlation doesn’t always have high influence, vice versa. We might get a variable with low correlation but gives significant impact to the prediction. Hence, we will try to make a model for our dataset using all the variables to find new insights about our data.
model_all <- lm(formula = crime_rate ~ ., data = crime)
summary(model_all)##
## Call:
## lm(formula = crime_rate ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5984.2876 1628.3184 -3.675 0.000893 ***
## percent_m 8.7830 4.1714 2.106 0.043443 *
## is_south -3.8035 148.7551 -0.026 0.979765
## mean_education 18.8324 6.2088 3.033 0.004861 **
## police_exp60 19.2804 10.6110 1.817 0.078892 .
## police_exp59 -10.9422 11.7478 -0.931 0.358830
## labour_participation -0.6638 1.4697 -0.452 0.654654
## m_per1000f 1.7407 2.0354 0.855 0.398995
## state_pop -0.7330 1.2896 -0.568 0.573845
## nonwhites_per1000 0.4204 0.6481 0.649 0.521279
## unemploy_m24 -5.8271 4.2103 -1.384 0.176238
## unemploy_m39 16.7800 8.2336 2.038 0.050161 .
## gdp 0.9617 1.0367 0.928 0.360754
## inequality 7.0672 2.2717 3.111 0.003983 **
## prob_prison -4855.2658 2272.3746 -2.137 0.040627 *
## time_prison -3.4790 7.1653 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 0.0000003539
By using confidence level of 95%, We got Adjusted R-squared score of 0.7078 by using all variables to train our model. But, there are some variables showing low significancy (t value > 0.05). In this case, we will eliminate some variables using step wise method.
model_step <- step(object = model_all, direction = "backward")## Start: AIC=514.65
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## police_exp59 + labour_participation + m_per1000f + state_pop +
## nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - is_south 1 29 1354974 512.65
## - labour_participation 1 8917 1363862 512.96
## - time_prison 1 10304 1365250 513.00
## - state_pop 1 14122 1369068 513.14
## - nonwhites_per1000 1 18395 1373341 513.28
## - m_per1000f 1 31967 1386913 513.74
## - gdp 1 37613 1392558 513.94
## - police_exp59 1 37919 1392865 513.95
## <none> 1354946 514.65
## - unemploy_m24 1 83722 1438668 515.47
## - police_exp60 1 144306 1499252 517.41
## - unemploy_m39 1 181536 1536482 518.56
## - percent_m 1 193770 1548716 518.93
## - prob_prison 1 199538 1554484 519.11
## - mean_education 1 402117 1757063 524.86
## - inequality 1 423031 1777977 525.42
##
## Step: AIC=512.65
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
## labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison +
## time_prison
##
## Df Sum of Sq RSS AIC
## - time_prison 1 10341 1365315 511.01
## - labour_participation 1 10878 1365852 511.03
## - state_pop 1 14127 1369101 511.14
## - nonwhites_per1000 1 21626 1376600 511.39
## - m_per1000f 1 32449 1387423 511.76
## - police_exp59 1 37954 1392929 511.95
## - gdp 1 39223 1394197 511.99
## <none> 1354974 512.65
## - unemploy_m24 1 96420 1451395 513.88
## - police_exp60 1 144302 1499277 515.41
## - unemploy_m39 1 189859 1544834 516.81
## - percent_m 1 195084 1550059 516.97
## - prob_prison 1 204463 1559437 517.26
## - mean_education 1 403140 1758114 522.89
## - inequality 1 488834 1843808 525.13
##
## Step: AIC=511.01
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
## labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## - labour_participation 1 10533 1375848 509.37
## - nonwhites_per1000 1 15482 1380797 509.54
## - state_pop 1 21846 1387161 509.75
## - police_exp59 1 28932 1394247 509.99
## - gdp 1 36070 1401385 510.23
## - m_per1000f 1 41784 1407099 510.42
## <none> 1365315 511.01
## - unemploy_m24 1 91420 1456735 512.05
## - police_exp60 1 134137 1499452 513.41
## - unemploy_m39 1 184143 1549458 514.95
## - percent_m 1 186110 1551425 515.01
## - prob_prison 1 237493 1602808 516.54
## - mean_education 1 409448 1774763 521.33
## - inequality 1 502909 1868224 523.75
##
## Step: AIC=509.37
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
## m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 +
## unemploy_m39 + gdp + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## - nonwhites_per1000 1 11675 1387523 507.77
## - police_exp59 1 21418 1397266 508.09
## - state_pop 1 27803 1403651 508.31
## - m_per1000f 1 31252 1407100 508.42
## - gdp 1 35035 1410883 508.55
## <none> 1375848 509.37
## - unemploy_m24 1 80954 1456802 510.06
## - police_exp60 1 123896 1499744 511.42
## - unemploy_m39 1 190746 1566594 513.47
## - percent_m 1 217716 1593564 514.27
## - prob_prison 1 226971 1602819 514.54
## - mean_education 1 413254 1789103 519.71
## - inequality 1 500944 1876792 521.96
##
## Step: AIC=507.77
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
## m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp +
## inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## - police_exp59 1 16706 1404229 506.33
## - state_pop 1 25793 1413315 506.63
## - m_per1000f 1 26785 1414308 506.66
## - gdp 1 31551 1419073 506.82
## <none> 1387523 507.77
## - unemploy_m24 1 83881 1471404 508.52
## - police_exp60 1 118348 1505871 509.61
## - unemploy_m39 1 201453 1588976 512.14
## - prob_prison 1 216760 1604282 512.59
## - percent_m 1 309214 1696737 515.22
## - mean_education 1 402754 1790276 517.74
## - inequality 1 589736 1977259 522.41
##
## Step: AIC=506.33
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality +
## prob_prison
##
## Df Sum of Sq RSS AIC
## - state_pop 1 22345 1426575 505.07
## - gdp 1 32142 1436371 505.39
## - m_per1000f 1 36808 1441037 505.54
## <none> 1404229 506.33
## - unemploy_m24 1 86373 1490602 507.13
## - unemploy_m39 1 205814 1610043 510.76
## - prob_prison 1 218607 1622836 511.13
## - percent_m 1 307001 1711230 513.62
## - mean_education 1 389502 1793731 515.83
## - inequality 1 608627 2012856 521.25
## - police_exp60 1 1050202 2454432 530.57
##
## Step: AIC=505.07
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## - gdp 1 26493 1453068 503.93
## <none> 1426575 505.07
## - m_per1000f 1 84491 1511065 505.77
## - unemploy_m24 1 99463 1526037 506.24
## - prob_prison 1 198571 1625145 509.20
## - unemploy_m39 1 208880 1635455 509.49
## - percent_m 1 320926 1747501 512.61
## - mean_education 1 386773 1813348 514.35
## - inequality 1 594779 2021354 519.45
## - police_exp60 1 1127277 2553852 530.44
##
## Step: AIC=503.93
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
## unemploy_m24 + unemploy_m39 + inequality + prob_prison
##
## Df Sum of Sq RSS AIC
## <none> 1453068 503.93
## - m_per1000f 1 103159 1556227 505.16
## - unemploy_m24 1 127044 1580112 505.87
## - prob_prison 1 247978 1701046 509.34
## - unemploy_m39 1 255443 1708511 509.55
## - percent_m 1 296790 1749858 510.67
## - mean_education 1 445788 1898855 514.51
## - inequality 1 738244 2191312 521.24
## - police_exp60 1 1672038 3125105 537.93
summary(model_step)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 0.0000040395 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 0.0000000826 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 0.0000863344 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 0.0000000001159
By using step wise method, we eliminated less important variables and the Adjusted R-squared score increased to 0.7444.
Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).
The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.
Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.
resact <- data.frame(residual = model_step$residuals, fitted = model_step$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + theme(panel.grid = element_blank(), panel.background = element_blank())## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.
shapiro.test(model_step$residuals)##
## Shapiro-Wilk normality test
##
## data: model_step$residuals
## W = 0.98511, p-value = 0.8051
From the test above, we know that: Error is distributed normally (P-value higher than 0.05)
Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.
bptest(model_step)##
## studentized Breusch-Pagan test
##
## data: model_step
## BP = 13.51, df = 8, p-value = 0.09546
Based on Breusch-Pagan test you have performed, Heteroscedasticity is not present.
Using VIF value, we can determine whether or not there are multicollinearity between predictor variables. A high VIF value indicates a high correlation between the variables.
library(car)
vif(model_step)## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
Multicollinearity is not present in our model because the VIF values for all variables are below 10.
By using crime_test dataset, we will use our model to predict crime rate and compare the predicted values with the real crime rate values in the dataset.
crime_test <- read.csv("crime_test.csv", stringsAsFactors = T)
crime_test <- crime_test %>%
mutate(prediction = predict(object = model_all, newdata = crime_test))
crime_testcrime_test2 <- crime_test %>%
mutate(prediction = predict(object = model_step, newdata = crime_test))
crime_test2The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error. RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package.
RMSE(y_true = crime_test$crime_rate, y_pred = crime_test$prediction)## [1] 212.7655
RMSE(y_true = crime_test2$crime_rate, y_pred = crime_test2$prediction)## [1] 215.5166
New insight: Adjusted R-squared doesn’t always produce lower RMSE. By comparing the first model which has lower Adjusted R-squared score with the second model, the first model produced lower RMSE
In this study case, further improvement on what variables should be eliminated needs to be done. Although both of the model seems to give good result, the gap between real values and predicted values is still high. This difference may give tremendous effect to the decision which government makes.