On this occasion, we will analyze using a linear regression model on
the data crime.csv.
We will explore the target variable crime_rate to be
able to analyze how socio-demographic variables influence the crime rate
in an area.
crime <- read.csv("crime.csv")
head(crime)As we can see in the data display above, the column names in the data are still in the form of codes. Therefore, we will change the column names to match the proper column names.
crime <- crime %>% select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)Check the data information.
str(crime) ## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
From the data information above, there is one variable that has a data type that does not match the data. Therefore, we will change the data type to match. And we will check if any data has missing values.
crime <- crime %>%
mutate(is_south = as.factor(is_south))
str(crime)## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
anyNA(crime)## [1] FALSE
Check the variable correlation.
ggcorr(crime, label = T, hjust = 1, layout.exp = 4)## Warning in ggcorr(crime, label = T, hjust = 1, layout.exp = 4): data in
## column(s) 'is_south' are not numeric and were ignored
The variables with a strong correlation with
crime_rate are
police_exp59(0.7) and police_exp60 (0.7).
We will make several linear regression model to predict
crime_rate. Based on the two most potential variables,
based on the combination of the two most potential variables, and based
on the combination of all the existing variables.
crime_rate_RM_59 <- lm(crime_rate ~ police_exp59, crime)
crime_rate_RM_60 <- lm(crime_rate ~ police_exp60, crime)
crime_rate_RM_5960 <- lm(crime_rate ~ police_exp59 + police_exp60, crime)
crime_rate_RM_all <- lm(crime_rate ~ ., crime)
summary(crime_rate_RM_59)##
## Call:
## lm(formula = crime_rate ~ police_exp59, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -595.58 -156.76 12.29 146.74 593.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 165.164 130.427 1.266 0.212
## police_exp59 9.222 1.537 6.001 3.11e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 291.4 on 45 degrees of freedom
## Multiple R-squared: 0.4445, Adjusted R-squared: 0.4322
## F-statistic: 36.01 on 1 and 45 DF, p-value: 3.114e-07
summary(crime_rate_RM_60)##
## Call:
## lm(formula = crime_rate ~ police_exp60, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -586.91 -155.63 32.52 139.58 568.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 144.464 126.693 1.140 0.26
## police_exp60 8.948 1.409 6.353 9.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 283.9 on 45 degrees of freedom
## Multiple R-squared: 0.4728, Adjusted R-squared: 0.4611
## F-statistic: 40.36 on 1 and 45 DF, p-value: 9.338e-08
summary(crime_rate_RM_5960)##
## Call:
## lm(formula = crime_rate ~ police_exp59 + police_exp60, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -636.09 -168.62 35.44 141.80 532.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 158.26 125.93 1.257 0.2155
## police_exp59 -17.83 13.12 -1.359 0.1810
## police_exp60 25.62 12.34 2.076 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 281.3 on 44 degrees of freedom
## Multiple R-squared: 0.494, Adjusted R-squared: 0.471
## F-statistic: 21.48 on 2 and 44 DF, p-value: 3.094e-07
summary(crime_rate_RM_all)##
## Call:
## lm(formula = crime_rate ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5984.2876 1628.3184 -3.675 0.000893 ***
## percent_m 8.7830 4.1714 2.106 0.043443 *
## is_south1 -3.8035 148.7551 -0.026 0.979765
## mean_education 18.8324 6.2088 3.033 0.004861 **
## police_exp60 19.2804 10.6110 1.817 0.078892 .
## police_exp59 -10.9422 11.7478 -0.931 0.358830
## labour_participation -0.6638 1.4697 -0.452 0.654654
## m_per1000f 1.7407 2.0354 0.855 0.398995
## state_pop -0.7330 1.2896 -0.568 0.573845
## nonwhites_per1000 0.4204 0.6481 0.649 0.521279
## unemploy_m24 -5.8271 4.2103 -1.384 0.176238
## unemploy_m39 16.7800 8.2336 2.038 0.050161 .
## gdp 0.9617 1.0367 0.928 0.360754
## inequality 7.0672 2.2717 3.111 0.003983 **
## prob_prison -4855.2658 2272.3746 -2.137 0.040627 *
## time_prison -3.4790 7.1653 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
And We will also buil a step-wise regression forward model.
# preparation : model without predictor
crime_rate_none <- lm(crime_rate ~ 1, data = crime)
# stepwise regression - forward
crime_rate_sw <- step(object = crime_rate_none, scope = list(upper = crime_rate_RM_all), direction = "forward", trace = F)
summary(crime_rate_sw)##
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education +
## percent_m + prob_prison + unemploy_m39, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470.68 -78.41 -19.68 133.12 556.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5040.505 899.843 -5.602 1.72e-06 ***
## police_exp60 11.502 1.375 8.363 2.56e-10 ***
## inequality 6.765 1.394 4.855 1.88e-05 ***
## mean_education 19.647 4.475 4.390 8.07e-05 ***
## percent_m 10.502 3.330 3.154 0.00305 **
## prob_prison -3801.836 1528.097 -2.488 0.01711 *
## unemploy_m39 8.937 4.091 2.185 0.03483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
We will compare the r-squared of each model that we have created.
summary(crime_rate_RM_59)$r.squared## [1] 0.4445077
summary(crime_rate_RM_60)$r.squared## [1] 0.4727999
summary(crime_rate_RM_5960)$adj.r.squared## [1] 0.4710441
summary(crime_rate_RM_all)$adj.r.squared## [1] 0.7078062
summary(crime_rate_sw)$adj.r.squared## [1] 0.7307463
And now We will compare the RMSE of each model that we have created. And also W will use 98% Confidence Interval.
pred_crime_rate_RM_59 <- predict(crime_rate_RM_59,
newdata = crime,
interval = "prediction",
level = 0.98)
pred_crime_rate_RM_60 <- predict(crime_rate_RM_60,
newdata = crime,
interval = "prediction",
level = 0.98)
pred_crime_rate_RM_5960 <- predict(crime_rate_RM_5960,
newdata = crime,
interval = "prediction",
level = 0.98)
pred_crime_rate_RM_all <- predict(crime_rate_RM_all,
newdata = crime,
interval = "prediction",
level = 0.98)
pred_crime_rate_sw <- predict(crime_rate_sw,
newdata = crime,
interval = "prediction",
level = 0.98)
# evaluation
RMSE(pred_crime_rate_RM_59, crime$crime_rate)## [1] 651.7798
RMSE(pred_crime_rate_RM_60, crime$crime_rate)## [1] 634.9648
RMSE(pred_crime_rate_RM_5960, crime$crime_rate)## [1] 633.3421
RMSE(pred_crime_rate_RM_all, crime$crime_rate)## [1] 513.6302
RMSE(pred_crime_rate_sw, crime$crime_rate)## [1] 464.1492
Based on the evaluation that We conduct above, for the r-squared
& RMSE, the best model we can use is linear regression model with
all predictor (crime_rate_RM_all).
There are several assumptions applied by the model. It can be checked whether these assumptions are met to ensure the goodness of the model.
Histogram residual visualization.
hist(crime_rate_RM_all$residuals)
Saphiro statistic test.
shapiro.test(crime_rate_RM_all$residuals)##
## Shapiro-Wilk normality test
##
## data: crime_rate_RM_all$residuals
## W = 0.9846, p-value = 0.7849
H0: error/residual normally distributedH1: error/residual not normally distributedp-value >= 0.05 = accept H0
p-value < 0.05 = accept H1
Conclusion: error/residual is normally distributed, assumption test is met.
Scatterplot visualization.
plot(x = crime_rate_RM_all$fitted.values, y = crime_rate_RM_all$residuals)
abline(h = 0, col = "red")
Breusch-Pagan hypothesis statistic test.
bptest(crime_rate_RM_all)##
## studentized Breusch-Pagan test
##
## data: crime_rate_RM_all
## BP = 18.469, df = 15, p-value = 0.2388
p-value >= 0.05 = accept H0
p-value < 0.05 = accept H1
Conclusion: Variance of error spreads constant (Homoscedasticity), assumption test is met.
VIF (Variance Inflation Factor) test.
vif(crime_rate_RM_all)## percent_m is_south mean_education
## 2.892448 5.342783 5.077447
## police_exp60 police_exp59 labour_participation
## 104.658667 113.559262 3.712690
## m_per1000f state_pop nonwhites_per1000
## 3.785934 2.536708 4.674088
## unemploy_m24 unemploy_m39 gdp
## 6.063931 5.088880 10.530375
## inequality prob_prison time_prison
## 8.644528 2.809459 2.713785
Conclusion: Our model has multicollinearity, the assumption test is not met.
From the normality of residuals, Homoscedasticity, and
Multicolinearity test, our crime_rate_RM_all model only
fail the assumption in the Multicolinearity test. Therefore, we will
validate the second best model, namely crime_rate_sw
model.
#Normality of Residuals
shapiro.test(crime_rate_sw$residuals)##
## Shapiro-Wilk normality test
##
## data: crime_rate_sw$residuals
## W = 0.96898, p-value = 0.2423
#Homoscadasticity
bptest(crime_rate_sw)##
## studentized Breusch-Pagan test
##
## data: crime_rate_sw
## BP = 14.873, df = 6, p-value = 0.02127
#Multicolinearity
vif(crime_rate_sw)## police_exp60 inequality mean_education percent_m prob_prison
## 1.908124 3.530429 2.862893 2.000244 1.378714
## unemploy_m39
## 1.363074
From the three validation test above, our crime_rate_sw
model has fail in the Homoscedasticity assumption.
The risk of assumption test:
* Normality is not met: there is a large probability of error in our
model (far from 0).
* Homoscedasticity is not met: there is a possibility of a large error
in a certain range of values.
* Multicoll not met: less efficient.
So, it can be concluded that our crime_rate_RM_all model
is the best model because it can handle errors that may appear better
despite the lack of efficiency than crime_rate_sw
model.
Based on the model that we have made, all predictor variables are very useful for predicting the target variable. In other words, all socio-demographic variables will influence the crime rate in an area.