library(dplyr)
library(GGally)
library(car)
library(gvlma)
crime <- read.csv("crime.csv") %>% select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
ggcorr(crime, label = T,label_size = 2.9, hjust=1, layout.exp = 5)
lm.crime <- lm(crime_rate ~ ., crime)
lm.crime.none <- lm(crime_rate ~ 1, crime)summary(step(lm.crime, data=crime, direction="backward", trace = 0))##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
summary(step(lm.crime.none, scope = list(upper=lm.crime), data=crime, direction="forward",trace = 0))##
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education +
## percent_m + prob_prison + unemploy_m39, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470.68 -78.41 -19.68 133.12 556.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5040.505 899.843 -5.602 1.72e-06 ***
## police_exp60 11.502 1.375 8.363 2.56e-10 ***
## inequality 6.765 1.394 4.855 1.88e-05 ***
## mean_education 19.647 4.475 4.390 8.07e-05 ***
## percent_m 10.502 3.330 3.154 0.00305 **
## prob_prison -3801.836 1528.097 -2.488 0.01711 *
## unemploy_m39 8.937 4.091 2.185 0.03483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
summary(step(lm.crime, scope = list(upper=lm.crime), data=crime, direction="both",trace = 0))##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
summary(step(lm.crime.none, scope = list(upper=lm.crime), data=crime, direction="both",trace = 0))##
## Call:
## lm(formula = crime_rate ~ police_exp60 + inequality + mean_education +
## percent_m + prob_prison + unemploy_m39, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470.68 -78.41 -19.68 133.12 556.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5040.505 899.843 -5.602 1.72e-06 ***
## police_exp60 11.502 1.375 8.363 2.56e-10 ***
## inequality 6.765 1.394 4.855 1.88e-05 ***
## mean_education 19.647 4.475 4.390 8.07e-05 ***
## percent_m 10.502 3.330 3.154 0.00305 **
## prob_prison -3801.836 1528.097 -2.488 0.01711 *
## unemploy_m39 8.937 4.091 2.185 0.03483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
Checking from 4 different directions of stepwise, gave us the best adj. R Square if we chose below variables to be included into the model:
Hence our Linear Model call will be
lm.adj <- lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
data = crime)plot(lm.adj, which=c(1,1))
This plot shows that the residuals don’t have non-linear patterns, because we can see on the plot that the residuals are spread around a horizontal line without distinct patterns
plot(lm.adj, which=c(2,2))
From the normal Q-Q Plot given, we could assume that the model is normally distributed, beacause the points is faling along a straight line. And if we look closely to the pattern, the normal distribution is assumed as skewed left type (positive skewness).
plot(lm.adj, which=c(3,3))
From this plot we can see that the model is homoscedastic because the residuals are tend to spread equally along the predictor range
plot(lm.adj, which=c(5,5))
We can barely see cook’s distance lines because all cases are well inside of the cook’s, hence the outliers didn’t have significant impact to the data model
vif(lm.adj)## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
Analyzing the value of VIF, we can assume that the multicollinearity of our data model is low (< 10)
gvlma(lm.adj)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Coefficients:
## (Intercept) percent_m mean_education police_exp60
## -6426.101 9.332 18.012 10.265
## m_per1000f unemploy_m24 unemploy_m39 inequality
## 2.234 -6.087 18.735 6.133
## prob_prison
## -3796.032
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = lm.adj)
##
## Value p-value Decision
## Global Stat 2.4658 0.6508 Assumptions acceptable.
## Skewness 0.0841 0.7718 Assumptions acceptable.
## Kurtosis 0.1539 0.6948 Assumptions acceptable.
## Link Function 2.0589 0.1513 Assumptions acceptable.
## Heteroscedasticity 0.1689 0.6811 Assumptions acceptable.
Global Stat showing the p-value > 0.05, indicates the linear relationship between X’s and Y
The Skewness value indicate the normal distribution is slightly skewed to left (positive skewness), p-value > 0.05 indicates that the data is distributed normally (no need for data transformation)
The kurtosis value indicate the normal distribution is slightly leptokurtic - fatter tails (positive kurtosis), p-value > 0.05 indicates that the data is distributed normally (no need for data transformation)
Link Function showing the p-value > 0.05, indicates that the dependant variable used is continuous rather than categorical
Heteroscedasticity showing the p-value > 0.05, indicates that the model assumption of homoscedasticity is verified
lmtest::bptest(lm.adj)##
## studentized Breusch-Pagan test
##
## data: lm.adj
## BP = 13.51, df = 8, p-value = 0.09546
The BP Test showing the p-value > 0.05, indicates that the model assumption of homoscedasticity is verified
After doing several test above we could safely assume that the model is good to be used
The final beta values for defining crime rate based on our data is
| x | |
|---|---|
| (Intercept) | -6426.101018 |
| prob_prison | -3796.031826 |
| unemploy_m24 | -6.086633 |
| m_per1000f | 2.233975 |
| inequality | 6.133494 |
| percent_m | 9.332155 |
| police_exp60 | 10.265316 |
| mean_education | 18.012011 |
| unemploy_m39 | 18.734512 |
Looking at the model shown, in order to decrease the crime rate significantly, we should increase the probability of the people to be imprisoned. Another recommendation that the model can give, we should decrease mean years of schooling (whaaaaaat??) and unemployed Male at the age of 35-39