Dataset crime is data that contains information about crime in some state with various factors. In this LBB, we’ll try to make a linear regression model to predict crime rate and see how different predictor variables affect it.
Import file .csv in the same folder with our project then assign it into an object named crime.
library(dplyr)## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
crime <- read.csv("crime.csv", stringsAsFactors = F) %>% select(-X)
head(crime)Change columns name, check the structure of dataset. Change data type correctly.
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60",
"police_exp59", "labour_participation", "m_per1000f",
"state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39",
"gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
glimpse(crime)## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0~
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~
crime <- crime %>% mutate(is_south = as.factor(is_south))
glimpse(crime)## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south <fct> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0~
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~
columns in dataset crime - percent_m: percentage of males aged 14-24 - is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
- mean_education: mean years of schooling
- police_exp60: police expenditure in 1960
- police_exp59: police expenditure in 1959 - labour_participation: labour force participation rate
- m_per1000f: number of males per 1000 females
- state_pop: state population
- nonwhites_per1000: number of non-whites resident per 1000 people
- unemploy_m24: unemployment rate of urban males aged 14-24
- unemploy_m39: unemployment rate of urban males aged 35-39
- gdp: gross domestic product per head
- inequality: income inequality
- prob_prison: probability of imprisonment
- time_prison: avg time served in prisons
- crime_rate: crime rate in an unspecified category
Use GGally libray to see the correlation of between target variable and predictors in dataset crime
library(GGally)## Warning: package 'GGally' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(crime, label = T, hjust = .9, layout.exp = 2, label_size = 3, cex = 3)## Warning in ggcorr(crime, label = T, hjust = 0.9, layout.exp = 2, label_size =
## 3, : data in column(s) 'is_south' are not numeric and were ignored
According to ggcorr, highest correlation is in police_exp59 and police_exp60. We use crime_rate as target variable.
Make model with all predictors.
model_all <- lm(crime_rate ~., data = crime)
summary(model_all)##
## Call:
## lm(formula = crime_rate ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5984.2876 1628.3184 -3.675 0.000893 ***
## percent_m 8.7830 4.1714 2.106 0.043443 *
## is_south1 -3.8035 148.7551 -0.026 0.979765
## mean_education 18.8324 6.2088 3.033 0.004861 **
## police_exp60 19.2804 10.6110 1.817 0.078892 .
## police_exp59 -10.9422 11.7478 -0.931 0.358830
## labour_participation -0.6638 1.4697 -0.452 0.654654
## m_per1000f 1.7407 2.0354 0.855 0.398995
## state_pop -0.7330 1.2896 -0.568 0.573845
## nonwhites_per1000 0.4204 0.6481 0.649 0.521279
## unemploy_m24 -5.8271 4.2103 -1.384 0.176238
## unemploy_m39 16.7800 8.2336 2.038 0.050161 .
## gdp 0.9617 1.0367 0.928 0.360754
## inequality 7.0672 2.2717 3.111 0.003983 **
## prob_prison -4855.2658 2272.3746 -2.137 0.040627 *
## time_prison -3.4790 7.1653 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
percent_m, mean_education, inequality, and prob_prison is significant to crime_rate. Stepwise is method that can choose predictor variable with minimum error value. Use stepwise regression with backward direction to see significant variables to crime_rate.
model_back <- step(model_all, direction = 'backward', trace = 0)
summary(model_back)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
Compare model_all and model_back to see which has better performance
library(performance)## Warning: package 'performance' was built under R version 4.0.5
compare_performance(model_all, model_back)AIC value in model_back is lower than model_all, so we choose model_back for the best model to predict data crime_rate.
# Predict Using the Best Model
crime$predict <- model_back$fitted.values# check correlation
cor.test(crime$mean_education, crime$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime$mean_education and crime$crime_rate
## t = 2.2882, df = 45, p-value = 0.02688
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03931263 0.55824793
## sample estimates:
## cor
## 0.3228349
cor.test(crime$state_pop, crime$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime$state_pop and crime$crime_rate
## t = 2.4049, df = 45, p-value = 0.02035
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05570577 0.56945429
## sample estimates:
## cor
## 0.3374741
cor.test(crime$gdp, crime$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime$gdp and crime$crime_rate
## t = 3.2991, df = 45, p-value = 0.001902
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1765245 0.6465481
## sample estimates:
## cor
## 0.4413199
crime$residual <- crime$predict - crime$crime_rate
shapiro.test(crime$residual)##
## Shapiro-Wilk normality test
##
## data: crime$residual
## W = 0.98511, p-value = 0.8051
plot(crime$crime_rate, model_back$residuals)library(car)## Loading required package: carData
## Warning: package 'carData' was built under R version 4.0.5
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model_back)## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
crime_test <- read.csv('crime.csv', stringsAsFactors = T)
names(crime_test) <- c("percent_m", "is_south", "mean_education", "police_exp60",
"police_exp59", "labour_participation", "m_per1000f",
"state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39",
"gdp", "inequality", "prob_prison", "time_prison", "crime_rate")crime_test$predict <- model_back$fitted.valueslibrary(MLmetrics)## Warning: package 'MLmetrics' was built under R version 4.0.5
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
RMSE(crime$predict, crime$crime_rate)## [1] 175.8304
RMSE(crime_test$predict, crime_test$crime_rate)## [1] 941.4898
From all analysis above, we can conclude that :
percent_m, mean_education, inequality, and prob_prison is significant to crime_rate.
RMSE value at 941.4898 is still considered large, means that our model is still inaccurate in predicting our target variable. If we want to get optimal result, do analysis with adding one or two more prediction, or get another data.
Model_back has linearity because p-value lower than 0.05.
Model_back’s normality distributed normal because p-value > 0.05
Model_back heteroscedasticity, there’s pattern in plot that can affected to standard error value of our model.
There’s no multicollinearity in our model. Multicollinearity happens when there’s correlation between predictor and vif value is greater than 10.