Background

Dataset crime is data that contains information about crime in some state with various factors. In this LBB, we’ll try to make a linear regression model to predict crime rate and see how different predictor variables affect it.

Data Wrangling

Import file .csv in the same folder with our project then assign it into an object named crime.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
crime <- read.csv("crime.csv", stringsAsFactors = F) %>% select(-X)
head(crime)

Change columns name, check the structure of dataset. Change data type correctly.

names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", 
                  "police_exp59", "labour_participation", "m_per1000f", 
                  "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", 
                  "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0~
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~
crime <- crime %>% mutate(is_south = as.factor(is_south))

glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south             <fct> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0~
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~

columns in dataset crime - percent_m: percentage of males aged 14-24 - is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
- mean_education: mean years of schooling
- police_exp60: police expenditure in 1960
- police_exp59: police expenditure in 1959 - labour_participation: labour force participation rate
- m_per1000f: number of males per 1000 females
- state_pop: state population
- nonwhites_per1000: number of non-whites resident per 1000 people
- unemploy_m24: unemployment rate of urban males aged 14-24
- unemploy_m39: unemployment rate of urban males aged 35-39
- gdp: gross domestic product per head
- inequality: income inequality
- prob_prison: probability of imprisonment
- time_prison: avg time served in prisons
- crime_rate: crime rate in an unspecified category

Exploratory Data Analysis

Use GGally libray to see the correlation of between target variable and predictors in dataset crime

library(GGally)
## Warning: package 'GGally' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(crime, label = T, hjust = .9, layout.exp = 2, label_size = 3, cex = 3)
## Warning in ggcorr(crime, label = T, hjust = 0.9, layout.exp = 2, label_size =
## 3, : data in column(s) 'is_south' are not numeric and were ignored

According to ggcorr, highest correlation is in police_exp59 and police_exp60. We use crime_rate as target variable.

Modelling

Make model with all predictors.

model_all <- lm(crime_rate ~., data = crime)
summary(model_all)
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
## percent_m                8.7830     4.1714   2.106 0.043443 *  
## is_south1               -3.8035   148.7551  -0.026 0.979765    
## mean_education          18.8324     6.2088   3.033 0.004861 ** 
## police_exp60            19.2804    10.6110   1.817 0.078892 .  
## police_exp59           -10.9422    11.7478  -0.931 0.358830    
## labour_participation    -0.6638     1.4697  -0.452 0.654654    
## m_per1000f               1.7407     2.0354   0.855 0.398995    
## state_pop               -0.7330     1.2896  -0.568 0.573845    
## nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
## unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
## unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
## gdp                      0.9617     1.0367   0.928 0.360754    
## inequality               7.0672     2.2717   3.111 0.003983 ** 
## prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
## time_prison             -3.4790     7.1653  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

percent_m, mean_education, inequality, and prob_prison is significant to crime_rate. Stepwise is method that can choose predictor variable with minimum error value. Use stepwise regression with backward direction to see significant variables to crime_rate.

model_back <- step(model_all, direction = 'backward', trace = 0)
summary(model_back)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

Comparison of Model Performance

Compare model_all and model_back to see which has better performance

library(performance)
## Warning: package 'performance' was built under R version 4.0.5
compare_performance(model_all, model_back)

AIC value in model_back is lower than model_all, so we choose model_back for the best model to predict data crime_rate.

# Predict Using the Best Model

crime$predict <- model_back$fitted.values

Assumption

Linearity

# check correlation
cor.test(crime$mean_education, crime$crime_rate)
## 
##  Pearson's product-moment correlation
## 
## data:  crime$mean_education and crime$crime_rate
## t = 2.2882, df = 45, p-value = 0.02688
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03931263 0.55824793
## sample estimates:
##       cor 
## 0.3228349
cor.test(crime$state_pop, crime$crime_rate)
## 
##  Pearson's product-moment correlation
## 
## data:  crime$state_pop and crime$crime_rate
## t = 2.4049, df = 45, p-value = 0.02035
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05570577 0.56945429
## sample estimates:
##       cor 
## 0.3374741
cor.test(crime$gdp, crime$crime_rate)
## 
##  Pearson's product-moment correlation
## 
## data:  crime$gdp and crime$crime_rate
## t = 3.2991, df = 45, p-value = 0.001902
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1765245 0.6465481
## sample estimates:
##       cor 
## 0.4413199

Normality of Residual

crime$residual <- crime$predict - crime$crime_rate
shapiro.test(crime$residual)
## 
##  Shapiro-Wilk normality test
## 
## data:  crime$residual
## W = 0.98511, p-value = 0.8051

Homoscedasticity

plot(crime$crime_rate, model_back$residuals)

No Multicollinearity

library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.0.5
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model_back)
##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

Test Data

crime_test <- read.csv('crime.csv', stringsAsFactors = T)

names(crime_test) <- c("percent_m", "is_south", "mean_education", "police_exp60", 
                  "police_exp59", "labour_participation", "m_per1000f", 
                  "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", 
                  "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

Predict

crime_test$predict <- model_back$fitted.values

Check Error

library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.0.5
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
RMSE(crime$predict, crime$crime_rate)
## [1] 175.8304
RMSE(crime_test$predict, crime_test$crime_rate)
## [1] 941.4898

Conclusion

From all analysis above, we can conclude that :

percent_m, mean_education, inequality, and prob_prison is significant to crime_rate.

RMSE value at 941.4898 is still considered large, means that our model is still inaccurate in predicting our target variable. If we want to get optimal result, do analysis with adding one or two more prediction, or get another data.

Model_back has linearity because p-value lower than 0.05.

Model_back’s normality distributed normal because p-value > 0.05

Model_back heteroscedasticity, there’s pattern in plot that can affected to standard error value of our model.

There’s no multicollinearity in our model. Multicollinearity happens when there’s correlation between predictor and vif value is greater than 10.