Case Information

Crime analysis is a crucial aspect of law enforcement and public safety efforts. One commonly used approach in analyzing crime data is the application of a linear regression model. In this context, a linear regression model is employed to understand and quantify the relationship between various factors and the occurrence of crime.

By utilizing historical crime data and relevant predictor variables, such as socio-economic indicators, population density, and geographic attributes, a linear regression model can help identify patterns and provide insights into the factors that contribute to crime rates. The model aims to establish a linear relationship between the predictor variables and the target variable, which in this case is the crime rate.

Through the estimation of regression coefficients, the model can determine the strength and direction of the relationship between each predictor variable and the crime rate. These coefficients allow us to interpret the impact of each predictor on the crime rate, controlling for other factors. Additionally, the model can be used for predicting future crime rates based on changes in the predictor variables

Reading Data Crime

crime <- read.csv("data_input/crime.csv", stringsAsFactors = F)
head(crime)

Exploratory Data Analysis

glimpse(crime)
#> Rows: 47
#> Columns: 17
#> $ X    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
#> $ M    <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134, 128, …
#> $ So   <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,…
#> $ Ed   <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 113, 117…
#> $ Po1  <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 62, 57,…
#> $ Po2  <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61, 53, …
#> $ LF   <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595, 624, …
#> $ M.F  <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 972, 972…
#> $ Pop  <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30, 33, 1…
#> $ NW   <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, 46, 72…
#> $ U1   <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, 92, 11…
#> $ U2   <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43, 47, 3…
#> $ GDP  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580, 507, …
#> $ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172, 206, …
#> $ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201, 0.042…
#> $ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.6993, 24…
#> $ y    <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674, 849, …

Describe :

  • percent_m: percentage of males aged 14-24
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: mean years of schooling
  • police_exp60: police expenditure in 1960
  • police_exp59: police expenditure in 1959
  • labour_participation: labour force participation rate
  • m_per1000f: number of males per 1000 females
  • state_pop: state population
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: unemployment rate of urban males aged 14-24
  • unemploy_m39: unemployment rate of urban males aged 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: avg time served in prisons
  • crime_rate: crime rate in an unspecified category

Checking Missing Values (NA)

We inspect the NA values in each column, so that we can understand the data we have and determine what actions need to be taken.

colSums(is.na(crime))
#>    X    M   So   Ed  Po1  Po2   LF  M.F  Pop   NW   U1   U2  GDP Ineq Prob Time 
#>    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
#>    y 
#>    0

Correlation Matrix

In this case, we want to predict the crime_rate value, so we choose crime_rate as the target variable. To select the predictors, we need to examine the correlation between the predictors and the target variable and filter them using a correlation matrix.

ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Before looking at the correlation matrix, it is important to rename the columns for better readability and understanding. After that, we can identify which columns are positively or negatively correlated in the matrix.

Data Wrangling

Removing Index Column

crime <- crime %>%
  select(-X)
head(crime)

Rename Column

# rename kolom agar lebih deskriptif
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)

To improve the model’s performance, we can combine the columns policy_exp59 and policy_exp60, as they exhibit a strong correlation

crime_exp <- crime
# Menghitung rata-rata dari police_exp60 dan police_exp59
crime_exp$police_exp <- rowMeans(crime_exp[, c("police_exp60", "police_exp59")], na.rm = TRUE)
# Menghapus variabel police_exp60 dan police_exp59
crime_exp <- select(crime_exp, -police_exp60, -police_exp59)
# Melihat hasilnya
head(crime_exp)
ggcorr(crime_exp, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

dim(crime_exp)
#> [1] 47 15

Modeling

Single Linier Regresion

model_single <- lm(data = crime_exp,
                 formula = crime_rate ~ police_exp)
summary(model_single)
#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp, data = crime_exp)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -581.07 -151.94   20.33  147.24  580.59 
#> 
#> Coefficients:
#>             Estimate Std. Error t value    Pr(>|t|)    
#> (Intercept)  152.067    128.533   1.183       0.243    
#> police_exp     9.115      1.471   6.197 0.000000159 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 287.2 on 45 degrees of freedom
#> Multiple R-squared:  0.4605, Adjusted R-squared:  0.4485 
#> F-statistic:  38.4 on 1 and 45 DF,  p-value: 0.0000001591

Multiple Linier Regresion

model_all <- lm(data = crime_exp,
                 formula = crime_rate ~ .)
summary(model_all)
#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_exp)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -459.84 -122.87   11.49  117.58  451.50 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -6606.4737  1583.4238  -4.172 0.000215 ***
#> percent_m                9.0258     4.2226   2.137 0.040298 *  
#> is_south                 6.1504   150.5377   0.041 0.967664    
#> mean_education          17.3268     6.1901   2.799 0.008612 ** 
#> labour_participation    -0.1626     1.4416  -0.113 0.910921    
#> m_per1000f               1.9532     2.0562   0.950 0.349278    
#> state_pop               -0.7269     1.3066  -0.556 0.581830    
#> nonwhites_per1000        0.2059     0.6369   0.323 0.748596    
#> unemploy_m24            -5.4736     4.2578  -1.286 0.207824    
#> unemploy_m39            17.3475     8.3316   2.082 0.045411 *  
#> gdp                      0.9446     1.0503   0.899 0.375170    
#> inequality               7.3384     2.2928   3.201 0.003091 ** 
#> prob_prison          -4079.3771  2228.7052  -1.830 0.076521 .  
#> time_prison             -0.2575     6.8520  -0.038 0.970261    
#> police_exp               9.8949     2.5688   3.852 0.000530 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 211.8 on 32 degrees of freedom
#> Multiple R-squared:  0.7913, Adjusted R-squared:    0.7 
#> F-statistic: 8.668 on 14 and 32 DF,  p-value: 0.0000002605

Feature Selection

model_backward <- step(object = model_all, # model referensi
                       direction = "backward", trace = 0)
summary(model_backward)
#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
#>     police_exp, data = crime_exp)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -475.87 -104.70    0.77  129.36  470.19 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value    Pr(>|t|)    
#> (Intercept)    -6541.179   1210.687  -5.403 0.000003750 ***
#> percent_m          9.258      3.392   2.729     0.00956 ** 
#> mean_education    17.775      5.350   3.323     0.00198 ** 
#> m_per1000f         2.383      1.375   1.733     0.09114 .  
#> unemploy_m24      -6.294      3.375  -1.865     0.06997 .  
#> unemploy_m39      19.257      7.322   2.630     0.01226 *  
#> inequality         6.180      1.418   4.358 0.000096343 ***
#> prob_prison    -3858.651   1508.471  -2.558     0.01464 *  
#> police_exp        10.556      1.635   6.456 0.000000135 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 198 on 38 degrees of freedom
#> Multiple R-squared:  0.7834, Adjusted R-squared:  0.7378 
#> F-statistic: 17.18 on 8 and 38 DF,  p-value: 0.0000000001842
comparison <- compare_performance(model_single, model_all, model_backward)

as.data.frame(comparison)

Tuning Model

Handling Outlier

We will tune the model’s performance by handling outlier values in several predictors. Here are the predictors:

  • police_exp
  • prob_prison
  • percent_m
# Memilih variabel yang akan digunakan
vars <- c("police_exp", "prob_prison", "percent_m")

# Subset data dengan variabel yang dipilih
crime_subset <- crime_exp[, vars]

# Menggabungkan variabel menjadi bentuk long format
crime_long <- tidyr::gather(crime_subset, key = "Variable", value = "Value")

# Membuat boxplot menggunakan ggplot2
ggplot(crime_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  labs(x = "Variable", y = "Value") +
  theme_minimal()

Removing Outlier police_exp

# Menghitung quartile dan rentang IQR dari variabel police_exp59
Q1 <- quantile(crime_exp$police_exp, 0.25, na.rm = TRUE)
Q3 <- quantile(crime_exp$police_exp, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

# Mengidentifikasi batas bawah dan batas atas untuk outlier
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR

# Menghapus outlier dari variabel police_exp
crime_clean <- subset(crime_exp, police_exp >= lower & police_exp <= upper)

boxplot(crime_clean$police_exp)

exp_lm <- lm(crime_rate ~ police_exp, crime_exp) 
clean_lm <- lm(crime_rate ~ police_exp, crime_clean) # without outlier

# plotting the linear regression model
plot(crime_exp$police_exp, crime_exp$crime_rate)
abline(exp_lm$coefficients[1],exp_lm$coefficients[2], col = "blue")
abline(clean_lm$coefficients[1],clean_lm$coefficients[2], col = "red") # without outlier

The blue line represents the model fit before removing the outliers, while the red line represents the fit after removing the outliers. By observing the diagram above, it can be concluded that removing outliers improves the model’s fit.

Removing Outlier prob_prison

# Menghitung quartile dan rentang IQR dari variabel prob_prison
Q1 <- quantile(crime_clean$prob_prison, 0.25, na.rm = TRUE)
Q3 <- quantile(crime_clean$prob_prison, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

# Mengidentifikasi batas bawah dan batas atas untuk outlier
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR

# Menghapus outlier dari variabel prob_prison
crime_clean <- subset(crime_clean, prob_prison >= lower & prob_prison <= upper)
boxplot(crime_clean$prob_prison)

hist(crime_clean$prob_prison)

Removing Outlier percent_m

# Menghitung quartile dan rentang IQR dari variabel percent_m 
Q1 <- quantile(crime_clean$percent_m , 0.25, na.rm = TRUE)
Q3 <- quantile(crime_clean$percent_m , 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

# Mengidentifikasi batas bawah dan batas atas untuk outlier
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR

# Menghapus outlier dari variabel percent_m 
crime_clean <- subset(crime_clean, percent_m  >= lower & percent_m  <= upper)
boxplot(crime_clean$percent_m )

dim(crime)
#> [1] 47 16
dim(crime_clean)
#> [1] 42 15

During the process of removing outliers, a total of 5 rows were deleted as they were identified as outliers.

Re-Modeling (Without Outlier)

model_all_clean <- lm(data = crime_clean,
                 formula = crime_rate ~ .)

summary(model_all_clean)
#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -413.34  -66.73    8.90   91.30  334.45 
#> 
#> Coefficients:
#>                         Estimate  Std. Error t value Pr(>|t|)    
#> (Intercept)          -6462.60028  1459.25896  -4.429 0.000141 ***
#> percent_m                7.47475     4.86819   1.535 0.136316    
#> is_south                21.41090   156.88322   0.136 0.892457    
#> mean_education          15.28698     6.52794   2.342 0.026820 *  
#> labour_participation     1.06312     1.96991   0.540 0.593842    
#> m_per1000f               1.86527     2.10845   0.885 0.384149    
#> state_pop               -0.65647     1.25915  -0.521 0.606364    
#> nonwhites_per1000        0.96947     0.67739   1.431 0.163854    
#> unemploy_m24            -2.11709     4.17261  -0.507 0.616008    
#> unemploy_m39            10.15330     8.59087   1.182 0.247562    
#> gdp                     -0.07671     1.03462  -0.074 0.941444    
#> inequality               6.57904     2.14183   3.072 0.004816 ** 
#> prob_prison          -6427.68273  3357.53321  -1.914 0.066221 .  
#> time_prison              5.50148     7.77670   0.707 0.485360    
#> police_exp              12.06850     2.71282   4.449 0.000134 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 188.3 on 27 degrees of freedom
#> Multiple R-squared:  0.8526, Adjusted R-squared:  0.7762 
#> F-statistic: 11.16 on 14 and 27 DF,  p-value: 0.00000007988

Step-Wise (Backward)

model_backward_clean <- step(object = model_all_clean, # model referensi
                       direction = "both", trace = 0)
summary(model_backward_clean)
#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
#>     nonwhites_per1000 + inequality + prob_prison + time_prison + 
#>     police_exp, data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -372.81  -90.20    7.22  112.50  293.89 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value     Pr(>|t|)    
#> (Intercept)       -6405.2103  1228.6816  -5.213 0.0000098398 ***
#> percent_m             4.5489     3.3539   1.356      0.18420    
#> mean_education       13.8048     4.4838   3.079      0.00417 ** 
#> m_per1000f            2.7545     1.2071   2.282      0.02907 *  
#> nonwhites_per1000     0.9748     0.5640   1.728      0.09329 .  
#> inequality            7.0307     1.4472   4.858 0.0000279991 ***
#> prob_prison       -5128.6279  2887.7465  -1.776      0.08496 .  
#> time_prison           8.7159     6.1719   1.412      0.16725    
#> police_exp           13.0542     1.8503   7.055 0.0000000449 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 176.1 on 33 degrees of freedom
#> Multiple R-squared:  0.8423, Adjusted R-squared:  0.8041 
#> F-statistic: 22.04 on 8 and 33 DF,  p-value: 0.00000000003757

Comparison Model

comparison <- compare_performance(model_all, model_all_clean,model_backward, model_backward_clean)

# Subsetting kolom R-squared adjusted dan RMSE
as.data.frame(comparison[, c("Name","AIC","R2_adjusted", "RMSE")])

Based on the provided AIC values, adjusted R-squared, and root mean square error (RMSE), we can compare the predictive quality of the models being evaluated.

  • Model_all has an AIC of 650.7548, adjusted R-squared of 0.7000357, and RMSE of 174.7855.
  • Model_all_clean has an AIC of 572.6177, adjusted R-squared of 0.7761915, and RMSE of 150.9557.
  • Model_backward has an AIC of 640.5084, adjusted R-squared of 0.7377957, and RMSE of 178.0768.
  • Model_backward_clean has an AIC of 563.4474, adjusted R-squared of 0.8041214, and RMSE of 156.1276.

In terms of adjusted R-squared, model_backward_clean has a higher value (0.8041214) compared to model_all_clean (0.7761915). Adjusted R-squared indicates how well the model can explain the variation in the data, considering the number of predictor variables used. In this case, model_backward_clean demonstrates better explanatory power over the observed variation in the data.

Therefore, in this context, if the focus is on lower prediction errors, then model_all_clean may be a better choice based on its lower RMSE. However, if explaining the variation in the data is the primary concern, then model_backward_clean with a higher adjusted R-squared may be considered a better option.

Hence, in this context, model_backward_clean can be regarded as the best model as it exhibits good explanation of the variation in the data (higher adjusted R-squared) and lower prediction errors (lower RMSE / only a difference of 5.1719 from model_all_clean).

The choice of the best model depends on the objectives and preferences within the specific context. It is recommended to consider both evaluation metrics as well as other relevant factors in the model selection process.

Prediction

crime_clean$pred <- predict(object = model_backward_clean,
                                    newdata = crime_clean)

# calculating RMSE
sqrt(mean((crime_clean$pred - crime_clean$crime_rate)^2))
#> [1] 156.1276
range(crime_clean$crime_rate)
#> [1]  342 1993

The obtained RMSE value (156.1276) reflects the average prediction error level of the model_backward_clean on the crime_clean data. Considering the range of values for crime_rate provided in the output range(crime_clean$crime_rate) (42 - 1993), an RMSE of 156.1276 is proportional to that range.

In this context, an RMSE of 156.1276 indicates an acceptable average prediction error level relative to the range of crime_rate values.

Assumptions of Linear Regression

Linear regression relies on several key assumptions for its validity and interpretation. These assumptions help ensure the reliability of the model’s estimates and the accuracy of its predictions.

The following are the main assumptions of linear regression:

  1. Linearity
  2. Normality of Residuals
  3. Homoscedasticity of Residuals
  4. No Multicollinearity

Linearity

The relationship between the dependent variable and the independent variables is assumed to be linear. This means that the change in the dependent variable is proportional to the change in the independent variables.

plot(model_backward_clean, which = 1)

Conclusion: The backward model can be considered to have a linear relationship between the predictors and the target variable.

Normality of Residuals

  1. Visualization of histogram
# histogram residual
hist(model_backward_clean$residuals)

  1. Statistical test using shapiro.test()

Shapiro-Wilk hypothesis test:

  • H0: Errors are normally distributed.

  • H1: Errors are NOT normally distributed.

  • Expected condition: H0

# shapiro test dari residual
shapiro.test(model_backward_clean$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward_clean$residuals
#> W = 0.97923, p-value = 0.6317

Conclusion: p-value > alpha (0.05). Failed to reject H0, the normality assumption has been met according to the Shapiro-Wilk test.

Homoscedasticity of Residuals

The variance of the errors (residuals) is constant across all levels of the independent variables. This assumption implies that the spread of the residuals should be consistent throughout the range of predicted values.

# scatter plot
plot(x = model_backward_clean$fitted.values, y = model_backward_clean$residuals)
abline(h = 0, col = "red")

Statistical test using bptest() from the lmtest package Breusch-Pagan hypothesis test:

  • H0: Errors have constant spread or homoscedasticity
  • H1: Errors do NOT have constant spread or heteroscedasticity
  • Expected condition: H0
bptest(model_backward_clean)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward_clean
#> BP = 5.3074, df = 8, p-value = 0.7243

Conclusion: p-value > alpha (0.05), indicating a failure to reject H0. Therefore, we fail to reject the hypothesis of constant error spread, indicating that the errors have constant spread or homoscedasticity.

No Multicollinearity

Multicollinearity is the condition where there is a strong correlation among predictors in a model. This is undesirable as it indicates redundant predictors in the model, and ideally, only one variable should be selected from those highly correlated variables. The goal is to avoid multicollinearity.

The VIF (Variance Inflation Factor) test can be performed using the vif() function from the car package. The interpretation of the VIF values is as follows:

  • VIF > 10: indicates the presence of multicollinearity in the model.
  • VIF < 10: indicates no multicollinearity in the model.
vif(model_backward_clean)
#>         percent_m    mean_education        m_per1000f nonwhites_per1000 
#>          1.818883          3.299894          1.734320          3.076223 
#>        inequality       prob_prison       time_prison        police_exp 
#>          4.038303          3.755321          2.202603          3.065742

Remark : all predictors in the model_backward_clean show no signs of multicollinearity, as none of them have a VIF value above 10.

Conclusion

  1. Adjusted R-squared: The adjusted R-squared value of 0.8041214 indicates that approximately 80.41% of the variation in the response variable (crime_rate) can be explained by the predictor variables in the model_backward_clean. This suggests that the model provides a good explanation of the variation in the data.

  2. RMSE: The RMSE of 156.1276 indicates that, on average, the model_backward_clean predicts the crime_rate with an error of approximately 156.1276. In other words, on average, the model’s predictions differ by 156.1276 from the actual values of the response variable (crime_rate). A lower RMSE value indicates better prediction performance.

  3. Comparison with actual values: To gain a better understanding of the prediction errors, compare the RMSE with the range of actual values from the response variable (crime_rate). In this case, with a range of 42 to 1993, an RMSE of 156.1276 falls within a relatively significant range. This suggests that the model’s prediction errors may be relatively large compared to the actual variation in the data.

  4. To enhance this model, we have several options. Firstly, we should consider augmenting the dataset with additional observations since 47 may not be sufficient. Additionally, exploring alternative predictor variables that exhibit stronger linear correlations with the target variable could improve the model’s performance. Moreover, optimizing predictions can be achieved by employing diverse machine learning approaches tailored to predicting crime rates. It is worth noting that linear regression, while highly interpretable, may not be as robust as other algorithms like Random Forest, which offer enhanced predictive capabilities.