Crime analysis is a crucial aspect of law enforcement and public safety efforts. One commonly used approach in analyzing crime data is the application of a linear regression model. In this context, a linear regression model is employed to understand and quantify the relationship between various factors and the occurrence of crime.
By utilizing historical crime data and relevant predictor variables, such as socio-economic indicators, population density, and geographic attributes, a linear regression model can help identify patterns and provide insights into the factors that contribute to crime rates. The model aims to establish a linear relationship between the predictor variables and the target variable, which in this case is the crime rate.
Through the estimation of regression coefficients, the model can determine the strength and direction of the relationship between each predictor variable and the crime rate. These coefficients allow us to interpret the impact of each predictor on the crime rate, controlling for other factors. Additionally, the model can be used for predicting future crime rates based on changes in the predictor variables
#> Rows: 47
#> Columns: 17
#> $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19…
#> $ M <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140, 124, 134, 128, …
#> $ So <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,…
#> $ Ed <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 105, 108, 113, 117…
#> $ Po1 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121, 75, 67, 62, 57,…
#> $ Po2 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, 71, 60, 61, 53, …
#> $ LF <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632, 580, 595, 624, …
#> $ M.F <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 1029, 966, 972, 972…
#> $ Pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 28, 22, 30, 33, 1…
#> $ NW <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106, 59, 10, 46, 72…
#> $ U1 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83, 77, 77, 92, 11…
#> $ U2 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 25, 27, 43, 47, 3…
#> $ GDP <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526, 657, 580, 507, …
#> $ Ineq <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174, 170, 172, 206, …
#> $ Prob <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399, 0.034201, 0.042…
#> $ Time <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9995, 20.6993, 24…
#> $ y <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, 705, 1674, 849, …
Describe :
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation
ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per
1000 peopleunemploy_m24: unemployment rate of urban males aged
14-24unemploy_m39: unemployment rate of urban males aged
35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryWe inspect the NA values in each column, so that we can understand the data we have and determine what actions need to be taken.
#> X M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time
#> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> y
#> 0
In this case, we want to predict the crime_rate value, so we choose crime_rate as the target variable. To select the predictors, we need to examine the correlation between the predictors and the target variable and filter them using a correlation matrix.
Before looking at the correlation matrix, it is important to rename the columns for better readability and understanding. After that, we can identify which columns are positively or negatively correlated in the matrix.
# rename kolom agar lebih deskriptif
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)To improve the model’s performance, we can combine the columns policy_exp59 and policy_exp60, as they exhibit a strong correlation
crime_exp <- crime
# Menghitung rata-rata dari police_exp60 dan police_exp59
crime_exp$police_exp <- rowMeans(crime_exp[, c("police_exp60", "police_exp59")], na.rm = TRUE)
# Menghapus variabel police_exp60 dan police_exp59
crime_exp <- select(crime_exp, -police_exp60, -police_exp59)
# Melihat hasilnya
head(crime_exp)#> [1] 47 15
#>
#> Call:
#> lm(formula = crime_rate ~ police_exp, data = crime_exp)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -581.07 -151.94 20.33 147.24 580.59
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 152.067 128.533 1.183 0.243
#> police_exp 9.115 1.471 6.197 0.000000159 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 287.2 on 45 degrees of freedom
#> Multiple R-squared: 0.4605, Adjusted R-squared: 0.4485
#> F-statistic: 38.4 on 1 and 45 DF, p-value: 0.0000001591
#>
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_exp)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -459.84 -122.87 11.49 117.58 451.50
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6606.4737 1583.4238 -4.172 0.000215 ***
#> percent_m 9.0258 4.2226 2.137 0.040298 *
#> is_south 6.1504 150.5377 0.041 0.967664
#> mean_education 17.3268 6.1901 2.799 0.008612 **
#> labour_participation -0.1626 1.4416 -0.113 0.910921
#> m_per1000f 1.9532 2.0562 0.950 0.349278
#> state_pop -0.7269 1.3066 -0.556 0.581830
#> nonwhites_per1000 0.2059 0.6369 0.323 0.748596
#> unemploy_m24 -5.4736 4.2578 -1.286 0.207824
#> unemploy_m39 17.3475 8.3316 2.082 0.045411 *
#> gdp 0.9446 1.0503 0.899 0.375170
#> inequality 7.3384 2.2928 3.201 0.003091 **
#> prob_prison -4079.3771 2228.7052 -1.830 0.076521 .
#> time_prison -0.2575 6.8520 -0.038 0.970261
#> police_exp 9.8949 2.5688 3.852 0.000530 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 211.8 on 32 degrees of freedom
#> Multiple R-squared: 0.7913, Adjusted R-squared: 0.7
#> F-statistic: 8.668 on 14 and 32 DF, p-value: 0.0000002605
model_backward <- step(object = model_all, # model referensi
direction = "backward", trace = 0)
summary(model_backward)#>
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f +
#> unemploy_m24 + unemploy_m39 + inequality + prob_prison +
#> police_exp, data = crime_exp)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -475.87 -104.70 0.77 129.36 470.19
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6541.179 1210.687 -5.403 0.000003750 ***
#> percent_m 9.258 3.392 2.729 0.00956 **
#> mean_education 17.775 5.350 3.323 0.00198 **
#> m_per1000f 2.383 1.375 1.733 0.09114 .
#> unemploy_m24 -6.294 3.375 -1.865 0.06997 .
#> unemploy_m39 19.257 7.322 2.630 0.01226 *
#> inequality 6.180 1.418 4.358 0.000096343 ***
#> prob_prison -3858.651 1508.471 -2.558 0.01464 *
#> police_exp 10.556 1.635 6.456 0.000000135 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 198 on 38 degrees of freedom
#> Multiple R-squared: 0.7834, Adjusted R-squared: 0.7378
#> F-statistic: 17.18 on 8 and 38 DF, p-value: 0.0000000001842
comparison <- compare_performance(model_single, model_all, model_backward)
as.data.frame(comparison)We will tune the model’s performance by handling outlier values in several predictors. Here are the predictors:
# Memilih variabel yang akan digunakan
vars <- c("police_exp", "prob_prison", "percent_m")
# Subset data dengan variabel yang dipilih
crime_subset <- crime_exp[, vars]
# Menggabungkan variabel menjadi bentuk long format
crime_long <- tidyr::gather(crime_subset, key = "Variable", value = "Value")
# Membuat boxplot menggunakan ggplot2
ggplot(crime_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
labs(x = "Variable", y = "Value") +
theme_minimal()# Menghitung quartile dan rentang IQR dari variabel police_exp59
Q1 <- quantile(crime_exp$police_exp, 0.25, na.rm = TRUE)
Q3 <- quantile(crime_exp$police_exp, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# Mengidentifikasi batas bawah dan batas atas untuk outlier
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
# Menghapus outlier dari variabel police_exp
crime_clean <- subset(crime_exp, police_exp >= lower & police_exp <= upper)
boxplot(crime_clean$police_exp)exp_lm <- lm(crime_rate ~ police_exp, crime_exp)
clean_lm <- lm(crime_rate ~ police_exp, crime_clean) # without outlier
# plotting the linear regression model
plot(crime_exp$police_exp, crime_exp$crime_rate)
abline(exp_lm$coefficients[1],exp_lm$coefficients[2], col = "blue")
abline(clean_lm$coefficients[1],clean_lm$coefficients[2], col = "red") # without outlierThe blue line represents the model fit before removing the outliers, while the red line represents the fit after removing the outliers. By observing the diagram above, it can be concluded that removing outliers improves the model’s fit.
# Menghitung quartile dan rentang IQR dari variabel prob_prison
Q1 <- quantile(crime_clean$prob_prison, 0.25, na.rm = TRUE)
Q3 <- quantile(crime_clean$prob_prison, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# Mengidentifikasi batas bawah dan batas atas untuk outlier
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
# Menghapus outlier dari variabel prob_prison
crime_clean <- subset(crime_clean, prob_prison >= lower & prob_prison <= upper)
boxplot(crime_clean$prob_prison)# Menghitung quartile dan rentang IQR dari variabel percent_m
Q1 <- quantile(crime_clean$percent_m , 0.25, na.rm = TRUE)
Q3 <- quantile(crime_clean$percent_m , 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# Mengidentifikasi batas bawah dan batas atas untuk outlier
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
# Menghapus outlier dari variabel percent_m
crime_clean <- subset(crime_clean, percent_m >= lower & percent_m <= upper)
boxplot(crime_clean$percent_m )#> [1] 47 16
#> [1] 42 15
During the process of removing outliers, a total of 5 rows were deleted as they were identified as outliers.
#>
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -413.34 -66.73 8.90 91.30 334.45
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6462.60028 1459.25896 -4.429 0.000141 ***
#> percent_m 7.47475 4.86819 1.535 0.136316
#> is_south 21.41090 156.88322 0.136 0.892457
#> mean_education 15.28698 6.52794 2.342 0.026820 *
#> labour_participation 1.06312 1.96991 0.540 0.593842
#> m_per1000f 1.86527 2.10845 0.885 0.384149
#> state_pop -0.65647 1.25915 -0.521 0.606364
#> nonwhites_per1000 0.96947 0.67739 1.431 0.163854
#> unemploy_m24 -2.11709 4.17261 -0.507 0.616008
#> unemploy_m39 10.15330 8.59087 1.182 0.247562
#> gdp -0.07671 1.03462 -0.074 0.941444
#> inequality 6.57904 2.14183 3.072 0.004816 **
#> prob_prison -6427.68273 3357.53321 -1.914 0.066221 .
#> time_prison 5.50148 7.77670 0.707 0.485360
#> police_exp 12.06850 2.71282 4.449 0.000134 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 188.3 on 27 degrees of freedom
#> Multiple R-squared: 0.8526, Adjusted R-squared: 0.7762
#> F-statistic: 11.16 on 14 and 27 DF, p-value: 0.00000007988
model_backward_clean <- step(object = model_all_clean, # model referensi
direction = "both", trace = 0)
summary(model_backward_clean)#>
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f +
#> nonwhites_per1000 + inequality + prob_prison + time_prison +
#> police_exp, data = crime_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -372.81 -90.20 7.22 112.50 293.89
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6405.2103 1228.6816 -5.213 0.0000098398 ***
#> percent_m 4.5489 3.3539 1.356 0.18420
#> mean_education 13.8048 4.4838 3.079 0.00417 **
#> m_per1000f 2.7545 1.2071 2.282 0.02907 *
#> nonwhites_per1000 0.9748 0.5640 1.728 0.09329 .
#> inequality 7.0307 1.4472 4.858 0.0000279991 ***
#> prob_prison -5128.6279 2887.7465 -1.776 0.08496 .
#> time_prison 8.7159 6.1719 1.412 0.16725
#> police_exp 13.0542 1.8503 7.055 0.0000000449 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 176.1 on 33 degrees of freedom
#> Multiple R-squared: 0.8423, Adjusted R-squared: 0.8041
#> F-statistic: 22.04 on 8 and 33 DF, p-value: 0.00000000003757
comparison <- compare_performance(model_all, model_all_clean,model_backward, model_backward_clean)
# Subsetting kolom R-squared adjusted dan RMSE
as.data.frame(comparison[, c("Name","AIC","R2_adjusted", "RMSE")])Based on the provided AIC values, adjusted R-squared, and root mean square error (RMSE), we can compare the predictive quality of the models being evaluated.
In terms of adjusted R-squared, model_backward_clean has a higher value (0.8041214) compared to model_all_clean (0.7761915). Adjusted R-squared indicates how well the model can explain the variation in the data, considering the number of predictor variables used. In this case, model_backward_clean demonstrates better explanatory power over the observed variation in the data.
Therefore, in this context, if the focus is on lower prediction errors, then model_all_clean may be a better choice based on its lower RMSE. However, if explaining the variation in the data is the primary concern, then model_backward_clean with a higher adjusted R-squared may be considered a better option.
Hence, in this context, model_backward_clean can be regarded as the best model as it exhibits good explanation of the variation in the data (higher adjusted R-squared) and lower prediction errors (lower RMSE / only a difference of 5.1719 from model_all_clean).
The choice of the best model depends on the objectives and preferences within the specific context. It is recommended to consider both evaluation metrics as well as other relevant factors in the model selection process.
crime_clean$pred <- predict(object = model_backward_clean,
newdata = crime_clean)
# calculating RMSE
sqrt(mean((crime_clean$pred - crime_clean$crime_rate)^2))#> [1] 156.1276
#> [1] 342 1993
The obtained RMSE value (156.1276) reflects the average prediction error level of the model_backward_clean on the crime_clean data. Considering the range of values for crime_rate provided in the output range(crime_clean$crime_rate) (42 - 1993), an RMSE of 156.1276 is proportional to that range.
In this context, an RMSE of 156.1276 indicates an acceptable average prediction error level relative to the range of crime_rate values.
Linear regression relies on several key assumptions for its validity and interpretation. These assumptions help ensure the reliability of the model’s estimates and the accuracy of its predictions.
The following are the main assumptions of linear regression:
The relationship between the dependent variable and the independent variables is assumed to be linear. This means that the change in the dependent variable is proportional to the change in the independent variables.
Conclusion: The backward model can be considered to have a linear relationship between the predictors and the target variable.
Shapiro-Wilk hypothesis test:
H0: Errors are normally distributed.
H1: Errors are NOT normally distributed.
Expected condition: H0
#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward_clean$residuals
#> W = 0.97923, p-value = 0.6317
Conclusion: p-value > alpha (0.05). Failed to reject H0, the normality assumption has been met according to the Shapiro-Wilk test.
The variance of the errors (residuals) is constant across all levels of the independent variables. This assumption implies that the spread of the residuals should be consistent throughout the range of predicted values.
# scatter plot
plot(x = model_backward_clean$fitted.values, y = model_backward_clean$residuals)
abline(h = 0, col = "red")Statistical test using bptest() from the lmtest package Breusch-Pagan hypothesis test:
#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward_clean
#> BP = 5.3074, df = 8, p-value = 0.7243
Conclusion: p-value > alpha (0.05), indicating a failure to reject H0. Therefore, we fail to reject the hypothesis of constant error spread, indicating that the errors have constant spread or homoscedasticity.
Multicollinearity is the condition where there is a strong correlation among predictors in a model. This is undesirable as it indicates redundant predictors in the model, and ideally, only one variable should be selected from those highly correlated variables. The goal is to avoid multicollinearity.
The VIF (Variance Inflation Factor) test can be performed using the vif() function from the car package. The interpretation of the VIF values is as follows:
#> percent_m mean_education m_per1000f nonwhites_per1000
#> 1.818883 3.299894 1.734320 3.076223
#> inequality prob_prison time_prison police_exp
#> 4.038303 3.755321 2.202603 3.065742
Remark : all predictors in the model_backward_clean show no signs of multicollinearity, as none of them have a VIF value above 10.
Adjusted R-squared: The adjusted R-squared value of 0.8041214 indicates that approximately 80.41% of the variation in the response variable (crime_rate) can be explained by the predictor variables in the model_backward_clean. This suggests that the model provides a good explanation of the variation in the data.
RMSE: The RMSE of 156.1276 indicates that, on average, the model_backward_clean predicts the crime_rate with an error of approximately 156.1276. In other words, on average, the model’s predictions differ by 156.1276 from the actual values of the response variable (crime_rate). A lower RMSE value indicates better prediction performance.
Comparison with actual values: To gain a better understanding of the prediction errors, compare the RMSE with the range of actual values from the response variable (crime_rate). In this case, with a range of 42 to 1993, an RMSE of 156.1276 falls within a relatively significant range. This suggests that the model’s prediction errors may be relatively large compared to the actual variation in the data.
To enhance this model, we have several options. Firstly, we should consider augmenting the dataset with additional observations since 47 may not be sufficient. Additionally, exploring alternative predictor variables that exhibit stronger linear correlations with the target variable could improve the model’s performance. Moreover, optimizing predictions can be achieved by employing diverse machine learning approaches tailored to predicting crime rates. It is worth noting that linear regression, while highly interpretable, may not be as robust as other algorithms like Random Forest, which offer enhanced predictive capabilities.