Introduction

This report describe Crime Prediction Linear Regression Analysis using Crime Dataset. The dataset used by criminologists to study the effect of punishment regimes on crime rates. Using linear regression analysis, we are going to build a model to predict crime rate based on various factors that influence it.

Load Package

These are the basic packages that I will use throughout the process:

library(readr) 
library(dplyr) 
library(GGally)
library(lmtest)
library(car)
library(MLmetrics)

Load Dataset

crime <- read.csv("quiz_4_rm-main/crime.csv")
head(crime)

Data Wrangling

glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

The variables are:

  • percent_m: number of males aged 14-24 per 1000 population
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: mean years of schooling times 10
  • police_exp60: police expenditure in 1960
  • police_exp59: police expenditure in 1959
  • labour_participation: labour force participation rate
  • m_per1000f: number of males per 1000 females
  • state_pop: state population (in hundred thousands)
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: unemployment rate of urban males aged 14-24
  • unemploy_m39: unemployment rate of urban males aged 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: avg time served in prisons
  • crime_rate: crime rate in an unspecified category

From the result, we can see that the is_south variable, initially represented as numeric values of 1 or 0, needs to be transformed into a factor. By categorizing it into ‘South’ and ‘Not South,’ we are now better equipped to examine regional disparities in crime rates more effectively and to estimate separate coefficients for each category within our regression model.”

crime <- crime %>%
  mutate(is_south = as.factor(is_south))
glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south             <fct> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

Explanatory Data Analysis

Missing Value

anyNA(crime)
## [1] FALSE

The absence of missing values in the crime dataset, denoted by the result FALSE in our analysis, is a favorable outcome during our data exploration. This finding indicates that the dataset is complete, and there is no need for imputing or handling missing data points.

Distribution Data and Outlier

boxplot(crime$crime_rate)


From the boxplot analysis, I observed that our dataset contains three outliers, all of which have substantially higher values compared to the majority of data points. Additionally, in terms of data distribution, I noticed a concentration of data points around lower values, particularly around the second quartile (Q2).

Correlation

ggcorr(crime, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 2)
## Warning in ggcorr(crime, label = TRUE, label_size = 2.5, hjust = 1, layout.exp
## = 2): data in column(s) 'is_south' are not numeric and were ignored

From the correlation matrix analysis, it is evident that police_exp60 and police_exp59 exhibit a high degree of correlation with each other within the crime dataset, and as I examine this, I recognize that this can introduce multicollinearity, a concern in linear regression modeling where predictor variables should not be strongly correlated. To uphold the models reliability and adherence to key assumptions, I have decided to exclude police_exp60, the latest data point from 1960, from the model. This step is essential to ensure the model’s robustness and the validity of the results in our crime rate prediction analysis, aligning with the requirement for non-multicollinearity in regression analysis.

Model Building

Simple Linear Regression

In building simple linear regression, it is important to have predictor variable that has high linear relationship with the target variables. Based on the correlation matrix, I will be using police_exp60 as my predictor variable. Prior to model building, I will inspect if there is any outlier from our data.

plot(x = crime$police_exp60, y = crime$crime_rate, 
     main="Scatter Plot Sales vs. Profit",
     xlab="police_exp60",
     ylab="crime_rate"
)

The correlation plot between police_exp60 and crime_rate has unveiled a compelling insight. I observe a strong and statistically significant positive correlation between these two variables. This signifies that as police_exp60 increases, there is a corresponding increase in crime_rate. This insight suggests a noteworthy relationship, indicating that higher police expenditure in the year 1960 is associated with higher crime rates.

boxplot(crime$police_exp60)

From the boxplot analysis of police_exp60, it is evident that there are no outliers present in the dataset. Furthermore, the data distribution exhibits a tendency for values to cluster around the lower end, specifically around the second quartile (Q2). This indicates that the majority of observations fall within a relatively narrow range of values on the lower scale.

model_crime <- lm(formula = crime_rate ~ police_exp60, data = crime)
summary(model_crime)
## 
## Call:
## lm(formula = crime_rate ~ police_exp60, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -586.91 -155.63   32.52  139.58  568.84 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   144.464    126.693   1.140     0.26    
## police_exp60    8.948      1.409   6.353 9.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283.9 on 45 degrees of freedom
## Multiple R-squared:  0.4728, Adjusted R-squared:  0.4611 
## F-statistic: 40.36 on 1 and 45 DF,  p-value: 9.338e-08

From the model_crime I have extracted key coefficient values, revealing that the intercept is 144.464 and the coefficient for police_exp60 is 8.948. These values have a significant implication as they enable me to construct the regression formula for our model. Consequently, the regression formula stands as follows:

\[\text{crime_rate} = 144.464 + 8.948 \cdot \text{police_exp60}\] The analysis of the regression model parameters sheds light on the relationship between police_exp60 and crime_rate. The intercept, represented as \(\beta_0\), signifies that when there is no police expenditure in 1960, the predicted crime_rate maintains a baseline value of 144.464, suggesting a persistent level of crime even in the absence of such expenditure. The coefficient or slope, \(\beta_1\), indicates that for each unit increase in police_exp60, crime_rate is anticipated to rise by 8.948 units. This reveals a direct and positive connection between police expenditure and crime rates, emphasizing that higher investment in law enforcement corresponds to an increased impact on crime rates. and The model’s multiple R-squared, obtained at 47.28%, indicates that approximately 47.28% of the variance in crime_rate can be explained by the predictor police_exp60 alone.

The following is a visualization of the regression line:

plot(x = crime$police_exp60, y = crime$crime_rate,
     xlab="police_exp60",
     ylab="crime_rate")
abline(model_crime, col = "red")

Multiple Linear Regression

In the process of building a multiple linear regression model to predict crime_rate, an automated feature selection approach, specifically backward elimination through step-wise regression, is employed. This technique aims to identify the most influential predictor variables among the available set. It is essential to note that not all variables are retained in the model; nonwhites_per1000 is excluded due to its zero correlation with crime_rate, rendering it irrelevant as a predictor. Additionally, police_exp59 is omitted because of multicollinearity with police_exp60, which can lead to ambiguous coefficient interpretation. This selective approach ensures that the model only includes the most meaningful and non-redundant predictor variables, ultimately enhancing its predictive accuracy and interpretability while focusing on capturing the most critical relationships for crime_rate prediction.

model_crime_all <- lm(formula = crime_rate ~ percent_m + is_south + mean_education + police_exp60 + labour_participation + m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + time_prison, data = crime)
model_crime_backward <- step(object = model_crime_all, direction = "backward")
summary(model_crime_backward)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

The application of backward step-wise regression for feature selection led to a refined model from an initial pool of 13 features. The final model consists of 8 features: percent_m, mean_education, police_exp60, m_per1000f, unemploy_m24, unemploy_m39, inequality, and prob_prison. Notably, among these 8 features, not all exhibit a strong correlation with crime_rate Specifically, 6 of these features, namely percent_m, mean_education, police_exp60, unemploy_m39, inequality, and prob_prison, are found to significantly influence the target variable. However, it’s worth noting that 2 features, m_per1000f and unemploy_m24, do not exhibit statistical significance in predicting crime_rate This rigorous feature selection process ensures that the model comprises the most relevant predictors, ultimately enhancing its effectiveness in capturing the essential relationships while excluding less significant variables.

Based on the coefficients and significance levels provided in the model summary, we have derived a multiple linear regression formula for predicting ‘crime_rate’ as follows:

\[\small \text{crime_rate} = -6426.101 + 9.332 \cdot \text{percent_m} + 18.012 \cdot \text{mean_education} + 10.265 \cdot \text{police_exp60} + \] \[\small 2.234 \cdot \text{m_per1000f} - 6.087 \cdot \text{unemploy_m24} + 18.735 \cdot \text{unemploy_m39} + 6.133 \cdot \text{inequality} - 3796.032 \cdot \text{prob_prison}\]

This formula represents a multiple linear regression model, where crime_rate can be predicted based on the values of the respective predictor variables. Each coefficient in the formula represents both the strength and direction of its influence on crime_rate, while the intercept (-6426.101) provides the baseline value when all predictors are zero. Additionally, it’s essential to note that the model’s adjusted R-squared value, which is approximately 74.44%, indicates that about 74.44% of the variability in crime_rate is accounted for by the included predictors.

Model Prediction

To build our prediction model, we will use the crime_test data, which consists of 9 records. The initial step involves aligning the crime_test data with the structure we have established in previous analysis. An essential part of this process is transforming the is_south variable into a factor. By following these steps, we ensure that the test data in crime_test meets the necessary requirements for use in our previously constructed regression model.

crime_test <- read.csv("quiz_4_rm-main/crime_test.csv")
crime_test <- crime_test %>%
  mutate(
    is_south = as.factor(is_south)
  )

MAE

crime_test$pred_crime_rate <- predict(model_crime_backward, newdata = crime_test)
MAE(y_pred = crime_test$pred_crime_rate, y_true = crime_test$crime_rate)
## [1] 180.7295
range(crime$crime_rate)
## [1]  342 1993

The Mean Absolute Error (MAE) value I obtained for our prediction model is 180.7295. When I consider the range of the crime_rate in our dataset, which spans from 342 to 1.993, it’s evident that the MAE well within this range. This observation indicates that our model exhibits relatively low error, making it a promising predictor. The MAE’s alignment with the data range underscores the effectiveness of model in capturing and predicting crime rates,

MAPE

MAPE(y_pred = crime_test$pred_crime_rate , y_true = crime_test$crime_rate)*100
## [1] 23.39692

The Mean Absolute Percentage Error (MAPE) value obtained is 23.4%. It’s important to note that in the context of MAPE, a smaller value signifies a more accurate and reliable model. Therefore, our MAPE of 23.4% indicates that model is performing relatively well in terms of prediction accuracy. A lower MAPE suggests that the model’s predictions are, on average, closer to the actual values, which is a positive outcome for predictive efforts. It demonstrates the model’s capability to provide reasonably accurate forecasts, which can be valuable in various decision-making and analytical scenarios.

Prediction Interval

summary(model_crime)$r.squared
## [1] 0.4727999
summary(model_crime_all)$adj.r.squared
## [1] 0.7164553
summary(model_crime_backward)$adj.r.squared
## [1] 0.7443692

The step-wise backward feature selection process has produced the most robust model among the three models developed, with an impressive R-squared value of 74.44%. This R-squared value represents the model’s ability to explain 74.44% of the variability in the crime_rate making it the most accurate predictor. Therefore, for prediction intervals and further analysis, we will rely on this high-performing model as it provides the most reliable forecasts for crime_rate.

crime_test$pred_crime_rate <- predict(model_crime_backward, newdata = crime_test)
pred_model_crime_interval <- predict(
  object = model_crime_backward,
  newdata = crime_test,
  interval = "prediction",
  level = 0.95)
head(pred_model_crime_interval)
##         fit      lwr      upr
## 1  745.0201 310.2824 1179.758
## 2 1012.3317 565.4985 1459.165
## 3  949.8039 513.9257 1385.682
## 4 1190.7017 764.5280 1616.876
## 5  772.4885 334.3809 1210.596
## 6  686.1097 268.7850 1103.434

The prediction intervals generated by model exhibit a consistent pattern where the values of predictor features consistently reside within the defined range. This suggests that model is designed to accommodate a 5% margin of error. Consequently, when a value falls within the prediction interval, it is not considered an error. This characteristic underscores the model’s robustness and its ability to provide reliable predictions for crime_rate within the specified intervals, offering a level of confidence in the accuracy of these forecasts.

Model Assumption Evaluation

Linearity

plot(model_crime_backward, which = 1)

From the plot above, it is evident that the residual values “bounce randomly” around zero, indicating the presence of irregular patterns. This suggests that the constructed model can be considered a reasonably adequate linear model, as the visual line it generates maintains a sense of balance and alignment with the observed data. Despite the fluctuations in the residuals, the regression line retains the characteristic of being nearly straight, supporting the understanding that this model is suitable for explaining the relationship between the predictor variables and crime_rate.

Normality

The Shapiro-Wilk Hypothesis Test

hist(model_crime_backward$residuals)

shapiro.test(model_crime_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_crime_backward$residuals
## W = 0.98511, p-value = 0.8051
  • H0: The errors are normally distributed.
  • H1: The errors are NOT normally distributed.

Upon conducting the test, the obtained p-value is 0.8051, which exceeds the common significance level (alpha) of 0.05. As a result, we do not have enough evidence to reject the H0. Consequently, we can conclude that the errors in our model follow a normal distribution. This is a favorable outcome in the context of linear regression analysis, as it supports the underlying assumption of normality, validating the reliability of our model’s predictions and inferences.

Heteroscedasticity

Breusch-Pagan Hypothesis Test:

plot(x = model_crime_backward$fitted.values, y = model_crime_backward$residuals)
abline(h = 0, col = "red")

bptest(model_crime_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_crime_backward
## BP = 13.51, df = 8, p-value = 0.09546
  • H0: The errors exhibit constant or homoscedastic spread.
  • H1: The errors do NOT exhibit constant spread, indicating heteroscedasticity.

After conducting the test, the resulting p-value is calculated to be 0.09546, which is greater than the common significance level (alpha) of 0.05. As a result, we do not have enough evidence to reject the H0. This suggests that the errors in model do exhibit a constant or homoscedastic spread, meeting the assumption of homoscedasticity.

Variance Inflation Factor (Multicollinearity)

  • VIF > 10 suggest the presence of multicollinearity within the model.
  • VIF < 10 indicate the absence of multicollinearity within the model.
vif(model_crime_backward)
##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

In case, the VIF values calculated for all features are less than 10, which is a positive outcome. This signifies that model is free from multicollinearity, meaning that the predictor variables do not exhibit strong intercorrelations.

Conclusion

  • Crime rate prediction model has yielded several significant insights and observations. The began by preparing and exploring the dataset, transforming the ‘is_south’ variable into a factor for categorical analysis. Identified that the data contains no missing values, and during data wrangling, ensured its readiness for regression analysis.
  • Constructed multiple regression models, including a step-wise backward selection, which exhibited the best goodness of fit with an R-squared value of 74.44%. This model became choice for further analysis and prediction.
  • In the prediction phase, used the test data and calculated the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). The MAE value of 180.7295, in comparison to the data range, suggested that model’s errors were relatively small, making it a promising predictor.
  • The Shapiro-Wilk test confirmed that the errors in model were normally distributed. Additionally, the Breusch-Pagan test supported the presence of homoscedasticity, ensuring that the error variances were constant.
  • Lastly, verified that multicollinearity was not an issue in model, as indicated by Variance Inflation Factor (VIF) values below 10 for all features.