This report describe Crime Prediction Linear Regression Analysis using Crime Dataset. The dataset used by criminologists to study the effect of punishment regimes on crime rates. Using linear regression analysis, we are going to build a model to predict crime rate based on various factors that influence it.
These are the basic packages that I will use throughout the process:
library(readr)
library(dplyr)
library(GGally)
library(lmtest)
library(car)
library(MLmetrics)
crime <- read.csv("quiz_4_rm-main/crime.csv")
head(crime)
glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
The variables are:
percent_m: number of males aged 14-24 per 1000
populationis_south: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education: mean years of schooling times 10police_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation
ratem_per1000f: number of males per 1000 femalesstate_pop: state population (in hundred thousands)nonwhites_per1000: number of non-whites resident per
1000 peopleunemploy_m24: unemployment rate of urban males aged
14-24unemploy_m39: unemployment rate of urban males aged
35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryFrom the result, we can see that the is_south variable,
initially represented as numeric values of 1 or
0, needs to be transformed into a factor. By categorizing
it into ‘South’ and ‘Not South,’ we are now better equipped to examine
regional disparities in crime rates more effectively and to estimate
separate coefficients for each category within our regression
model.”
crime <- crime %>%
mutate(is_south = as.factor(is_south))
glimpse(crime)
## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south <fct> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
anyNA(crime)
## [1] FALSE
The absence of missing values in the crime dataset, denoted by the
result FALSE in our analysis, is a favorable outcome during
our data exploration. This finding indicates that the dataset is
complete, and there is no need for imputing or handling missing data
points.
boxplot(crime$crime_rate)
From the boxplot analysis, I observed that our dataset contains
three outliers, all of which have substantially higher values compared
to the majority of data points. Additionally, in terms of data
distribution, I noticed a concentration of data points around lower
values, particularly around the second quartile (Q2).
ggcorr(crime, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 2)
## Warning in ggcorr(crime, label = TRUE, label_size = 2.5, hjust = 1, layout.exp
## = 2): data in column(s) 'is_south' are not numeric and were ignored
From the correlation matrix analysis, it is evident that
police_exp60 and police_exp59 exhibit a high
degree of correlation with each other within the crime dataset, and as I
examine this, I recognize that this can introduce multicollinearity, a
concern in linear regression modeling where predictor variables should
not be strongly correlated. To uphold the models reliability and
adherence to key assumptions, I have decided to exclude
police_exp60, the latest data point from 1960, from the
model. This step is essential to ensure the model’s robustness and the
validity of the results in our crime rate prediction analysis, aligning
with the requirement for non-multicollinearity in regression
analysis.
In building simple linear regression, it is important to have
predictor variable that has high linear relationship with the target
variables. Based on the correlation matrix, I will be using
police_exp60 as my predictor variable. Prior to model
building, I will inspect if there is any outlier from our data.
plot(x = crime$police_exp60, y = crime$crime_rate,
main="Scatter Plot Sales vs. Profit",
xlab="police_exp60",
ylab="crime_rate"
)
The correlation plot between police_exp60 and
crime_rate has unveiled a compelling insight. I observe a
strong and statistically significant positive correlation between these
two variables. This signifies that as police_exp60
increases, there is a corresponding increase in crime_rate.
This insight suggests a noteworthy relationship, indicating that higher
police expenditure in the year 1960 is associated with higher crime
rates.
boxplot(crime$police_exp60)
From the boxplot analysis of police_exp60, it is evident
that there are no outliers present in the dataset. Furthermore, the data
distribution exhibits a tendency for values to cluster around the lower
end, specifically around the second quartile (Q2). This indicates that
the majority of observations fall within a relatively narrow range of
values on the lower scale.
model_crime <- lm(formula = crime_rate ~ police_exp60, data = crime)
summary(model_crime)
##
## Call:
## lm(formula = crime_rate ~ police_exp60, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -586.91 -155.63 32.52 139.58 568.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 144.464 126.693 1.140 0.26
## police_exp60 8.948 1.409 6.353 9.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 283.9 on 45 degrees of freedom
## Multiple R-squared: 0.4728, Adjusted R-squared: 0.4611
## F-statistic: 40.36 on 1 and 45 DF, p-value: 9.338e-08
From the model_crime I have extracted key coefficient
values, revealing that the intercept is 144.464 and the
coefficient for police_exp60 is 8.948.
These values have a significant implication as they enable me to
construct the regression formula for our model. Consequently, the
regression formula stands as follows:
\[\text{crime_rate} = 144.464 + 8.948
\cdot \text{police_exp60}\] The analysis of the regression model
parameters sheds light on the relationship between
police_exp60 and crime_rate. The intercept,
represented as \(\beta_0\), signifies
that when there is no police expenditure in 1960, the predicted
crime_rate maintains a baseline value of 144.464,
suggesting a persistent level of crime even in the absence of such
expenditure. The coefficient or slope, \(\beta_1\), indicates that for each unit
increase in police_exp60, crime_rate is
anticipated to rise by 8.948 units. This reveals a direct and positive
connection between police expenditure and crime rates, emphasizing that
higher investment in law enforcement corresponds to an increased impact
on crime rates. and The model’s multiple R-squared, obtained at 47.28%,
indicates that approximately 47.28% of the variance in
crime_rate can be explained by the predictor
police_exp60 alone.
The following is a visualization of the regression line:
plot(x = crime$police_exp60, y = crime$crime_rate,
xlab="police_exp60",
ylab="crime_rate")
abline(model_crime, col = "red")
In the process of building a multiple linear regression model to
predict crime_rate, an automated feature selection
approach, specifically backward elimination through step-wise
regression, is employed. This technique aims to identify the most
influential predictor variables among the available set. It is essential
to note that not all variables are retained in the model;
nonwhites_per1000 is excluded due to its zero correlation
with crime_rate, rendering it irrelevant as a predictor.
Additionally, police_exp59 is omitted because of
multicollinearity with police_exp60, which can lead to
ambiguous coefficient interpretation. This selective approach ensures
that the model only includes the most meaningful and non-redundant
predictor variables, ultimately enhancing its predictive accuracy and
interpretability while focusing on capturing the most critical
relationships for crime_rate prediction.
model_crime_all <- lm(formula = crime_rate ~ percent_m + is_south + mean_education + police_exp60 + labour_participation + m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + time_prison, data = crime)
model_crime_backward <- step(object = model_crime_all, direction = "backward")
summary(model_crime_backward)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison,
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -444.70 -111.07 3.03 122.15 483.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6426.101 1194.611 -5.379 4.04e-06 ***
## percent_m 9.332 3.350 2.786 0.00828 **
## mean_education 18.012 5.275 3.414 0.00153 **
## police_exp60 10.265 1.552 6.613 8.26e-08 ***
## m_per1000f 2.234 1.360 1.642 0.10874
## unemploy_m24 -6.087 3.339 -1.823 0.07622 .
## unemploy_m39 18.735 7.248 2.585 0.01371 *
## inequality 6.133 1.396 4.394 8.63e-05 ***
## prob_prison -3796.032 1490.646 -2.547 0.01505 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
## F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
The application of backward step-wise regression for feature
selection led to a refined model from an initial pool of 13 features.
The final model consists of 8 features:
percent_m, mean_education, police_exp60, m_per1000f, unemploy_m24, unemploy_m39, inequality, and prob_prison.
Notably, among these 8 features, not all exhibit a strong correlation
with crime_rate Specifically, 6 of these features, namely
percent_m, mean_education, police_exp60, unemploy_m39, inequality, and prob_prison,
are found to significantly influence the target variable. However, it’s
worth noting that 2 features, m_per1000f and unemploy_m24,
do not exhibit statistical significance in predicting
crime_rate This rigorous feature selection process ensures
that the model comprises the most relevant predictors, ultimately
enhancing its effectiveness in capturing the essential relationships
while excluding less significant variables.
Based on the coefficients and significance levels provided in the model summary, we have derived a multiple linear regression formula for predicting ‘crime_rate’ as follows:
\[\small \text{crime_rate} = -6426.101 + 9.332 \cdot \text{percent_m} + 18.012 \cdot \text{mean_education} + 10.265 \cdot \text{police_exp60} + \] \[\small 2.234 \cdot \text{m_per1000f} - 6.087 \cdot \text{unemploy_m24} + 18.735 \cdot \text{unemploy_m39} + 6.133 \cdot \text{inequality} - 3796.032 \cdot \text{prob_prison}\]
This formula represents a multiple linear regression model, where
crime_rate can be predicted based on the values of the
respective predictor variables. Each coefficient in the formula
represents both the strength and direction of its influence on
crime_rate, while the intercept (-6426.101) provides the
baseline value when all predictors are zero. Additionally, it’s
essential to note that the model’s adjusted R-squared value, which is
approximately 74.44%, indicates that about 74.44% of the variability in
crime_rate is accounted for by the included predictors.
To build our prediction model, we will use the
crime_test data, which consists of 9 records. The initial
step involves aligning the crime_test data with the
structure we have established in previous analysis. An essential part of
this process is transforming the is_south variable into a
factor. By following these steps, we ensure that the test data in
crime_test meets the necessary requirements for use in our
previously constructed regression model.
crime_test <- read.csv("quiz_4_rm-main/crime_test.csv")
crime_test <- crime_test %>%
mutate(
is_south = as.factor(is_south)
)
crime_test$pred_crime_rate <- predict(model_crime_backward, newdata = crime_test)
MAE(y_pred = crime_test$pred_crime_rate, y_true = crime_test$crime_rate)
## [1] 180.7295
range(crime$crime_rate)
## [1] 342 1993
The Mean Absolute Error (MAE) value I obtained for our prediction
model is 180.7295. When I consider the range of the
crime_rate in our dataset, which spans from 342 to
1.993, it’s evident that the MAE well within this range. This
observation indicates that our model exhibits relatively low error,
making it a promising predictor. The MAE’s alignment with the data range
underscores the effectiveness of model in capturing and predicting crime
rates,
MAPE(y_pred = crime_test$pred_crime_rate , y_true = crime_test$crime_rate)*100
## [1] 23.39692
The Mean Absolute Percentage Error (MAPE) value obtained is 23.4%. It’s important to note that in the context of MAPE, a smaller value signifies a more accurate and reliable model. Therefore, our MAPE of 23.4% indicates that model is performing relatively well in terms of prediction accuracy. A lower MAPE suggests that the model’s predictions are, on average, closer to the actual values, which is a positive outcome for predictive efforts. It demonstrates the model’s capability to provide reasonably accurate forecasts, which can be valuable in various decision-making and analytical scenarios.
summary(model_crime)$r.squared
## [1] 0.4727999
summary(model_crime_all)$adj.r.squared
## [1] 0.7164553
summary(model_crime_backward)$adj.r.squared
## [1] 0.7443692
The step-wise backward feature selection process has produced the
most robust model among the three models developed, with an impressive
R-squared value of 74.44%. This R-squared value
represents the model’s ability to explain 74.44% of the
variability in the crime_rate making it the most accurate
predictor. Therefore, for prediction intervals and further analysis, we
will rely on this high-performing model as it provides the most reliable
forecasts for crime_rate.
crime_test$pred_crime_rate <- predict(model_crime_backward, newdata = crime_test)
pred_model_crime_interval <- predict(
object = model_crime_backward,
newdata = crime_test,
interval = "prediction",
level = 0.95)
head(pred_model_crime_interval)
## fit lwr upr
## 1 745.0201 310.2824 1179.758
## 2 1012.3317 565.4985 1459.165
## 3 949.8039 513.9257 1385.682
## 4 1190.7017 764.5280 1616.876
## 5 772.4885 334.3809 1210.596
## 6 686.1097 268.7850 1103.434
The prediction intervals generated by model exhibit a consistent
pattern where the values of predictor features consistently reside
within the defined range. This suggests that model is designed to
accommodate a 5% margin of error. Consequently, when a
value falls within the prediction interval, it is not considered an
error. This characteristic underscores the model’s robustness and its
ability to provide reliable predictions for crime_rate
within the specified intervals, offering a level of confidence in the
accuracy of these forecasts.
plot(model_crime_backward, which = 1)
From the plot above, it is evident that the residual values
“bounce randomly” around zero, indicating the presence of
irregular patterns. This suggests that the constructed model can be
considered a reasonably adequate linear model, as the visual line it
generates maintains a sense of balance and alignment with the observed
data. Despite the fluctuations in the residuals, the regression line
retains the characteristic of being nearly straight, supporting the
understanding that this model is suitable for explaining the
relationship between the predictor variables and
crime_rate.
The Shapiro-Wilk Hypothesis Test
hist(model_crime_backward$residuals)
shapiro.test(model_crime_backward$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_crime_backward$residuals
## W = 0.98511, p-value = 0.8051
H0: The errors are normally distributed.H1: The errors are NOT normally distributed.Upon conducting the test, the obtained p-value is 0.8051, which exceeds the common significance level (alpha) of 0.05. As a result, we do not have enough evidence to reject the H0. Consequently, we can conclude that the errors in our model follow a normal distribution. This is a favorable outcome in the context of linear regression analysis, as it supports the underlying assumption of normality, validating the reliability of our model’s predictions and inferences.
Breusch-Pagan Hypothesis Test:
plot(x = model_crime_backward$fitted.values, y = model_crime_backward$residuals)
abline(h = 0, col = "red")
bptest(model_crime_backward)
##
## studentized Breusch-Pagan test
##
## data: model_crime_backward
## BP = 13.51, df = 8, p-value = 0.09546
H0: The errors exhibit constant or homoscedastic
spread.H1: The errors do NOT exhibit constant spread,
indicating heteroscedasticity.After conducting the test, the resulting p-value is calculated to be 0.09546, which is greater than the common significance level (alpha) of 0.05. As a result, we do not have enough evidence to reject the H0. This suggests that the errors in model do exhibit a constant or homoscedastic spread, meeting the assumption of homoscedasticity.
vif(model_crime_backward)
## percent_m mean_education police_exp60 m_per1000f unemploy_m24
## 2.131963 4.189684 2.560496 1.932367 4.360038
## unemploy_m39 inequality prob_prison
## 4.508106 3.731074 1.381879
In case, the VIF values calculated for all features are less than 10, which is a positive outcome. This signifies that model is free from multicollinearity, meaning that the predictor variables do not exhibit strong intercorrelations.