This report details the process of building a regression model for crime rate from crime rate data.
We first invoke the required libraries.
library(tidyverse)
library(caret)
library(plotly)
library(ggplot2)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
options(scipen = 100, max.print = 1e+06)
Then download the data.
In our dataset we have 17 columns and 47 rows.
The column names are:
colnames(crime)
## [1] "V1" "M" "So" "Ed" "Po1" "Po2" "LF" "M.F" "Pop" "NW"
## [11] "U1" "U2" "GDP" "Ineq" "Prob" "Time" "y"
We will rename the columns as follows:
crime <- read.csv("data_input/crime.csv") %>%
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
str(crime)
## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
After you rename the dataset the variables are: -
percent_m: percentage of males aged 14-24, -
is_south: whether it is in a Southern state. 1 for Yes, 0
for No, - mean_education: mean years of schooling, -
police_exp60: police expenditure in 1960, -
police_exp59: police expenditure in 1959, -
labour_participation:labour force participation rate, -
m_per1000f: number of males per 1000 females, -
state_pop: state population, -
nonwhites_per1000: number of non-whites resident per 1000
people, - unemploy_m24: unemployment rate of urban males
aged 14-24, - unemploy_m39: unemployment rate of urban
males aged 35-39, - gdp: gross domestic product per head, -
inequality: income inequality, - prob_prison:
probability of imprisonment, - time_prison: avg time served
in prisons, - crime_rate: crime rate in an unspecified
category .
Inspection of data types.
str(crime)
## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
We check for missing values.
colSums(is.na(crime))
## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
There are no missing values..
summary(crime)
## percent_m is_south mean_education police_exp60
## Min. :119.0 Min. :0.0000 Min. : 87.0 Min. : 45.0
## 1st Qu.:130.0 1st Qu.:0.0000 1st Qu.: 97.5 1st Qu.: 62.5
## Median :136.0 Median :0.0000 Median :108.0 Median : 78.0
## Mean :138.6 Mean :0.3404 Mean :105.6 Mean : 85.0
## 3rd Qu.:146.0 3rd Qu.:1.0000 3rd Qu.:114.5 3rd Qu.:104.5
## Max. :177.0 Max. :1.0000 Max. :122.0 Max. :166.0
## police_exp59 labour_participation m_per1000f state_pop
## Min. : 41.00 Min. :480.0 Min. : 934.0 Min. : 3.00
## 1st Qu.: 58.50 1st Qu.:530.5 1st Qu.: 964.5 1st Qu.: 10.00
## Median : 73.00 Median :560.0 Median : 977.0 Median : 25.00
## Mean : 80.23 Mean :561.2 Mean : 983.0 Mean : 36.62
## 3rd Qu.: 97.00 3rd Qu.:593.0 3rd Qu.: 992.0 3rd Qu.: 41.50
## Max. :157.00 Max. :641.0 Max. :1071.0 Max. :168.00
## nonwhites_per1000 unemploy_m24 unemploy_m39 gdp
## Min. : 2.0 Min. : 70.00 Min. :20.00 Min. :288.0
## 1st Qu.: 24.0 1st Qu.: 80.50 1st Qu.:27.50 1st Qu.:459.5
## Median : 76.0 Median : 92.00 Median :34.00 Median :537.0
## Mean :101.1 Mean : 95.47 Mean :33.98 Mean :525.4
## 3rd Qu.:132.5 3rd Qu.:104.00 3rd Qu.:38.50 3rd Qu.:591.5
## Max. :423.0 Max. :142.00 Max. :58.00 Max. :689.0
## inequality prob_prison time_prison crime_rate
## Min. :126.0 Min. :0.00690 Min. :12.20 Min. : 342.0
## 1st Qu.:165.5 1st Qu.:0.03270 1st Qu.:21.60 1st Qu.: 658.5
## Median :176.0 Median :0.04210 Median :25.80 Median : 831.0
## Mean :194.0 Mean :0.04709 Mean :26.60 Mean : 905.1
## 3rd Qu.:227.5 3rd Qu.:0.05445 3rd Qu.:30.45 3rd Qu.:1057.5
## Max. :276.0 Max. :0.11980 Max. :44.00 Max. :1993.0
We check for correlation.
ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
Upon inspection we have identified that all data are in suitable form for our analysis. No further modifications to the dataset is required for now.
We are now begin to split the dataset into “train” and “test”. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparison to see if the model will overfit. We will split 70% of the data as the “train” set.
set.seed(123)
samplesize <- round(0.7 * nrow(crime), 0)
index <- sample(seq_len(nrow(crime)), size = samplesize)
crime_train <- crime[index, ]
crime_test <- crime[-index, ]
We begin the procedures for modelling the dataset using the Linear Regression Model.
Firstly we will try to model “crime_rate” using linear regression using all remaining predictor variables.
set.seed(123)
crime_lm <- lm(crime_rate ~ ., data = crime_train)
summary(crime_lm)
##
## Call:
## lm(formula = crime_rate ~ ., data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -304.71 -91.82 -31.14 101.84 426.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10332.3968 2232.6587 -4.628 0.00024 ***
## percent_m 9.6677 4.4748 2.160 0.04530 *
## is_south 8.1011 223.3111 0.036 0.97148
## mean_education 26.2959 7.6978 3.416 0.00329 **
## police_exp60 20.0908 11.8628 1.694 0.10858
## police_exp59 -14.8569 13.5366 -1.098 0.28771
## labour_participation -4.6592 2.1721 -2.145 0.04669 *
## m_per1000f 8.2657 2.8606 2.889 0.01019 *
## state_pop 1.3699 1.7925 0.764 0.45521
## nonwhites_per1000 1.3334 0.9991 1.335 0.19960
## unemploy_m24 -11.9795 5.1323 -2.334 0.03212 *
## unemploy_m39 14.2389 9.8688 1.443 0.16724
## gdp 0.6436 1.2624 0.510 0.61671
## inequality 4.2472 2.8552 1.488 0.15518
## prob_prison 940.1481 3823.0185 0.246 0.80869
## time_prison 14.7756 9.4304 1.567 0.13558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 199.9 on 17 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.7766
## F-statistic: 8.414 on 15 and 17 DF, p-value: 0.00003846
From the above we can identify that predictor variables which have Pr(>|t|) values larger than 0.05. These are ‘is_south’, ‘police_exp60’, ‘police_exp59’, ‘state_pop’, ‘nonwhites_per1000’, ‘unemploy_m39’, ‘gdp’, ‘inequality’, ‘prob_prison’, ‘time_prison’. Pr(>|t|) values larger than 0.05 signify predictors that are statistically insignificant to the prediction of the target ‘crime_rate’. Thus, we can simplify the model by removing all predictors identified with such Pr(>|t|) values, re-compute the model and see if we can improve on the Adjusted R-squared of 0.7766.
crime2 <- crime %>% select(!c(is_south, police_exp60, police_exp59, state_pop, nonwhites_per1000, unemploy_m39, gdp, inequality, prob_prison, time_prison))
set.seed(123)
samplesize <- round(0.7 * nrow(crime2), 0)
index <- sample(seq_len(nrow(crime2)), size = samplesize)
crime2_train <- crime2[index, ]
crime2_test <- crime2[-index, ]
set.seed(123)
crime_lm2 <- lm(crime_rate ~ ., data = crime2_train)
summary(crime_lm2)
##
## Call:
## lm(formula = crime_rate ~ ., data = crime2_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -683.32 -308.90 -23.25 185.01 939.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3415.575 2711.750 -1.260 0.219
## percent_m 3.349 7.444 0.450 0.656
## mean_education 14.013 8.712 1.608 0.119
## labour_participation -3.359 2.903 -1.157 0.257
## m_per1000f 5.157 3.948 1.306 0.203
## unemploy_m24 -8.766 6.644 -1.320 0.198
##
## Residual standard error: 414.2 on 27 degrees of freedom
## Multiple R-squared: 0.1904, Adjusted R-squared: 0.0405
## F-statistic: 1.27 on 5 and 27 DF, p-value: 0.3054
In this second attempt, the R2 we get is 0.0405. Or roughly 4% of the variance found in the response variable (crime_rate) can be explained by the predictor variables.This is not an improvement but a deterioration compared to the first attempt at a linear regression model. We will stick with the first model as our LM 1 to compare with other LMs.
In this second model we will run the StepAIC() function which computes step-wise regressions via backward and forward elimination (both ways). Then the computations will choose the best linear regression model for us.
library(MASS)
crime_lm <- lm(crime_rate ~., data = crime_train)
# Stepwise regression model
step.model <- stepAIC(crime_lm, direction = "both",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## labour_participation + m_per1000f + nonwhites_per1000 + unemploy_m24 +
## unemploy_m39 + inequality + time_prison, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -283.69 -80.50 -16.92 82.84 429.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9643.5294 1570.9487 -6.139 0.00000353 ***
## percent_m 8.3543 3.9050 2.139 0.043752 *
## mean_education 25.3573 6.1671 4.112 0.000459 ***
## police_exp60 8.3365 1.7646 4.724 0.000103 ***
## labour_participation -3.5144 1.5221 -2.309 0.030727 *
## m_per1000f 7.1285 2.0723 3.440 0.002337 **
## nonwhites_per1000 1.1725 0.6474 1.811 0.083817 .
## unemploy_m24 -10.2895 3.9585 -2.599 0.016368 *
## unemploy_m39 16.0240 8.4283 1.901 0.070453 .
## inequality 4.6137 1.7115 2.696 0.013205 *
## time_prison 15.7576 5.6929 2.768 0.011222 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.1 on 22 degrees of freedom
## Multiple R-squared: 0.8682, Adjusted R-squared: 0.8082
## F-statistic: 14.49 on 10 and 22 DF, p-value: 0.0000001705
Using the stepAIC() we have achieved an Adjusted R-squared of 0.8082. Notice the predictors chosen for the final formula.
Before we proceed to evaluating the LMs we take the opportunity to identify the most important variable.
varImp(step.model)
## Overall
## percent_m 2.139380
## mean_education 4.111694
## police_exp60 4.724382
## labour_participation 2.308865
## m_per1000f 3.439948
## nonwhites_per1000 1.811009
## unemploy_m24 2.599377
## unemploy_m39 1.901216
## inequality 2.695704
## time_prison 2.767931
We can see above that the top two predictors are ‘mean_education’ and ‘police_exp60’.
We compute the RMSE (Root Mean Square Error) with the train dataset.
# RMSE of train dataset
RMSE(pred = crime_lm$fitted.values, obs = crime_train$crime_rate)
## [1] 143.446
We compute the RMSE (Root Mean Square Error) with the test dataset.
lm_pred1 <- predict(crime_lm, newdata = crime_test)
RMSE(pred = lm_pred1, obs = crime_test$crime_rate)
## [1] 434.199
The RMSE (root mean square error) is an indicator of the difference between the actual values and the predicted values. We see that there is a difference between RMSEs for the train and test datasets (RMSE for train dataset < RMSE for test dataset). This suggests that the model overfits the train dataset and is not a good for for unseen data.
We compute the RMSE (Root Mean Square Error) with the train dataset.
RMSE(pred = step.model$fitted.values, obs = crime_train$crime_rate)
## [1] 151.1674
We compute the RMSE (Root Mean Square Error) with the test dataset.
lm_pred2 <- predict(step.model, newdata = crime_test)
RMSE(pred = lm_pred2, obs = crime_test$crime_rate)
## [1] 376.4441
Again the RMSE for the train dataset < RMSE for the test dataset but to a lesser degree than in LM 1.
The Linear Regression model is only as good as its assumptions are valid. In this section we explore the validity of the assumptions for the best linear regression model - LM 2 - that we have obtained.
The assumptions of a linear regression model are :-
*Linearity: The relationship between predictor and the mean of target is linear.
*Homoscedasticity: The variance of residual is the same for any value of predictor.
*Independence: Observations are independent of each other.
*Normality: For any fixed value of predictor, the target is normally distributed
The linear regression model makes the assumption that the relationship between the predictors and the target must be linear.
model.diag.metrics <- augment(step.model)
head(model.diag.metrics)
## # A tibble: 6 × 18
## .rownames crime_rate percent_m mean_…¹ polic…² labou…³ m_per…⁴ nonwh…⁵ unemp…⁶
## <chr> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 31 373 140 93 55 535 1045 20 135
## 2 15 798 152 87 57 530 986 72 92
## 3 14 664 135 117 62 595 986 46 77
## 4 3 578 142 89 45 533 969 219 94
## 5 42 542 141 109 56 523 968 2 107
## 6 37 831 177 87 58 638 974 349 76
## # … with 9 more variables: unemploy_m39 <int>, inequality <int>,
## # time_prison <dbl>, .fitted <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>, and abbreviated variable names
## # ¹mean_education, ²police_exp60, ³labour_participation, ⁴m_per1000f,
## # ⁵nonwhites_per1000, ⁶unemploy_m24
library(ggfortify)
## Registered S3 method overwritten by 'ggfortify':
## method from
## autoplot.glmnet parsnip
par(mfrow = c(2, 2))
autoplot(step.model)
We can see above that the plot of residuals vs fitted show that the blue line remains close to the dotted line and the residuals show no regular pattern - these are signs suggesting that the assumption of linearity in the model holds.
The Q-Q plot shows almost a straight line suggesting that there are no signs of skewness which is typical of data derived from a normal distribution.
The residuals vs leverage plot shows that the model produces some extreme predictions (outliers vs actual data) which are highly influential to error measures.
The standardized residual square rooted vs Fitted values does not show a horizontal blue line but instead show some variance with respect to fitted values; this may suggest the presence of heteroscedasticity.
We can conclude that the linearity test holds for this model.
As an addition to the Q-Q plot, we can perform the Shapiro-Wilk test to confirm normality.
shapiro.test(step.model$residuals)
##
## Shapiro-Wilk normality test
##
## data: step.model$residuals
## W = 0.97453, p-value = 0.6143
The p value is more than 0.05 which suggests that the distribution of the data is normally distributed. Hence we can assume normality, which confirms the reading of our Q-Q plot.
The linear model assumes that there is no correlation between the residuals. The Durbin Watson test detects the presence of autocorrelation in the residuals of a regression model.
durbinWatsonTest(step.model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1715621 1.641577 0.258
## Alternative hypothesis: rho != 0
The D-W statistic is between 1.5 and 2.5; hence the null hypothesis is not rejected. We read this as an indication that there is no autocorrelation between residuals.
bptest(step.model)
##
## studentized Breusch-Pagan test
##
## data: step.model
## BP = 9.5527, df = 10, p-value = 0.4806
No sufficient evidence is gleaned here to confirm that heteroscedasticity is present since p-value is more than 0.05. This does not concur with our observation on plot of standardized residual square rooted vs fitted values.
We compute the Variance Inflation Factor of the model.
vif(step.model)
## percent_m mean_education police_exp60
## 2.008588 4.492356 3.075057
## labour_participation m_per1000f nonwhites_per1000
## 3.797952 3.449977 2.662333
## unemploy_m24 unemploy_m39 inequality
## 4.145015 4.330561 3.842859
## time_prison
## 1.739850
There are no values >=10 hence there is no evidence for multi-collinearity.
The best regression model (LM 2) is obtained by the stepwise regression modelling in both forward and backward directions. LM2 achieves an Adjusted R-squared of 0.8082. All the tests conducted on LM2 appear to agree with the assumptions of a linear regression model.
https://rpubs.com/Argaadya/531140 date: 16/03/2023
https://rpubs.com/nabiilahardini/happiness date: 16/03/2023
https://scc.ms.unimelb.edu.au/resources/reporting-statistical-inference/rescaling-explanatory-variables-in-linear-regression date:23/03/2023
https://www.statology.org/interpret-prt-regression-output-r/ date: 25/03/2023
https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R date: 25/03/2023
https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html date: 25/03/2023
https://sscc.wisc.edu/sscc/pubs/RegDiag-R/normality.html date: 25/03/2023
https://rpubs.com/Amrabdelhamed611/669768 date: 25/03/2023
http://www.sthda.com/english/wiki/normality-test-in-r date: 25/03/2023
https://www.investopedia.com/terms/d/durbin-watson-statistic.asp date: 25/03/2023
https://www.statology.org/breusch-pagan-test-r/ date: 25/03/2023
https://rpubs.com/nabiilahardini/happiness date: 25/03/2023
https://www.statology.org/variance-inflation-factor-r/ date: 25/03/2023