We will learn to use linear regression model using crime dataset. We want to know the relationship among variables, especially between the crime rate with other variables. We also want to predict the future crime rate based on the historical data.
In the following code we take a peek at a dataset used by criminologists to study the effect of punishment regimes on crime rates. We’ll read the dataset and rename the columns.
Load the required package
library(caret)## Loading required package: ggplot2
## Loading required package: lattice
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(data.table)
library(GGally) ## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(car)## Loading required package: carData
library(scales)
library(lmtest)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(dplyr)##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(leaps)
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:plotly':
##
## select
options(scipen = 100, max.print = 1e+06)Load the dataset.
crime <- read.csv("data_input/crime.csv") %>%
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
crime$police_exp <- crime$police_exp59 + crime$police_exp60
crime_s <- subset(crime, select=-c(police_exp59, police_exp60))glimpse(crime_s)## Rows: 47
## Columns: 15
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
## $ police_exp <int> 114, 198, 89, 290, 210, 233, 161, 224, 127, 139, …
The data has 47 rows and 15 columns. The first column is a unique
identifier for each data, so we can remove it. Field
police_exp59 and police_exp60 can be merged to
police_exp. Our target variable is the
crime_rate, which signifies the rate of the crime. We will
use other variable.
The dataset was collected in 1960 and a full description of the
dataset wasn’t conveniently available. I use the description I gathered
from the authors of the MASS package. After you rename the dataset (in
your coursebook, around line 410 to line 420), the variables are:
- percent_m: percentage of males aged 14-24 -
is_south: whether it is in a Southern state. 1 for Yes, 0
for No.
- mean_education: mean years of schooling
- police_exp60: police expenditure in 1960
- police_exp59: police expenditure in 1959 -
labour_participation: labour force participation rate
- m_per1000f: number of males per 1000 females
- state_pop: state population
- nonwhites_per1000: number of non-whites resident per 1000
people
- unemploy_m24: unemployment rate of urban males aged
14-24
- unemploy_m39: unemployment rate of urban males aged
35-39
- gdp: gross domestic product per head
- inequality: income inequality
- prob_prison: probability of imprisonment
- time_prison: avg time served in prisons
- crime_rate: crime rate in an unspecified category
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)The graphic shows that a few of variables has strong correlation with
the crime_rate variables.
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
set.seed(123)
samplesize <- round(0.8 * nrow(crime_s), 0)
index <- sample(seq_len(nrow(crime_s)), size = samplesize)
data_train <- crime_s[index, ]
data_test <- crime_s[-index, ]Now we will try to model the linear regression using
crime_rate as the target variable
set.seed(123)
model_all <- lm(crime_rate ~ ., data = data_train)
summary(model_all)##
## Call:
## lm(formula = crime_rate ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -409.73 -104.29 -0.07 113.64 488.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7323.86557 1940.81527 -3.774 0.000985 ***
## percent_m 8.38471 4.61466 1.817 0.082276 .
## is_south 166.08123 202.14547 0.822 0.419744
## mean_education 17.89829 7.24483 2.470 0.021336 *
## labour_participation 0.43050 1.64230 0.262 0.795552
## m_per1000f 3.13984 2.38274 1.318 0.200566
## state_pop -0.07351 1.64892 -0.045 0.964827
## nonwhites_per1000 0.04631 0.74938 0.062 0.951255
## unemploy_m24 -5.55878 4.88981 -1.137 0.267324
## unemploy_m39 18.86544 10.56948 1.785 0.087475 .
## gdp 0.27470 1.27517 0.215 0.831334
## inequality 4.26827 2.93323 1.455 0.159143
## prob_prison -3158.94404 2494.40612 -1.266 0.218042
## time_prison 2.98775 8.02661 0.372 0.713129
## police_exp 4.63402 1.47146 3.149 0.004490 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 224.5 on 23 degrees of freedom
## Multiple R-squared: 0.8056, Adjusted R-squared: 0.6873
## F-statistic: 6.809 on 14 and 23 DF, p-value: 0.00003092
The summary of crime_lm model shows a lot of information. But for
now, we may be better focus on the Pr(>|t|). This column shows the
signifance level of the variable toward the model. If the value is below
0.05, than we can safely asume that the variable has significant effect
toward the model (meaning that the estimated coefficient are no
different than 0), and vice versa. Thus, we can made a simpler model by
removing variables that has p-value > 0.05, since they don’t have
significant effect toward our model. The estimate value shows the
coefficient of each variable. To interpret the value of each
coefficient, for example with every increased value of 1 unit-point in
prob_prison will contribute to 3158.9 decrease in the
crime_rate.\
Next, the selection of predictor variables will be attempted automatically using step-wise regression with the backward elimination method.
model_step <- step(model_all,
direction = "backward",
trace = F)
summary(model_step)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f +
## unemploy_m24 + unemploy_m39 + inequality + prob_prison +
## police_exp, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -465.44 -133.71 22.53 138.72 489.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7029.9512 1481.3725 -4.746 0.0000514 ***
## percent_m 9.7056 3.7240 2.606 0.0143 *
## mean_education 17.6753 6.1018 2.897 0.0071 **
## m_per1000f 3.0776 1.5609 1.972 0.0583 .
## unemploy_m24 -7.8050 3.8300 -2.038 0.0508 .
## unemploy_m39 22.8809 8.8535 2.584 0.0151 *
## inequality 5.0624 1.6340 3.098 0.0043 **
## prob_prison -3010.8064 1646.9002 -1.828 0.0778 .
## police_exp 4.9368 0.8862 5.571 0.0000052 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 206 on 29 degrees of freedom
## Multiple R-squared: 0.7937, Adjusted R-squared: 0.7368
## F-statistic: 13.94 on 8 and 29 DF, p-value: 0.00000004513
We have removed the non-significant variables. Too see if this action
affect our model, we can check the Adjusted R-Squared value from our two
previous models. The first model with complete variables has adjusted
R-squared of 0.6873, meaning that the model can explain 68.73% of
variance in the target variable (crime_rate). Meanwhile,
our simpler model has adjusted R-squared of 73.68%, big difference with
our first model. This shows that it is safe to remove variables that has
no significant coefficient values.
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error:
\(RMSE = \sqrt{\frac{\sum_i^{n}(Predicted_i-Actual_i)^2}{n}}\)
RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.
lm_pred <- predict(model_all, newdata = data_test %>% dplyr::select(-crime_rate))
# RMSE of train dataset
RMSE(pred = model_all$fitted.values, obs = data_train$crime_rate)## [1] 174.6438
# RMSE of test dataset
RMSE(pred = lm_pred, obs = data_test$crime_rate)## [1] 218.6472
Below is the second model (result of using step-wise regression) performance.
lm_pred2 <- predict(model_step, newdata = data_test %>% dplyr::select(-crime_rate))
# RMSE of train dataset
RMSE(pred = model_step$fitted.values, obs = data_train$crime_rate)## [1] 179.9273
# RMSE of test dataset
RMSE(pred = lm_pred2, obs = data_test$crime_rate)## [1] 188.1339
The second model is better than the first one in predicting the testing dataset, even only by a small margin. On both models, the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.
Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).
Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.
resact <- data.frame(residual = model_step$residuals, fitted = model_step$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There is a pattern in the data, with the residuals has become more negative as the fitted values increase ,decreased and increased again. The pattern indicate that our model may be linear enough.
shapiro.test(model_step$residuals)##
## Shapiro-Wilk normality test
##
## data: model_step$residuals
## W = 0.97685, p-value = 0.606
the null hypothesis is that the residuals follow normal distribution. With p-value > 0.05, we can conclude that our hypothesis is accepted, and our residuals are following the normal distribution.
Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.
durbinWatsonTest(model_step)## lag Autocorrelation D-W Statistic p-value
## 1 0.1906526 1.552397 0.16
## Alternative hypothesis: rho != 0
The result shows that the null hypothesis is accepted, meaning that our residual has autocorrelation in it.
bptest(model_step)##
## studentized Breusch-Pagan test
##
## data: model_step
## BP = 14.006, df = 8, p-value = 0.08162
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))We can observe that on lower fitted values, the residuals are concentrated around the value of 0. As the fitted value increases, the residuals are also got bigger. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is not present in our model.
Variables that are useful to describe the variances in
crime_rate are
percent_m,mean_education,m_per1000f,unemploy_m24,unemploy_m39,inequality,prob_prison,police_exp.
Our final model has satisfied the classical assumptions. The Adjsted
R-squared of the model is high, with 73.68% of the variables can explain
the variances in the crime_rate. The accuracy of the model
in predicting the car price is measured with RMSE, with training data
has RMSE of 179.9273 and testing data has RMSE of 188.1339, suggesting
that our model may overfit the traning dataset.
We have already learn how to build a linear regression model and what need to be concerned when building the model.