1 Explanation

1.1 Brief

We will learn to use linear regression model using crime dataset. We want to know the relationship among variables, especially between the crime rate with other variables. We also want to predict the future crime rate based on the historical data.

1.2 Data’s Point of View

In the following code we take a peek at a dataset used by criminologists to study the effect of punishment regimes on crime rates. We’ll read the dataset and rename the columns.

2 Data Preparation

Load the required package

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(data.table)
library(GGally) 
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(car)
## Loading required package: carData
library(scales)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(leaps)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:plotly':
## 
##     select
options(scipen = 100, max.print = 1e+06)

Load the dataset.

crime <- read.csv("data_input/crime.csv") %>% 
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

crime$police_exp <- crime$police_exp59 + crime$police_exp60
crime_s <- subset(crime, select=-c(police_exp59, police_exp60))
glimpse(crime_s)
## Rows: 47
## Columns: 15
## $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
## $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
## $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
## $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
## $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
## $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
## $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
## $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
## $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
## $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
## $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
## $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
## $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
## $ police_exp           <int> 114, 198, 89, 290, 210, 233, 161, 224, 127, 139, …

The data has 47 rows and 15 columns. The first column is a unique identifier for each data, so we can remove it. Field police_exp59 and police_exp60 can be merged to police_exp. Our target variable is the crime_rate, which signifies the rate of the crime. We will use other variable.

The dataset was collected in 1960 and a full description of the dataset wasn’t conveniently available. I use the description I gathered from the authors of the MASS package. After you rename the dataset (in your coursebook, around line 410 to line 420), the variables are:
- percent_m: percentage of males aged 14-24 - is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
- mean_education: mean years of schooling
- police_exp60: police expenditure in 1960
- police_exp59: police expenditure in 1959 - labour_participation: labour force participation rate
- m_per1000f: number of males per 1000 females
- state_pop: state population
- nonwhites_per1000: number of non-whites resident per 1000 people
- unemploy_m24: unemployment rate of urban males aged 14-24
- unemploy_m39: unemployment rate of urban males aged 35-39
- gdp: gross domestic product per head
- inequality: income inequality
- prob_prison: probability of imprisonment
- time_prison: avg time served in prisons
- crime_rate: crime rate in an unspecified category

3 Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

The graphic shows that a few of variables has strong correlation with the crime_rate variables.

4 Modeling

4.1 Holdout: Train-Test Split

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(123)
samplesize <- round(0.8 * nrow(crime_s), 0)
index <- sample(seq_len(nrow(crime_s)), size = samplesize)

data_train <- crime_s[index, ]
data_test <- crime_s[-index, ]

4.2 Linear Regression

Now we will try to model the linear regression using crime_rate as the target variable

set.seed(123)
model_all <- lm(crime_rate ~ ., data = data_train)

summary(model_all)
## 
## Call:
## lm(formula = crime_rate ~ ., data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -409.73 -104.29   -0.07  113.64  488.86 
## 
## Coefficients:
##                         Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)          -7323.86557  1940.81527  -3.774 0.000985 ***
## percent_m                8.38471     4.61466   1.817 0.082276 .  
## is_south               166.08123   202.14547   0.822 0.419744    
## mean_education          17.89829     7.24483   2.470 0.021336 *  
## labour_participation     0.43050     1.64230   0.262 0.795552    
## m_per1000f               3.13984     2.38274   1.318 0.200566    
## state_pop               -0.07351     1.64892  -0.045 0.964827    
## nonwhites_per1000        0.04631     0.74938   0.062 0.951255    
## unemploy_m24            -5.55878     4.88981  -1.137 0.267324    
## unemploy_m39            18.86544    10.56948   1.785 0.087475 .  
## gdp                      0.27470     1.27517   0.215 0.831334    
## inequality               4.26827     2.93323   1.455 0.159143    
## prob_prison          -3158.94404  2494.40612  -1.266 0.218042    
## time_prison              2.98775     8.02661   0.372 0.713129    
## police_exp               4.63402     1.47146   3.149 0.004490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 224.5 on 23 degrees of freedom
## Multiple R-squared:  0.8056, Adjusted R-squared:  0.6873 
## F-statistic: 6.809 on 14 and 23 DF,  p-value: 0.00003092

The summary of crime_lm model shows a lot of information. But for now, we may be better focus on the Pr(>|t|). This column shows the signifance level of the variable toward the model. If the value is below 0.05, than we can safely asume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable. To interpret the value of each coefficient, for example with every increased value of 1 unit-point in prob_prison will contribute to 3158.9 decrease in the crime_rate.\

Next, the selection of predictor variables will be attempted automatically using step-wise regression with the backward elimination method.

model_step <- step(model_all,
                   direction = "backward",
                   trace = F)
summary(model_step)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
##     police_exp, data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -465.44 -133.71   22.53  138.72  489.94 
## 
## Coefficients:
##                  Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)    -7029.9512  1481.3725  -4.746 0.0000514 ***
## percent_m          9.7056     3.7240   2.606    0.0143 *  
## mean_education    17.6753     6.1018   2.897    0.0071 ** 
## m_per1000f         3.0776     1.5609   1.972    0.0583 .  
## unemploy_m24      -7.8050     3.8300  -2.038    0.0508 .  
## unemploy_m39      22.8809     8.8535   2.584    0.0151 *  
## inequality         5.0624     1.6340   3.098    0.0043 ** 
## prob_prison    -3010.8064  1646.9002  -1.828    0.0778 .  
## police_exp         4.9368     0.8862   5.571 0.0000052 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 206 on 29 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7368 
## F-statistic: 13.94 on 8 and 29 DF,  p-value: 0.00000004513

We have removed the non-significant variables. Too see if this action affect our model, we can check the Adjusted R-Squared value from our two previous models. The first model with complete variables has adjusted R-squared of 0.6873, meaning that the model can explain 68.73% of variance in the target variable (crime_rate). Meanwhile, our simpler model has adjusted R-squared of 73.68%, big difference with our first model. This shows that it is safe to remove variables that has no significant coefficient values.

5 Evaluation

5.1 Model Performance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error:

\(RMSE = \sqrt{\frac{\sum_i^{n}(Predicted_i-Actual_i)^2}{n}}\)

RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.

lm_pred <- predict(model_all, newdata = data_test %>% dplyr::select(-crime_rate))

# RMSE of train dataset
RMSE(pred = model_all$fitted.values, obs = data_train$crime_rate)
## [1] 174.6438
# RMSE of test dataset
RMSE(pred = lm_pred, obs = data_test$crime_rate)
## [1] 218.6472

Below is the second model (result of using step-wise regression) performance.

lm_pred2 <- predict(model_step, newdata = data_test %>% dplyr::select(-crime_rate))

# RMSE of train dataset
RMSE(pred = model_step$fitted.values, obs = data_train$crime_rate)
## [1] 179.9273
# RMSE of test dataset
RMSE(pred = lm_pred2, obs = data_test$crime_rate)
## [1] 188.1339

The second model is better than the first one in predicting the testing dataset, even only by a small margin. On both models, the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.

5.2 Assumptions

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).

  1. Linearity The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.

resact <- data.frame(residual = model_step$residuals, fitted = model_step$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a pattern in the data, with the residuals has become more negative as the fitted values increase ,decreased and increased again. The pattern indicate that our model may be linear enough.

  1. Normality Test The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.
shapiro.test(model_step$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_step$residuals
## W = 0.97685, p-value = 0.606

the null hypothesis is that the residuals follow normal distribution. With p-value > 0.05, we can conclude that our hypothesis is accepted, and our residuals are following the normal distribution.

  1. Autocorrelation The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms (no autocorrelation). If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.

Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.

durbinWatsonTest(model_step)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1906526      1.552397    0.16
##  Alternative hypothesis: rho != 0

The result shows that the null hypothesis is accepted, meaning that our residual has autocorrelation in it.

  1. Heterocedasticity Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.
bptest(model_step)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_step
## BP = 14.006, df = 8, p-value = 0.08162
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

We can observe that on lower fitted values, the residuals are concentrated around the value of 0. As the fitted value increases, the residuals are also got bigger. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is not present in our model.

6 Conclusion

Variables that are useful to describe the variances in crime_rate are percent_m,mean_education,m_per1000f,unemploy_m24,unemploy_m39,inequality,prob_prison,police_exp. Our final model has satisfied the classical assumptions. The Adjsted R-squared of the model is high, with 73.68% of the variables can explain the variances in the crime_rate. The accuracy of the model in predicting the car price is measured with RMSE, with training data has RMSE of 179.9273 and testing data has RMSE of 188.1339, suggesting that our model may overfit the traning dataset.

We have already learn how to build a linear regression model and what need to be concerned when building the model.