Intro

What we’ll do

We will learn to use linear regression model using crime dataset. We want to know the relationship among variables, especially between the prob_prison with other variables. We also want to predict the price of a new house based on the historical data.

Data Preparation

crime <- fread("data-input/crime.csv")

rmarkdown::paged_table(crime)
crime <- crime %>% 
  select(-V1)

names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", 
                  "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", 
                  "unemploy_m24", "unemploy_m39","gdp", "inequality", "prob_prison", 
                  "time_prison", "crime_rate")

glimpse(crime)
#> Rows: 47
#> Columns: 16
#> $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
#> $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
#> $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
#> $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
#> $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
#> $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
#> $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
#> $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
#> $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
#> $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
#> $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
#> $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
#> $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
#> $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
#> $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
#> $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

Data description of crime dataset:

  • percent_m: percentage of males aged 14-24
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: mean years of schooling
  • police_exp60: police expenditure in 1960
  • police_exp59: police expenditure in 1959
  • labour_participation: labour force participation rate
  • m_per1000f: number of males per 1000 females
  • state_pop: state population
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: unemployment rate of urban males aged 14-24
  • unemploy_m39: unemployment rate of urban males aged 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: avg time served in prisons
  • crime_rate: crime rate in an unspecified category

The data has 47 rows and 16 columns. Our target variable is the prob_prison, which signifies the porbability of improsonment. We will use other variable.

Before we go further, first we need to make sure that our data is clean and will be useful. If you watch closely, there are some problems with the data type that we need to change.

sapply(crime, function(x) length(unique(x)))
#>            percent_m             is_south       mean_education 
#>                   31                    2                   24 
#>         police_exp60         police_exp59 labour_participation 
#>                   38                   39                   40 
#>           m_per1000f            state_pop    nonwhites_per1000 
#>                   36                   35                   44 
#>         unemploy_m24         unemploy_m39                  gdp 
#>                   35                   26                   46 
#>           inequality          prob_prison          time_prison 
#>                   42                   47                   45 
#>           crime_rate 
#>                   45

we can change the columns that have only 2 variable into factor type because it is an indicator to yes and no such as is_south

crime <- crime %>%
  mutate(is_south = as.factor(is_south))
colSums(is.na(crime))
#>            percent_m             is_south       mean_education 
#>                    0                    0                    0 
#>         police_exp60         police_exp59 labour_participation 
#>                    0                    0                    0 
#>           m_per1000f            state_pop    nonwhites_per1000 
#>                    0                    0                    0 
#>         unemploy_m24         unemploy_m39                  gdp 
#>                    0                    0                    0 
#>           inequality          prob_prison          time_prison 
#>                    0                    0                    0 
#>           crime_rate 
#>                    0

Exploratory Data Analysis

ggcorr(crime,
       label = T,
       label_size = 2,
       hjust = 1,
       layout.exp = 2)

The graphic shows only 1 variable that have correlation with the target that is gdp

Modelling

Holdout: Train-Test Split

set.seed(123)
samplesize <- round(0.7 * nrow(crime), 0)
index <- sample(seq_len(nrow(crime)), size = samplesize)

data_train <- crime[index, ]
data_test <- crime[-index,]

Linier Regression

For this section, we will try to model the linier regression using prob_prison as target variable.

model_all <- lm(formula = prob_prison ~ .,
                data = data_train)

model_none <- lm(formula = prob_prison ~ 1,
                 data = data_train)

model_forward <- step(object = model_none,
                      direction = "forward",
                      scope = list(upper = model_all),
                      trace = F)
model_backward <- step(object = model_all,
                      direction = "backward",
                      trace = F)
model_both <- step(object = model_none,
                   direction = "both",
                   scope = list(upper = model_all),
                   trace = F)
summary(model_all)
#> 
#> Call:
#> lm(formula = prob_prison ~ ., data = data_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.015707 -0.005356 -0.002696  0.007130  0.017258 
#> 
#> Coefficients:
#>                         Estimate  Std. Error t value Pr(>|t|)   
#> (Intercept)           0.30687009  0.19909130   1.541  0.14164   
#> percent_m            -0.00015612  0.00031768  -0.491  0.62940   
#> is_south1            -0.00733254  0.01403021  -0.523  0.60798   
#> mean_education        0.00016556  0.00063179   0.262  0.79643   
#> police_exp60          0.00048031  0.00080376   0.598  0.55800   
#> police_exp59         -0.00069558  0.00087091  -0.799  0.43549   
#> labour_participation  0.00003204  0.00015486   0.207  0.83854   
#> m_per1000f           -0.00021156  0.00021518  -0.983  0.33932   
#> state_pop            -0.00004389  0.00011496  -0.382  0.70733   
#> nonwhites_per1000     0.00004011  0.00006579   0.610  0.55014   
#> unemploy_m24          0.00001450  0.00037347   0.039  0.96949   
#> unemploy_m39          0.00040691  0.00065474   0.621  0.54252   
#> gdp                  -0.00007475  0.00007849  -0.952  0.35424   
#> inequality            0.00007529  0.00019135   0.393  0.69885   
#> time_prison          -0.00174643  0.00047828  -3.651  0.00198 **
#> crime_rate            0.00000377  0.00001533   0.246  0.80869   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01266 on 17 degrees of freedom
#> Multiple R-squared:  0.7891, Adjusted R-squared:  0.603 
#> F-statistic:  4.24 on 15 and 17 DF,  p-value: 0.002737
summary(model_forward)
#> 
#> Call:
#> lm(formula = prob_prison ~ gdp + time_prison, data = data_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.018068 -0.006369 -0.002147  0.009097  0.020277 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value          Pr(>|t|)    
#> (Intercept)  0.14736233  0.01199762  12.283 0.000000000000311 ***
#> gdp         -0.00012690  0.00002146  -5.912 0.000001783259074 ***
#> time_prison -0.00136747  0.00025706  -5.320 0.000009444018132 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared:  0.7326, Adjusted R-squared:  0.7147 
#> F-statistic: 41.09 on 2 and 30 DF,  p-value: 0.00000000256
summary(model_backward)
#> 
#> Call:
#> lm(formula = prob_prison ~ gdp + time_prison, data = data_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.018068 -0.006369 -0.002147  0.009097  0.020277 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value          Pr(>|t|)    
#> (Intercept)  0.14736233  0.01199762  12.283 0.000000000000311 ***
#> gdp         -0.00012690  0.00002146  -5.912 0.000001783259074 ***
#> time_prison -0.00136747  0.00025706  -5.320 0.000009444018132 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared:  0.7326, Adjusted R-squared:  0.7147 
#> F-statistic: 41.09 on 2 and 30 DF,  p-value: 0.00000000256
summary(model_both)
#> 
#> Call:
#> lm(formula = prob_prison ~ gdp + time_prison, data = data_train)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.018068 -0.006369 -0.002147  0.009097  0.020277 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value          Pr(>|t|)    
#> (Intercept)  0.14736233  0.01199762  12.283 0.000000000000311 ***
#> gdp         -0.00012690  0.00002146  -5.912 0.000001783259074 ***
#> time_prison -0.00136747  0.00025706  -5.320 0.000009444018132 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared:  0.7326, Adjusted R-squared:  0.7147 
#> F-statistic: 41.09 on 2 and 30 DF,  p-value: 0.00000000256

From the data split performed, it was found that for the linear regression model using all variables, only time_prison has a correlation with prob_prison of 0.00198. This may be due to the small dataset size of only 47 rows. To achieve better prediction and model training, a dataset with a sufficient scale is needed.

But, from step-wise feature selection we got 2 variable that have correlation to prob_prison in training dataset which is gdp and time_prison.

crime2 <- crime %>% 
  select(gdp, time_prison, prob_prison)

data_train2 <- crime2[index, ]
data_test2 <- crime2[-index, ]

model_selection <- lm(prob_prison ~ .,
                      data = data_train2)

summary(model_selection)
#> 
#> Call:
#> lm(formula = prob_prison ~ ., data = data_train2)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.018068 -0.006369 -0.002147  0.009097  0.020277 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value          Pr(>|t|)    
#> (Intercept)  0.14736233  0.01199762  12.283 0.000000000000311 ***
#> gdp         -0.00012690  0.00002146  -5.912 0.000001783259074 ***
#> time_prison -0.00136747  0.00025706  -5.320 0.000009444018132 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared:  0.7326, Adjusted R-squared:  0.7147 
#> F-statistic: 41.09 on 2 and 30 DF,  p-value: 0.00000000256

Too see if this action affect our model, we can check the Adjusted R-Squared value from our three previous models. The first model with complete variables has adjusted R-squared of 0.603, meaning that the model can explain 60.3% of variance in the target variable prob_prison. Meanwhile, our simpler model has adjusted R-squared of 71.47%, there is such difference with our first model. This shows that it is recommended to remove variables that has no significant coefficient values.

Evaluation

Model Performance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error:

\[RMSE = \sqrt{\sum (prediction_i - actual_i)^2}\]

We will use RMSE (Root Mean Squared Error) to assess how well our model predicts the target based on the smallest error. RMSE is a commonly used metric to evaluate the performance of regression models. It measures the average magnitude of the differences between predicted values and actual values. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.

all_pred <- predict(model_all, newdata = data_test %>% select(-prob_prison))

#RMSE of train dataset
RMSE(pred = model_all$fitted.values, obs = data_train$prob_prison)
#> [1] 0.009084194
#RMSE of test dataset
RMSE(pred = all_pred, obs = data_test$prob_prison)
#> [1] 0.02646812

Below is the second model (model with selected feature) performance.

selection_pred <- predict(model_selection, newdata = data_test2 %>% select(-prob_prison))

#RMSE of train dataset
RMSE(pred = model_selection$fitted.values, obs = data_train2$prob_prison)
#> [1] 0.01022914
#RMSE of test dataset
RMSE(pred = selection_pred, data_test2$prob_prison)
#> [1] 0.02513543

The selection model is better than the first one in predicting the testing dataset, even only by a small margin. On both models, the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.

Assumptions

1. Linearity

The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.

plot(model_selection,
     which = 1)

The pattern show that our data is spread among the 0 line (-0.01 to 0.01) indicate in creator opinion that our data spread is linier enough.

2. Normality Test

The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test. Shapiro-Wilk hypothesis test:

  • H0: error distributed normally
  • H1: error NOT distributed normally

H0 is rejected if p-values < 0.05 (alpha)

shapiro.test(model_selection$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_selection$residuals
#> W = 0.97457, p-value = 0.6155
plot(model_selection, which = 2)

p-value is greater than 0.05 which mean residual data is distributed normally.

3. Heterocedasticity

Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.

plot(x = model_selection$fitted.values,
     y = model_selection$residuals)
abline(h = 0, col = "red")

bptest(model_selection)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_selection
#> BP = 8.1593, df = 2, p-value = 0.01691

our p-value from bptest is 0.01691 which is less than 0.05. This indicate that our error spread is in Heterocedasticity.

4. Multicolliniearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(model_selection)
#>         gdp time_prison 
#>     1.05642     1.05642

Model Improvement

Tuning

We has already seen that our model doesn’t satisfy 1 of the assumptions, that is the error is spread in heterocesdasticity. Now we will try to fix it. To made the model more linear, we can transform some of the variables, both the target and the predictors.

crime3 <- crime2 %>% mutate_if(~is.numeric(.), ~log10(.))
head(crime3)
# split the data
set.seed(123)
data_train3 <- crime3[index, ]
data_test3 <- crime3[-index, ]

# train the model
model_transform <- lm(prob_prison ~ .,
                      data = data_train3)

Performance

summary(model_transform)
#> 
#> Call:
#> lm(formula = prob_prison ~ ., data = data_train3)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.44665 -0.04281  0.01172  0.07306  0.21374 
#> 
#> Coefficients:
#>             Estimate Std. Error t value   Pr(>|t|)    
#> (Intercept)   4.3531     0.8216   5.298 0.00001003 ***
#> gdp          -1.5584     0.3053  -5.105 0.00001734 ***
#> time_prison  -1.0867     0.1925  -5.645 0.00000377 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1339 on 30 degrees of freedom
#> Multiple R-squared:  0.7052, Adjusted R-squared:  0.6856 
#> F-statistic: 35.89 on 2 and 30 DF,  p-value: 0.00000001102
trans_pred <- predict(model_transform, newdata = data_test3 %>% select(-prob_prison))

#RMSE of train dataset
RMSE(pred = 10^(model_transform$fitted.values), obs = 10^(data_train3$prob_prison))
#> [1] 0.01110531
#RMSE of test dataset
RMSE(pred = 10^(trans_pred), obs = 10^(data_test3$prob_prison))
#> [1] 0.0254162

Our model has R-squared of 68.56%. The value may be dropped a bit, but in return we will get a model that satisfies the assumption. Unfortunately, our model seems to overfit, with RMSE of training dataset is 0.01110531 while the test dataset has RMSE of 0.0254162.

Assumptions

1. Linearity

plot(model_transform,
     which = 1)

plot(model_transform,
     which = 2)

There is little to no discernible pattern in our residual plot, we need to remove the outlier in index 33 and then we can conclude that our model is linear.

2. Normality Test

#removing outlier in data index 33
model_transform1 <- model_transform
model_transform1$residuals <- model_transform1$residuals[-33]

shapiro.test(model_transform1$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_transform1$residuals
#> W = 0.9782, p-value = 0.7458

3. Heteroscedacity

bptest(model_transform1)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_transform1
#> BP = 3.6562, df = 2, p-value = 0.1607

With p-value > 0.05, we can conclude that heterocesdasticity is not present.

4. Multicolliniearity

vif(model_transform1)
#>         gdp time_prison 
#>    1.038923    1.038923

Conclusion

Variables that are useful to describe the variances in probability someone to prisonate are gdp, time_prison, and prob_prison. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 68.56% of the variables can explain the variances in the probality prison. The accuracy of the model in predicting the prob_prison is measured with RMSE, with training data has RMSE of 0.01110531 and testing data has RMSE of 0.0254162, suggesting that our model may overfit the traning dataset.

We have already learn how to build a linear regression model and what need to be concerned when building the model.