We will learn to use linear regression model using crime dataset. We want to know the relationship among variables, especially between the prob_prison with other variables. We also want to predict the price of a new house based on the historical data.
crime <- fread("data-input/crime.csv")
rmarkdown::paged_table(crime)crime <- crime %>%
select(-V1)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59",
"labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000",
"unemploy_m24", "unemploy_m39","gdp", "inequality", "prob_prison",
"time_prison", "crime_rate")
glimpse(crime)#> Rows: 47
#> Columns: 16
#> $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
#> $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
#> $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
#> $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
#> $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
#> $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
#> $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
#> $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
#> $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
#> $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
#> $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
#> $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
#> $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
#> $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
#> $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
#> $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …
Data description of crime dataset:
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes, 0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per 1000 peopleunemploy_m24: unemployment rate of urban males aged 14-24unemploy_m39: unemployment rate of urban males aged 35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryThe data has 47 rows and 16 columns. Our target variable is the prob_prison, which signifies the porbability of improsonment. We will use other variable.
Before we go further, first we need to make sure that our data is clean and will be useful. If you watch closely, there are some problems with the data type that we need to change.
sapply(crime, function(x) length(unique(x)))#> percent_m is_south mean_education
#> 31 2 24
#> police_exp60 police_exp59 labour_participation
#> 38 39 40
#> m_per1000f state_pop nonwhites_per1000
#> 36 35 44
#> unemploy_m24 unemploy_m39 gdp
#> 35 26 46
#> inequality prob_prison time_prison
#> 42 47 45
#> crime_rate
#> 45
we can change the columns that have only 2 variable into factor type because it is an indicator to yes and no such as is_south
crime <- crime %>%
mutate(is_south = as.factor(is_south))colSums(is.na(crime))#> percent_m is_south mean_education
#> 0 0 0
#> police_exp60 police_exp59 labour_participation
#> 0 0 0
#> m_per1000f state_pop nonwhites_per1000
#> 0 0 0
#> unemploy_m24 unemploy_m39 gdp
#> 0 0 0
#> inequality prob_prison time_prison
#> 0 0 0
#> crime_rate
#> 0
ggcorr(crime,
label = T,
label_size = 2,
hjust = 1,
layout.exp = 2)The graphic shows only 1 variable that have correlation with the target that is gdp
set.seed(123)
samplesize <- round(0.7 * nrow(crime), 0)
index <- sample(seq_len(nrow(crime)), size = samplesize)
data_train <- crime[index, ]
data_test <- crime[-index,]For this section, we will try to model the linier regression using prob_prison as target variable.
model_all <- lm(formula = prob_prison ~ .,
data = data_train)
model_none <- lm(formula = prob_prison ~ 1,
data = data_train)
model_forward <- step(object = model_none,
direction = "forward",
scope = list(upper = model_all),
trace = F)
model_backward <- step(object = model_all,
direction = "backward",
trace = F)
model_both <- step(object = model_none,
direction = "both",
scope = list(upper = model_all),
trace = F)
summary(model_all)#>
#> Call:
#> lm(formula = prob_prison ~ ., data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.015707 -0.005356 -0.002696 0.007130 0.017258
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.30687009 0.19909130 1.541 0.14164
#> percent_m -0.00015612 0.00031768 -0.491 0.62940
#> is_south1 -0.00733254 0.01403021 -0.523 0.60798
#> mean_education 0.00016556 0.00063179 0.262 0.79643
#> police_exp60 0.00048031 0.00080376 0.598 0.55800
#> police_exp59 -0.00069558 0.00087091 -0.799 0.43549
#> labour_participation 0.00003204 0.00015486 0.207 0.83854
#> m_per1000f -0.00021156 0.00021518 -0.983 0.33932
#> state_pop -0.00004389 0.00011496 -0.382 0.70733
#> nonwhites_per1000 0.00004011 0.00006579 0.610 0.55014
#> unemploy_m24 0.00001450 0.00037347 0.039 0.96949
#> unemploy_m39 0.00040691 0.00065474 0.621 0.54252
#> gdp -0.00007475 0.00007849 -0.952 0.35424
#> inequality 0.00007529 0.00019135 0.393 0.69885
#> time_prison -0.00174643 0.00047828 -3.651 0.00198 **
#> crime_rate 0.00000377 0.00001533 0.246 0.80869
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01266 on 17 degrees of freedom
#> Multiple R-squared: 0.7891, Adjusted R-squared: 0.603
#> F-statistic: 4.24 on 15 and 17 DF, p-value: 0.002737
summary(model_forward)#>
#> Call:
#> lm(formula = prob_prison ~ gdp + time_prison, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.018068 -0.006369 -0.002147 0.009097 0.020277
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.14736233 0.01199762 12.283 0.000000000000311 ***
#> gdp -0.00012690 0.00002146 -5.912 0.000001783259074 ***
#> time_prison -0.00136747 0.00025706 -5.320 0.000009444018132 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared: 0.7326, Adjusted R-squared: 0.7147
#> F-statistic: 41.09 on 2 and 30 DF, p-value: 0.00000000256
summary(model_backward)#>
#> Call:
#> lm(formula = prob_prison ~ gdp + time_prison, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.018068 -0.006369 -0.002147 0.009097 0.020277
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.14736233 0.01199762 12.283 0.000000000000311 ***
#> gdp -0.00012690 0.00002146 -5.912 0.000001783259074 ***
#> time_prison -0.00136747 0.00025706 -5.320 0.000009444018132 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared: 0.7326, Adjusted R-squared: 0.7147
#> F-statistic: 41.09 on 2 and 30 DF, p-value: 0.00000000256
summary(model_both)#>
#> Call:
#> lm(formula = prob_prison ~ gdp + time_prison, data = data_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.018068 -0.006369 -0.002147 0.009097 0.020277
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.14736233 0.01199762 12.283 0.000000000000311 ***
#> gdp -0.00012690 0.00002146 -5.912 0.000001783259074 ***
#> time_prison -0.00136747 0.00025706 -5.320 0.000009444018132 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared: 0.7326, Adjusted R-squared: 0.7147
#> F-statistic: 41.09 on 2 and 30 DF, p-value: 0.00000000256
From the data split performed, it was found that for the linear regression model using all variables, only time_prison has a correlation with prob_prison of 0.00198. This may be due to the small dataset size of only 47 rows. To achieve better prediction and model training, a dataset with a sufficient scale is needed.
But, from step-wise feature selection we got 2 variable that have correlation to prob_prison in training dataset which is gdp and time_prison.
crime2 <- crime %>%
select(gdp, time_prison, prob_prison)
data_train2 <- crime2[index, ]
data_test2 <- crime2[-index, ]
model_selection <- lm(prob_prison ~ .,
data = data_train2)
summary(model_selection)#>
#> Call:
#> lm(formula = prob_prison ~ ., data = data_train2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.018068 -0.006369 -0.002147 0.009097 0.020277
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.14736233 0.01199762 12.283 0.000000000000311 ***
#> gdp -0.00012690 0.00002146 -5.912 0.000001783259074 ***
#> time_prison -0.00136747 0.00025706 -5.320 0.000009444018132 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.01073 on 30 degrees of freedom
#> Multiple R-squared: 0.7326, Adjusted R-squared: 0.7147
#> F-statistic: 41.09 on 2 and 30 DF, p-value: 0.00000000256
Too see if this action affect our model, we can check the Adjusted R-Squared value from our three previous models. The first model with complete variables has adjusted R-squared of 0.603, meaning that the model can explain 60.3% of variance in the target variable prob_prison. Meanwhile, our simpler model has adjusted R-squared of 71.47%, there is such difference with our first model. This shows that it is recommended to remove variables that has no significant coefficient values.
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error:
\[RMSE = \sqrt{\sum (prediction_i - actual_i)^2}\]
We will use RMSE (Root Mean Squared Error) to assess how well our model predicts the target based on the smallest error. RMSE is a commonly used metric to evaluate the performance of regression models. It measures the average magnitude of the differences between predicted values and actual values. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.
all_pred <- predict(model_all, newdata = data_test %>% select(-prob_prison))
#RMSE of train dataset
RMSE(pred = model_all$fitted.values, obs = data_train$prob_prison)#> [1] 0.009084194
#RMSE of test dataset
RMSE(pred = all_pred, obs = data_test$prob_prison)#> [1] 0.02646812
Below is the second model (model with selected feature) performance.
selection_pred <- predict(model_selection, newdata = data_test2 %>% select(-prob_prison))
#RMSE of train dataset
RMSE(pred = model_selection$fitted.values, obs = data_train2$prob_prison)#> [1] 0.01022914
#RMSE of test dataset
RMSE(pred = selection_pred, data_test2$prob_prison)#> [1] 0.02513543
The selection model is better than the first one in predicting the testing dataset, even only by a small margin. On both models, the error of the training dataset is lower than the test dataset, suggesting that our model may be overfit.
The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.
Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.
plot(model_selection,
which = 1)The pattern show that our data is spread among the 0 line (-0.01 to 0.01) indicate in creator opinion that our data spread is linier enough.
The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test. Shapiro-Wilk hypothesis test:
H0 is rejected if p-values < 0.05 (alpha)
shapiro.test(model_selection$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_selection$residuals
#> W = 0.97457, p-value = 0.6155
plot(model_selection, which = 2)p-value is greater than 0.05 which mean residual data is distributed normally.
Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.
plot(x = model_selection$fitted.values,
y = model_selection$residuals)
abline(h = 0, col = "red")bptest(model_selection)#>
#> studentized Breusch-Pagan test
#>
#> data: model_selection
#> BP = 8.1593, df = 2, p-value = 0.01691
our p-value from bptest is 0.01691 which is less than 0.05. This indicate that our error spread is in Heterocedasticity.
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif(model_selection)#> gdp time_prison
#> 1.05642 1.05642
We has already seen that our model doesn’t satisfy 1 of the assumptions, that is the error is spread in heterocesdasticity. Now we will try to fix it. To made the model more linear, we can transform some of the variables, both the target and the predictors.
crime3 <- crime2 %>% mutate_if(~is.numeric(.), ~log10(.))
head(crime3)# split the data
set.seed(123)
data_train3 <- crime3[index, ]
data_test3 <- crime3[-index, ]
# train the model
model_transform <- lm(prob_prison ~ .,
data = data_train3)summary(model_transform)#>
#> Call:
#> lm(formula = prob_prison ~ ., data = data_train3)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.44665 -0.04281 0.01172 0.07306 0.21374
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.3531 0.8216 5.298 0.00001003 ***
#> gdp -1.5584 0.3053 -5.105 0.00001734 ***
#> time_prison -1.0867 0.1925 -5.645 0.00000377 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1339 on 30 degrees of freedom
#> Multiple R-squared: 0.7052, Adjusted R-squared: 0.6856
#> F-statistic: 35.89 on 2 and 30 DF, p-value: 0.00000001102
trans_pred <- predict(model_transform, newdata = data_test3 %>% select(-prob_prison))
#RMSE of train dataset
RMSE(pred = 10^(model_transform$fitted.values), obs = 10^(data_train3$prob_prison))#> [1] 0.01110531
#RMSE of test dataset
RMSE(pred = 10^(trans_pred), obs = 10^(data_test3$prob_prison))#> [1] 0.0254162
Our model has R-squared of 68.56%. The value may be dropped a bit, but in return we will get a model that satisfies the assumption. Unfortunately, our model seems to overfit, with RMSE of training dataset is 0.01110531 while the test dataset has RMSE of 0.0254162.
plot(model_transform,
which = 1)plot(model_transform,
which = 2)There is little to no discernible pattern in our residual plot, we need to remove the outlier in index 33 and then we can conclude that our model is linear.
#removing outlier in data index 33
model_transform1 <- model_transform
model_transform1$residuals <- model_transform1$residuals[-33]
shapiro.test(model_transform1$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_transform1$residuals
#> W = 0.9782, p-value = 0.7458
bptest(model_transform1)#>
#> studentized Breusch-Pagan test
#>
#> data: model_transform1
#> BP = 3.6562, df = 2, p-value = 0.1607
With p-value > 0.05, we can conclude that heterocesdasticity is not present.
vif(model_transform1)#> gdp time_prison
#> 1.038923 1.038923
Variables that are useful to describe the variances in probability someone to prisonate are gdp, time_prison, and prob_prison. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 68.56% of the variables can explain the variances in the probality prison. The accuracy of the model in predicting the prob_prison is measured with RMSE, with training data has RMSE of 0.01110531 and testing data has RMSE of 0.0254162, suggesting that our model may overfit the traning dataset.
We have already learn how to build a linear regression model and what need to be concerned when building the model.