The binary regression model used was to predict the number of persons who were arrested (1) and not arrested (0) based on the data of arrests in California in 1986. The variables used in the model are arrest (formerly known as narr or if a person is arrested or not arrested), pcnv (proportion of prior arrests that led to a conviction), avgsen (average sentence served from prior convictions), tottime (months spent in prison since age 18 before 1986), ptime (months spent in prison in 1986), and qemp (number of quarters (0 to 4) that the man was legally employed in 1986) (Eyal, 2010). The arrest is the dependent variable while the rest of the aforementioned variables are the independent variables. With that being said, there will be training and testing models that will be compared to assess the performance of the binary regression model created if it predicted correctly or incorrectly.
Reference: Eyal, K. (2010, October 25). Quantitative Methods for Economics: Tutorial 12. University of Cape Town. https://vula.uct.ac.za/access/content/group/95dfae58-9991-4317-8a1d-e2d58f80b3a3/Published%20OER%20UCT%20Resources/Quantitative%20Methods%20in%20Economics/Tutorials/Tutorial%2012.pdf
library(pacman)
library(readxl)
library(packrat)
library(rsconnect)
pacman::p_load(psych, Hmisc, broom, tidyverse, pastecs, DescTools, skimr, olsrr, car, GGally, ggfortify, stargazer)
mydata <- readxl::read_excel("california_arrests_1986.xlsx", col_names = TRUE)
##The code for importing the excel came from https://www.tutorialkart.com/r-tutorial/r-read-excel-xls-xlsx/#gsc.tab=0. I added "readxl::" to be somehow similar with the working code when importing .csv file since it did not work when it was only read_excel().
str(mydata)
## tibble [2,725 × 6] (S3: tbl_df/tbl/data.frame)
## $ narr : num [1:2725] 0 2 1 2 1 0 2 5 0 0 ...
## $ pcnv : num [1:2725] 0.38 0.44 0.33 0.25 0 1 0.44 0.75 0.33 0.23 ...
## $ avgsen : num [1:2725] 17.6 0 22.8 0 0 0 0 0 10.9 0 ...
## $ tottime: num [1:2725] 35.2 0 22.8 0 0 0 0 0 21.8 0 ...
## $ ptime : num [1:2725] 12 0 0 5 0 0 0 0 9 0 ...
## $ qemp : num [1:2725] 0 1 0 2 2 4 0 0 0 3 ...
mydata$narr <- as.factor(ifelse(mydata$narr == "1", 1, 0))
colnames(mydata)[colnames(mydata) == "narr"] <- "arrest"
##the data in the narr column are being dummified to either be 0 or 1. (0 = no arrest; 1 = with arrest)
##The dummification of variables in the narr column code were based on this link: https://rstudio-pubs-static.s3.amazonaws.com/709917_68beddf9f3e841519ed53a458ea946ff.html
##The change of columns from "narr" to "arrest" code were based on this link: https://statisticsglobe.com/rename-column-name-in-r-data-frame/
str(mydata)
## tibble [2,725 × 6] (S3: tbl_df/tbl/data.frame)
## $ arrest : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
## $ pcnv : num [1:2725] 0.38 0.44 0.33 0.25 0 1 0.44 0.75 0.33 0.23 ...
## $ avgsen : num [1:2725] 17.6 0 22.8 0 0 0 0 0 10.9 0 ...
## $ tottime: num [1:2725] 35.2 0 22.8 0 0 0 0 0 21.8 0 ...
## $ ptime : num [1:2725] 12 0 0 5 0 0 0 0 9 0 ...
## $ qemp : num [1:2725] 0 1 0 2 2 4 0 0 0 3 ...
set.seed(3717)
grp <- sample(2, 2725, replace = T, prob = c(0.8, 0.2))
head(grp, 20)
## [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
str(grp)
## int [1:2725] 1 1 1 1 2 1 1 1 1 1 ...
table(grp)
## grp
## 1 2
## 2155 570
##3717 in the set.seed came from my ID number, 12073717.
##The 2725 observations will be split into two with a proportion of 80:20: training data (used to create the model and check the performance of the model) and testing data (used to run the model made by the training data).
train <- mydata[grp == 1, ]
test <- mydata[grp == 2, ]
head(train, 20)
## # A tibble: 20 × 6
## arrest pcnv avgsen tottime ptime qemp
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.38 17.6 35.2 12 0
## 2 0 0.44 0 0 0 1
## 3 1 0.33 22.8 22.8 0 0
## 4 0 0.25 0 0 5 2
## 5 0 1 0 0 0 4
## 6 0 0.44 0 0 0 0
## 7 0 0.75 0 0 0 0
## 8 0 0.33 10.9 21.8 9 0
## 9 0 0.23 0 0 0 3
## 10 1 0 0 0 0 4
## 11 0 0.17 31.7 63.4 12 0
## 12 0 0.33 0 0 0 4
## 13 0 0 0 0 0 0
## 14 1 0 0 0 0 4
## 15 0 0.2 59.2 59.2 3 1.7
## 16 1 0.38 0 0 0 0
## 17 0 0.5 0 0 0 4
## 18 0 0.6 7.9 23.7 0 0
## 19 0 0.5 0 0 0 4
## 20 0 0.47 5.6 22.4 0 0
head(test, 20)
## # A tibble: 20 × 6
## arrest pcnv avgsen tottime ptime qemp
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0 0 2
## 2 0 0.36 18.7 37.4 0 0
## 3 0 0.33 36.1 36.1 0 4
## 4 1 0.43 0 0 0 1
## 5 0 0.39 11.3 11.3 0 3
## 6 0 0.57 0 0 0 4
## 7 0 0.41 0 0 0 1
## 8 0 1 0 0 0 4
## 9 1 0 0 0 0 1
## 10 0 1 0 0 0 2
## 11 1 0.5 0 0 0 4
## 12 0 0.54 5.5 11 0 3
## 13 0 0.25 0 0 0 4
## 14 0 0 0 0 0 4
## 15 0 0.5 0 0 0 4
## 16 0 1 0 0 0 4
## 17 0 0 0 0 0 4
## 18 1 0 0 0 0 2
## 19 0 1 0 0 0 1
## 20 0 0.5 0 0 0 3
summary(mod1 <- glm(arrest ~ pcnv + avgsen + tottime + ptime + qemp, family = "binomial", data = train))
##
## Call:
## glm(formula = arrest ~ pcnv + avgsen + tottime + ptime + qemp,
## family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.84482 0.10707 -7.890 3.01e-15 ***
## pcnv -1.07829 0.15652 -6.889 5.62e-12 ***
## avgsen 0.06817 0.06019 1.133 0.2573
## tottime -0.05792 0.05189 -1.116 0.2644
## ptime -0.11298 0.04572 -2.471 0.0135 *
## qemp -0.08866 0.03501 -2.532 0.0113 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2131.5 on 2154 degrees of freedom
## Residual deviance: 2061.2 on 2149 degrees of freedom
## AIC: 2073.2
##
## Number of Fisher Scoring iterations: 5
summary(mod1 <- glm(arrest ~ pcnv + avgsen + ptime + qemp, family = "binomial", data = train))
##
## Call:
## glm(formula = arrest ~ pcnv + avgsen + ptime + qemp, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.845857 0.107097 -7.898 2.83e-15 ***
## pcnv -1.081364 0.156610 -6.905 5.03e-12 ***
## avgsen 0.001238 0.018794 0.066 0.94749
## ptime -0.121014 0.045681 -2.649 0.00807 **
## qemp -0.088174 0.035004 -2.519 0.01177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2131.5 on 2154 degrees of freedom
## Residual deviance: 2062.8 on 2150 degrees of freedom
## AIC: 2072.8
##
## Number of Fisher Scoring iterations: 5
## remove "tottime" from glm model since it has a higher Pr than "avgsen" and not significant.
summary(mod1 <- glm(arrest ~ pcnv + ptime + qemp, family = "binomial", data = train))
##
## Call:
## glm(formula = arrest ~ pcnv + ptime + qemp, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.84523 0.10667 -7.924 2.30e-15 ***
## pcnv -1.08088 0.15642 -6.910 4.84e-12 ***
## ptime -0.12047 0.04490 -2.683 0.00729 **
## qemp -0.08828 0.03496 -2.525 0.01157 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2131.5 on 2154 degrees of freedom
## Residual deviance: 2062.8 on 2151 degrees of freedom
## AIC: 2070.8
##
## Number of Fisher Scoring iterations: 5
## remove "avgsen" from glm model since it is still not significant.
## Therefore, the variables pcnv, ptime, and qemp can predict if a person will be arrested using this model considering its high significance.
yadmit = -0.84523118 -1.08087863 * pcnv -0.12047300 * ptime -0.08828444 * qemp
##checking observation 1 of train data set respondent 1 = 0.38pcnv 17.6avgsen 35.2tottime 12ptime 0.0qemp
-0.84523118 -1.08087863 * 0.38 -0.12047300 * 12 -0.08828444 * 0.0
###-2.701641 means y of the first observation
$$ p = e^y / (1+e^y)
$$ ###Convert y to p y = -2.701641 exp(y)/(1+exp(y)) p = 0.06287659
###I added this part just to see if the p is somehow similar with the fitted values of mod1, which resulted that it did.
p1 <- predict(mod1, data = train, type = "response")
head(p1)
## 1 2 3 4 5 6
## 0.06287659 0.19637570 0.23113363 0.13074534 0.09285632 0.21068227
pred1 <- ifelse(p1 > 0.5, 1, 0)
head(pred1)
## 1 2 3 4 5 6
## 0 0 0 0 0 0
##converted the probabilities to 1 or 0
tab1 <- table(Predicted = pred1, Actual = train$arrest)
tab1
## Actual
## Predicted 0 1
## 0 1733 422
##correct prediction is 1733 (not arrested)
##incorrect prediction is 422 (not arrested but arrested in reality)
sum(diag(tab1))/sum(tab1)
## [1] 0.8041763
1-sum(diag(tab1))/sum(tab1)
## [1] 0.1958237
##0.8041763 = predicted correctly; good indication for its accuracy
##0.1958237 = predicted incorrectly
summary(mod1_test <- glm(arrest ~ pcnv + ptime + qemp, family = "binomial", data = test))
##
## Call:
## glm(formula = arrest ~ pcnv + ptime + qemp, family = "binomial",
## data = test)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.58596 0.18552 -3.158 0.00159 **
## pcnv -1.17920 0.28829 -4.090 4.31e-05 ***
## ptime -0.10743 0.07399 -1.452 0.14652
## qemp -0.07955 0.06180 -1.287 0.19801
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 628.69 on 569 degrees of freedom
## Residual deviance: 605.35 on 566 degrees of freedom
## AIC: 613.35
##
## Number of Fisher Scoring iterations: 4
p2 <- predict(mod1_test, data = test, type = "response")
head(p2, 10)
## 1 2 3 4 5 6 7 8
## 0.3218993 0.2668875 0.2152959 0.2363916 0.2167859 0.1713198 0.2406752 0.1107247
## 9 10
## 0.3395035 0.1273871
pred2 <- ifelse(p2 > 0.5, 1, 0)
head(pred2)
## 1 2 3 4 5 6
## 0 0 0 0 0 0
##converted the probabilities to 1 or 0
tab2 <- table(Predicted = pred2, Actual = test$arrest)
tab2
## Actual
## Predicted 0 1
## 0 433 137
##correct prediction is 433 (not arrested)
##incorrect prediction is 137 (not arrested but arrested in reality)
sum(diag(tab2))/sum(tab2)
## [1] 0.7596491
1-sum(diag(tab2))/sum(tab2)
## [1] 0.2403509
##0.7596491 = predicted correctly; good indication for its accuracy
##0.2403509 = predicted incorrectly
The evaluation of the model’s accuracy is through confusion error and misclassification errors which shows how accurate the models are. For both the training and testing model, it did not show 1 more row for 1 (persons who were arrested) which was alarming since the previous exercise showed 2 rows for the predicted column. However, after checking how the model performed, it resulted in 80.42% correct prediction and 19.58% for incorrect prediction, which is a good indication of a high level of accuracy. On the other hand, the testing model had a lower correct prediction percentage of 75.96% and a higher incorrect prediction percentage of 24.04%, which is still a high level of accuracy. This may imply that the training data model may be overfitting, wherein the model fails to generalize to other data within the population (Frost, 2023). Therefore, increasing the number of the training data set could help in having more different samples that can be considered so that in the testing data set, the model can be generalized.
Frost, J. (2023). Overfitting Regression Models: Problems, Detection, and Avoidance. University of Cape Town. https://statisticsbyjim.com/regression/overfitting-regression-models/