Introduction

The binary regression model used was to predict the number of persons who were arrested (1) and not arrested (0) based on the data of arrests in California in 1986. The variables used in the model are arrest (formerly known as narr or if a person is arrested or not arrested), pcnv (proportion of prior arrests that led to a conviction), avgsen (average sentence served from prior convictions), tottime (months spent in prison since age 18 before 1986), ptime (months spent in prison in 1986), and qemp (number of quarters (0 to 4) that the man was legally employed in 1986) (Eyal, 2010). The arrest is the dependent variable while the rest of the aforementioned variables are the independent variables. With that being said, there will be training and testing models that will be compared to assess the performance of the binary regression model created if it predicted correctly or incorrectly.

Reference: Eyal, K. (2010, October 25). Quantitative Methods for Economics: Tutorial 12. University of Cape Town. https://vula.uct.ac.za/access/content/group/95dfae58-9991-4317-8a1d-e2d58f80b3a3/Published%20OER%20UCT%20Resources/Quantitative%20Methods%20in%20Economics/Tutorials/Tutorial%2012.pdf

Install packages

library(pacman)
library(readxl)
library(packrat)
library(rsconnect)
pacman::p_load(psych, Hmisc, broom, tidyverse, pastecs, DescTools, skimr, olsrr, car, GGally, ggfortify, stargazer)

Importing california_arrests_1986.xlsx data

mydata <- readxl::read_excel("california_arrests_1986.xlsx", col_names = TRUE)

##The code for importing the excel came from https://www.tutorialkart.com/r-tutorial/r-read-excel-xls-xlsx/#gsc.tab=0. I added "readxl::" to be somehow similar with the working code when importing .csv file since it did not work when it was only read_excel().  

See the data structure of the imported data

str(mydata)
## tibble [2,725 × 6] (S3: tbl_df/tbl/data.frame)
##  $ narr   : num [1:2725] 0 2 1 2 1 0 2 5 0 0 ...
##  $ pcnv   : num [1:2725] 0.38 0.44 0.33 0.25 0 1 0.44 0.75 0.33 0.23 ...
##  $ avgsen : num [1:2725] 17.6 0 22.8 0 0 0 0 0 10.9 0 ...
##  $ tottime: num [1:2725] 35.2 0 22.8 0 0 0 0 0 21.8 0 ...
##  $ ptime  : num [1:2725] 12 0 0 5 0 0 0 0 9 0 ...
##  $ qemp   : num [1:2725] 0 1 0 2 2 4 0 0 0 3 ...

Categorical variables are dummified

mydata$narr <- as.factor(ifelse(mydata$narr == "1", 1, 0))
colnames(mydata)[colnames(mydata) == "narr"] <- "arrest"

##the data in the narr column are being dummified to either be 0 or 1. (0 = no arrest; 1 = with arrest)
##The dummification of variables in the narr column code were based on this link: https://rstudio-pubs-static.s3.amazonaws.com/709917_68beddf9f3e841519ed53a458ea946ff.html 
##The change of columns from "narr" to "arrest" code were based on this link: https://statisticsglobe.com/rename-column-name-in-r-data-frame/ 

See the data structure of the updated imported data

str(mydata)
## tibble [2,725 × 6] (S3: tbl_df/tbl/data.frame)
##  $ arrest : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
##  $ pcnv   : num [1:2725] 0.38 0.44 0.33 0.25 0 1 0.44 0.75 0.33 0.23 ...
##  $ avgsen : num [1:2725] 17.6 0 22.8 0 0 0 0 0 10.9 0 ...
##  $ tottime: num [1:2725] 35.2 0 22.8 0 0 0 0 0 21.8 0 ...
##  $ ptime  : num [1:2725] 12 0 0 5 0 0 0 0 9 0 ...
##  $ qemp   : num [1:2725] 0 1 0 2 2 4 0 0 0 3 ...

Splitting of Data

set.seed(3717)
grp <- sample(2, 2725, replace = T, prob = c(0.8, 0.2)) 
head(grp, 20)
##  [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
str(grp)
##  int [1:2725] 1 1 1 1 2 1 1 1 1 1 ...
table(grp)
## grp
##    1    2 
## 2155  570
##3717 in the set.seed came from my ID number, 12073717. 
##The 2725 observations will be split into two with a proportion of 80:20: training data (used to create the model and check the performance of the model) and testing data (used to run the model made by the training data). 

Train and Test (continuation of spliting data)

train <- mydata[grp == 1, ]
test <- mydata[grp == 2, ]

Check the Train and Test variables

head(train, 20)
## # A tibble: 20 × 6
##    arrest  pcnv avgsen tottime ptime  qemp
##    <fct>  <dbl>  <dbl>   <dbl> <dbl> <dbl>
##  1 0       0.38   17.6    35.2    12   0  
##  2 0       0.44    0       0       0   1  
##  3 1       0.33   22.8    22.8     0   0  
##  4 0       0.25    0       0       5   2  
##  5 0       1       0       0       0   4  
##  6 0       0.44    0       0       0   0  
##  7 0       0.75    0       0       0   0  
##  8 0       0.33   10.9    21.8     9   0  
##  9 0       0.23    0       0       0   3  
## 10 1       0       0       0       0   4  
## 11 0       0.17   31.7    63.4    12   0  
## 12 0       0.33    0       0       0   4  
## 13 0       0       0       0       0   0  
## 14 1       0       0       0       0   4  
## 15 0       0.2    59.2    59.2     3   1.7
## 16 1       0.38    0       0       0   0  
## 17 0       0.5     0       0       0   4  
## 18 0       0.6     7.9    23.7     0   0  
## 19 0       0.5     0       0       0   4  
## 20 0       0.47    5.6    22.4     0   0
head(test, 20)
## # A tibble: 20 × 6
##    arrest  pcnv avgsen tottime ptime  qemp
##    <fct>  <dbl>  <dbl>   <dbl> <dbl> <dbl>
##  1 1       0       0       0       0     2
##  2 0       0.36   18.7    37.4     0     0
##  3 0       0.33   36.1    36.1     0     4
##  4 1       0.43    0       0       0     1
##  5 0       0.39   11.3    11.3     0     3
##  6 0       0.57    0       0       0     4
##  7 0       0.41    0       0       0     1
##  8 0       1       0       0       0     4
##  9 1       0       0       0       0     1
## 10 0       1       0       0       0     2
## 11 1       0.5     0       0       0     4
## 12 0       0.54    5.5    11       0     3
## 13 0       0.25    0       0       0     4
## 14 0       0       0       0       0     4
## 15 0       0.5     0       0       0     4
## 16 0       1       0       0       0     4
## 17 0       0       0       0       0     4
## 18 1       0       0       0       0     2
## 19 0       1       0       0       0     1
## 20 0       0.5     0       0       0     3

Initial GLM for Training Data

summary(mod1 <- glm(arrest ~ pcnv + avgsen + tottime + ptime + qemp, family = "binomial", data = train))
## 
## Call:
## glm(formula = arrest ~ pcnv + avgsen + tottime + ptime + qemp, 
##     family = "binomial", data = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.84482    0.10707  -7.890 3.01e-15 ***
## pcnv        -1.07829    0.15652  -6.889 5.62e-12 ***
## avgsen       0.06817    0.06019   1.133   0.2573    
## tottime     -0.05792    0.05189  -1.116   0.2644    
## ptime       -0.11298    0.04572  -2.471   0.0135 *  
## qemp        -0.08866    0.03501  -2.532   0.0113 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2131.5  on 2154  degrees of freedom
## Residual deviance: 2061.2  on 2149  degrees of freedom
## AIC: 2073.2
## 
## Number of Fisher Scoring iterations: 5

Refining GLM for Training Data

summary(mod1 <- glm(arrest ~ pcnv + avgsen + ptime + qemp, family = "binomial", data = train))
## 
## Call:
## glm(formula = arrest ~ pcnv + avgsen + ptime + qemp, family = "binomial", 
##     data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.845857   0.107097  -7.898 2.83e-15 ***
## pcnv        -1.081364   0.156610  -6.905 5.03e-12 ***
## avgsen       0.001238   0.018794   0.066  0.94749    
## ptime       -0.121014   0.045681  -2.649  0.00807 ** 
## qemp        -0.088174   0.035004  -2.519  0.01177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2131.5  on 2154  degrees of freedom
## Residual deviance: 2062.8  on 2150  degrees of freedom
## AIC: 2072.8
## 
## Number of Fisher Scoring iterations: 5
## remove "tottime" from glm model since it has a higher Pr than "avgsen" and not significant. 

Final GLM for Training Data

summary(mod1 <- glm(arrest ~ pcnv + ptime + qemp, family = "binomial", data = train))
## 
## Call:
## glm(formula = arrest ~ pcnv + ptime + qemp, family = "binomial", 
##     data = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.84523    0.10667  -7.924 2.30e-15 ***
## pcnv        -1.08088    0.15642  -6.910 4.84e-12 ***
## ptime       -0.12047    0.04490  -2.683  0.00729 ** 
## qemp        -0.08828    0.03496  -2.525  0.01157 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2131.5  on 2154  degrees of freedom
## Residual deviance: 2062.8  on 2151  degrees of freedom
## AIC: 2070.8
## 
## Number of Fisher Scoring iterations: 5
## remove "avgsen" from glm model since it is still not significant. 
## Therefore, the variables pcnv,  ptime, and qemp can predict if a person will be arrested using this model considering its high significance. 

Generation of BLR equation = y = log(oddsratio)

yadmit = -0.84523118 -1.08087863 * pcnv -0.12047300 * ptime -0.08828444 * qemp

##checking observation 1 of train data set respondent 1 = 0.38pcnv 17.6avgsen 35.2tottime 12ptime 0.0qemp

-0.84523118 -1.08087863 * 0.38 -0.12047300 * 12 -0.08828444 * 0.0

###-2.701641 means y of the first observation

Original formula of p

$$ p = e^y / (1+e^y)

$$ ###Convert y to p y = -2.701641 exp(y)/(1+exp(y)) p = 0.06287659

###I added this part just to see if the p is somehow similar with the fitted values of mod1, which resulted that it did.

Prediction for Training data set

p1 <- predict(mod1, data = train, type = "response")
head(p1)
##          1          2          3          4          5          6 
## 0.06287659 0.19637570 0.23113363 0.13074534 0.09285632 0.21068227

Training data: Converting probabilities to binary (0 (not) or 1 (admitted))

pred1 <- ifelse(p1 > 0.5, 1, 0)
head(pred1)
## 1 2 3 4 5 6 
## 0 0 0 0 0 0
##converted the probabilities to 1 or 0

Training data: Misclassification error

Training data: Confusion error

tab1 <- table(Predicted = pred1, Actual = train$arrest)
tab1
##          Actual
## Predicted    0    1
##         0 1733  422
##correct prediction is 1733 (not arrested) 
##incorrect prediction is 422 (not arrested but arrested in reality)

Training Data: To check how the model performed

sum(diag(tab1))/sum(tab1)
## [1] 0.8041763
1-sum(diag(tab1))/sum(tab1)
## [1] 0.1958237
##0.8041763 = predicted correctly; good indication for its accuracy
##0.1958237 = predicted incorrectly

Test Data: Final GLM from Training Data

summary(mod1_test <- glm(arrest ~ pcnv + ptime + qemp, family = "binomial", data = test))
## 
## Call:
## glm(formula = arrest ~ pcnv + ptime + qemp, family = "binomial", 
##     data = test)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.58596    0.18552  -3.158  0.00159 ** 
## pcnv        -1.17920    0.28829  -4.090 4.31e-05 ***
## ptime       -0.10743    0.07399  -1.452  0.14652    
## qemp        -0.07955    0.06180  -1.287  0.19801    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 628.69  on 569  degrees of freedom
## Residual deviance: 605.35  on 566  degrees of freedom
## AIC: 613.35
## 
## Number of Fisher Scoring iterations: 4

Prediction of testing data set

p2 <- predict(mod1_test, data = test, type = "response")
head(p2, 10)
##         1         2         3         4         5         6         7         8 
## 0.3218993 0.2668875 0.2152959 0.2363916 0.2167859 0.1713198 0.2406752 0.1107247 
##         9        10 
## 0.3395035 0.1273871

Testing Data: Converting probabilities to binary (0 (not) or 1 (admitted))

pred2 <- ifelse(p2 > 0.5, 1, 0)
head(pred2)
## 1 2 3 4 5 6 
## 0 0 0 0 0 0
##converted the probabilities to 1 or 0

Testing Data: Misclassification error

Testing Data: Confusion error

tab2 <- table(Predicted = pred2, Actual = test$arrest)
tab2
##          Actual
## Predicted   0   1
##         0 433 137
##correct prediction is 433 (not arrested) 
##incorrect prediction is 137 (not arrested but arrested in reality)

Testing Data: To check how the model performed

sum(diag(tab2))/sum(tab2)
## [1] 0.7596491
1-sum(diag(tab2))/sum(tab2)
## [1] 0.2403509
##0.7596491 = predicted correctly; good indication for its accuracy
##0.2403509 = predicted incorrectly

Discussion and Analysis of the Training and Testing Models and its results

The evaluation of the model’s accuracy is through confusion error and misclassification errors which shows how accurate the models are. For both the training and testing model, it did not show 1 more row for 1 (persons who were arrested) which was alarming since the previous exercise showed 2 rows for the predicted column. However, after checking how the model performed, it resulted in 80.42% correct prediction and 19.58% for incorrect prediction, which is a good indication of a high level of accuracy. On the other hand, the testing model had a lower correct prediction percentage of 75.96% and a higher incorrect prediction percentage of 24.04%, which is still a high level of accuracy. This may imply that the training data model may be overfitting, wherein the model fails to generalize to other data within the population (Frost, 2023). Therefore, increasing the number of the training data set could help in having more different samples that can be considered so that in the testing data set, the model can be generalized.

Frost, J. (2023). Overfitting Regression Models: Problems, Detection, and Avoidance. University of Cape Town. https://statisticsbyjim.com/regression/overfitting-regression-models/