1 Introduction

In the following assignment, we will define a case study that analyzes the effects of a felony conviction on employment prospects of 126 formerly incarcerated individuals. Our general question is whether or not a felony conviction leads to higher rates of unemployment.

1.1 Data and Description

The data comes from surveys taken at various Clean Slate events in Southeastern Pennsylvania. Respondents are asked a series of socioeconomic questions including questions related to age, race, employment status, income, criminal record, etc. The data set contained here has 126 observations with two variables:

felony: binary dummy variable which determines criminal record (1 for felony conviction, 0 for not felony conviction)
unemployed: binary dummy variable which determines employment status (1 for unemployed, 0 for other status)

The data is private and may not be shared until further notice.

## tibble [126 × 2] (S3: tbl_df/tbl/data.frame)
##  $ felony    : num [1:126] 0 0 0 0 0 0 0 0 0 1 ...
##  $ unemployed: num [1:126] 0 0 0 0 0 0 0 1 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int 13
##   ..- attr(*, "names")= chr "13"

2 Analysis

The following analysis will be sub-divided into two sections:

A preliminary analysis which will determine the relationship of the two variables and define a narrative
Model building and specification which will generate a model and explain the output by determining the accuracy and utility of the model

2.1 Preliminary Analysis

The following section will create two graphical representations of the data:

a side-by-side histogram of the distribution of the two variables
a pairwise plot which will further elucidate the relationship between the two variables

2.1.1 Distribution of the Variables

The following graph shows the distribution of the two variables. The histogram of the distribution of felony shows that around eight individuals had felony convictions and the histogram of unemployed yields a similar conclusion where only six or seven individuals are unemployed.

hist_felony <- ggplot(data, aes(x=felony)) + 
                geom_histogram(fill = "lightblue2", 
                                color = "darkblue", aes(y=..density..))+
                  geom_density(alpha=.2, fill="cadetblue2")+
                theme_minimal() + 
                labs(title = "Distribution of Felony", 
                     subtitle = "Binary Distribution Between 0 and 1",
                      x = "", 
                      y = "Frequency")

hist_unempl <- ggplot(data, aes(x=unemployed)) + 
                geom_histogram(fill = "gold2", 
                                color = "goldenrod4", aes(y=..density..))+
                  geom_density(alpha=.2, fill="orange")+
                theme_minimal() + 
                labs(title = "Distribution of Unemployed", 
                     subtitle = "Binary Distribution Between 0 and 1",
                      x = "", 
                      y = "Frequency")
                

figure <- ggarrange(hist_felony,hist_unempl, nrow = 1, ncol = 2) #plot both histograms in the same graphic

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

figure

2.1.2 Graphical Representation of the Relationship between Variables

The following graph shows a pairwise relationship between the two variables in which the distributions and correlation between the two variables are determined.

GGally::ggpairs(data[1:2])

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

2.2 Model Building and Specification

In the following section, we will first define a logit model to best represent the data. We will then interpret the model and determine its predictive power and capabilities.

2.2.1 Model Building

## 
## Call:
## glm(formula = unemployed ~ felony, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7733  -0.7733  -0.7733  -0.5949   1.9074  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0542     0.2421  -4.354 1.34e-05 ***
## felony       -0.5881     0.5075  -1.159    0.247    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 135.95  on 125  degrees of freedom
## Residual deviance: 134.51  on 124  degrees of freedom
## AIC: 138.51
## 
## Number of Fisher Scoring iterations: 4

model.coef.stats = summary(model.logit)$coef       # output stats of coefficients
conf.ci = confint(model.logit)                     # confidence intervals of betas

## Waiting for profiling to be done...

sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)   # rounding off decimals
kable(sum.stats,caption = "The summary stats of regression coefficients")

The summary stats of regression coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)	2.5 %	97.5 %
(Intercept)	-1.0541605	0.2421357	-4.353595	0.0000134	-1.549385	-0.5954257
felony	-0.5880672	0.5074982	-1.158757	0.2465551	-1.662269	0.3575115

The table above represents the final logit model. An interpretation of the model is as follows:

the model is statistically significant with \(p<0.0000\) and has an intercept coefficient equal to \(\beta_0=-1.0542\)
the explanatory variable felony has a coefficient of \(\beta_1=-0.5881\) meaning that it has a negative association with unemployent, and the variable is not statistically significant with \(p=0.2466\)
the confidence intervals verify the intercept coefficients as the 95% confidence interval of \(\beta_1=[-1.6623, 0.3575]\) and \(\beta_0=[-1.5494, -0.5954]\)

The table below specifies other goodness-of-fit measurements. Since we have no other models to compare our goodness measurements, we report the measurements solely for completeness.

## Other global goodness-of-fit
dev.resid = model.logit$deviance
dev.0.resid = model.logit$null.deviance
aic = model.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
pander(goodness)

Deviance.residual	Null.Deviance.Residual	AIC
134.5	135.9	138.5

2.2.2 Log-odds Interpretation

In the following section we will interpret the log-odds of the coefficients. This means that we will interpret the probability of being unemployed if you had a felony conviction. The table below shows that the log odds of being unemployed if you have a felony conviction is 0.56 percent. This means that a felony conviction does not impact access to employment as severely as it is commonly percieved.

# Odds ratio
model.coef.stats = summary(model.logit)$coef
odds.ratio = exp(coef(model.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")

Summary Stats with Odds Ratios
	Estimate	Std. Error	z value	Pr(>\|z\|)	odds.ratio
(Intercept)	-1.0541605	0.2421357	-4.353595	0.0000134	0.3484848
felony	-0.5880672	0.5074982	-1.158757	0.2465551	0.5553997

The following graphs show the probability and rate of change of the probability of being unemployed given a felony conviction. The left-hand graph shows that the probability of being unemployed decreases when you have had a felony conviction. And the right-hand graph yields a similar conclusion: if you have been convicted of a felony, the likelihood of being unemployed decreases.

felony.range = range(data$felony)
x = seq(felony.range[1], felony.range[2], length = 200)
beta.x = coef(model.logit)[1] + coef(model.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(model.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n  unemployed", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Unemployed",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
# legend(30, 1, c("Success Probability", "Failure Probability"), lwd = rep(2,2), 
#       col = c("navy", "darkred"), cex = 0.7, bty = "n")
##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
     main = "The rate of change in the probability \n  unemployed", 
     xlab = "Unemployed",
     ylab = "Rate of Change",
     ylim=c(0,2*y.rate),
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8
     )
axis(1, pos = 0)
axis(2)

3 Conclusion

In the report above, a preliminary analysis of the data was constructed and found the distributions of the two variables: felony and unemployed. A simple logistic regression model was designed to test the relationship between unemployment and a felony conviction. The model revealed that unemployment is not determined by criminal record, or, at least, it is not determined by felony conviction as the log odds of becoming unemployed if you have a prior felony conviction is 0.55 percent.

In the next assignment, we will include further variables such as ethnicity, income, and other criminal record categories (i.e., misdemeanor, summary offense). We will then run a multiple logistical regression model to determine which variables are most significant in determining whether an individual is unemployed.

Week 6: Simple Logistic Regression Assignment

Angelo Saporito

2023-03-01