In the following assignment, we will define a case study that analyzes the effects of a felony conviction on employment prospects of 126 formerly incarcerated individuals. Our general question is whether or not a felony conviction leads to higher rates of unemployment.
The data comes from surveys taken at various Clean Slate events in Southeastern Pennsylvania. Respondents are asked a series of socioeconomic questions including questions related to age, race, employment status, income, criminal record, etc. The data set contained here has 126 observations with two variables:
The data is private and may not be shared until further notice.
## tibble [126 × 2] (S3: tbl_df/tbl/data.frame)
## $ felony : num [1:126] 0 0 0 0 0 0 0 0 0 1 ...
## $ unemployed: num [1:126] 0 0 0 0 0 0 0 1 0 0 ...
## - attr(*, "na.action")= 'omit' Named int 13
## ..- attr(*, "names")= chr "13"
The following analysis will be sub-divided into two sections:
The following section will create two graphical representations of the data:
The following graph shows the distribution of the two variables. The histogram of the distribution of felony shows that around eight individuals had felony convictions and the histogram of unemployed yields a similar conclusion where only six or seven individuals are unemployed.
hist_felony <- ggplot(data, aes(x=felony)) +
geom_histogram(fill = "lightblue2",
color = "darkblue", aes(y=..density..))+
geom_density(alpha=.2, fill="cadetblue2")+
theme_minimal() +
labs(title = "Distribution of Felony",
subtitle = "Binary Distribution Between 0 and 1",
x = "",
y = "Frequency")
hist_unempl <- ggplot(data, aes(x=unemployed)) +
geom_histogram(fill = "gold2",
color = "goldenrod4", aes(y=..density..))+
geom_density(alpha=.2, fill="orange")+
theme_minimal() +
labs(title = "Distribution of Unemployed",
subtitle = "Binary Distribution Between 0 and 1",
x = "",
y = "Frequency")
figure <- ggarrange(hist_felony,hist_unempl, nrow = 1, ncol = 2) #plot both histograms in the same graphic
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
The following graph shows a pairwise relationship between the two variables in which the distributions and correlation between the two variables are determined.
GGally::ggpairs(data[1:2])
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
In the following section, we will first define a logit model to best represent the data. We will then interpret the model and determine its predictive power and capabilities.
##
## Call:
## glm(formula = unemployed ~ felony, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7733 -0.7733 -0.7733 -0.5949 1.9074
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0542 0.2421 -4.354 1.34e-05 ***
## felony -0.5881 0.5075 -1.159 0.247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 135.95 on 125 degrees of freedom
## Residual deviance: 134.51 on 124 degrees of freedom
## AIC: 138.51
##
## Number of Fisher Scoring iterations: 4
model.coef.stats = summary(model.logit)$coef # output stats of coefficients
conf.ci = confint(model.logit) # confidence intervals of betas
## Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci) # rounding off decimals
kable(sum.stats,caption = "The summary stats of regression coefficients")
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
|---|---|---|---|---|---|---|
| (Intercept) | -1.0541605 | 0.2421357 | -4.353595 | 0.0000134 | -1.549385 | -0.5954257 |
| felony | -0.5880672 | 0.5074982 | -1.158757 | 0.2465551 | -1.662269 | 0.3575115 |
The table above represents the final logit model. An interpretation of the model is as follows:
The table below specifies other goodness-of-fit measurements. Since we have no other models to compare our goodness measurements, we report the measurements solely for completeness.
## Other global goodness-of-fit
dev.resid = model.logit$deviance
dev.0.resid = model.logit$null.deviance
aic = model.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
pander(goodness)
| Deviance.residual | Null.Deviance.Residual | AIC |
|---|---|---|
| 134.5 | 135.9 | 138.5 |
In the following section we will interpret the log-odds of the coefficients. This means that we will interpret the probability of being unemployed if you had a felony conviction. The table below shows that the log odds of being unemployed if you have a felony conviction is 0.56 percent. This means that a felony conviction does not impact access to employment as severely as it is commonly percieved.
# Odds ratio
model.coef.stats = summary(model.logit)$coef
odds.ratio = exp(coef(model.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -1.0541605 | 0.2421357 | -4.353595 | 0.0000134 | 0.3484848 |
| felony | -0.5880672 | 0.5074982 | -1.158757 | 0.2465551 | 0.5553997 |
The following graphs show the probability and rate of change of the probability of being unemployed given a felony conviction. The left-hand graph shows that the probability of being unemployed decreases when you have had a felony conviction. And the right-hand graph yields a similar conclusion: if you have been convicted of a felony, the likelihood of being unemployed decreases.
felony.range = range(data$felony)
x = seq(felony.range[1], felony.range[2], length = 200)
beta.x = coef(model.logit)[1] + coef(model.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(model.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
main = "The probability of being \n unemployed",
ylim=c(0, 1.1*ylimit),
xlab = "Unemployed",
ylab = "probability",
axes = FALSE,
col.main = "navy",
cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
# legend(30, 1, c("Success Probability", "Failure Probability"), lwd = rep(2,2),
# col = c("navy", "darkred"), cex = 0.7, bty = "n")
##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
main = "The rate of change in the probability \n unemployed",
xlab = "Unemployed",
ylab = "Rate of Change",
ylim=c(0,2*y.rate),
axes = FALSE,
col.main = "navy",
cex.main = 0.8
)
axis(1, pos = 0)
axis(2)
In the report above, a preliminary analysis of the data was constructed and found the distributions of the two variables: felony and unemployed. A simple logistic regression model was designed to test the relationship between unemployment and a felony conviction. The model revealed that unemployment is not determined by criminal record, or, at least, it is not determined by felony conviction as the log odds of becoming unemployed if you have a prior felony conviction is 0.55 percent.
In the next assignment, we will include further variables such as ethnicity, income, and other criminal record categories (i.e., misdemeanor, summary offense). We will then run a multiple logistical regression model to determine which variables are most significant in determining whether an individual is unemployed.