#1 Introduction
In the following report, we use survey data with 103 respondents to assess the effect of various socioeconomic and demographic factors on employment status. The respondents are all formerly incarcerated or have a criminal record (felony, misdemeanor, or summary offense).
The data is a subset of a much larger data set containing 131 respondents and 64 variables. The variables range from various demographic characteristics (age, race, gender, etc.) and socioeconomic attributes (employment status, income, criminal record, education, etc.). Our current data set was created by subsetting the primary data set by selecting the following variables:
How does various socioeconomic and demographic attributes affect the employment status of individuals with a criminal record? Special consideration is paid to the race and criminal record of the individual.
There will be multiple components to the following analysis conducted below:
There will be five main components to the preliminary analysis:
data <- readxl::read_xlsx("C:/Users/Angelo/OneDrive/Desktop/College Babyyyyyyy/RESEARCH/Clean Slate/data.xlsx") %>%
select(6:8,11,13:14,38) %>%
na.omit() # read in data, select variables, omit missing values
data$gender <- ifelse(data$gender=="Male",1,0) # make gender variable numeric binary
pander(summary(data),
caption = "Summary Statistics Table") # summary statistics table
| felony | misdemeanor | summary | gender |
|---|---|---|---|
| Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 |
| 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 |
| Median :0.0000 | Median :0.0000 | Median :0.0000 | Median :1.0000 |
| Mean :0.2816 | Mean :0.4951 | Mean :0.1748 | Mean :0.5437 |
| 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 |
| Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
| black | white | Unemployed |
|---|---|---|
| Min. :0.0000 | Min. :0.0000 | Min. :0.0000 |
| 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 |
| Median :1.0000 | Median :0.0000 | Median :0.0000 |
| Mean :0.6893 | Mean :0.2427 | Mean :0.2816 |
| 3rd Qu.:1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 |
| Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
The summary statistics table yields the following results:
crosstable(data, felony+misdemeanor+summary~Unemployed, showNA="no",
percent_digits=0, percent_pattern="{n} ({p_col}/{p_row})") %>%
as_flextable()
label | variable | Unemployed | |
|---|---|---|---|
0 | 1 | ||
felony | 0 | 51 (69%/69%) | 23 (79%/31%) |
1 | 23 (31%/79%) | 6 (21%/21%) | |
misdemeanor | 0 | 36 (49%/69%) | 16 (55%/31%) |
1 | 38 (51%/75%) | 13 (45%/25%) | |
summary | 0 | 59 (80%/69%) | 26 (90%/31%) |
1 | 15 (20%/83%) | 3 (10%/17%) | |
crosstable(data, Unemployed~black+gender, showNA="no",
percent_digits=0, percent_pattern="{n} ({p_col}/{p_row})") %>%
as_flextable()
label | variable | gender=0 | gender=1 | ||
|---|---|---|---|---|---|
black=0 | black=1 | black=0 | black=1 | ||
Unemployed | 0 | 13 (81%/18%) | 21 (68%/28%) | 12 (75%/16%) | 28 (70%/38%) |
1 | 3 (19%/10%) | 10 (32%/34%) | 4 (25%/14%) | 12 (30%/41%) | |
The first crosstable shows the employment status of respondents based on their criminal record. Percentages are column and row percentages. For example, for individuals with a felony record 79 percent are employed and 21 percent are unemployed.
The second crosstable shows the employment status based on race and gender. Take for example Black males: 70 percent are employed and 30 percent are unemployed.
ggpairs(data, columns = 1:7)
The pairwise comparison plot shows the correlations as well as the distributions between and for all variables. Our column of interest is the Unemployed column as felony, misdemeanor, summary, and white are all negatively correlated with Unemployed, whereas, gender=Male and black are positively correlated.
In the following section, we will build three logit models to best represent our data: a full model in which all variables are included, a reduced model in which variables are manually selected, and a forward model in which variables are automatically selected.
full.model = glm(Unemployed~.,
family = binomial(link = "logit"), # logit(p) = log(p/(1-p))!
data = data)
pander(summary(full.model)$coef,
caption="Summary of inferential statistics of the full model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.176 | 1.236 | -1.76 | 0.07834 |
| felony | -0.6818 | 0.5557 | -1.227 | 0.2198 |
| misdemeanor | -0.245 | 0.4615 | -0.531 | 0.5954 |
| summary | -0.9108 | 0.7276 | -1.252 | 0.2106 |
| gender | 0.02418 | 0.4608 | 0.05248 | 0.9581 |
| black | 1.773 | 1.186 | 1.494 | 0.1351 |
| white | 1.551 | 1.22 | 1.271 | 0.2036 |
After building the full model, we find that the model yields the following results:
reduced.model = glm(Unemployed~.-gender-misdemeanor-white,
family = binomial(link = "logit"), # logit(p) = log(p/(1-p))!
data = data)
pander(summary(reduced.model)$coef,
caption="Summary of inferential statistics of the reduced model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.037 | 0.4507 | -2.301 | 0.02137 |
| felony | -0.7189 | 0.5407 | -1.33 | 0.1837 |
| summary | -0.7676 | 0.6845 | -1.121 | 0.2621 |
| black | 0.5617 | 0.5186 | 1.083 | 0.2788 |
After reducing the full model, we find that the reduced model yields the following results:
Despite the variance inflation factors of black and white being greater than 5, hinting at some multicollinearity, both variables will be kept in the reduced model for narrative purposes.
final.model.forward = stepAIC(reduced.model,
scope = list(lower=formula(reduced.model),upper=formula(full.model)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
pander(summary(final.model.forward)$coef,
caption="Summary of inferential statistics of the final model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.037 | 0.4507 | -2.301 | 0.02137 |
| felony | -0.7189 | 0.5407 | -1.33 | 0.1837 |
| summary | -0.7676 | 0.6845 | -1.121 | 0.2621 |
| black | 0.5617 | 0.5186 | 1.083 | 0.2788 |
After automatically selecting variables to include in the forward model, the model yielded the same results as the reduced model as the same variables were selected. See above (Section 2.2.2) to see the final results.
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
reduced.model=global.measure(reduced.model),
final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 116.0743 | 122.4494 | 130.0743 |
| reduced.model | 118.3444 | 122.4494 | 126.3444 |
| final.model | 118.3444 | 122.4494 | 126.3444 |
We used various goodness-of-fit measurements todetermine which model was superior for classification purposes. Since both the reduced and forward models are the same, it was decided that the forward model would be used as it has the lowest AIC compared to the full model.
model.coef.stats = summary(reduced.model)$coef
odds.ratio = exp(coef(reduced.model))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -1.0372839 | 0.4507357 | -2.301313 | 0.0213740 | 0.3544160 |
| felony | -0.7188880 | 0.5407112 | -1.329523 | 0.1836755 | 0.4872938 |
| summary | -0.7675842 | 0.6844728 | -1.121424 | 0.2621074 | 0.4641330 |
| black | 0.5617290 | 0.5186215 | 1.083119 | 0.2787554 | 1.7537020 |
After deciding on the final model, we interpretted the coefficients into the log-odds scale for increased interpretability. After transforming the coefficients into the log-odds scale, the final model yielded the following results:
In the analysis above, a preliminary analysis of the data was constructed and found the distributions and correlations of the response and explanatory variables. A multiple logistic regression model was designed to test the relationship between Unemployment and felony, misdemeanor, summary, gender, black, and white. The results of the full model were interpreted and a reduced model was built in order to increase the model’s predictive power. Another model was created by automatically selecting variables for inclusion; however, the model yielded the same results as the reduced model. Using various goodness-of-fit measures, the reduced (or forward) model was selected for use and log-odds transformations of the logit coefficients were calculated for increased interpretability.