#1 Introduction

In the following report, we use survey data with 103 respondents to assess the effect of various socioeconomic and demographic factors on employment status. The respondents are all formerly incarcerated or have a criminal record (felony, misdemeanor, or summary offense).

1.1 Data description

The data is a subset of a much larger data set containing 131 respondents and 64 variables. The variables range from various demographic characteristics (age, race, gender, etc.) and socioeconomic attributes (employment status, income, criminal record, education, etc.). Our current data set was created by subsetting the primary data set by selecting the following variables:

  1. Unemployed: determines whether the respondent is unemployed (1 for unemployed, 0 for employed)
  2. felony: determines whether the respondent has a felony offense on their criminal record (1 for felony offense, 0 otherwise)
  3. misdemeanor: determines whether the respondent has a misdemeanor offense on their criminal record (1 for misdemeanor offense, 0 otherwise)
  4. summary: determines whether the respondent has a summary offense on their criminal record (1 for summary offense, 0 otherwise)
  5. gender: determines the gender of the respondent. All respondents were binary (Male or Female); however, the survey also allowed for Transgender, Non-binary, etc.
  6. black: specifies the race of the respondent (1 for Black, 0 otherwise)
  7. white: specifies the race of the respondent (1 for White, 0 otherwise)

1.2 Practical Question

How does various socioeconomic and demographic attributes affect the employment status of individuals with a criminal record? Special consideration is paid to the race and criminal record of the individual.

2 Analysis

There will be multiple components to the following analysis conducted below:

  1. A preliminary analysis of the data in which a basic understanding of the variables will be revealed in various tables and plots.
  2. Building a logistic model to best fit the data. There will be two models: a full model and a reduced model.
  3. Goodness-of-fit measures will be constructed in order to determine the best model for classification.
  4. Final model selection and reinterpretation of coefficients in terms of log-odds scale.

2.1 Preliminary Analysis

There will be five main components to the preliminary analysis:

  • a summary statistics table
  • two cross tables that provide percentage proportions of variables
  • a pairwise scatterplot that assesses the correlation between variables

2.1.1 Load, Clean, and Summarize Data

data <- readxl::read_xlsx("C:/Users/Angelo/OneDrive/Desktop/College Babyyyyyyy/RESEARCH/Clean Slate/data.xlsx") %>%
  select(6:8,11,13:14,38) %>%
  na.omit() # read in data, select variables, omit missing values
data$gender <- ifelse(data$gender=="Male",1,0) # make gender variable numeric binary
pander(summary(data),
       caption = "Summary Statistics Table") # summary statistics table
Summary Statistics Table (continued below)
felony misdemeanor summary gender
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
Mean :0.2816 Mean :0.4951 Mean :0.1748 Mean :0.5437
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
black white Unemployed
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :0.0000 Median :0.0000
Mean :0.6893 Mean :0.2427 Mean :0.2816
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000

The summary statistics table yields the following results:

  • 28.16 percent of respondents have a felony offense
  • 49.51 percent of respondents have a misdemeanor offense
  • 17.48 percent of respondents have a summary offense
  • 54.37 percent of respondents are Male
  • 68.93 percent of respondents are Black
  • 24.27 percent of respondents are White
    • 93.2 percent of respondents are Black and/or White
  • 28.16 percent of respondents are unemployed

2.1.2 Crosstables of Employment Status and Socioeconomic/Demographic Variables

crosstable(data, felony+misdemeanor+summary~Unemployed, showNA="no",
           percent_digits=0, percent_pattern="{n} ({p_col}/{p_row})") %>% 
  as_flextable()

label

variable

Unemployed

0

1

felony

0

51 (69%/69%)

23 (79%/31%)

1

23 (31%/79%)

6 (21%/21%)

misdemeanor

0

36 (49%/69%)

16 (55%/31%)

1

38 (51%/75%)

13 (45%/25%)

summary

0

59 (80%/69%)

26 (90%/31%)

1

15 (20%/83%)

3 (10%/17%)

crosstable(data, Unemployed~black+gender, showNA="no",
           percent_digits=0, percent_pattern="{n} ({p_col}/{p_row})") %>% 
  as_flextable()

label

variable

gender=0

gender=1

black=0

black=1

black=0

black=1

Unemployed

0

13 (81%/18%)

21 (68%/28%)

12 (75%/16%)

28 (70%/38%)

1

3 (19%/10%)

10 (32%/34%)

4 (25%/14%)

12 (30%/41%)

The first crosstable shows the employment status of respondents based on their criminal record. Percentages are column and row percentages. For example, for individuals with a felony record 79 percent are employed and 21 percent are unemployed.

The second crosstable shows the employment status based on race and gender. Take for example Black males: 70 percent are employed and 30 percent are unemployed.

2.1.3 Pairwise Comparison of Variables

ggpairs(data, columns = 1:7)

The pairwise comparison plot shows the correlations as well as the distributions between and for all variables. Our column of interest is the Unemployed column as felony, misdemeanor, summary, and white are all negatively correlated with Unemployed, whereas, gender=Male and black are positively correlated.

2.2 Model Building

In the following section, we will build three logit models to best represent our data: a full model in which all variables are included, a reduced model in which variables are manually selected, and a forward model in which variables are automatically selected.

2.2.1 Full Logit Model

full.model = glm(Unemployed~., 
         family = binomial(link = "logit"),  #  logit(p) = log(p/(1-p))!
         data = data) 

pander(summary(full.model)$coef, 
   caption="Summary of inferential statistics of the full model")
Summary of inferential statistics of the full model
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.176 1.236 -1.76 0.07834
felony -0.6818 0.5557 -1.227 0.2198
misdemeanor -0.245 0.4615 -0.531 0.5954
summary -0.9108 0.7276 -1.252 0.2106
gender 0.02418 0.4608 0.05248 0.9581
black 1.773 1.186 1.494 0.1351
white 1.551 1.22 1.271 0.2036

After building the full model, we find that the model yields the following results:

  1. the model is statistically significant at \(\alpha=0.1\) since \(p = 0.0783\)
  2. the intercept coefficient is -2.176 (\(\beta_0=-2.176\))
  3. the independent variables’ coefficients and significance are as follows:
    • felony has a coefficient of -0.6818 meaning that the variable is negatively associated with unemployment. The variable is not statistically significant as the p-value is equal to 0.2198.
    • misdemeanor has a coefficient of -0.245 meaning that the variable is negatively associated with unemployment. The variable is not statistically significant as the p-value is equal to 0.5954.
    • summary has a coefficient of -0.9108 meaning that the variable is negatively associated with unemployment. The variable is not statistically significant at a p-value of 0.2106.
    • gender has a coefficient of 0.02418 meaning that when gender=Male or 1, the variable is positively associated with unemployment. The variable is not statistically significant as the p-value is equal to 0.9581.
    • black has a coefficient of 1.773 meaning that a positive response (Black or 1) is positively associated with unemployment. The variable is not statistically significant at a p-value of 0.1351.
    • white has a coefficient of 1.551 meaning that a positive response (White or 1) is positively associated with unemployment. The variable is not statistically significant at a p-value equal to 0.2036.

2.2.2 Reduced Logit Model

reduced.model = glm(Unemployed~.-gender-misdemeanor-white, 
         family = binomial(link = "logit"),  #  logit(p) = log(p/(1-p))!
         data = data) 

pander(summary(reduced.model)$coef, 
   caption="Summary of inferential statistics of the reduced model")
Summary of inferential statistics of the reduced model
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.037 0.4507 -2.301 0.02137
felony -0.7189 0.5407 -1.33 0.1837
summary -0.7676 0.6845 -1.121 0.2621
black 0.5617 0.5186 1.083 0.2788

After reducing the full model, we find that the reduced model yields the following results:

  1. the model is statistically significant at \(\alpha=0.1\) since \(p = 0.0506\)
  2. the intercept coefficient is -2.316 (\(\beta_0=-2.316\))
  3. the independent variables’ coefficients and significance are as follows:
    • felony has a coefficient of -0.6478 meaning that the variable is negatively associated with unemployment. The variable is not statistically significant as the p-value is equal to 0.2375.
    • summary has a coefficient of -0.9089 meaning that the variable is negatively associated with unemployment. The variable is not statistically significant at a p-value of 0.2043.
    • black has a coefficient of 1.811 meaning that a positive response (Black or 1) is positively associated with unemployment. The variable is not statistically significant at a p-value of 0.1301.
    • white has a coefficient of 1.55 meaning that a positive response (White or 1) is positively associated with unemployment. The variable is not statistically significant at a p-value equal to 0.2081.

Despite the variance inflation factors of black and white being greater than 5, hinting at some multicollinearity, both variables will be kept in the reduced model for narrative purposes.

2.2.3 Forward Logit Model

final.model.forward = stepAIC(reduced.model, 
                      scope = list(lower=formula(reduced.model),upper=formula(full.model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )

pander(summary(final.model.forward)$coef, 
   caption="Summary of inferential statistics of the final model")
Summary of inferential statistics of the final model
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.037 0.4507 -2.301 0.02137
felony -0.7189 0.5407 -1.33 0.1837
summary -0.7676 0.6845 -1.121 0.2621
black 0.5617 0.5186 1.083 0.2788

After automatically selecting variables to include in the forward model, the model yielded the same results as the reduced model as the same variables were selected. See above (Section 2.2.2) to see the final results.

2.3 Goodness-of-fit Measures and Model Selection

global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
      reduced.model=global.measure(reduced.model),
      final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 116.0743 122.4494 130.0743
reduced.model 118.3444 122.4494 126.3444
final.model 118.3444 122.4494 126.3444

We used various goodness-of-fit measurements todetermine which model was superior for classification purposes. Since both the reduced and forward models are the same, it was decided that the forward model would be used as it has the lowest AIC compared to the full model.

model.coef.stats = summary(reduced.model)$coef
odds.ratio = exp(coef(reduced.model))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -1.0372839 0.4507357 -2.301313 0.0213740 0.3544160
felony -0.7188880 0.5407112 -1.329523 0.1836755 0.4872938
summary -0.7675842 0.6844728 -1.121424 0.2621074 0.4641330
black 0.5617290 0.5186215 1.083119 0.2787554 1.7537020

After deciding on the final model, we interpretted the coefficients into the log-odds scale for increased interpretability. After transforming the coefficients into the log-odds scale, the final model yielded the following results:

  1. the intercept log-odds coefficient is 0.0987 (\(\beta_0=0.0987\)) meaning that, holding all explanatory variables constant at 0, the odds of being unemployed are 0.099.
  2. the independent variables’ log-odds coefficients are as follows:
    • felony has a log-odds coefficient of 0.5232 meaning that an individual is 0.4245 more likely of being unemployed if they had a felony offense than if they did not have a record.
    • summary has a log-odds coefficient of 0.4030 meaning that an individual is 0.3043 more likely of being unemployed if they had a felony offense than if they did not have a record.
    • black has a log-odds coefficient of 6.1158 meaning that a Black individual is 1.4058 more likely of being unemployed than a White individual.
    • white has a log-odds coefficient of 4.71 meaning that a White individual is 1.4058 less likely of being unemployed than a White individual.

3 Conclusion

In the analysis above, a preliminary analysis of the data was constructed and found the distributions and correlations of the response and explanatory variables. A multiple logistic regression model was designed to test the relationship between Unemployment and felony, misdemeanor, summary, gender, black, and white. The results of the full model were interpreted and a reduced model was built in order to increase the model’s predictive power. Another model was created by automatically selecting variables for inclusion; however, the model yielded the same results as the reduced model. Using various goodness-of-fit measures, the reduced (or forward) model was selected for use and log-odds transformations of the logit coefficients were calculated for increased interpretability.