1 Introduction

In the following report, we use survey data with 129 respondents to assess the effect of various socioeconomic and demographic factors on employment status. The respondents are all formerly incarcerated or have a criminal record (felony, misdemeanor, or summary offense).

1.1 Data description

The data is a subset of a much larger data set containing 130 respondents and 64 variables. The variables range from various demographic characteristics (age, race, gender, etc.) and socioeconomic attributes (employment status, income, criminal record, education, etc.). Our current data set was created by subsetting the primary data set by selecting the following variables:

  1. Unemployed: determines whether the respondent is unemployed (1 for unemployed, 0 for employed)
  2. felony: determines whether the respondent has a felony offense on their criminal record (1 for felony offense, 0 otherwise)
  3. misdemeanor: determines whether the respondent has a misdemeanor offense on their criminal record (1 for misdemeanor offense, 0 otherwise)
  4. summary: determines whether the respondent has a summary offense on their criminal record (1 for summary offense, 0 otherwise)
  5. gender: determines the gender of the respondent. All respondents were binary (Male or Female); however, the survey also allowed for Transgender, Non-binary, etc.
  6. black: specifies the race of the respondent (1 for Black, 0 otherwise) ** there were more than solely Black respondents; however, for this analysis we will only look at Black respondents

1.2 Practical Question

How does various socioeconomic and demographic attributes affect the employment status of individuals with a criminal record? Special consideration is paid to the race and criminal record of the individual.

2 Analysis

There will be multiple components to the following analysis conducted below:

  1. A preliminary analysis of the data in which a basic understanding of the variables will be revealed in various tables and plots.
  2. Building a logistic model to best fit the data.
  3. Splitting of the data and a cross-validation in which a confusion matrix will be constructed and the accuracy of the full model will be assessed

There will be three main components to the preliminary analysis:

2.1.1 Load, Clean, and Summarize Data

data <- readxl::read_xlsx("C:/Users/Angelo/West Chester University of PA/Saboe, Matt B. - ZFolder_AS/Clean_Slate_Survey_Data_Cleaned_2023_UNFINISHED.xlsx") %>%
  dplyr::select(6, 7, 8, 11, 13, 36, 37) %>%
  na.omit()
  # read in data, select variables, omit missing values

data$gender <- ifelse(data$gender=="Male",1,0) # make gender variable numeric binary
data$unemployed <- ifelse(data$unemployed_look==1|data$unemployed_notlook==1,1,0)
data <- data %>%
  dplyr::select(1:5,8)
pander(summary(data),
       caption = "Summary Statistics Table") # summary statistics table
Summary Statistics Table (continued below)
felony misdemeanor summary gender
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :1.0000 Median :0.0000 Median :1.0000
Mean :0.2868 Mean :0.5039 Mean :0.2093 Mean :0.5426
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
black unemployed
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :0.0000
Mean :0.6589 Mean :0.2248
3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000

The summary statistics table yields the following results:

  • 28.68 percent of respondents have a felony offense
  • 50.39 percent of respondents have a misdemeanor offense
  • 20.93 percent of respondents have a summary offense
  • 54.26 percent of respondents are Male
  • 65.89 percent of respondents are Black
    • 93.8 percent of respondents are Black and/or White
  • 22.48 percent of respondents are unemployed

2.1.2 Crosstables of Employment Status and Socioeconomic/Demographic Variables

data.plot <- data

data.plot$crim_record <- ifelse(data.plot$felony==1,"Felony",ifelse(data.plot$misdemeanor==1,"Misdemeanor",ifelse(data.plot$summary==1,"Summary","Other"))) %>%
  factor(levels = c("Other", "Summary","Misdemeanor","Felony"))
  data.plot <- na.omit(data.plot)

ggplot2::ggplot(data = data.plot) + 
  geom_bar(aes(x=after_stat(count)/sum(after_stat(count))*100, y=unemployed,fill = crim_record),position = position_stack(reverse = TRUE), width = .35)+
  labs(y="Employment Status",x="Percentage", title = "Criminal Record and Unemployment", fill="Criminal Record: ", subtitle = "Colored by Criminal Record and Distinguished By Employment Status")+
  scale_y_discrete(limits = c(0, 1), 
                     expand = c(0, 0),
                   labels = c("Employed","Unemployed"))+
  scale_fill_brewer(palette = "Blues")+
  theme_minimal()+
  theme(legend.position = "bottom", axis.text = element_text(size = 10))

2.1.3 Pairwise Comparison of Variables

ggpairs(data, columns = 1:6)

The pairwise comparison plot shows the correlations as well as the distributions between and for all variables. Our column of interest is the Unemployed column as felony, misdemeanor, and summary are all negatively correlated with Unemployed, whereas, gender=Male and black are positively correlated.

2.2 Model Building and Cross-Validation

In the following section we will develop a five-fold cross-validation to resample the data for prediction purposes. We will use the full model to cross-validate.

k=5
fold.size = round(dim(data)[1]/k)

## PE vectors for candidate models
PE1 = rep(0,k)

for(i in 1:k){
  ## Training and testing folds
  valid.id = (fold.size*(i-1)+1):(fold.size*i)
  valid = data[valid.id, ]

  train.dat = data[-valid.id,]

  ##  full model

  m01 = glm(unemployed ~., family = binomial(link = "logit"), data = train.dat) 
 

  ##  predicted probabilities of each candidate model

  pred01 = predict(m01, newdata = valid, type="response")
 
  ## confusion matrix: ftable() will 

  confusion01 = ftable(as.vector((pred01>0.5)),(valid$unemployed==0))

  PE1[i] = (confusion01[1,1]+confusion01[1,2])/length(pred01)
  
}

accuracy <- (confusion01[1,1])/length(pred01)

The final cross-validated model has an accurracy equal to 19.23 percent. The accuracy is far below 50 percent and is therefore rather ineffective at prediction; however, the model may still have utility.

3 Conclusion and Final Remarks

In the above sections, we developed a narrative to understand what factors contribute to unemployment based on socioeconomic and demographic variables. A preliminary analysis of the data was developed and a final model was built to understand what variables contribute to unemployment. Finally, the data was split and cross-validated to determine the accuracy of the full model. The model’s accurracy was assessed to be weak and therefore the model was determined to be relatively poor at prediction.