1 Introduction

In the following report, we use survey data with 129 respondents to assess the effect of various socioeconomic and demographic factors on employment status. The respondents are all formerly incarcerated or have a criminal record (felony, misdemeanor, or summary offense).

1.1 Data description

The data is a subset of a much larger data set containing 130 respondents and 64 variables. The variables range from various demographic characteristics (age, race, gender, etc.) and socioeconomic attributes (employment status, income, criminal record, education, etc.). Our current data set was created by subsetting the primary data set by selecting the following variables:

Unemployed: determines whether the respondent is unemployed (1 for unemployed, 0 for employed)
felony: determines whether the respondent has a felony offense on their criminal record (1 for felony offense, 0 otherwise)
misdemeanor: determines whether the respondent has a misdemeanor offense on their criminal record (1 for misdemeanor offense, 0 otherwise)
summary: determines whether the respondent has a summary offense on their criminal record (1 for summary offense, 0 otherwise)
gender: determines the gender of the respondent. All respondents were binary (Male or Female); however, the survey also allowed for Transgender, Non-binary, etc.
black: specifies the race of the respondent (1 for Black, 0 otherwise) ** there were more than solely Black respondents; however, for this analysis we will only look at Black respondents

1.2 Practical Question

How does various socioeconomic and demographic attributes affect the employment status of individuals with a criminal record? Special consideration is paid to the race and criminal record of the individual.

2 Analysis

There will be multiple components to the following analysis conducted below:

A preliminary analysis of the data in which a basic understanding of the variables will be revealed in various tables and plots.
Building a logistic model to best fit the data.
Splitting of the data and a cross-validation in which a confusion matrix will be constructed and the accuracy of the full model will be assessed

There will be three main components to the preliminary analysis:

a summary statistics table
a graphical representation of employment status based upon criminal record
a pairwise scatterplot that assesses the correlation between variables

2.1.1 Load, Clean, and Summarize Data

data <- readxl::read_xlsx("C:/Users/Angelo/West Chester University of PA/Saboe, Matt B. - ZFolder_AS/Clean_Slate_Survey_Data_Cleaned_2023_UNFINISHED.xlsx") %>%
  dplyr::select(6, 7, 8, 11, 13, 36, 37) %>%
  na.omit()
  # read in data, select variables, omit missing values

data$gender <- ifelse(data$gender=="Male",1,0) # make gender variable numeric binary
data$unemployed <- ifelse(data$unemployed_look==1|data$unemployed_notlook==1,1,0)
data <- data %>%
  dplyr::select(1:5,8)
pander(summary(data),
       caption = "Summary Statistics Table") # summary statistics table

Summary Statistics Table (continued below)
felony	misdemeanor	summary	gender
Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000
1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000
Median :0.0000	Median :1.0000	Median :0.0000	Median :1.0000
Mean :0.2868	Mean :0.5039	Mean :0.2093	Mean :0.5426
3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:0.0000	3rd Qu.:1.0000
Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000

black	unemployed
Min. :0.0000	Min. :0.0000
1st Qu.:0.0000	1st Qu.:0.0000
Median :1.0000	Median :0.0000
Mean :0.6589	Mean :0.2248
3rd Qu.:1.0000	3rd Qu.:0.0000
Max. :1.0000	Max. :1.0000

The summary statistics table yields the following results:

28.68 percent of respondents have a felony offense
50.39 percent of respondents have a misdemeanor offense
20.93 percent of respondents have a summary offense
54.26 percent of respondents are Male
65.89 percent of respondents are Black
- 93.8 percent of respondents are Black and/or White
22.48 percent of respondents are unemployed

2.1.2 Crosstables of Employment Status and Socioeconomic/Demographic Variables

data.plot <- data

data.plot$crim_record <- ifelse(data.plot$felony==1,"Felony",ifelse(data.plot$misdemeanor==1,"Misdemeanor",ifelse(data.plot$summary==1,"Summary","Other"))) %>%
  factor(levels = c("Other", "Summary","Misdemeanor","Felony"))
  data.plot <- na.omit(data.plot)

ggplot2::ggplot(data = data.plot) + 
  geom_bar(aes(x=after_stat(count)/sum(after_stat(count))*100, y=unemployed,fill = crim_record),position = position_stack(reverse = TRUE), width = .35)+
  labs(y="Employment Status",x="Percentage", title = "Criminal Record and Unemployment", fill="Criminal Record: ", subtitle = "Colored by Criminal Record and Distinguished By Employment Status")+
  scale_y_discrete(limits = c(0, 1), 
                     expand = c(0, 0),
                   labels = c("Employed","Unemployed"))+
  scale_fill_brewer(palette = "Blues")+
  theme_minimal()+
  theme(legend.position = "bottom", axis.text = element_text(size = 10))

2.1.3 Pairwise Comparison of Variables

ggpairs(data, columns = 1:6)

The pairwise comparison plot shows the correlations as well as the distributions between and for all variables. Our column of interest is the Unemployed column as felony, misdemeanor, and summary are all negatively correlated with Unemployed, whereas, gender=Male and black are positively correlated.

2.2 Model Building and Cross-Validation

In the following section we will develop a five-fold cross-validation to resample the data for prediction purposes. We will use the full model to cross-validate.

k=5
fold.size = round(dim(data)[1]/k)

## PE vectors for candidate models
PE1 = rep(0,k)

for(i in 1:k){
  ## Training and testing folds
  valid.id = (fold.size*(i-1)+1):(fold.size*i)
  valid = data[valid.id, ]

  train.dat = data[-valid.id,]

  ##  full model

  m01 = glm(unemployed ~., family = binomial(link = "logit"), data = train.dat) 
 

  ##  predicted probabilities of each candidate model

  pred01 = predict(m01, newdata = valid, type="response")
 
  ## confusion matrix: ftable() will 

  confusion01 = ftable(as.vector((pred01>0.5)),(valid$unemployed==0))

  PE1[i] = (confusion01[1,1]+confusion01[1,2])/length(pred01)
  
}

accuracy <- (confusion01[1,1])/length(pred01)

The final cross-validated model has an accurracy equal to 19.23 percent. The accuracy is far below 50 percent and is therefore rather ineffective at prediction; however, the model may still have utility.

Predictive Modeling on Unemployment and Various Socioeconomic and Demographic Factors

Angelo Saporito

2023-02-22