In the following report, we use survey data with 129 respondents to assess the effect of various socioeconomic and demographic factors on employment status. The respondents are all formerly incarcerated or have a criminal record (felony, misdemeanor, or summary offense).
The data is a subset of a much larger data set containing 130 respondents and 64 variables. The variables range from various demographic characteristics (age, race, gender, etc.) and socioeconomic attributes (employment status, income, criminal record, education, etc.). Our current data set was created by subsetting the primary data set by selecting the following variables:
How does various socioeconomic and demographic attributes affect the employment status of individuals with a criminal record? Special consideration is paid to the race and criminal record of the individual.
There will be multiple components to the following analysis conducted below:
There will be three main components to the preliminary analysis:
data <- readxl::read_xlsx("C:/Users/Angelo/West Chester University of PA/Saboe, Matt B. - ZFolder_AS/Clean_Slate_Survey_Data_Cleaned_2023_UNFINISHED.xlsx") %>%
dplyr::select(6, 7, 8, 11, 13, 36, 37) %>%
na.omit()
# read in data, select variables, omit missing values
data$gender <- ifelse(data$gender=="Male",1,0) # make gender variable numeric binary
data$unemployed <- ifelse(data$unemployed_look==1|data$unemployed_notlook==1,1,0)
data <- data %>%
dplyr::select(1:5,8)
pander(summary(data),
caption = "Summary Statistics Table") # summary statistics table
| felony | misdemeanor | summary | gender |
|---|---|---|---|
| Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 |
| 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 |
| Median :0.0000 | Median :1.0000 | Median :0.0000 | Median :1.0000 |
| Mean :0.2868 | Mean :0.5039 | Mean :0.2093 | Mean :0.5426 |
| 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 |
| Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
| black | unemployed |
|---|---|
| Min. :0.0000 | Min. :0.0000 |
| 1st Qu.:0.0000 | 1st Qu.:0.0000 |
| Median :1.0000 | Median :0.0000 |
| Mean :0.6589 | Mean :0.2248 |
| 3rd Qu.:1.0000 | 3rd Qu.:0.0000 |
| Max. :1.0000 | Max. :1.0000 |
The summary statistics table yields the following results:
data.plot <- data
data.plot$crim_record <- ifelse(data.plot$felony==1,"Felony",ifelse(data.plot$misdemeanor==1,"Misdemeanor",ifelse(data.plot$summary==1,"Summary","Other"))) %>%
factor(levels = c("Other", "Summary","Misdemeanor","Felony"))
data.plot <- na.omit(data.plot)
ggplot2::ggplot(data = data.plot) +
geom_bar(aes(x=after_stat(count)/sum(after_stat(count))*100, y=unemployed,fill = crim_record),position = position_stack(reverse = TRUE), width = .35)+
labs(y="Employment Status",x="Percentage", title = "Criminal Record and Unemployment", fill="Criminal Record: ", subtitle = "Colored by Criminal Record and Distinguished By Employment Status")+
scale_y_discrete(limits = c(0, 1),
expand = c(0, 0),
labels = c("Employed","Unemployed"))+
scale_fill_brewer(palette = "Blues")+
theme_minimal()+
theme(legend.position = "bottom", axis.text = element_text(size = 10))
ggpairs(data, columns = 1:6)
The pairwise comparison plot shows the correlations as well as the distributions between and for all variables. Our column of interest is the Unemployed column as felony, misdemeanor, and summary are all negatively correlated with Unemployed, whereas, gender=Male and black are positively correlated.
In the following section we will develop a five-fold cross-validation to resample the data for prediction purposes. We will use the full model to cross-validate.
k=5
fold.size = round(dim(data)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,k)
for(i in 1:k){
## Training and testing folds
valid.id = (fold.size*(i-1)+1):(fold.size*i)
valid = data[valid.id, ]
train.dat = data[-valid.id,]
## full model
m01 = glm(unemployed ~., family = binomial(link = "logit"), data = train.dat)
## predicted probabilities of each candidate model
pred01 = predict(m01, newdata = valid, type="response")
## confusion matrix: ftable() will
confusion01 = ftable(as.vector((pred01>0.5)),(valid$unemployed==0))
PE1[i] = (confusion01[1,1]+confusion01[1,2])/length(pred01)
}
accuracy <- (confusion01[1,1])/length(pred01)
The final cross-validated model has an accurracy equal to 19.23 percent. The accuracy is far below 50 percent and is therefore rather ineffective at prediction; however, the model may still have utility.
In the above sections, we developed a narrative to understand what factors contribute to unemployment based on socioeconomic and demographic variables. A preliminary analysis of the data was developed and a final model was built to understand what variables contribute to unemployment. Finally, the data was split and cross-validated to determine the accuracy of the full model. The model’s accurracy was assessed to be weak and therefore the model was determined to be relatively poor at prediction.