Do certain resume components predict the chance of receiving a call back from a job application, and which attributes have the greatest influence on callback probability?
The dataset that I used for my final project is the resume dataset that can be found in the openintro repository. This data was collected by Bertrand and Mullainathan, in a study that monitored job postings in Boston and Chicago for several months during 2001 and 2002 and used this to build up a set of test cases. Our research question is Do certain resume components predict the chance of receiving a call back from a job application, and which attributes have the greatest influence on callback probability? The dataset contains 4870 observations on 30 variables, where each observation is a resume that was submitted to a real job posting in Boston and Chicago during 2001 and 2002. The key variable in this dataset is the received_callback variable, which looks at if the employer contacted the applicant after the resume. The other variables that I used to predict received_callback is race, gender, honors, years_experience, college_degree, has_email_address, military, and special_skills. I chose this data set and topic because I think that there could be some very significant and intriguing findings through statistical analysis that can help us explain predjudice and discrimination that still exists in the application process today. You can find this dataset at https://www.openintro.org/data/index.php?data=resume, where the study is listed as the source.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Documents/Data 101")
resume <- read_csv("resume.csv")
## Rows: 4870 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): job_city, job_industry, job_type, job_ownership, job_req_min_exper...
## dbl (20): job_ad_id, job_fed_contractor, job_equal_opp_employer, job_req_any...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before we did the logistic regression, we did some data wrangling and cleaning to prepare the dataset. First we selected the 8 variables for the logistic regression using the select() function. Then we used mutate() to convert the race and gender variables into factors. We also did this for the received_callback variable, except we stated the factor levels as 0 and 1, so we did not have any trouble doing our statistical analysis. Lastly, we used summarise() to calculate the the overall callback rate, showing that only 8% of resumes received a callback. This 8% that we got was used as our threshold for our confusion matrix later on.
names(resume) <- tolower(names(resume))
names(resume) <- gsub(" ","_",names(resume))
names(resume) <- gsub("\\.","_",names(resume))
head(resume)
## # A tibble: 6 × 30
## job_ad_id job_city job_industry job_type job_fed_contractor
## <dbl> <chr> <chr> <chr> <dbl>
## 1 384 Chicago manufacturing supervisor NA
## 2 384 Chicago manufacturing supervisor NA
## 3 384 Chicago manufacturing supervisor NA
## 4 384 Chicago manufacturing supervisor NA
## 5 385 Chicago other_service secretary 0
## 6 386 Chicago wholesale_and_retail_trade sales_rep 0
## # ℹ 25 more variables: job_equal_opp_employer <dbl>, job_ownership <chr>,
## # job_req_any <dbl>, job_req_communication <dbl>, job_req_education <dbl>,
## # job_req_min_experience <chr>, job_req_computer <dbl>,
## # job_req_organization <dbl>, job_req_school <chr>, received_callback <dbl>,
## # firstname <chr>, race <chr>, gender <chr>, years_college <dbl>,
## # college_degree <dbl>, honors <dbl>, worked_during_school <dbl>,
## # years_experience <dbl>, computer_skills <dbl>, special_skills <dbl>, …
Cleaned the names, making spaces underscores and periods underscores as well.
str(resume)
## spc_tbl_ [4,870 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ job_ad_id : num [1:4870] 384 384 384 384 385 386 386 385 386 386 ...
## $ job_city : chr [1:4870] "Chicago" "Chicago" "Chicago" "Chicago" ...
## $ job_industry : chr [1:4870] "manufacturing" "manufacturing" "manufacturing" "manufacturing" ...
## $ job_type : chr [1:4870] "supervisor" "supervisor" "supervisor" "supervisor" ...
## $ job_fed_contractor : num [1:4870] NA NA NA NA 0 0 0 0 0 0 ...
## $ job_equal_opp_employer: num [1:4870] 1 1 1 1 1 1 1 1 1 1 ...
## $ job_ownership : chr [1:4870] "unknown" "unknown" "unknown" "unknown" ...
## $ job_req_any : num [1:4870] 1 1 1 1 1 0 0 1 0 0 ...
## $ job_req_communication : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
## $ job_req_education : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
## $ job_req_min_experience: chr [1:4870] "5" "5" "5" "5" ...
## $ job_req_computer : num [1:4870] 1 1 1 1 1 0 0 1 0 0 ...
## $ job_req_organization : num [1:4870] 0 0 0 0 1 0 0 1 0 0 ...
## $ job_req_school : chr [1:4870] "none_listed" "none_listed" "none_listed" "none_listed" ...
## $ received_callback : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
## $ firstname : chr [1:4870] "Allison" "Kristen" "Lakisha" "Latonya" ...
## $ race : chr [1:4870] "white" "white" "black" "black" ...
## $ gender : chr [1:4870] "f" "f" "f" "f" ...
## $ years_college : num [1:4870] 4 3 4 3 3 4 4 3 4 4 ...
## $ college_degree : num [1:4870] 1 0 1 0 0 1 1 0 1 1 ...
## $ honors : num [1:4870] 0 0 0 0 0 1 0 0 0 0 ...
## $ worked_during_school : num [1:4870] 0 1 1 0 1 0 1 0 0 1 ...
## $ years_experience : num [1:4870] 6 6 6 6 22 6 5 21 3 6 ...
## $ computer_skills : num [1:4870] 1 1 1 1 1 0 1 1 1 0 ...
## $ special_skills : num [1:4870] 0 0 0 1 0 1 1 1 1 1 ...
## $ volunteer : num [1:4870] 0 1 0 1 0 0 1 1 0 1 ...
## $ military : num [1:4870] 0 1 0 0 0 0 0 0 0 0 ...
## $ employment_holes : num [1:4870] 1 0 0 1 0 0 0 1 0 0 ...
## $ has_email_address : num [1:4870] 0 1 0 1 1 0 1 1 0 1 ...
## $ resume_quality : chr [1:4870] "low" "high" "low" "high" ...
## - attr(*, "spec")=
## .. cols(
## .. job_ad_id = col_double(),
## .. job_city = col_character(),
## .. job_industry = col_character(),
## .. job_type = col_character(),
## .. job_fed_contractor = col_double(),
## .. job_equal_opp_employer = col_double(),
## .. job_ownership = col_character(),
## .. job_req_any = col_double(),
## .. job_req_communication = col_double(),
## .. job_req_education = col_double(),
## .. job_req_min_experience = col_character(),
## .. job_req_computer = col_double(),
## .. job_req_organization = col_double(),
## .. job_req_school = col_character(),
## .. received_callback = col_double(),
## .. firstname = col_character(),
## .. race = col_character(),
## .. gender = col_character(),
## .. years_college = col_double(),
## .. college_degree = col_double(),
## .. honors = col_double(),
## .. worked_during_school = col_double(),
## .. years_experience = col_double(),
## .. computer_skills = col_double(),
## .. special_skills = col_double(),
## .. volunteer = col_double(),
## .. military = col_double(),
## .. employment_holes = col_double(),
## .. has_email_address = col_double(),
## .. resume_quality = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Here we checked the structure of the dataset, looking at the varaible types.
colSums(is.na(resume))
## job_ad_id job_city job_industry
## 0 0 0
## job_type job_fed_contractor job_equal_opp_employer
## 0 1768 0
## job_ownership job_req_any job_req_communication
## 0 0 0
## job_req_education job_req_min_experience job_req_computer
## 0 2746 0
## job_req_organization job_req_school received_callback
## 0 0 0
## firstname race gender
## 0 0 0
## years_college college_degree honors
## 0 0 0
## worked_during_school years_experience computer_skills
## 0 0 0
## special_skills volunteer military
## 0 0 0
## employment_holes has_email_address resume_quality
## 0 0 0
In this chunk above, we checked the na values for all of our columns. We had NA values in the fed contractor and min experience varaibles, but it does not matter as we will not use them for our statistical analysis.
resume_clean <- resume |>
select(received_callback, race, gender, honors, years_experience,
college_degree, has_email_address, military, special_skills) |>
mutate(
race = as.factor(race),
gender = as.factor(gender),
received_callback = factor(received_callback, levels = c(0, 1))
)
In this chunk we selected the 8 variables that we will use for our analysis. Then I used mutate to make race and gender factors, and set the factor levels for the received_callback variable to 0 and 1, to ensure that there were no issues with the logistic regression.
resume_clean |>
summarise(callback_rate = mean(as.numeric(received_callback) - 1),
total_callbacks = sum(as.numeric(received_callback) - 1),
total_resumes = n())
## # A tibble: 1 × 3
## callback_rate total_callbacks total_resumes
## <dbl> <dbl> <int>
## 1 0.0805 392 4870
In this chunk, I used summarize to calculate the callback rate, the total number of resumes and the total callbacks out of those resumes. We see that our callback rate is 8%, which will be key for our confusion matrix later on.
We will use a logistic regression model as our outcome variable is binary.
logistic <- glm(received_callback ~ race + gender + honors + years_experience + college_degree + has_email_address + military + special_skills, data = resume_clean, family = "binomial")
summary(logistic)
##
## Call:
## glm(formula = received_callback ~ race + gender + honors + years_experience +
## college_degree + has_email_address + military + special_skills,
## family = "binomial", data = resume_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.473459 0.177969 -19.517 < 2e-16 ***
## racewhite 0.443955 0.108498 4.092 4.28e-05 ***
## genderm -0.015386 0.133895 -0.115 0.908513
## honors 0.587857 0.187939 3.128 0.001761 **
## years_experience 0.035249 0.009739 3.619 0.000295 ***
## college_degree 0.130072 0.123184 1.056 0.291007
## has_email_address 0.110579 0.113433 0.975 0.329636
## military -0.098573 0.213390 -0.462 0.644127
## special_skills 0.799244 0.110860 7.210 5.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2726.9 on 4869 degrees of freedom
## Residual deviance: 2622.8 on 4861 degrees of freedom
## AIC: 2640.8
##
## Number of Fisher Scoring iterations: 5
r_square_all <- 1 - (logistic$deviance/logistic$null.deviance)
r_square_all
## [1] 0.03819633
This model explains 3.8% of the variance in callback outcomes.
1 - pchisq((logistic$null.deviance - logistic$deviance), df=(length(logistic$coefficients)-1))
## [1] 0
Our p-value for the model is 0, which means our results are very significant.
Intercept: The log-odds of receiving a callback when all predictors are at 0 is -3.473, which means that there is a very low baseline probability of receiving a callback.
racewhite: Resumes with white-sounding names have log-odds of recieving a callback that are 0.444 higher than resume’s with black-sounding names. This variable is statistically significant as it’s p-value is 4.28e-05 which is less than our alpha of 0.05, confirming that race plays a meaningful role in callback probability even when resume quality is the same.
honors: Resumes with honors increases the log-odds of receiving a callback by 0.588. The honors variable is statistically significant, as its p-value of 0.001761 is less than our alpha of 0.05, which shows employers may lean to people with academic or professional distinctions.
years_experience: Each additional year of experience increases the log-odds of a callback by 0.035. This variable is statistically significant as it’s p-value is 0.000295, which is less than our alpha of 0.05. While the effect is small per each year added, it accumulates exponentially over the course of a longer career.
special_skills: Listing special skills on your resume increases the log-odds of a callback by 0.799. This variable is statistically significant as it’s p value is 5.61e-13, which is less than our alpha of 0.05. This shows that specific qualifications are one of the most valued attributes a person can have when applying.
genderm: Resmues with a male sounding name have a log odds of -0.015. This means that the log-odds of male names recieving a callback is 0.015 lower than female names. This variable is not statistically significant as it’s p-value is 0.908513, which is greater than our alpha of 0.05.
college_degree: Having a college degree increases the log-odds of recieving a callback by 0.130. However, this variable is not statistically significant as it’s p-value of 0.291007 is greater than the alpha of 0.05. I think that this is not statistically significant as most of the applicants already have a college degree.
has_email_address: Including an email address on your resume increases the log-odds of a callback by 0.111. However, this variable is not statistically significant with it’s p-value being 0.329636, which is greater than our alpha of 0.05. I think this is similar to the college_degree, as many of the applicants will include an email address on the email address.
military: Having military experience on your resume has a decrease in the log-odds of -0.999. However, this variable is not statistically significant as it’s p-value of 0.644127 is greater than our alpha of 0.05. This shows that military experience doesn’t hurt or help callback probability.
I lowered the threshold to 0.08 because when I ran it with 0.05 there were no values for true or false positives, as the threshold was too broad. So I lowered the threshold to 0.08 because of the callback rate we got in chunk 6.
#Predicted Probabilities
predicted.probs <-logistic$fitted.values
# Predicted classes: 1 if prob > 0.5, else 0
predicted.classes <- ifelse(predicted.probs > 0.08, 1, 0)
# Confusion matrix
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(resume_clean$received_callback, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 2878 168
## 1 1600 224
2878 resumes did not receive a callback, and the model correctly predicted them as no callback. This is a True Negative.
168 resumes actually received a callback, but the model mistakenly predicted them as no callback. This is a False Negative.
1600 resumes did not receive a callback, but the model mistakenly predicted them as receiving a callback. This is a False Positive.
224 resumes actually received a callback, and the model correctly predicted them as receiving a callback. This is a True Positive.
# Extract the values
TN <- 2878
FP <- 1600
FN <- 168
TP <- 224
# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# Print the results
cat("Accuracy: ", round(accuracy, 4), "\n")
## Accuracy: 0.637
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity: 0.5714
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity: 0.6427
cat("Precision: ", round(precision, 4), "\n")
## Precision: 0.1228
cat("F1 Score: ", round(f1_score, 4), "\n")
## F1 Score: 0.2022
The model achieves an accuracy of 63.7%, with a sensitivity of 57.1% and specificity of 64.3%, indicating a pretty good balance between detecting actual callbacks and avoiding false alarms.
Overall, these results are acceptable for a highly imbalanced real-world dataset where only 8% of resumes received a callback.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(response = resume_clean$received_callback,
predictor = logistic$fitted.values,
levels = c(0, 1),
direction = "<")
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.6493
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
The AUC of 0.649 indicates the model is decent at distinguishing between resumes that received a callback and resumes that didn’t. On the plot, the curve sits above the diagonal random guess line, confirming the model performs better than chance. In plain terms, if you randomly picked one resume that received a callback and one that did not, the model would correctly rank the callback resume higher about 65% of the time.
The logistic regression showed that race and special skills were the strongest predictors of receiving a job callback. Resumes with white-sounding names had significantly higher log odds of receiving a callback compared to identical resumes with black-sounding names. The Coefficient was 0.444, and the p-value was 4.28e-05 for the racewhite variable, which shows that it is significant in whether or not a resume got called back. Listing special skills on a resume had the largest positive effect in the entire model (Coefficient = 0.799, p-value = 5.61e-13 ). Honors (Coefficient = 0.588, p-value = 0.001761) and years of experience (Coefficient = 0.035, p-value = 0.000295) also significantly increased callback odds, while gender, college degree, email address, and military experience were not statistically significant.
These findings directly confirm the research question that race significantly influences callback probability. The fact that race remains a significant predictor after controlling for qualifications such as honors, years of experience, and special skills suggests that hiring discrimination is present in this dataset, which has important implications for equal opportunity employment policy and hiring practices. The model achieved an AUC of 0.649 and an accuracy of 63.7%, which is acceptable given that only 8% of resumes received a callback.
Future research can have us look at whether race varies with the industry or city by incorporating real job information in the dataset. We can also look at if highly qualified black applicants face the same discrimintation as the less qualified ones.
https://www.openintro.org/data/index.php?data=resume <- Direct link to the dataset
Bertrand M, Mullainathan S. 2004. “Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination”. The American Economic Review 94:4 (991-1013). . <- Source of the dataset