This project was made using the resume data set from https://www.openintro.org/data/index.php?data=resume. This data set is on diffrent job types and industries and many variables that may impact weather one would recive a call back. This project will look into how predictors years_college, college_degree, years_experience, and honors impact weather one recives a call back using logistic regression. The data set has 4,870 observations and 30 variables.
source: Bertrand M, Mullainathan S. 2004. “Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination”. The American Economic Review 94:4 (991-1013). .
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
resume1 <- read_csv("resume.csv")
## Rows: 4870 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): job_city, job_industry, job_type, job_ownership, job_req_min_exper...
## dbl (20): job_ad_id, job_fed_contractor, job_equal_opp_employer, job_req_any...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
first looked at the structure using str then checked for NAs using colsums. Afteer tha cleaned the data by selecting the needed variables and filtering for Chicago as there where only 2 cities used and wanted to just look into chicago.
str(resume1)
## spc_tbl_ [4,870 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ job_ad_id : num [1:4870] 384 384 384 384 385 386 386 385 386 386 ...
## $ job_city : chr [1:4870] "Chicago" "Chicago" "Chicago" "Chicago" ...
## $ job_industry : chr [1:4870] "manufacturing" "manufacturing" "manufacturing" "manufacturing" ...
## $ job_type : chr [1:4870] "supervisor" "supervisor" "supervisor" "supervisor" ...
## $ job_fed_contractor : num [1:4870] NA NA NA NA 0 0 0 0 0 0 ...
## $ job_equal_opp_employer: num [1:4870] 1 1 1 1 1 1 1 1 1 1 ...
## $ job_ownership : chr [1:4870] "unknown" "unknown" "unknown" "unknown" ...
## $ job_req_any : num [1:4870] 1 1 1 1 1 0 0 1 0 0 ...
## $ job_req_communication : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
## $ job_req_education : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
## $ job_req_min_experience: chr [1:4870] "5" "5" "5" "5" ...
## $ job_req_computer : num [1:4870] 1 1 1 1 1 0 0 1 0 0 ...
## $ job_req_organization : num [1:4870] 0 0 0 0 1 0 0 1 0 0 ...
## $ job_req_school : chr [1:4870] "none_listed" "none_listed" "none_listed" "none_listed" ...
## $ received_callback : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
## $ firstname : chr [1:4870] "Allison" "Kristen" "Lakisha" "Latonya" ...
## $ race : chr [1:4870] "white" "white" "black" "black" ...
## $ gender : chr [1:4870] "f" "f" "f" "f" ...
## $ years_college : num [1:4870] 4 3 4 3 3 4 4 3 4 4 ...
## $ college_degree : num [1:4870] 1 0 1 0 0 1 1 0 1 1 ...
## $ honors : num [1:4870] 0 0 0 0 0 1 0 0 0 0 ...
## $ worked_during_school : num [1:4870] 0 1 1 0 1 0 1 0 0 1 ...
## $ years_experience : num [1:4870] 6 6 6 6 22 6 5 21 3 6 ...
## $ computer_skills : num [1:4870] 1 1 1 1 1 0 1 1 1 0 ...
## $ special_skills : num [1:4870] 0 0 0 1 0 1 1 1 1 1 ...
## $ volunteer : num [1:4870] 0 1 0 1 0 0 1 1 0 1 ...
## $ military : num [1:4870] 0 1 0 0 0 0 0 0 0 0 ...
## $ employment_holes : num [1:4870] 1 0 0 1 0 0 0 1 0 0 ...
## $ has_email_address : num [1:4870] 0 1 0 1 1 0 1 1 0 1 ...
## $ resume_quality : chr [1:4870] "low" "high" "low" "high" ...
## - attr(*, "spec")=
## .. cols(
## .. job_ad_id = col_double(),
## .. job_city = col_character(),
## .. job_industry = col_character(),
## .. job_type = col_character(),
## .. job_fed_contractor = col_double(),
## .. job_equal_opp_employer = col_double(),
## .. job_ownership = col_character(),
## .. job_req_any = col_double(),
## .. job_req_communication = col_double(),
## .. job_req_education = col_double(),
## .. job_req_min_experience = col_character(),
## .. job_req_computer = col_double(),
## .. job_req_organization = col_double(),
## .. job_req_school = col_character(),
## .. received_callback = col_double(),
## .. firstname = col_character(),
## .. race = col_character(),
## .. gender = col_character(),
## .. years_college = col_double(),
## .. college_degree = col_double(),
## .. honors = col_double(),
## .. worked_during_school = col_double(),
## .. years_experience = col_double(),
## .. computer_skills = col_double(),
## .. special_skills = col_double(),
## .. volunteer = col_double(),
## .. military = col_double(),
## .. employment_holes = col_double(),
## .. has_email_address = col_double(),
## .. resume_quality = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
colSums(is.na(resume1))
## job_ad_id job_city job_industry
## 0 0 0
## job_type job_fed_contractor job_equal_opp_employer
## 0 1768 0
## job_ownership job_req_any job_req_communication
## 0 0 0
## job_req_education job_req_min_experience job_req_computer
## 0 2746 0
## job_req_organization job_req_school received_callback
## 0 0 0
## firstname race gender
## 0 0 0
## years_college college_degree honors
## 0 0 0
## worked_during_school years_experience computer_skills
## 0 0 0
## special_skills volunteer military
## 0 0 0
## employment_holes has_email_address resume_quality
## 0 0 0
resume_cleaned1 <- resume1 |>
select(received_callback, years_college, college_degree, years_experience, honors, job_city, job_industry, job_type)
resume_cleaned2 <- resume_cleaned1 |>
filter(job_city == "Chicago")
resume_cleaned2 |>
group_by(job_industry) |>
summarise(received_callback)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'job_industry'. You can override using the
## `.groups` argument.
## # A tibble: 2,704 × 2
## # Groups: job_industry [6]
## job_industry received_callback
## <chr> <dbl>
## 1 business_and_personal_service 0
## 2 business_and_personal_service 0
## 3 business_and_personal_service 0
## 4 business_and_personal_service 0
## 5 business_and_personal_service 0
## 6 business_and_personal_service 0
## 7 business_and_personal_service 0
## 8 business_and_personal_service 0
## 9 business_and_personal_service 0
## 10 business_and_personal_service 0
## # ℹ 2,694 more rows
logistic <- glm(received_callback ~years_college+college_degree+years_experience+honors, data = resume_cleaned2, family = "binomial")
summary(logistic)
##
## Call:
## glm(formula = received_callback ~ years_college + college_degree +
## years_experience + honors, family = "binomial", data = resume_cleaned2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.83629 0.48806 -5.811 6.2e-09 ***
## years_college -0.14498 0.18459 -0.785 0.43222
## college_degree 0.54644 0.32401 1.686 0.09171 .
## years_experience 0.03969 0.01655 2.399 0.01645 *
## honors 0.66743 0.25888 2.578 0.00993 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1333.7 on 2703 degrees of freedom
## Residual deviance: 1316.1 on 2699 degrees of freedom
## AIC: 1326.1
##
## Number of Fisher Scoring iterations: 5
Out of the predictors years_college, college_degree, years_experience, and honors the predictora which where statistically significant (p<0.05) in this model where years_experience and honors.
years_experience 0.03969 (p=0.01645)- meaning years of experience increases the log-odds of reciving a call back. Years of experience is statisticly significant so it is important in prediciting weather one recives a call back.
honors 0.66743 (p=0.00993)- meaning honors increases the log-odds of reciving a call back. honors is statisticly significant so it is important in prediciting weather one recives a call back.
r_square <- 1 - (logistic$deviance/logistic$null.deviance)
r_square
## [1] 0.01317504
The R^2 here is 1.3% meaning the model explains about 1.3% of the variation in receved call backs based on predictors years_college, college_degree, years_experience, and honors. The R^2 is low meaning there are other factors that impact weather one recives a call back and this model might be improved by adding predictors that realte to reciving a call back.
1 - pchisq((logistic$null.deviance - logistic$deviance), df=(length(logistic$coefficients) -1))
## [1] 0.001496063
The p-value is 0.001496063 which is lower than (p<0.05) meaning the model is significant.
predicted.probs <- logistic$fitted.values
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(resume_cleaned2$received_callback, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 2522 182
## 1 0 0
??
True negative- 2522 did not recive a call back and the model predicted them correctly
False negative- 185 recived a call back but the model predicted them as not reciveing a callback
False posative- 0 did not recive a call back and the model predicted recived call back
True posative- 0 recived a call back and the model predicted recived call back
TN <- 2522
FP <- 0
FN <- 182
TP <- 0
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
cat("Accuracy: ", round(accuracy, 4), "\n")
## Accuracy: 0.9327
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity: 0
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity: 1
cat("Precision: ", round(precision, 4), "\n")
## Precision: NaN
cat("F1 Score: ", round(f1_score, 4), "\n")
## F1 Score: NaN
The model has high accuracy at 93.27%
The model is better at identifying not reciving a call back as it has high specificity at 100% and dose bad at identifying recived call backs at a sensitivity of 0%
for predision and f1 scores its showing NaN so probably made a mistake somewhere but not enough time to fix right now.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(response = resume_cleaned2$received_callback,
predictor = logistic$fitted.values,
levels = c(0, 1),
direction = "<")
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.5948
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
AUC = 0.595 and the curve is ubove the diagonal line meaning it is slightly better than random guessing at predicting recived call backs.
Ovcerall this model atttempts to predict recived call backs based on predictors years_college, college_degree, years_experience, and honors. Out of these predictors only years_experience and honors where significant (p<0.05).The R^2 was is 1.3% meaning the model explains about 1.3% of the variation in receved call backs based on predictors.The model has high accuracy at 93.27% and is better at identifying not reciving a call back as it has high specificity at 100% and dose bad at identifying recived call backs at a sensitivity of 0%. lastly the AUC = 0.595 mening the model is slightly better than random guessing at predicting recived call backs.
In the future it would be better to include more predictors to make the model better at predciting receved call backs.