p3_d101

introduction

This project was made using the resume data set from https://www.openintro.org/data/index.php?data=resume. This data set is on diffrent job types and industries and many variables that may impact weather one would recive a call back. This project will look into how predictors years_college, college_degree, years_experience, and honors impact weather one recives a call back using logistic regression. The data set has 4,870 observations and 30 variables.

source: Bertrand M, Mullainathan S. 2004. “Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination”. The American Economic Review 94:4 (991-1013). .

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

resume1 <- read_csv("resume.csv")

## Rows: 4870 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): job_city, job_industry, job_type, job_ownership, job_req_min_exper...
## dbl (20): job_ad_id, job_fed_contractor, job_equal_opp_employer, job_req_any...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data anlalysis

first looked at the structure using str then checked for NAs using colsums. Afteer tha cleaned the data by selecting the needed variables and filtering for Chicago as there where only 2 cities used and wanted to just look into chicago.

Checking structure

str(resume1)

## spc_tbl_ [4,870 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ job_ad_id             : num [1:4870] 384 384 384 384 385 386 386 385 386 386 ...
##  $ job_city              : chr [1:4870] "Chicago" "Chicago" "Chicago" "Chicago" ...
##  $ job_industry          : chr [1:4870] "manufacturing" "manufacturing" "manufacturing" "manufacturing" ...
##  $ job_type              : chr [1:4870] "supervisor" "supervisor" "supervisor" "supervisor" ...
##  $ job_fed_contractor    : num [1:4870] NA NA NA NA 0 0 0 0 0 0 ...
##  $ job_equal_opp_employer: num [1:4870] 1 1 1 1 1 1 1 1 1 1 ...
##  $ job_ownership         : chr [1:4870] "unknown" "unknown" "unknown" "unknown" ...
##  $ job_req_any           : num [1:4870] 1 1 1 1 1 0 0 1 0 0 ...
##  $ job_req_communication : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
##  $ job_req_education     : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
##  $ job_req_min_experience: chr [1:4870] "5" "5" "5" "5" ...
##  $ job_req_computer      : num [1:4870] 1 1 1 1 1 0 0 1 0 0 ...
##  $ job_req_organization  : num [1:4870] 0 0 0 0 1 0 0 1 0 0 ...
##  $ job_req_school        : chr [1:4870] "none_listed" "none_listed" "none_listed" "none_listed" ...
##  $ received_callback     : num [1:4870] 0 0 0 0 0 0 0 0 0 0 ...
##  $ firstname             : chr [1:4870] "Allison" "Kristen" "Lakisha" "Latonya" ...
##  $ race                  : chr [1:4870] "white" "white" "black" "black" ...
##  $ gender                : chr [1:4870] "f" "f" "f" "f" ...
##  $ years_college         : num [1:4870] 4 3 4 3 3 4 4 3 4 4 ...
##  $ college_degree        : num [1:4870] 1 0 1 0 0 1 1 0 1 1 ...
##  $ honors                : num [1:4870] 0 0 0 0 0 1 0 0 0 0 ...
##  $ worked_during_school  : num [1:4870] 0 1 1 0 1 0 1 0 0 1 ...
##  $ years_experience      : num [1:4870] 6 6 6 6 22 6 5 21 3 6 ...
##  $ computer_skills       : num [1:4870] 1 1 1 1 1 0 1 1 1 0 ...
##  $ special_skills        : num [1:4870] 0 0 0 1 0 1 1 1 1 1 ...
##  $ volunteer             : num [1:4870] 0 1 0 1 0 0 1 1 0 1 ...
##  $ military              : num [1:4870] 0 1 0 0 0 0 0 0 0 0 ...
##  $ employment_holes      : num [1:4870] 1 0 0 1 0 0 0 1 0 0 ...
##  $ has_email_address     : num [1:4870] 0 1 0 1 1 0 1 1 0 1 ...
##  $ resume_quality        : chr [1:4870] "low" "high" "low" "high" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   job_ad_id = col_double(),
##   ..   job_city = col_character(),
##   ..   job_industry = col_character(),
##   ..   job_type = col_character(),
##   ..   job_fed_contractor = col_double(),
##   ..   job_equal_opp_employer = col_double(),
##   ..   job_ownership = col_character(),
##   ..   job_req_any = col_double(),
##   ..   job_req_communication = col_double(),
##   ..   job_req_education = col_double(),
##   ..   job_req_min_experience = col_character(),
##   ..   job_req_computer = col_double(),
##   ..   job_req_organization = col_double(),
##   ..   job_req_school = col_character(),
##   ..   received_callback = col_double(),
##   ..   firstname = col_character(),
##   ..   race = col_character(),
##   ..   gender = col_character(),
##   ..   years_college = col_double(),
##   ..   college_degree = col_double(),
##   ..   honors = col_double(),
##   ..   worked_during_school = col_double(),
##   ..   years_experience = col_double(),
##   ..   computer_skills = col_double(),
##   ..   special_skills = col_double(),
##   ..   volunteer = col_double(),
##   ..   military = col_double(),
##   ..   employment_holes = col_double(),
##   ..   has_email_address = col_double(),
##   ..   resume_quality = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Checking for na

colSums(is.na(resume1))

##              job_ad_id               job_city           job_industry 
##                      0                      0                      0 
##               job_type     job_fed_contractor job_equal_opp_employer 
##                      0                   1768                      0 
##          job_ownership            job_req_any  job_req_communication 
##                      0                      0                      0 
##      job_req_education job_req_min_experience       job_req_computer 
##                      0                   2746                      0 
##   job_req_organization         job_req_school      received_callback 
##                      0                      0                      0 
##              firstname                   race                 gender 
##                      0                      0                      0 
##          years_college         college_degree                 honors 
##                      0                      0                      0 
##   worked_during_school       years_experience        computer_skills 
##                      0                      0                      0 
##         special_skills              volunteer               military 
##                      0                      0                      0 
##       employment_holes      has_email_address         resume_quality 
##                      0                      0                      0

cleaning

resume_cleaned1 <- resume1 |>
  select(received_callback, years_college, college_degree, years_experience, honors, job_city, job_industry, job_type)

resume_cleaned2 <- resume_cleaned1 |>
  filter(job_city == "Chicago")

resume_cleaned2 |>
  group_by(job_industry) |>
  summarise(received_callback)

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'job_industry'. You can override using the
## `.groups` argument.

## # A tibble: 2,704 × 2
## # Groups:   job_industry [6]
##    job_industry                  received_callback
##    <chr>                                     <dbl>
##  1 business_and_personal_service                 0
##  2 business_and_personal_service                 0
##  3 business_and_personal_service                 0
##  4 business_and_personal_service                 0
##  5 business_and_personal_service                 0
##  6 business_and_personal_service                 0
##  7 business_and_personal_service                 0
##  8 business_and_personal_service                 0
##  9 business_and_personal_service                 0
## 10 business_and_personal_service                 0
## # ℹ 2,694 more rows

Logistic Regression

logistic <- glm(received_callback ~years_college+college_degree+years_experience+honors, data = resume_cleaned2, family = "binomial")

summary(logistic)

## 
## Call:
## glm(formula = received_callback ~ years_college + college_degree + 
##     years_experience + honors, family = "binomial", data = resume_cleaned2)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.83629    0.48806  -5.811  6.2e-09 ***
## years_college    -0.14498    0.18459  -0.785  0.43222    
## college_degree    0.54644    0.32401   1.686  0.09171 .  
## years_experience  0.03969    0.01655   2.399  0.01645 *  
## honors            0.66743    0.25888   2.578  0.00993 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1333.7  on 2703  degrees of freedom
## Residual deviance: 1316.1  on 2699  degrees of freedom
## AIC: 1326.1
## 
## Number of Fisher Scoring iterations: 5

Inturpretation

Out of the predictors years_college, college_degree, years_experience, and honors the predictora which where statistically significant (p<0.05) in this model where years_experience and honors.

years_experience 0.03969 (p=0.01645)- meaning years of experience increases the log-odds of reciving a call back. Years of experience is statisticly significant so it is important in prediciting weather one recives a call back.

honors 0.66743 (p=0.00993)- meaning honors increases the log-odds of reciving a call back. honors is statisticly significant so it is important in prediciting weather one recives a call back.

R^2

r_square <- 1 - (logistic$deviance/logistic$null.deviance)

r_square

## [1] 0.01317504

The R^2 here is 1.3% meaning the model explains about 1.3% of the variation in receved call backs based on predictors years_college, college_degree, years_experience, and honors. The R^2 is low meaning there are other factors that impact weather one recives a call back and this model might be improved by adding predictors that realte to reciving a call back.

P-value

1 - pchisq((logistic$null.deviance - logistic$deviance), df=(length(logistic$coefficients) -1))

## [1] 0.001496063

The p-value is 0.001496063 which is lower than (p<0.05) meaning the model is significant.

Confusion Matrix

predicted.probs <- logistic$fitted.values

predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)

confusion <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(resume_cleaned2$received_callback, levels = c(0, 1))
)

confusion

##          Actual
## Predicted    0    1
##         0 2522  182
##         1    0    0

True negative- 2522 did not recive a call back and the model predicted them correctly

False negative- 185 recived a call back but the model predicted them as not reciveing a callback

False posative- 0 did not recive a call back and the model predicted recived call back

True posative- 0 recived a call back and the model predicted recived call back

Metrics

TN <- 2522
FP <- 0
FN <- 182
TP <- 0


accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)   
specificity <- TN / (TN + FP)   
precision <- TP / (TP + FP)    
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)


cat("Accuracy:    ", round(accuracy, 4), "\n")

## Accuracy:     0.9327

cat("Sensitivity: ", round(sensitivity, 4), "\n")

## Sensitivity:  0

cat("Specificity: ", round(specificity, 4), "\n")

## Specificity:  1

cat("Precision:   ", round(precision, 4), "\n")

## Precision:    NaN

cat("F1 Score:    ", round(f1_score, 4), "\n")

## F1 Score:     NaN

The model has high accuracy at 93.27%

The model is better at identifying not reciving a call back as it has high specificity at 100% and dose bad at identifying recived call backs at a sensitivity of 0%

for predision and f1 scores its showing NaN so probably made a mistake somewhere but not enough time to fix right now.

ROC and AUC Curve

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc_obj <- roc(response = resume_cleaned2$received_callback,
               predictor = logistic$fitted.values,
               levels = c(0, 1),
               direction = "<")  

auc_val <- auc(roc_obj); auc_val

## Area under the curve: 0.5948

plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

AUC = 0.595 and the curve is ubove the diagonal line meaning it is slightly better than random guessing at predicting recived call backs.

Conclusion

Ovcerall this model atttempts to predict recived call backs based on predictors years_college, college_degree, years_experience, and honors. Out of these predictors only years_experience and honors where significant (p<0.05).The R^2 was is 1.3% meaning the model explains about 1.3% of the variation in receved call backs based on predictors.The model has high accuracy at 93.27% and is better at identifying not reciving a call back as it has high specificity at 100% and dose bad at identifying recived call backs at a sensitivity of 0%. lastly the AUC = 0.595 mening the model is slightly better than random guessing at predicting recived call backs.

Future

In the future it would be better to include more predictors to make the model better at predciting receved call backs.