Here is the final project:
We will be exploring the dataset ResumeNames.
Cross-section of data about resume, call-back and employer information for 4870 fictitious resumes.
Are Emily and Greg More Employable than Lakisha and Jamal? In other words, are Caucasian sounding names more likely to receive a call-back than African American sounding names?
Lets begin with a walk through of the methodology of analyzing the dataset.
library(ggpubr)
## Loading required package: ggplot2
theme_set(theme_pubr())
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gmodels)
library(ggplot2)
ResumeNames<-"https://raw.githubusercontent.com/lszydziak/LS_CUNY/main/ResumeNames.csv"
#C:/Users/Lisa/Documents/CUNY/Bridge/rsconnect/documents/HW3/ResumeNames.csv"
#ResumeNames<-"C:/Personal/CUNY/ResumeNames.csv"
#
Resume<-read.table(file=ResumeNames,header=TRUE, sep=",")
class(Resume)
## [1] "data.frame"
#Take a peek at the first few records
head(Resume)
## X name gender ethnicity quality call city jobs experience honors
## 1 1 Allison female cauc low no chicago 2 6 no
## 2 2 Kristen female cauc high no chicago 3 6 no
## 3 3 Lakisha female afam low no chicago 1 6 no
## 4 4 Latonya female afam high no chicago 4 6 no
## 5 5 Carrie female cauc high no chicago 3 22 no
## 6 6 Jay male cauc low no chicago 2 6 yes
## volunteer military holes school email computer special college minimum equal
## 1 no no yes no no yes no yes 5 yes
## 2 yes yes no yes yes yes no no 5 yes
## 3 no no no yes no yes no yes 5 yes
## 4 yes no yes no yes yes yes no 5 yes
## 5 no no no yes yes yes no no some yes
## 6 no no no no no no yes yes none yes
## wanted requirements reqexp reqcomm reqeduc reqcomp reqorg
## 1 supervisor yes yes no no yes no
## 2 supervisor yes yes no no yes no
## 3 supervisor yes yes no no yes no
## 4 supervisor yes yes no no yes no
## 5 secretary yes yes no no yes yes
## 6 other no no no no no no
## industry
## 1 manufacturing
## 2 manufacturing
## 3 manufacturing
## 4 manufacturing
## 5 health/education/social services
## 6 trade
# What is the size of this dataset?
dim(Resume)
## [1] 4870 28
#What types of variables are in the dataset?
str(Resume)
## 'data.frame': 4870 obs. of 28 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "Allison" "Kristen" "Lakisha" "Latonya" ...
## $ gender : chr "female" "female" "female" "female" ...
## $ ethnicity : chr "cauc" "cauc" "afam" "afam" ...
## $ quality : chr "low" "high" "low" "high" ...
## $ call : chr "no" "no" "no" "no" ...
## $ city : chr "chicago" "chicago" "chicago" "chicago" ...
## $ jobs : int 2 3 1 4 3 2 2 4 3 2 ...
## $ experience : int 6 6 6 6 22 6 5 21 3 6 ...
## $ honors : chr "no" "no" "no" "no" ...
## $ volunteer : chr "no" "yes" "no" "yes" ...
## $ military : chr "no" "yes" "no" "no" ...
## $ holes : chr "yes" "no" "no" "yes" ...
## $ school : chr "no" "yes" "yes" "no" ...
## $ email : chr "no" "yes" "no" "yes" ...
## $ computer : chr "yes" "yes" "yes" "yes" ...
## $ special : chr "no" "no" "no" "yes" ...
## $ college : chr "yes" "no" "yes" "no" ...
## $ minimum : chr "5" "5" "5" "5" ...
## $ equal : chr "yes" "yes" "yes" "yes" ...
## $ wanted : chr "supervisor" "supervisor" "supervisor" "supervisor" ...
## $ requirements: chr "yes" "yes" "yes" "yes" ...
## $ reqexp : chr "yes" "yes" "yes" "yes" ...
## $ reqcomm : chr "no" "no" "no" "no" ...
## $ reqeduc : chr "no" "no" "no" "no" ...
## $ reqcomp : chr "yes" "yes" "yes" "yes" ...
## $ reqorg : chr "no" "no" "no" "no" ...
## $ industry : chr "manufacturing" "manufacturing" "manufacturing" "manufacturing" ...
#quick summary
summary(Resume)
## X name gender ethnicity
## Min. : 1 Length:4870 Length:4870 Length:4870
## 1st Qu.:1218 Class :character Class :character Class :character
## Median :2436 Mode :character Mode :character Mode :character
## Mean :2436
## 3rd Qu.:3653
## Max. :4870
## quality call city jobs
## Length:4870 Length:4870 Length:4870 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.661
## 3rd Qu.:4.000
## Max. :7.000
## experience honors volunteer military
## Min. : 1.000 Length:4870 Length:4870 Length:4870
## 1st Qu.: 5.000 Class :character Class :character Class :character
## Median : 6.000 Mode :character Mode :character Mode :character
## Mean : 7.843
## 3rd Qu.: 9.000
## Max. :44.000
## holes school email computer
## Length:4870 Length:4870 Length:4870 Length:4870
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## special college minimum equal
## Length:4870 Length:4870 Length:4870 Length:4870
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## wanted requirements reqexp reqcomm
## Length:4870 Length:4870 Length:4870 Length:4870
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## reqeduc reqcomp reqorg industry
## Length:4870 Length:4870 Length:4870 Length:4870
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
#What are the variables?
colnames(Resume)
## [1] "X" "name" "gender" "ethnicity" "quality"
## [6] "call" "city" "jobs" "experience" "honors"
## [11] "volunteer" "military" "holes" "school" "email"
## [16] "computer" "special" "college" "minimum" "equal"
## [21] "wanted" "requirements" "reqexp" "reqcomm" "reqeduc"
## [26] "reqcomp" "reqorg" "industry"
#Lets reduce the number of variables in the dataset
Resume2<-data.frame(Resume$name, Resume$gender, Resume$ethnicity,Resume$quality,
Resume$call, Resume$city, Resume$jobs, Resume$experience,
Resume$computer, Resume$college, Resume$minimum,
Resume$requirements, Resume$reqexp, Resume$reqcomm,
Resume$reqeduc, Resume$reqcomp, Resume$reqorg, Resume$industry)
colnames(Resume2)<-c("name", "gender", "ethnicity", "quality", "call", "city", "jobs", "experience",
"computer", "college", "minimum", "requirements", "reqexp", "reqcomm", "reqeduc", "reqcomp", "reqorg", "industry")
#what is the size of the dataset (records variables)?
dim(Resume2)
## [1] 4870 18
#Take a peek at the dataset with reduced variables
head(Resume2)
## name gender ethnicity quality call city jobs experience computer
## 1 Allison female cauc low no chicago 2 6 yes
## 2 Kristen female cauc high no chicago 3 6 yes
## 3 Lakisha female afam low no chicago 1 6 yes
## 4 Latonya female afam high no chicago 4 6 yes
## 5 Carrie female cauc high no chicago 3 22 yes
## 6 Jay male cauc low no chicago 2 6 no
## college minimum requirements reqexp reqcomm reqeduc reqcomp reqorg
## 1 yes 5 yes yes no no yes no
## 2 no 5 yes yes no no yes no
## 3 yes 5 yes yes no no yes no
## 4 no 5 yes yes no no yes no
## 5 no some yes yes no no yes yes
## 6 yes none no no no no no no
## industry
## 1 manufacturing
## 2 manufacturing
## 3 manufacturing
## 4 manufacturing
## 5 health/education/social services
## 6 trade
Let’s discuss the variable “requirements”. Does the ad mention some “requirement” for the job?
So, if there is some “requirement”, it should be met, otherwise there is no possibility of a call-back.
Conversely, if there is no requirement, anyone is eligible for this job. So, lets separate the jobs which don’t have any minimum requirements, and store them for later.
table(Resume2$requirements)
##
## no yes
## 1036 3834
We will store 1036 records for later. Now let’s look at the 3834 records with a “requirement”.
#recall,if AD doesnt have requirements keep these 1036 records
#if AD does have requirements, resume must meet the requirements
#if resume does not meet requirements, no call regardless, drop records
NoReq<-Resume2[Resume2$requirements=="no",]
dim(NoReq)
## [1] 1036 18
#Store these 1036 for later
#No Requirement so Keep these 1036
#do the remaining 3834 meet requirements?
#unfortunately, only education, computer and experience
#can be checked explicitly. So Keep the records
#that meet the requirements
So, now the issue is, did the resumes with a requirement meet that requirement?
Here is the rub: the “requirement” variable is simply a Yes/No.
There are specific “requirement” variables: experience, communication, education, computer and organizational skills. On the other hand, the resume attributes variables are: experience, education, computer and special skills.
We can pair [requirement – attribute] for experience, education and computer.
Unfortunately, we cannot pair communication and organizational skills with the attribute of special skills because it is not explicitly defined. We don’t know what a special skill is, and the two remaining requirements do not have a one-one pairing of attributes.
Let’s begin…with the 3834 records that have a requirement…..
Req<-Resume2[Resume2$requirements=="yes",]
#Here is the number of resumes that apply to a job with some "requirement"
dim(Req)
## [1] 3834 18
#Lets start with the education requirement,is it met #or is it not required?
table(Req$college,Req$reqeduc)
##
## no yes
## no 986 90
## yes 2328 430
#this cross tab view gives a better description of #the breakdown by categories
CrossTable(Req$college, Req$reqeduc)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3834
##
##
## | Req$reqeduc
## Req$college | no | yes | Row Total |
## -------------|-----------|-----------|-----------|
## no | 986 | 90 | 1076 |
## | 3.364 | 21.440 | |
## | 0.916 | 0.084 | 0.281 |
## | 0.298 | 0.173 | |
## | 0.257 | 0.023 | |
## -------------|-----------|-----------|-----------|
## yes | 2328 | 430 | 2758 |
## | 1.312 | 8.365 | |
## | 0.844 | 0.156 | 0.719 |
## | 0.702 | 0.827 | |
## | 0.607 | 0.112 | |
## -------------|-----------|-----------|-----------|
## Column Total | 3314 | 520 | 3834 |
## | 0.864 | 0.136 | |
## -------------|-----------|-----------|-----------|
##
##
So, 986+2328 have no education requirement, so we need to keep these. 430 meet the education requirement, so keep these as well. 986+2328+430= 3744 records need to check the next requirement.
CollegeR<-Req[Req$college=="yes"|(Req$college=="no" & Req$reqeduc=="no"),]
dim(CollegeR)
## [1] 3744 18
Now 3744 meet the education requirement, now check these to see if they meet the computer requirement as well.
table(CollegeR$computer, CollegeR$reqcomp)
##
## no yes
## no 432 110
## yes 1237 1965
This cross tab view gives a better description of the breakdown by categories
CrossTable(CollegeR$computer, CollegeR$reqcomp)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3744
##
##
## | CollegeR$reqcomp
## CollegeR$computer | no | yes | Row Total |
## ------------------|-----------|-----------|-----------|
## no | 432 | 110 | 542 |
## | 150.022 | 120.669 | |
## | 0.797 | 0.203 | 0.145 |
## | 0.259 | 0.053 | |
## | 0.115 | 0.029 | |
## ------------------|-----------|-----------|-----------|
## yes | 1237 | 1965 | 3202 |
## | 25.394 | 20.425 | |
## | 0.386 | 0.614 | 0.855 |
## | 0.741 | 0.947 | |
## | 0.330 | 0.525 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 1669 | 2075 | 3744 |
## | 0.446 | 0.554 | |
## ------------------|-----------|-----------|-----------|
##
##
So, 432+1237 have no computer requirement, so we need to keep these. 1965 meet the computer requirement, so keep these as well. 423+1237+1965= 3634 records need to check the next requirement.
CompR<-CollegeR[CollegeR$computer=="yes"|(CollegeR$computer=="no" & CollegeR$reqcomp=="no"),]
dim(CompR)
## [1] 3634 18
Here 3634 records meet computer and education require, now check experience
Experience required: none, some, .5, 1-8, 10. Actual experience: 0-44.
So, we must check to see if the actual experience adequately meets the required experience……
```r
table(CompR$experience,CompR$minimum)
##
## 0 0.5 1 10 2 3 4 5 6 7 8 none some
## 1 0 1 0 0 1 0 0 0 0 0 0 22 4
## 2 1 0 13 1 22 10 0 4 0 0 0 87 93
## 3 0 0 4 1 19 6 0 1 0 0 0 75 45
## 4 0 0 10 0 44 46 1 3 1 0 0 193 110
## 5 0 0 7 2 32 31 2 18 0 0 0 162 111
## 6 1 0 25 2 52 59 1 22 1 0 0 284 166
## 7 0 0 20 1 35 22 0 20 4 3 1 145 118
## 8 1 0 21 1 53 48 1 30 0 2 1 128 153
## 9 0 1 1 0 11 4 0 7 0 2 2 81 26
## 10 0 0 4 4 13 4 0 6 0 1 3 46 27
## 11 0 2 3 6 4 7 0 9 0 1 2 100 20
## 12 0 0 2 0 5 3 0 1 0 0 0 34 17
## 13 0 0 1 0 11 8 0 4 0 0 0 52 33
## 14 0 0 4 0 7 16 0 8 0 1 1 48 30
## 15 0 0 1 0 2 0 0 4 0 0 0 18 5
## 16 0 0 1 0 0 6 1 0 0 0 0 50 7
## 17 0 0 0 0 0 2 0 0 0 0 0 0 0
## 18 0 0 4 0 7 14 0 5 0 1 0 9 18
## 19 0 0 0 0 4 1 0 1 0 0 0 10 7
## 20 0 0 0 0 3 4 0 2 0 0 0 9 6
## 21 0 0 3 0 5 2 1 6 1 0 0 19 9
## 22 0 0 1 0 0 0 1 0 0 0 0 4 2
## 23 0 0 1 0 1 0 0 1 1 0 0 2 1
## 25 0 0 2 0 1 0 0 1 0 0 0 2 1
## 26 0 0 5 0 7 13 0 6 0 1 0 26 18
## 44 0 0 0 0 0 1 0 0 0 0 0 0 0
class(CompR$experience)
## [1] "integer"
class(CompR$minimum)
## [1] "character"
ExpR<-CompR[(CompR$minimum == "none"| CompR$minimum=="0")
| (CompR$minimum=="some" & CompR$experience>=0)
| (CompR$minimum==".5" & CompR$experience>=0.5)
| (CompR$minimum=="1" & CompR$experience >=1.5)
| (CompR$minimum=="2" & CompR$experience >=2)
| (CompR$minimum=="3" & CompR$experience >=3)
| (CompR$minimum=="4" & CompR$experience >=4)
| (CompR$minimum=="5" & CompR$experience >=5)
| (CompR$minimum=="6" & CompR$experience >=6)
| (CompR$minimum=="7" & CompR$experience >=7)
| (CompR$minimum=="8" & CompR$experience >=8)
| (CompR$minimum=="10" & CompR$experience >=10),]
dim(ExpR)
## [1] 3601 18
3601 meet computer,education and experience requirements. (Out of 3834)
Recall, the 3834 records were retained because “Requirements”=yes. It is understood that this occurs when one of the following requirements are met: education, experience, computer, communication and organization.
We were able to tease out experience, computer and education requirements met, however, unable to distinquish whether organzation and communication requirements are met.
Unfortunately, we need to drop the records which have a organzational skills or communication requirement or both, and do not have an education, experience or computer skill requirement because we cannot explicity check if the resumes have this attribute.
In other words, we need to exclude this subset of records which have communication or organizational skills “requirements” but we have no way of determining if these resumes meet these requirements.
ExpRFinal<-ExpR[ExpR$reqcomm=="no" & ExpR$reqorg=="no",]
#
dim(ExpRFinal)
## [1] 2851 18
So, all of the resumes with “requirements” (which we can check) have met requirements corresponds to 2851 records.
Now we need to Append 1036 resumes which no requirements, to 2851 which met education, computer and experience requirements. 1036+2851=3887 records to be analyzed.
NewResume<-rbind(NoReq,ExpRFinal)
dim(NewResume)
## [1] 3887 18
Recall we began with 4870 and kept all records which met requirements 2851 or had no requirements 1036.
We move to our analysis of 3887 records.
Here are the names in the study, classified by ethnicity
NewResume$ethnicity[NewResume$ethnicity == 'afam'] <- 'AfricanAmer'
NewResume$ethnicity[NewResume$ethnicity == 'cauc'] <- 'Caucasian'
table(NewResume$name,NewResume$ethnicity)
##
## AfricanAmer Caucasian
## Aisha 134 0
## Allison 0 182
## Anne 0 187
## Brad 0 57
## Brendan 0 56
## Brett 0 51
## Carrie 0 132
## Darnell 38 0
## Ebony 156 0
## Emily 0 169
## Geoffrey 0 53
## Greg 0 48
## Hakim 47 0
## Jamal 54 0
## Jay 0 60
## Jermaine 48 0
## Jill 0 170
## Kareem 61 0
## Keisha 145 0
## Kenya 150 0
## Kristen 0 153
## Lakisha 159 0
## Latonya 186 0
## Latoya 163 0
## Laurie 0 151
## Leroy 57 0
## Matthew 0 56
## Meredith 0 140
## Neil 0 72
## Rasheed 58 0
## Sarah 0 146
## Tamika 200 0
## Tanisha 160 0
## Todd 0 59
## Tremayne 65 0
## Tyrone 64 0
Recall, we are interested in who got a call back. Lets look at some visualizations.
#there are few callbacks
ggplot(NewResume) + geom_bar(aes(x = call))
# a larger number of qualified females vs males
ggplot(NewResume) + geom_bar(aes(x = gender))
#Approximately the same number of African American and Caucasian resumes in the analysis, well balanced.
ggplot(NewResume, aes(ethnicity)) +
geom_bar(fill = "#0073C2FF") +
theme_pubclean()
#Caucasians appear to have a more call backs
ggplot(data = NewResume, aes(x = ethnicity, fill = call)) +
geom_bar()
# the boxplots suggest similar years experience
# of the two groups, African American and Caucasian
boxplot(NewResume$experience~NewResume$ethnicity)
# This gives you an idea of popular names and call backs
ggplot(NewResume, aes(x = name, fill = call)) +
geom_bar() +labs(title = "Call back by Name")+
theme(axis.text.x = element_text(angle = 90))
# This gives you an idea of industry applicants
# Similar number of Caucasian vs African American applicants across industries
ggplot(NewResume, aes(x = industry, fill = ethnicity)) +
geom_bar() +labs(title = "Call back by Industry")+
theme(axis.text.x = element_text(angle = 90))
# Similar number of Caucasian vs African American by requirement qualifications
ggplot(NewResume, aes(x = requirements, fill = ethnicity)) +
geom_bar() +labs(title = "Call back by jobs with some requirement")+
theme(axis.text.x = element_text(angle = 90))
#
#
#
Now, let’s get to the root problem…..
Is there an association between call-backs and ethnicity?
table1<-table(NewResume$ethnicity,NewResume$call)
table1
##
## no yes
## AfricanAmer 1818 127
## Caucasian 1752 190
chisq.test(table1)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table1
## X-squared = 13.307, df = 1, p-value = 0.0002644
##The chisq test is signficant.
There is an association between ethnicity and calls backs. This conclusion is confirmed by the significant chisq test. Furthermore, view the mosaic plot below. The mosaic plot gives an overview of the data and makes it possible to recognize relationships between different variables. Notice the larger proportion of call backs for caucasians as depicted in the plot.
mosaicplot(table1, main = "Mosaic plot: Call backs for Applicants ", color = TRUE)
##
#
#
Now, lets analyze two groups: Resumes with requirements that were met, Resumes with no requirements. Is there an association between ethnicity and call backs in the these groups?
We begin with the Group of resumes with no required qualifications.
NewResumeNoReq<-NewResume[NewResume$requirements=="no",]
table3<-table(NewResumeNoReq$ethnicity,NewResumeNoReq$call)
table3
##
## no yes
## AfricanAmer 480 38
## Caucasian 450 68
chisq.test(table3)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table3
## X-squared = 8.8383, df = 1, p-value = 0.00295
#The chisq test is signficant
There is an association between ethnicity and call backs for those applicants responding to ads for employment with no requirements.
mosaicplot(table3, main = "Call backs for Applicants NO requirements needed", color = TRUE)
#
#
Now lets look at the resumes with required qualifications met. A more skillful pool of prospective employees.
NewResumeReq<-NewResume[NewResume$requirements=="yes",]
table4<-table(NewResumeReq$ethnicity,NewResumeReq$call)
table4
##
## no yes
## AfricanAmer 1338 89
## Caucasian 1302 122
#
#
chisq.test(table4)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table4
## X-squared = 5.3139, df = 1, p-value = 0.02116
# The chisq test is significant.
mosaicplot(table4,main = "Call backs for Applicants meeting requirements", color = TRUE)
#
#
The group of resumes that meet requirements also shows there is an association between ethnicity and call backs.
FINAL CONCLUSION
Upon understanding the underlying data, we manipulated the dataset to contain the records feasible to our study.
There is an association between ethnicity and call backs, with Caucasians obtaining a higher proportion of call backs than African Americans.
In a further breakdown of resumes with and without requirements, both groups resumes had a significant chisq test indicating an association between ethnicity and call backs with Caucasians more likely to get called back.