R Bridge Course Final Project This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list).

From the list of datasets in the link above, I selected the dataset ResumeNames. The dataset has 4,870 fictitious records sent in response to employment advertisements. It measures various criteria, such as quality of the resume, years of experience the applicant has, the requirements for the position and many other columns of criteria. The primary study being conducted with this data revolves around bias on call backs from employers, based off their names. Do Caucasian sounding names have a higher call back rate than African American sounding names? Is there underlying data that explains this occurrence, if it is in fact happening?

library(RCurl)
resume <- getURL("https://vincentarelbundock.github.io/Rdatasets/csv/AER/ResumeNames.csv")
resume_data <- read.csv(text = resume)
head(resume_data, 10)

Looking at the first ten records in the dataset, the data looks randomized. The names so far are unique, and there seems to be an even split between the two ethnicity. Let’s see a breakdown of the various names:

library(ggplot2)
ggplot(resume_data, aes(name)) + geom_bar(aes(y = stat(count))) + coord_flip()

The names have an interesting distribution, but we can’t tell right away if there are more Caucasian sounding names or African-American sounding names. Let’s look at a more detailed summary:

summary(resume_data)
##        X            name              gender           ethnicity        
##  Min.   :   1   Length:4870        Length:4870        Length:4870       
##  1st Qu.:1218   Class :character   Class :character   Class :character  
##  Median :2436   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2436                                                           
##  3rd Qu.:3653                                                           
##  Max.   :4870                                                           
##    quality              call               city                jobs      
##  Length:4870        Length:4870        Length:4870        Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.661  
##                                                           3rd Qu.:4.000  
##                                                           Max.   :7.000  
##    experience        honors           volunteer           military        
##  Min.   : 1.000   Length:4870        Length:4870        Length:4870       
##  1st Qu.: 5.000   Class :character   Class :character   Class :character  
##  Median : 6.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 7.843                                                           
##  3rd Qu.: 9.000                                                           
##  Max.   :44.000                                                           
##     holes              school             email             computer        
##  Length:4870        Length:4870        Length:4870        Length:4870       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    special            college            minimum             equal          
##  Length:4870        Length:4870        Length:4870        Length:4870       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     wanted          requirements          reqexp            reqcomm         
##  Length:4870        Length:4870        Length:4870        Length:4870       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    reqeduc            reqcomp             reqorg            industry        
##  Length:4870        Length:4870        Length:4870        Length:4870       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

With the summary function, we are able to get counts of the various criteria measured in the dataset. Additionally, where there are numeric values, we are able to find the mean, median, minimum and maximum of that column.

With a closer look at the columns, we can see that some aren’t intuitive. The names of the columns will be changed for readability.

colnames(resume_data)
##  [1] "X"            "name"         "gender"       "ethnicity"    "quality"     
##  [6] "call"         "city"         "jobs"         "experience"   "honors"      
## [11] "volunteer"    "military"     "holes"        "school"       "email"       
## [16] "computer"     "special"      "college"      "minimum"      "equal"       
## [21] "wanted"       "requirements" "reqexp"       "reqcomm"      "reqeduc"     
## [26] "reqcomp"      "reqorg"       "industry"
resume_data <- setNames(resume_data, c("Index","Name","Gender", "Ethnicity", "Quality_of_Resume", "Called_Back", "City", "Num_Jobs_on_Resume", "Experience", "Honors", "Volunteer", "Military_Experience", "Holes_on_Resume", "Experience_in_School", "Email_on_Resume", "Computer_Skills", "Special_Skills", "College_Degree", "Minimum_Exp_Requirement", "EOE", "Position_Wanted", "Requirements", "Experience_Required", "Communication_Required", "Education_Required", "Computer_Exp_Required", "Organization_Required", "Industry"))

Confirming that the column names have changed:

colnames(resume_data)
##  [1] "Index"                   "Name"                   
##  [3] "Gender"                  "Ethnicity"              
##  [5] "Quality_of_Resume"       "Called_Back"            
##  [7] "City"                    "Num_Jobs_on_Resume"     
##  [9] "Experience"              "Honors"                 
## [11] "Volunteer"               "Military_Experience"    
## [13] "Holes_on_Resume"         "Experience_in_School"   
## [15] "Email_on_Resume"         "Computer_Skills"        
## [17] "Special_Skills"          "College_Degree"         
## [19] "Minimum_Exp_Requirement" "EOE"                    
## [21] "Position_Wanted"         "Requirements"           
## [23] "Experience_Required"     "Communication_Required" 
## [25] "Education_Required"      "Computer_Exp_Required"  
## [27] "Organization_Required"   "Industry"

We’ll begin the analysis by checking the general count of the two ethnicities, as well as checking how many from each group have a high or low quality resume.

ggplot(resume_data, aes(Ethnicity, fill = Quality_of_Resume)) + geom_bar() + ggtitle("Ethnicity and Quality of Resume Count") + xlab("Ethnicity") + ylab("Count")

The bar graph above shows that there is approximately an equal amount of Caucasian sounding names and African-American sounding names in the dataset. Additionally, they have about the same distribution between high and low quality resumes.

It’s great news to hear that the datasets are equal in terms of quantity for African-American and Caucasian, because the data will not need to be scaled, or interpreted differently. With a 1:1 ratio, all conclusions will be easily understood. We can now expand on our previous analysis by now checking to see how often an ethnic group has holes in their resume.

library(gridExtra)
afam = subset(resume_data, Ethnicity == "afam")
cauc = subset(resume_data, Ethnicity == "cauc")

afam_holes<-ggplot(afam, aes(Quality_of_Resume, fill = Holes_on_Resume)) + geom_bar() + ggtitle("AFAM Quality & Holes in Resume") + xlab("Quality of Resume") + ylab("Count")
cauc_holes<-ggplot(cauc, aes(Quality_of_Resume, fill = Holes_on_Resume)) + geom_bar() + ggtitle("CAUC Quality & Holes in Resume") + xlab("Quality of Resume") + ylab("Count")
grid.arrange(afam_holes, cauc_holes, nrow=1)

By creating subsets and a subplot, we can see the differentiation between both ethnic groups side by side, but in this case, there aren’t any. Both groups had an equal amount of holes in their resumes as a whole. So far, it seems as if there is not going to be any bias between the two groups. Given that both groups have the same distribution for resume quality, holes in their resume and general count per group, we can now see the percentage of users that were called back per group.

library(vtree)
vtree(resume_data, c("Called_Back", "Ethnicity"), horiz = FALSE)

The tree diagram above shows us how all of the information is broken down. We start with the 4870 records, and we can then see the percentage of users that were called back – a whopping 8%! The distribution of that 8% is then split between the African-American sounding names and the Caucasian sounding names. All of the data so far shows that there should not be a bias between call backs between the two groups, but once we see the breakdown, we see that caucasian sounding names receive more call backs.

We’ll be making a small table below to see the frequency of years of experience for the applicants.

table(resume_data$Experience)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##  45 352 194 537 507 817 541 578 159 130 173  69 154 149  34  94   3  77  46  35 
##  21  22  23  25  26  44 
##  47   8   9   7 104   1

With the table, we see that most applicants have 2 to 8 years of experience. Maybe this has something to do with why Caucasians receive more call backs.

ggplot(resume_data, aes(x=Ethnicity, y=Experience)) + geom_point(alpha = .2, position = "jitter", color = "orange") + geom_boxplot(alpha = 0) + ggtitle("Ethnicity vs Experience Boxplot") + xlab("Ethnicity") + ylab("Experience")

The boxplot above confirms out initial findings, that there isn’t a definitive difference between the two ethnic groups. We have to look at other criteria and see where the difference lies, or is there an non-data reason for bias towards Caucasian sounding names?

visualize <- function(arg_1) {
    #argg <- c(as.list(environment()), list(...)) - reads the parameters. Good for adding to lists.
    temp_plot <- ggplot(resume_data, aes(arg_1))+geom_bar()+ facet_wrap(~Ethnicity, nrow = 1)
}
p1 <- visualize(resume_data$Computer_Exp_Required)
p2 <- visualize(resume_data$Education_Required)
p3 <- visualize(resume_data$Organization_Required)
grid.arrange(p1 + ggtitle("Computer Exp Requirement") + theme(axis.title.x = element_blank()), 
             p2 + ggtitle("Education Requirement") + theme(axis.title.x = element_blank()),
             p3 + ggtitle("Organization Requirements") + theme(axis.title.x = element_blank()), nrow = 1) 

From this final sub_plot, we can see that there isn’t much of a difference in the three criterias above: Computer Experience needed, a mandatory education level and organizational skills required. The positions being offered are all even and have the same distribution of users who meet the criteria. All of the data shown doesn’t have an explanation as to why there is a bias towards Caucasian sounding names having a higher call back rate than African-American sounding names. On paper, everyone is on the same playing field. The groups have the same credentials, years of experience, education and many other criteria. There is no blatant explanation behind the bias of Caucasian call back rate. It is likely that there is an ethnic preference in these positions, or the data is too fictitious to make a realistic conclusion.