The dataset I am using to conduct my research is from data.gov, the United States Government’s open data site. The dataset is Alleghany County employee salaries from 2022 and includes annual salary, regular pay, incentive pay, and gross pay for employees under the County Executive. It also includes employees’ department, job title, gender, ethnicity, and pay status. My hypotheses are as follows:
Null Hypothesis: There is no significant difference in the likelihood of being terminated between individuals identifying as racial or gender minorities and those identifying as White or male.
Alternative Hypothesis: There is a significant difference in the likelihood of being terminated between individuals identifying as racial or gender minorities and those identifying as White or male.
rm(list=ls())
library(dplyr)
library(ggplot2)
salary <- read.csv("~/Desktop/redacted-2022-december-31-wprdc.csv")
head(salary)
str(salary)
'data.frame': 6280 obs. of 18 variables:
$ FIRST_NAME : chr "CATHERINE" "KEVIN J" "JOY M" "ELIZABETH S" ...
$ LAST_NAME : chr "ABALO" "ABBOTT" "ABBOTT" "ABRAHAM" ...
$ Combo.Name : chr "ABALO, CATHERINE" "ABBOTT, KEVIN J" "ABBOTT, JOY M" "ABRAHAM, ELIZABETH S" ...
$ DEPARTMENT : chr "Kane Regional Centers" "Emergency Management" "Kane Regional Centers" "Parks" ...
$ JOB_TITLE : chr "NURSING ASSISTANT" "FIRE INSTRUCTOR - PART TIME" "COOK" "SEASONAL AIDE" ...
$ ELECTED_OFFICIAL: int 0 0 0 0 0 0 0 0 0 0 ...
$ DATE_STARTED : chr "4/8/2010" "6/11/2018" "2/14/1999" "5/26/2022" ...
$ SEX : chr "F" "M" "F" "F" ...
$ ETHNICITY : chr "Black" "White (Not of Hispanic Origin)" "White (Not of Hispanic Origin)" "White (Not of Hispanic Origin)" ...
$ ORIG_START : chr "7/6/2009" "6/11/2018" "2/2/1998" "5/26/2022" ...
$ DATE_TERM : chr "" "" "" "9/5/2022" ...
$ PAY_STATUS : chr "Active" "Active" "Active" "Terminated" ...
$ ANNUAL_SALARY : chr " 41,581.07 " " 30,576.00 " " 40,761.76 " " 39,936.00 " ...
$ REGULAR_PAY : chr " 40,740.50 " " 378.00 " " 44,015.85 " " 6,488.00 " ...
$ OVERTIME_PAY : chr " 23,416.98 " " - " " 1,923.77 " " - " ...
$ INCENTIVE_PAY : chr " 125.00 " " - " " 225.00 " " 100.00 " ...
$ GROSS_PAY : chr " 64,282.48 " " 378.00 " " 46,164.62 " " 6,588.00 " ...
$ X : chr "" "" "" "" ...
dim(salary)
[1] 6280 18
summary(salary)
FIRST_NAME LAST_NAME Combo.Name DEPARTMENT JOB_TITLE
Length:6280 Length:6280 Length:6280 Length:6280 Length:6280
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
ELECTED_OFFICIAL DATE_STARTED SEX ETHNICITY ORIG_START
Min. :0.0000000 Length:6280 Length:6280 Length:6280 Length:6280
1st Qu.:0.0000000 Class :character Class :character Class :character Class :character
Median :0.0000000 Mode :character Mode :character Mode :character Mode :character
Mean :0.0009554
3rd Qu.:0.0000000
Max. :1.0000000
DATE_TERM PAY_STATUS ANNUAL_SALARY REGULAR_PAY OVERTIME_PAY
Length:6280 Length:6280 Length:6280 Length:6280 Length:6280
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
INCENTIVE_PAY GROSS_PAY X
Length:6280 Length:6280 Length:6280
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Dropping unnecessary columns. dropping personal information for privacy reasons!
salary <- salary %>% select(-1,-2,-3,-6)
Visuals
ggplot(data = salary, aes(x = ETHNICITY, y = ANNUAL_SALARY, fill = ETHNICITY)) +
geom_bar(stat = "identity", position = "dodge") +
labs(y = "Annual Salary",
title = "Annual Salary by Ethnicity")
#y axis is mess because annual salary is characters not numeric
#since my y-axis had too big of numbers, I am going to try and use the log of annual salary
class(salary$ANNUAL_SALARY)
[1] "character"
#character
salary$ANNUAL_SALARY <- as.numeric(gsub("\\$|,", "", salary$ANNUAL_SALARY))
#changing to numeric
salary$salary_log <- log(salary$ANNUAL_SALARY)
#using fixed numerical salary
#ggplot(data = salary, aes(x = ETHNICITY, y = ANNUAL_SALARY, fill = ETHNICITY)) +
#geom_bar(stat = "identity", position = "dodge") +
#labs(y = "Annual Salary",
#title = "Annual Salary by Ethnicity")
#originally used bar chart but switched to gem_col and grouped by ethnicity and found the average salary
avg_salary <- salary %>%
group_by(ETHNICITY) %>%
summarise(avg_salary = mean(ANNUAL_SALARY, na.rm = TRUE))
ggplot(avg_salary, aes(x = ETHNICITY, y = avg_salary, fill = ETHNICITY)) +
geom_col() +
labs(y = "Average Annual Salary",
title = "Average Annual Salary by Ethnicity")
Bar graph for gender
#ggplot(data = salary, aes(x = SEX, y = ANNUAL_SALARY, fill = SEX)) +
#geom_bar(stat = "identity", position = "dodge") +
#labs(y = "Annual Salary",
#title = "Annual Salary by Sex")
#originally used bar chart but switched to gem_col and grouped by sex and found the average salary
avg_salary_sex <- salary %>%
group_by(SEX) %>%
summarise(avg_salary = mean(ANNUAL_SALARY, na.rm = TRUE))
ggplot(avg_salary_sex, aes(x = SEX, y = avg_salary, fill = SEX)) +
geom_col() +
labs(y = "Average Annual Salary",
title = "Average Annual Salary by Sex")
Exploring other visuals with active pay (experimentation)
library(ggplot2)
summary_data <- salary %>%
group_by(ETHNICITY, PAY_STATUS) %>%
summarise(count = n())
`summarise()` has grouped output by 'ETHNICITY'. You can override using the `.groups` argument.
ggplot(data = summary_data, aes(x = "", y = count, fill = PAY_STATUS)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
facet_wrap(~ ETHNICITY) + #separate pie charts
labs(title = "Distribution of Active Pay Status by Ethnicity") +
theme_void()
#the White category appears as a full pie chart while the other categories appear as small slivers, it is probably because the White category has significantly more observations compared to the other categories. This imbalance in the distribution of observations can lead to differences in the appearance of the pie chart slices.
Density plot for pay status
#scientific notation
options(scipen = 999)
ggplot(salary, aes(x = ANNUAL_SALARY, fill = PAY_STATUS)) +
geom_density(alpha = 0.5) +
labs(x = "Annual Salary", y = "Density", fill = "Active Pay Status") +
scale_fill_manual(values = c("Active" = "blue", "Terminated" = "red")) +
theme_minimal()
Changing gender and race and pay status into binary
#race
salary$ETHNICITY[salary$ETHNICITY== "White (Not of Hispanic Origin)"] <- 0
salary$ETHNICITY[salary$ETHNICITY== "Black"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "American Indian/Alaskan Native"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Hispanic"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Asian or Pacific Islander"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Two or More Races"] <- 1
#gender
salary$SEX[salary$SEX== "M"] <- 0
salary$SEX[salary$SEX== "F"] <- 1
#pay status
salary$PAY_STATUS[salary$PAY_STATUS == "Active"] <- 0
salary$PAY_STATUS[salary$PAY_STATUS == "Terminated"] <- 1
SPLIT DATA– TRAINING AND TEST
# Set seed for reproducibility
set.seed(42)
# Split data into training and testing sets
split <- 0.75
rows <- nrow(salary)
train.entries <- sample(rows, rows*split) # We have already set split as 0.75
model.train <- salary[train.entries, ]
model.valid <- salary[-train.entries, ]
Regression
model.train$PAY_STATUS <- as.numeric(model.train$PAY_STATUS)
model.valid$PAY_STATUS <- as.numeric(model.valid$PAY_STATUS)
model.train$SEX <- factor(model.train$SEX)
model.train$ETHNICITY <- factor(model.train$ETHNICITY)
model.valid$SEX <- factor(model.valid$SEX)
model.valid$ETHNICITY <- factor(model.valid$ETHNICITY)
model <- glm(PAY_STATUS ~ SEX + ETHNICITY + ANNUAL_SALARY, data = model.train, family = binomial)
summary(model)
Call:
glm(formula = PAY_STATUS ~ SEX + ETHNICITY + ANNUAL_SALARY, family = binomial,
data = model.train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.406737941 0.130549863 3.116 0.00184 **
SEX1 0.118293921 0.076221464 1.552 0.12067
ETHNICITY1 -0.189988072 0.098528610 -1.928 0.05382 .
ANNUAL_SALARY -0.000035030 0.000002355 -14.876 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4752.3 on 4709 degrees of freedom
Residual deviance: 4448.1 on 4706 degrees of freedom
AIC: 4456.1
Number of Fisher Scoring iterations: 5
str(model.valid$SEX)
Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 1 ...
#factor
str(model.valid$ETHNICITY)
Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 1 ...
#factor
#making predictions
probabilities <- predict(model, newdata = model.valid, type = "response")
CONFUSION MATRIX
predicted <- ifelse(probabilities < 0.5, 1, 0)
conf_matrix <- table(predicted, model.valid$PAY_STATUS)
conf_matrix
predicted 0 1
0 1 6
1 1262 301
#predicted
# 1- terminated
# under predicted 0 1 -- 0 (-) 1 (+)
That’s how we can interpret a Confusion Matrix:
Actual
+ -
Predicted + TP FP - FN TN
TP (True Positive): survived TN (True Negative): not survived FP (False Positive): predicted as survived but didn’t actually survive FN (False Negative): predicted as not survived but did actually survive
The Confusion Matrix can be used to calculate the
accuracy of the model:
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
[1] 0.1923567
#.1923567
20% accuracy
I used a confusion matrix to gauge the efficacy of my model. Initially, my model accurately predicted the termination of 301 individuals, aligning with the actual outcomes. Similarly, it correctly anticipated the continued employment of one individual, aligning with the actual outcomes again. However, the discrepancy arises in the false positives and negatives: my model projected 1,262 workers to be terminated when they were still active and misjudged six active workers as terminated. Consequently, the model’s accuracy is 20%, indicating its inadequacy. This limitation may come from the limited variables utilized, necessitating a more comprehensive dataset encompassing factors like seniority, industry affiliations, and the potential influence of COVID-19 and the pandemic. Despite leveraging the available data to the best of my ability, the results underscore the imperative for additional information to conduct a thorough and precise analysis.