The dataset I am using to conduct my research is from data.gov, the United States Government’s open data site. The dataset is Alleghany County employee salaries from 2022 and includes annual salary, regular pay, incentive pay, and gross pay for employees under the County Executive. It also includes employees’ department, job title, gender, ethnicity, and pay status. My hypotheses are as follows:

Null Hypothesis: There is no significant difference in the likelihood of being terminated between individuals identifying as racial or gender minorities and those identifying as White or male.

Alternative Hypothesis: There is a significant difference in the likelihood of being terminated between individuals identifying as racial or gender minorities and those identifying as White or male.

rm(list=ls()) 

library(dplyr)
library(ggplot2)

salary <- read.csv("~/Desktop/redacted-2022-december-31-wprdc.csv")
head(salary) 
str(salary) 
'data.frame':   6280 obs. of  18 variables:
 $ FIRST_NAME      : chr  "CATHERINE" "KEVIN J" "JOY M" "ELIZABETH S" ...
 $ LAST_NAME       : chr  "ABALO" "ABBOTT" "ABBOTT" "ABRAHAM" ...
 $ Combo.Name      : chr  "ABALO, CATHERINE" "ABBOTT, KEVIN J" "ABBOTT, JOY M" "ABRAHAM, ELIZABETH S" ...
 $ DEPARTMENT      : chr  "Kane Regional Centers" "Emergency Management" "Kane Regional Centers" "Parks" ...
 $ JOB_TITLE       : chr  "NURSING ASSISTANT" "FIRE INSTRUCTOR - PART TIME" "COOK" "SEASONAL AIDE" ...
 $ ELECTED_OFFICIAL: int  0 0 0 0 0 0 0 0 0 0 ...
 $ DATE_STARTED    : chr  "4/8/2010" "6/11/2018" "2/14/1999" "5/26/2022" ...
 $ SEX             : chr  "F" "M" "F" "F" ...
 $ ETHNICITY       : chr  "Black" "White (Not of Hispanic Origin)" "White (Not of Hispanic Origin)" "White (Not of Hispanic Origin)" ...
 $ ORIG_START      : chr  "7/6/2009" "6/11/2018" "2/2/1998" "5/26/2022" ...
 $ DATE_TERM       : chr  "" "" "" "9/5/2022" ...
 $ PAY_STATUS      : chr  "Active" "Active" "Active" "Terminated" ...
 $ ANNUAL_SALARY   : chr  " 41,581.07 " " 30,576.00 " " 40,761.76 " " 39,936.00 " ...
 $ REGULAR_PAY     : chr  " 40,740.50 " " 378.00 " " 44,015.85 " " 6,488.00 " ...
 $ OVERTIME_PAY    : chr  " 23,416.98 " " -   " " 1,923.77 " " -   " ...
 $ INCENTIVE_PAY   : chr  " 125.00 " " -   " " 225.00 " " 100.00 " ...
 $ GROSS_PAY       : chr  " 64,282.48 " " 378.00 " " 46,164.62 " " 6,588.00 " ...
 $ X               : chr  "" "" "" "" ...
dim(salary) 
[1] 6280   18
summary(salary) 
  FIRST_NAME         LAST_NAME          Combo.Name         DEPARTMENT         JOB_TITLE        
 Length:6280        Length:6280        Length:6280        Length:6280        Length:6280       
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
 ELECTED_OFFICIAL    DATE_STARTED           SEX             ETHNICITY          ORIG_START       
 Min.   :0.0000000   Length:6280        Length:6280        Length:6280        Length:6280       
 1st Qu.:0.0000000   Class :character   Class :character   Class :character   Class :character  
 Median :0.0000000   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.0009554                                                                              
 3rd Qu.:0.0000000                                                                              
 Max.   :1.0000000                                                                              
  DATE_TERM          PAY_STATUS        ANNUAL_SALARY      REGULAR_PAY        OVERTIME_PAY      
 Length:6280        Length:6280        Length:6280        Length:6280        Length:6280       
 Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                               
                                                                                               
                                                                                               
 INCENTIVE_PAY       GROSS_PAY              X            
 Length:6280        Length:6280        Length:6280       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         

Dropping unnecessary columns. dropping personal information for privacy reasons!

salary <- salary %>% select(-1,-2,-3,-6)

Visuals

ggplot(data = salary, aes(x = ETHNICITY, y = ANNUAL_SALARY, fill = ETHNICITY)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(y = "Annual Salary",
       title = "Annual Salary by Ethnicity")


#y axis is mess because annual salary is characters not numeric

#since my y-axis had too big of numbers, I am going to try and use the log of annual salary

class(salary$ANNUAL_SALARY)
[1] "character"
#character

salary$ANNUAL_SALARY <- as.numeric(gsub("\\$|,", "", salary$ANNUAL_SALARY))
#changing to numeric

salary$salary_log <- log(salary$ANNUAL_SALARY)
#using fixed numerical salary 
#ggplot(data = salary, aes(x = ETHNICITY, y = ANNUAL_SALARY, fill = ETHNICITY)) +
  #geom_bar(stat = "identity", position = "dodge") +
  #labs(y = "Annual Salary",
       #title = "Annual Salary by Ethnicity")

#originally used bar chart but switched to gem_col and grouped by ethnicity and found the average salary

avg_salary <- salary %>%
  group_by(ETHNICITY) %>%
  summarise(avg_salary = mean(ANNUAL_SALARY, na.rm = TRUE))


ggplot(avg_salary, aes(x = ETHNICITY, y = avg_salary, fill = ETHNICITY)) +
  geom_col() +
  labs(y = "Average Annual Salary",
       title = "Average Annual Salary by Ethnicity")

Bar graph for gender

#ggplot(data = salary, aes(x = SEX, y = ANNUAL_SALARY, fill = SEX)) +
  #geom_bar(stat = "identity", position = "dodge") +
  #labs(y = "Annual Salary",
       #title = "Annual Salary by Sex")

#originally used bar chart but switched to gem_col and grouped by sex and found the average salary

avg_salary_sex <- salary %>%
  group_by(SEX) %>%
  summarise(avg_salary = mean(ANNUAL_SALARY, na.rm = TRUE))

ggplot(avg_salary_sex, aes(x = SEX, y = avg_salary, fill = SEX)) +
  geom_col() +
  labs(y = "Average Annual Salary",
       title = "Average Annual Salary by Sex")

Exploring other visuals with active pay (experimentation)

library(ggplot2)

summary_data <- salary %>%
  group_by(ETHNICITY, PAY_STATUS) %>%
  summarise(count = n())
`summarise()` has grouped output by 'ETHNICITY'. You can override using the `.groups` argument.
ggplot(data = summary_data, aes(x = "", y = count, fill = PAY_STATUS)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  facet_wrap(~ ETHNICITY) +  #separate pie charts 
  labs(title = "Distribution of Active Pay Status by Ethnicity") +
  theme_void()



#the White category appears as a full pie chart while the other categories appear as small slivers, it is probably because the White category has significantly more observations compared to the other categories. This imbalance in the distribution of observations can lead to differences in the appearance of the pie chart slices.

Density plot for pay status


#scientific notation
  options(scipen = 999)



ggplot(salary, aes(x = ANNUAL_SALARY, fill = PAY_STATUS)) +
  geom_density(alpha = 0.5) +
  labs(x = "Annual Salary", y = "Density", fill = "Active Pay Status") +
  scale_fill_manual(values = c("Active" = "blue", "Terminated" = "red")) +
  theme_minimal()

Changing gender and race and pay status into binary

#race
salary$ETHNICITY[salary$ETHNICITY== "White (Not of Hispanic Origin)"] <- 0
salary$ETHNICITY[salary$ETHNICITY== "Black"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "American Indian/Alaskan Native"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Hispanic"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Asian or Pacific Islander"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Two or More Races"] <- 1

#gender
salary$SEX[salary$SEX== "M"] <- 0
salary$SEX[salary$SEX== "F"] <- 1

#pay status
salary$PAY_STATUS[salary$PAY_STATUS == "Active"] <- 0
salary$PAY_STATUS[salary$PAY_STATUS == "Terminated"] <- 1

SPLIT DATA– TRAINING AND TEST

# Set seed for reproducibility
set.seed(42)

# Split data into training and testing sets
split <- 0.75
rows  <- nrow(salary)

train.entries <- sample(rows, rows*split) # We have already set split as 0.75

model.train <- salary[train.entries, ]
model.valid  <- salary[-train.entries,  ]

Regression


model.train$PAY_STATUS <- as.numeric(model.train$PAY_STATUS)
model.valid$PAY_STATUS <- as.numeric(model.valid$PAY_STATUS)


model.train$SEX <- factor(model.train$SEX)
model.train$ETHNICITY <- factor(model.train$ETHNICITY)

model.valid$SEX <- factor(model.valid$SEX)
model.valid$ETHNICITY <- factor(model.valid$ETHNICITY)


model <- glm(PAY_STATUS ~ SEX + ETHNICITY + ANNUAL_SALARY, data = model.train, family = binomial)


summary(model)

Call:
glm(formula = PAY_STATUS ~ SEX + ETHNICITY + ANNUAL_SALARY, family = binomial, 
    data = model.train)

Coefficients:
                  Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)    0.406737941  0.130549863   3.116              0.00184 ** 
SEX1           0.118293921  0.076221464   1.552              0.12067    
ETHNICITY1    -0.189988072  0.098528610  -1.928              0.05382 .  
ANNUAL_SALARY -0.000035030  0.000002355 -14.876 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4752.3  on 4709  degrees of freedom
Residual deviance: 4448.1  on 4706  degrees of freedom
AIC: 4456.1

Number of Fisher Scoring iterations: 5
str(model.valid$SEX)
 Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 1 ...
#factor
str(model.valid$ETHNICITY)
 Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 1 ...
#factor

#making predictions
probabilities <- predict(model, newdata = model.valid, type = "response")

CONFUSION MATRIX

predicted <- ifelse(probabilities < 0.5, 1, 0)
conf_matrix <- table(predicted, model.valid$PAY_STATUS)
conf_matrix
         
predicted    0    1
        0    1    6
        1 1262  301
#predicted 


# 1- terminated
# under predicted 0 1 -- 0 (-) 1 (+) 

That’s how we can interpret a Confusion Matrix:

         Actual
         +   -  

Predicted + TP FP - FN TN

TP (True Positive): survived TN (True Negative): not survived FP (False Positive): predicted as survived but didn’t actually survive FN (False Negative): predicted as not survived but did actually survive

The Confusion Matrix can be used to calculate the accuracy of the model:

# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
[1] 0.1923567
#.1923567

20% accuracy

I used a confusion matrix to gauge the efficacy of my model. Initially, my model accurately predicted the termination of 301 individuals, aligning with the actual outcomes. Similarly, it correctly anticipated the continued employment of one individual, aligning with the actual outcomes again. However, the discrepancy arises in the false positives and negatives: my model projected 1,262 workers to be terminated when they were still active and misjudged six active workers as terminated. Consequently, the model’s accuracy is 20%, indicating its inadequacy. This limitation may come from the limited variables utilized, necessitating a more comprehensive dataset encompassing factors like seniority, industry affiliations, and the potential influence of COVID-19 and the pandemic. Despite leveraging the available data to the best of my ability, the results underscore the imperative for additional information to conduct a thorough and precise analysis.

---
title: "R Notebook"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

The dataset I am using to conduct my research is from data.gov, the United States Government’s open data site. The dataset is Alleghany County employee salaries from 2022 and includes annual salary, regular pay, incentive pay, and gross pay for employees under the County Executive. It also includes employees' department, job title, gender, ethnicity, and pay status. My hypotheses are as follows:

Null Hypothesis:  There is no significant difference in the likelihood of being terminated between individuals identifying as racial or gender minorities and those identifying as White or male.

Alternative Hypothesis:  There is a significant difference in the likelihood of being terminated between individuals identifying as racial or gender minorities and those identifying as White or male.
```{r}
rm(list=ls()) 

library(dplyr)
library(ggplot2)

salary <- read.csv("~/Desktop/redacted-2022-december-31-wprdc.csv")

```

```{r}
head(salary) 
str(salary) 
dim(salary) 
summary(salary) 
```

Dropping unnecessary columns. dropping personal information for privacy reasons!
```{r}
salary <- salary %>% select(-1,-2,-3,-6)
```


Visuals
```{r}
ggplot(data = salary, aes(x = ETHNICITY, y = ANNUAL_SALARY, fill = ETHNICITY)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(y = "Annual Salary",
       title = "Annual Salary by Ethnicity")

#y axis is mess because annual salary is characters not numeric
```
```{r}

#since my y-axis had too big of numbers, I am going to try and use the log of annual salary

class(salary$ANNUAL_SALARY)
#character

salary$ANNUAL_SALARY <- as.numeric(gsub("\\$|,", "", salary$ANNUAL_SALARY))
#changing to numeric

salary$salary_log <- log(salary$ANNUAL_SALARY)
```



```{r}
#using fixed numerical salary 
#ggplot(data = salary, aes(x = ETHNICITY, y = ANNUAL_SALARY, fill = ETHNICITY)) +
  #geom_bar(stat = "identity", position = "dodge") +
  #labs(y = "Annual Salary",
       #title = "Annual Salary by Ethnicity")

#originally used bar chart but switched to gem_col and grouped by ethnicity and found the average salary

avg_salary <- salary %>%
  group_by(ETHNICITY) %>%
  summarise(avg_salary = mean(ANNUAL_SALARY, na.rm = TRUE))


ggplot(avg_salary, aes(x = ETHNICITY, y = avg_salary, fill = ETHNICITY)) +
  geom_col() +
  labs(y = "Average Annual Salary",
       title = "Average Annual Salary by Ethnicity")

```

Bar graph for gender
```{r}
#ggplot(data = salary, aes(x = SEX, y = ANNUAL_SALARY, fill = SEX)) +
  #geom_bar(stat = "identity", position = "dodge") +
  #labs(y = "Annual Salary",
       #title = "Annual Salary by Sex")

#originally used bar chart but switched to gem_col and grouped by sex and found the average salary

avg_salary_sex <- salary %>%
  group_by(SEX) %>%
  summarise(avg_salary = mean(ANNUAL_SALARY, na.rm = TRUE))

ggplot(avg_salary_sex, aes(x = SEX, y = avg_salary, fill = SEX)) +
  geom_col() +
  labs(y = "Average Annual Salary",
       title = "Average Annual Salary by Sex")

```
Exploring other visuals with active pay (experimentation)
```{r}
library(ggplot2)

summary_data <- salary %>%
  group_by(ETHNICITY, PAY_STATUS) %>%
  summarise(count = n())

ggplot(data = summary_data, aes(x = "", y = count, fill = PAY_STATUS)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  facet_wrap(~ ETHNICITY) +  #separate pie charts 
  labs(title = "Distribution of Active Pay Status by Ethnicity") +
  theme_void()


#the White category appears as a full pie chart while the other categories appear as small slivers, it is probably because the White category has significantly more observations compared to the other categories. This imbalance in the distribution of observations can lead to differences in the appearance of the pie chart slices.

```
Density plot for pay status
```{r}

#scientific notation
  options(scipen = 999)



ggplot(salary, aes(x = ANNUAL_SALARY, fill = PAY_STATUS)) +
  geom_density(alpha = 0.5) +
  labs(x = "Annual Salary", y = "Density", fill = "Active Pay Status") +
  scale_fill_manual(values = c("Active" = "blue", "Terminated" = "red")) +
  theme_minimal()
```


Changing gender and race and pay status into binary
```{r}
#race
salary$ETHNICITY[salary$ETHNICITY== "White (Not of Hispanic Origin)"] <- 0
salary$ETHNICITY[salary$ETHNICITY== "Black"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "American Indian/Alaskan Native"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Hispanic"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Asian or Pacific Islander"] <- 1
salary$ETHNICITY[salary$ETHNICITY== "Two or More Races"] <- 1

#gender
salary$SEX[salary$SEX== "M"] <- 0
salary$SEX[salary$SEX== "F"] <- 1

#pay status
salary$PAY_STATUS[salary$PAY_STATUS == "Active"] <- 0
salary$PAY_STATUS[salary$PAY_STATUS == "Terminated"] <- 1

```
SPLIT DATA-- TRAINING AND TEST
```{r}
# Set seed for reproducibility
set.seed(42)

# Split data into training and testing sets
split <- 0.75
rows  <- nrow(salary)

train.entries <- sample(rows, rows*split) # We have already set split as 0.75

model.train <- salary[train.entries, ]
model.valid  <- salary[-train.entries,  ]
```

Regression
```{r}

model.train$PAY_STATUS <- as.numeric(model.train$PAY_STATUS)
model.valid$PAY_STATUS <- as.numeric(model.valid$PAY_STATUS)


model.train$SEX <- factor(model.train$SEX)
model.train$ETHNICITY <- factor(model.train$ETHNICITY)

model.valid$SEX <- factor(model.valid$SEX)
model.valid$ETHNICITY <- factor(model.valid$ETHNICITY)


model <- glm(PAY_STATUS ~ SEX + ETHNICITY + ANNUAL_SALARY, data = model.train, family = binomial)


summary(model)

```

```{r}
str(model.valid$SEX)
#factor
str(model.valid$ETHNICITY)
#factor

#making predictions
probabilities <- predict(model, newdata = model.valid, type = "response")

```


CONFUSION MATRIX
```{r}
predicted <- ifelse(probabilities < 0.5, 1, 0)
conf_matrix <- table(predicted, model.valid$PAY_STATUS)
conf_matrix

#predicted 


# 1- terminated
# under predicted 0 1 -- 0 (-) 1 (+) 
```

That's how we can interpret a Confusion Matrix:

             Actual
             +   -  
Predicted +  TP  FP
          -  FN  TN

TP (True Positive): survived
TN (True Negative): not survived
FP (False Positive): predicted as survived but didn't actually survive
FN (False Negative): predicted as not survived but did actually survive


The Confusion Matrix can be used to calculate the `accuracy` of the model:
```{r}
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
#.1923567
```

20% accuracy 


I used a confusion matrix to gauge the efficacy of my model. Initially, my model accurately predicted the termination of 301 individuals, aligning with the actual outcomes. Similarly, it correctly anticipated the continued employment of one individual, aligning with the actual outcomes again. However, the discrepancy arises in the false positives and negatives: my model projected 1,262 workers to be terminated when they were still active and misjudged six active workers as terminated. Consequently, the model's accuracy is 20%, indicating its inadequacy. This limitation may come from the limited variables utilized, necessitating a more comprehensive dataset encompassing factors like seniority, industry affiliations, and the potential influence of COVID-19 and the pandemic. Despite leveraging the available data to the best of my ability, the results underscore the imperative for additional information to conduct a thorough and precise analysis.
