Project 1

2024-11-03

Introduction

For this project, I used payroll data taken from all Los Angeles city employees. The data includes employee information from departments like the police, library, and harbor.

The data set contains 555655 observations with 20 variables each, detailing information about the employees’ characteristics and their jobs.

Here is a link to the original data set: https://controllerdata.lacity.org/Payroll/City-Employee-Payroll-Current-/g9h8-fvhu/about_data

Problem Definition

The data centers around the employees’ pay, but it also has a lot of variables to characterize each employee:

##  [1] "ALL_OTHER_PAY"                 "BENEFIT_PAY"                  
##  [3] "CITY_RETIREMENT_CONTRIBUTIONS" "DEPARTMENT_NO"                
##  [5] "DEPARTMENT_TITLE"              "EMPLOYMENT_TYPE"              
##  [7] "ETHNICITY"                     "FIRST_NAME"                   
##  [9] "GENDER"                        "JOB_CLASS_PGRADE"             
## [11] "JOB_STATUS"                    "JOB_TITLE"                    
## [13] "LAST_NAME"                     "MOU"                          
## [15] "MOU_TITLE"                     "OVERTIME_PAY"                 
## [17] "PAY_YEAR"                      "RECORD_NBR"                   
## [19] "REGULAR_PAY"                   "TOTAL_PAY"

So, by how much do these variables have an impact on the employees’ average regular pay?

Data Cleaning

Employees’ pay is stored in a variable named REGULAR_PAY, but this doesn’t specify what kind of position they are in.

To make the analysis more consistent, I decided to only look at the regular pay of full time employees.

Here, I’m just creating a new data frame to store our full time employees and making sure that there are no NA entries.

payrollfull <- payroll[payroll$EMPLOYMENT_TYPE == "FULL_TIME",]
payrollfull <- na.omit(payrollfull)

After this, we now have:

## 
## FULL_TIME 
##    446776

Data Exploration

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      7.5  51068.8  81219.2  82660.2 112921.6 481255.1

Job Status

The JOB_STATUS variable only has two options: active, and not active.

Here, I’ll be using plotly to make an interactive box plot, with one box for each JOB_STATUS option.

The boxes will plot the regular pay of the employees for each option.

plot <- plot_ly(payrollfull, 
                x = payrollfull[payrollfull$JOB_STATUS == "ACTIVE",]
                $REGULAR_PAY, 
                name = "Active", type = "box")
plot <- plot %>% 
  add_trace(payrollfull, x = payrollfull[payrollfull$JOB_STATUS 
                                        == "NOT_ACTIVE",]$REGULAR_PAY, 
            name = "Not Active")

Job Status Plot

Gender

There are four options for the GENDER variable: female, male, other, and unknown. For this plot, I’m taking the means for the employees’ regular pay separated by their reported gender.

More on Gender

The unknown option seems to have a significantly lower pay than the rest. Let’s look at the distribution of these options:

table(payrollfull$GENDER)

## 
##  FEMALE    MALE   OTHER UNKNOWN 
##  126693  302569      36     176

We can see that there are significantly less data points for both other and unknown than there are for male or female.

On its own, this doesn’t invalidate the results of the plot, but more info on the data collection and the individual employees is needed to make this more reliable.

Benefit Pay and Pay Year

The two variables we will be looking at next are BENEFIT_PAY and PAY_YEAR. To do this, we will use ggplot to create a graph that includes three variables and some regression lines.

g = ggplot(payrollfull, aes(x = BENEFIT_PAY, y = REGULAR_PAY, 
                            color= PAY_YEAR)) + 
  geom_point(alpha=1/2) + xlim(0, 80000) + 
  scale_color_gradient(low="blue", high="pink") + 
  geom_smooth(method="lm", se=F, color="orange") + 
  geom_smooth(formula=y ~ poly(x, 2), method="lm", se=F, color="green") + 
  geom_smooth(formula=y ~ poly(x, 4), method="lm", se=F, color="darkred") + 
  ggtitle("Benefit Pay vs. Regular Pay and Pay Year")

Benefit Pay and Pay Year Plot

More Detail on Pay Year

meanbyyear <- payrollfull %>%
  group_by(PAY_YEAR) %>%
  summarise(means = mean(REGULAR_PAY))
meanbyyear

## # A tibble: 8 × 2
##   PAY_YEAR  means
##      <int>  <dbl>
## 1     2017 75541.
## 2     2018 80926.
## 3     2019 80043.
## 4     2020 92391.
## 5     2021 89548.
## 6     2022 88908.
## 7     2023 93190.
## 8     2024 51520.

Ethnicity

piedata = count(payrollfull, payrollfull$ETHNICITY)
plot_ly(data = piedata, labels =~ factor(payrollfull$ETHNICITY), type = 'pie')

More on Ethnicity

We saw in the pie chart that there were some entries marked as integers, which we don’t want for our following plot. Here, we’ll remove any blank or numerical values for ethnicity before making our bar plot.

ethnicitydata = payrollfull[!is.numeric(payrollfull$ETHNICITY), ]
ethnicitydata = ethnicitydata[!ethnicitydata$ETHNICITY == "", ]
plot = ethnicitydata %>%
  group_by(ETHNICITY) %>%
  summarise(AverageRegularPay = mean(REGULAR_PAY)) %>%
  ggplot(aes(x = ETHNICITY, y = AverageRegularPay, fill = ETHNICITY)) +
    geom_col(position = position_dodge()) +
    scale_x_discrete(guide = guide_axis(angle = 90)) + 
  geom_text(aes(label = round(AverageRegularPay, 0)), 
            colour = "black", size = 3,vjust = 1.5, 
            position = position_dodge(.9)) + 
  theme(legend.position='none') + 
  ggtitle("Average Regular Pay by Ethnicity")

Ethnicity Plot

Conclusion

All variables we looked at had some effect on the average regular pay
Options like “other” and “unknown” tended to result in lower pay
Categorical variables that resulted in the highest -> lowest pay:
- Job Status: Active -> Not Active
- Gender: Male -> Unknown
- Pay Year: 2023 -> 2024
- Ethnicity: Asian American -> Two or More Races
Info from data source needed to draw more sweeping conclusions
- Anomalies like regular pay being < 10, “unknown” gender option being extremely low