Research Question: What factors out of gender, pay grade, overtime pay, and longevity pay predict base salary among Montgomery County government employees in 2020?
To answer this question, I used the Employee Salaries: 2020 dataset from the Montgomery County Open Data Portal: https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2020/he7s-ebwb/about_data. This dataset tracks the earnings for county government employees, including their salary, overtime, longevity pay, pay grade, and gender.
Each row represents a single employee. The key variables used in this analysis are:
I will use multiple linear regression because Base_Salary is a continuous outcome variable and I want to understand the influence of multiple predictors simultaneously.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/SwagD/Desktop/Data 101")
data <- read.csv("Employee_Salaries_-_2020_20260507.csv")
head(data)
## Department Department.Name Division Gender
## 1 ABS Alcohol Beverage Services Wholesale Administration F
## 2 ABS Alcohol Beverage Services Administrative Services F
## 3 ABS Alcohol Beverage Services Administration M
## 4 ABS Alcohol Beverage Services Wholesale Operations F
## 5 ABS Alcohol Beverage Services Administration F
## 6 ABS Alcohol Beverage Services Marketing F
## Base.Salary X2020.Overtime.Pay X2020.Longevity.Pay Grade
## 1 $78,902 $199.17 $0 18
## 2 $35,926 $0 $4,038.91 16
## 3 $167,345 $0 $0 M2
## 4 $90,848 $0 $5,717.68 21
## 5 $78,902 $205.16 $2,460.24 18
## 6 $109,761 $0 $0 25
Before I start my analysis I will first clean the data by standardizing column names and converting the currency columns from character strings to numbers.
# Clean column names
names(data) <- tolower(gsub("[(). \\-]", "_", names(data)))
clean_data <- data |>
mutate(
base_salary = as.numeric(gsub("[$,]", "", base_salary)), # removes $ and , and then converts them to number
overtime_pay = as.numeric(gsub("[$,]", "", x2020_overtime_pay)), # same thing as base_salary
longevity_pay = as.numeric(gsub("[$,]", "", x2020_longevity_pay)), # same thing as base_salary
numeric_grade = as.numeric(grade), # converts grade to number
gender = as.factor(gender) # converts gender to factor, where F is the baseline, M is the comparison
) |>
filter(!is.na(base_salary), !is.na(numeric_grade),
!is.na(overtime_pay), !is.na(longevity_pay)) # removes incomplete rows
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `numeric_grade = as.numeric(grade)`.
## Caused by warning:
## ! NAs introduced by coercion
Now that I’ve cleaned my dataset I will look at the summary statistics grouped by gender.
clean_data |>
group_by(gender) |>
summarize(
mean_salary = mean(base_salary),
median_salary = median(base_salary),
max_salary = max(base_salary),
min_salary = min(base_salary)
)
## # A tibble: 2 × 5
## gender mean_salary median_salary max_salary min_salary
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 F 72182. 71852 212556 16004
## 2 M 72822. 68575 280000 11147.
Here is the visualization I created to help visualize the average base salary by gender.
ggplot(clean_data, aes(x = gender, y = base_salary, fill = gender)) +
geom_boxplot(color = "black") +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Distribution of Base Salary by Gender",
x = "Gender",
y = "Base Salary in USD",
caption = "Source: Montgomery County Open Data Portal"
) +
theme_minimal() +
theme(legend.position = "none")
I chose multiple linear regression because the outcome variable Base_Salary is continuous and I want to examine the influence of multiple predictors at the same time. This is because the dependent variable is numeric and I’m examining multiple predictors at once.
I chose Numeric_Grade, Overtime_Pay, Longevity_Pay, and Gender as predictors because grade directly determines where an employee falls on the county pay scale, longevity pay reflects years of service which typically raises compensation, overtime pay captures additional hours worked, and gender is included to test whether a wage gap exists after controlling for those structural factors.
The 5 assumptions I will check are: Linearity, Independence, Normality of Residuals, Homoscedasticity, No Multicollinearity.
# My multiple linear regression model
model <- lm(base_salary ~ gender + numeric_grade + overtime_pay + longevity_pay,
data = clean_data)
summary(model)
##
## Call:
## lm(formula = base_salary ~ gender + numeric_grade + overtime_pay +
## longevity_pay, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53275 -11699 -1532 8200 263287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.207e+04 1.076e+03 11.222 < 2e-16 ***
## genderM 4.639e+03 5.740e+02 8.082 7.56e-16 ***
## numeric_grade 3.049e+03 5.245e+01 58.136 < 2e-16 ***
## overtime_pay -1.308e-01 4.637e-02 -2.821 0.00481 **
## longevity_pay 2.934e+00 1.903e-01 15.418 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22530 on 6477 degrees of freedom
## Multiple R-squared: 0.371, Adjusted R-squared: 0.3706
## F-statistic: 955 on 4 and 6477 DF, p-value: < 2.2e-16
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared
cat("R-squared:", round(r_squared, 3),
"\nAdjusted R-squared:", round(adj_r_squared, 3))
## R-squared: 0.371
## Adjusted R-squared: 0.371
I checked linearity, homoscedasticity, normality of residuals, and influential points using the four standard diagnostic plots.
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
For independence, each row is a separate county employee so independence is reasonably assumed.
For no multicollinearity, I checked the correlation between the numeric predictors.
cor(clean_data[, c("numeric_grade", "overtime_pay", "longevity_pay")], use = "complete.obs")
## numeric_grade overtime_pay longevity_pay
## numeric_grade 1.00000000 -0.05713947 0.08740723
## overtime_pay -0.05713947 1.00000000 0.03575301
## longevity_pay 0.08740723 0.03575301 1.00000000
# Using RMSE for the average prediction error in dollars
residuals <- model$residuals
rmse <- sqrt(mean(residuals^2))
rmse
## [1] 22525.32
My linear regression model went and looked at the factors predicted base salary for Montgomery County government employees in 2020. The variable that represents the results was base_salary.
The R-squared value tells us how much of the change in base salary can be explained by the model. The adjusted R-squared accounts for the number of predictors and is a better measure of fit. The RMSE tells us on average how many dollars the model’s predictions are off by.
Numeric_Grade tells us how much base salary increases for each unit increase in pay grade, while holding everything else constant. Longevity_Pay and Overtime_Pay tell us how much each additional dollar in those pay categories is due to a change in base salary. The Male coefficient tells us how much more or less male employees earn compared to female employees after controlling for grade, overtime, and longevity pay. If the coefficient is positive and significant, it means that a wage gap in favor of male employees. If it is not significant, that means that the grade and experience factors explain the salary differences rather than gender itself.
The best use of this is understanding whether pay grade and experience explain salary differences or whether gender plays an independent role. If a gender gap remains after controlling for structural factors, that is important information for county leadership to address.
This analysis used multiple linear regression to find out what factors predict base salary among Montgomery County government employees. The predictors were gender, pay grade, overtime pay, and longevity pay. The model provides insight into which factors are most strongly associated with salary levels. Future research could include additional variables such as specific job title, department, or full time workers compared to part time workers to improve the model’s power.
Montgomery County, Maryland. (2020).Employee Salaries2020:Montgomery County Open Data Portal. https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2020/he7s-ebwb/about_data
I used gemini to get the command scale_y_continuous(labels = scales::comma) and scale_x_continuous(labels = scales::comma) because Arrival to Cleared in seconds was being shown as 2e+05 and other numbers like that. I didn’t know how to fix it so I asked Gemini and it gave me that command to turn that into 100,000 and 200,000.
I used Gemini to get gsub([$]) to remove dollar signs and commas from the salary columns so they could be converted to numeric. I did not know how to clean it normally