Introduction

My dataset covers annual salary information of all active employees
of Montgomery County, MD in 2021, including gross and overtime pay. My
variables include the Department Code, Department Name, assigned
department division, employee gender, employee base salary, employee
overtime pay for 2021, employee longevity pay for 2021, and employee
grade. The data comes from the data.montgomerycountymd.gov website. I
cleaned up my data simply by removing all NAs out of the respective
columns by filtering them out with dplyr. I chose this dataset because
it was interesting to me to look at local employees information, I
believed it could make for an interesting visualization to reveal
something unseen before.
Data Prepping and Cleaning
# Load and install necessary packages and read csv file
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
install.packages("ggcorrplot")
## Installing package into 'C:/Users/jman1/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggcorrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\jman1\AppData\Local\Temp\RtmpycAgfL\downloaded_packages
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.2
library(ggpubr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
install.packages("gapminder")
## Installing package into 'C:/Users/jman1/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'gapminder' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\jman1\AppData\Local\Temp\RtmpycAgfL\downloaded_packages
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.4.2
employeeSalaries <- read_csv("Employee_Salaries_-_2021_20241115.csv")
## Rows: 9907 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Department, Department Name, Division, Gender, Grade
## dbl (3): Base Salary, 2021 Overtime Pay, 2021 Longevity Pay
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove NAs
employeeSalaries <- employeeSalaries %>%
filter(!is.na(`2021 Longevity Pay` & `2021 Overtime Pay`))
any(is.na(employeeSalaries))
## [1] FALSE
Data Exploration
# Creating a linear regression model comparing Base Salary to overtime and longevity pay
model <- lm(`Base Salary` ~ `2021 Overtime Pay` + `2021 Longevity Pay`, data = employeeSalaries)
summary(model)
##
## Call:
## lm(formula = `Base Salary` ~ `2021 Overtime Pay` + `2021 Longevity Pay`,
## data = employeeSalaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53524 -9674 -852 8736 61333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.416e+04 8.605e+02 86.189 <2e-16 ***
## `2021 Overtime Pay` 1.855e-01 1.919e-02 9.669 <2e-16 ***
## `2021 Longevity Pay` 3.726e+00 1.841e-01 20.242 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15160 on 1406 degrees of freedom
## Multiple R-squared: 0.2778, Adjusted R-squared: 0.2768
## F-statistic: 270.5 on 2 and 1406 DF, p-value: < 2.2e-16
coefficients <- coef(model)
equation <- paste0(
"Base Salary = ", round(coefficients[1], 2), " + ",
round(coefficients[2], 2), " * Overtime Pay + ",
round(coefficients[3], 2), " * Longevity Pay"
)
ggplot(employeeSalaries, aes(x = `2021 Overtime Pay`, y = `Base Salary`)) +
geom_point(alpha = 0.7, color = "#1c30e8") +
geom_smooth(method = "lm", se = FALSE, color = "#de1414") +
labs(
title = "Base Salary vs. Overtime + Longevity Pay",
subtitle = paste("Regression Equation:", equation),
x = "2021 Overtime Pay",
y = "Base Salary"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This model yields the equation: Base Salary = 74161.93 + 0.19*
Overtime Pay + 3.73 * Longevity Pay. The p-value is near 0, indicating
that this is a statistically significant relationship between these
variables. The adjusted R-squared value (0.2768) is somewhat low,
suggesting that the correlation is a relatively weak fit but still shows
that the independent variables have some predictive power over the
dependent variable. This makes sense as one’s base salary is probably
related to how much they are paid as salary in other ways.
Bar Chart of Salary Component Breakdown by Department
# Create bar chart of breakdown of base salary, longevity pay and overtime pay, by department
employeeSalaries_long <- employeeSalaries %>%
filter(!is.na(`Base Salary`) & !is.na(`2021 Overtime Pay`) & !is.na(`2021 Longevity Pay`)) %>%
pivot_longer(
cols = c(`Base Salary`, `2021 Overtime Pay`, `2021 Longevity Pay`),
names_to = "PayType",
values_to = "Amount"
)
p <- ggplot(employeeSalaries_long, aes(x = `Department Name`, y = Amount, fill = PayType)) +
geom_bar(stat = "identity", alpha = 0.8) +
labs(
title = "Total Compensation Breakdown by Department",
subtitle = "Base Salary, Overtime, and Longevity Pay",
x = "Department",
y = "Amount",
fill = "Pay Type",
caption = "https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2021/kmkb-bmhe/about_data"
) +
scale_fill_manual(values = c("Base Salary" = "#066db8",
"2021 Overtime Pay" = "#b88906",
"2021 Longevity Pay" = "#06c72d")) +
scale_y_continuous(labels = label_comma()) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title.x = element_text(size = 12),
plot.caption = element_text(hjust = 0.08, size = 10, color = "gray")
)
print(p)

interactive_plot <- ggplotly(p)
interactive_plot
Conclusion & Observations
This visualization shows the breakdown of pay by department
including all pay types. It includes longevity pay, overtime pay, and
base salary as portions of each bar. I thought it was very interesting
that the Department of Police especially and the Fire and Rescue
services make so much more in totality than all the other departments,
my best guess is that this is simply due to the high number of employees
within these departments. I’d also be curious to know more about this
disparity in pay distribution between departments, for example, the Fire
and Rescue Services have a significant portion of their earnings from
overtime and longevity pay, whereas the vast majority of the Department
of Police’s earnings comes from their base salary.