Introduction

My dataset covers annual salary information of all active employees of Montgomery County, MD in 2021, including gross and overtime pay. My variables include the Department Code, Department Name, assigned department division, employee gender, employee base salary, employee overtime pay for 2021, employee longevity pay for 2021, and employee grade. The data comes from the data.montgomerycountymd.gov website. I cleaned up my data simply by removing all NAs out of the respective columns by filtering them out with dplyr. I chose this dataset because it was interesting to me to look at local employees information, I believed it could make for an interesting visualization to reveal something unseen before.

Data Prepping and Cleaning

# Load and install necessary packages and read csv file
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
install.packages("ggcorrplot")
## Installing package into 'C:/Users/jman1/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggcorrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\jman1\AppData\Local\Temp\RtmpycAgfL\downloaded_packages
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.2
library(ggpubr)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
install.packages("gapminder")
## Installing package into 'C:/Users/jman1/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'gapminder' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\jman1\AppData\Local\Temp\RtmpycAgfL\downloaded_packages
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.4.2
employeeSalaries <- read_csv("Employee_Salaries_-_2021_20241115.csv")
## Rows: 9907 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Department, Department Name, Division, Gender, Grade
## dbl (3): Base Salary, 2021 Overtime Pay, 2021 Longevity Pay
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove NAs
employeeSalaries <- employeeSalaries %>%
  filter(!is.na(`2021 Longevity Pay` & `2021 Overtime Pay`))
any(is.na(employeeSalaries))
## [1] FALSE

Data Exploration

# Creating a linear regression model comparing Base Salary to overtime and longevity pay
model <- lm(`Base Salary` ~ `2021 Overtime Pay` + `2021 Longevity Pay`, data = employeeSalaries)
summary(model)
## 
## Call:
## lm(formula = `Base Salary` ~ `2021 Overtime Pay` + `2021 Longevity Pay`, 
##     data = employeeSalaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53524  -9674   -852   8736  61333 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.416e+04  8.605e+02  86.189   <2e-16 ***
## `2021 Overtime Pay`  1.855e-01  1.919e-02   9.669   <2e-16 ***
## `2021 Longevity Pay` 3.726e+00  1.841e-01  20.242   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15160 on 1406 degrees of freedom
## Multiple R-squared:  0.2778, Adjusted R-squared:  0.2768 
## F-statistic: 270.5 on 2 and 1406 DF,  p-value: < 2.2e-16
coefficients <- coef(model)
equation <- paste0(
  "Base Salary = ", round(coefficients[1], 2), " + ", 
  round(coefficients[2], 2), " * Overtime Pay + ", 
  round(coefficients[3], 2), " * Longevity Pay"
)
ggplot(employeeSalaries, aes(x = `2021 Overtime Pay`, y = `Base Salary`)) +
  geom_point(alpha = 0.7, color = "#1c30e8") +
  geom_smooth(method = "lm", se = FALSE, color = "#de1414") +
  labs(
    title = "Base Salary vs. Overtime + Longevity Pay",
    subtitle = paste("Regression Equation:", equation),
    x = "2021 Overtime Pay",
    y = "Base Salary"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Bar Chart of Salary Component Breakdown by Department

# Create bar chart of breakdown of base salary, longevity pay and overtime pay, by department
employeeSalaries_long <- employeeSalaries %>%
  filter(!is.na(`Base Salary`) & !is.na(`2021 Overtime Pay`) & !is.na(`2021 Longevity Pay`)) %>%
  pivot_longer(
    cols = c(`Base Salary`, `2021 Overtime Pay`, `2021 Longevity Pay`),
    names_to = "PayType",
    values_to = "Amount"
  )
p <- ggplot(employeeSalaries_long, aes(x = `Department Name`, y = Amount, fill = PayType)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  labs(
    title = "Total Compensation Breakdown by Department",
    subtitle = "Base Salary, Overtime, and Longevity Pay",
    x = "Department",
    y = "Amount",
    fill = "Pay Type",
    caption = "https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2021/kmkb-bmhe/about_data"
  ) +
  scale_fill_manual(values = c("Base Salary" = "#066db8", 
                               "2021 Overtime Pay" = "#b88906", 
                               "2021 Longevity Pay" = "#06c72d")) +
  scale_y_continuous(labels = label_comma()) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title.x = element_text(size = 12),
    plot.caption = element_text(hjust = 0.08, size = 10, color = "gray")
  )
print(p)

interactive_plot <- ggplotly(p)
interactive_plot

Background Research

I found this dataset here: https://data.montgomerycountymd.gov/Human-Resources/Employee-Salaries-2021/kmkb-bmhe/about_data. Although originally i was looking through kaggle but couldn’t find any proper datasets. I also used this website: https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html for inspiration for good visualizations.

Conclusion & Observations

This visualization shows the breakdown of pay by department including all pay types. It includes longevity pay, overtime pay, and base salary as portions of each bar. I thought it was very interesting that the Department of Police especially and the Fire and Rescue services make so much more in totality than all the other departments, my best guess is that this is simply due to the high number of employees within these departments. I’d also be curious to know more about this disparity in pay distribution between departments, for example, the Fire and Rescue Services have a significant portion of their earnings from overtime and longevity pay, whereas the vast majority of the Department of Police’s earnings comes from their base salary.