Are there significant differences in the average pay for civil servants between New York City’s five boroughs?
New York City is the most populated city in the United States by far, and as such, depends on hundreds of thousands of civil servants to keep the city going. This project will measure the differences in how civil servants are payed across the city’s five wildly different boroughs.
The parent dataset to be used is sourced directly from the city of New York. It contains citywide payroll data going back to 2014 and lists each employee plus a variety of relevant information. The dataset specifically imported into this project has already been filtered to solely contain payroll data from 2024. It contains payroll data for 562898 employees. The columns relevant to the upcoming analysis are:
| Column | Description |
|---|---|
| Payroll Number | Numerical payroll identifier, unique to each agency |
| Work Location Borough | Employee’s primary work location |
| Base Salary | Base salary assigned to employee |
| Regular Gross Paid | Amount paid to employee from base salary |
| Total OT Paid | Amount paid to employee from overtime |
| Total Other Pay | Any compensation in addition to gross and OT pay; includes differentials, backpay, settlements, allowances |
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
data05 <- read_csv("D:/DATA 101/Datasets/Citywide_Payroll_Data_(Fiscal_Year)_20251207.csv")
## Rows: 562898 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Agency Name, Last Name, First Name, Mid Init, Agency Start Date, W...
## dbl (2): Fiscal Year, Payroll Number
## num (2): Regular Hours, OT Hours
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I will now use a series of R chunks to reduce my dataset to the parts essential for answering our question. We begin with some exploratory data analysis, looking at NAs by column and the locations under the “Work Location Borough” column. The first step in preparing the data for analysis is to only keep the variables relevant to my analysis. This dataset is so large that we are saving large amounts of storage and memory by filtering out these columns. Next is renaming the remaining columns to a more computer-friendly format. This analysis is looking at NYC’s five boroughs, so we filter for those. There is one NA value in the “borough” column, but it is functionally irrelevant against 557308 other observations and will be automatically filtered out. The pay values are currently in the dataset as character sequences because of the dollar sign, so we use the parse_number column from readr to return numerical values. We’re also removing Staten Island DOT employees with “other pay” over a thousand dollars, as the city handed out over $30 million in back pay to them that shouldn’t affect this analysis. The final step is to create the total_pay column by adding gross pay (from base pay), overtime pay, and other pay.
colSums(is.na(data05))
## Fiscal Year Payroll Number
## 0 0
## Agency Name Last Name
## 0 416
## First Name Mid Init
## 428 235478
## Agency Start Date Work Location Borough
## 77 1
## Title Description Leave Status as of June 30
## 68 0
## Base Salary Pay Basis
## 0 0
## Regular Hours Regular Gross Paid
## 0 0
## OT Hours Total OT Paid
## 0 0
## Total Other Pay
## 0
unique(data05$"Work Location Borough")
## [1] "MANHATTAN" "BROOKLYN" "BRONX" "QUEENS"
## [5] "RICHMOND" "ALBANY" "WESTCHESTER" "NASSAU"
## [9] "DELAWARE" "SULLIVAN" "ORANGE" "ULSTER"
## [13] NA "OTHER" "SCHOHARIE" "PUTNAM"
## [17] "DUTCHESS" "GREENE" "WASHINGTON DC"
data <- data05 |>
select("Payroll Number", "Work Location Borough", "Base Salary", "Regular Gross Paid", "Total OT Paid", "Total Other Pay") |>
rename(
"agency" = "Payroll Number",
"borough" = "Work Location Borough",
"base_pay" = "Base Salary",
"gross_pay" = "Regular Gross Paid",
"overtime_pay" = "Total OT Paid",
"other_pay" = "Total Other Pay"
) |>
filter(borough %in% c("MANHATTAN", "BRONX", "QUEENS", "BROOKLYN", "RICHMOND")) ## Richmond is Staten Island
data$gross_pay <- parse_number(data$gross_pay)
data$overtime_pay <- parse_number(data$overtime_pay)
data$other_pay <- parse_number(data$other_pay)
data <- data |>
filter(gross_pay > 0) |>
filter(!(agency==841 & borough=="RICHMOND" & other_pay>=1000))
data <- data |>
mutate("total_pay" = gross_pay + overtime_pay + other_pay)
To answer this question, we will use an ANOVA (Analysis of Varaince) test. This makes the most sense because our question concerns differences in mean (variance), which is exactly what an ANOVA test analyzes.
Hypotheses
\(H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\)
\(H_A: \mu_1 \neq \mu_2 \neq \mu_3 \neq \mu_4 \neq \mu_5\)
anova_result <- aov(total_pay ~ borough, data=data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 4 1.401e+14 3.503e+13 9336 <2e-16 ***
## Residuals 541456 2.032e+15 3.752e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = total_pay ~ borough, data = data)
##
## $borough
## diff lwr upr p adj
## BROOKLYN-BRONX 1502.7977 222.3484 2783.2469 0.0119370
## MANHATTAN-BRONX -34855.2510 -35923.3741 -33787.1279 0.0000000
## QUEENS-BRONX 1347.9818 102.5632 2593.4003 0.0261849
## RICHMOND-BRONX 13590.2851 11305.7499 15874.8203 0.0000000
## MANHATTAN-BROOKLYN -36358.0487 -37156.6663 -35559.4310 0.0000000
## QUEENS-BROOKLYN -154.8159 -1178.5203 868.8885 0.9939367
## RICHMOND-BROOKLYN 12087.4875 9915.8511 14259.1238 0.0000000
## QUEENS-MANHATTAN 36203.2328 35462.0802 36944.3854 0.0000000
## RICHMOND-MANHATTAN 48445.5361 46391.9196 50499.1527 0.0000000
## RICHMOND-QUEENS 12242.3033 10091.1359 14393.4708 0.0000000
library(scales)
## Warning: package 'scales' was built under R version 4.4.3
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(data, aes(x=borough, y=total_pay)) +
geom_boxplot() +
stat_summary( ## need to show mean not median
fun = mean,
geom = "point",
shape = 18,
size = 3,
color = "red"
) +
scale_y_continuous(labels = label_currency()) +
labs(
title = "Total pay by borough",
x = "Borough",
y = "Total Pay"
) +
theme_minimal()
The ANOVA model’s p-value is \(< 2*10^{-16}\), which indicates strong statistical significance. The TukeyHSD test shows the differences between individual boroughs, and the boxplot shows them all on the same scale. The adjusted p-values from the Tukey test shows that there are statistically significant differences in the mean pay of each borough in relation to each other, except for the mean pay of Queens and Brooklyn. It also shows that there are minorly significant differences in the mean pay of Brooklyn-Bronx and Queens-Bronx, with the Bronx being about $1300-$1500 behind in both cases. Surprisingly enough, Manhattan’s average pay is significantly lower than every other borough, though the boxplot does note that there are many outliers. The largest difference in means is between Richmond (Staten Island) and Manhattan, with the mean difference being $48,445. Richmond led the other boroughs by about $12,000 dollars each.
There are significant differences in the mean pay for civil servants in each of New York City’s five boroughs. Manhattan’s mean pay is $12,000 lower than most of the other boroughs, and is $36,000 lower than Richmond’s mean pay. Civil servants in Queens and Brooklyn are payed almost exactly the same, while the mean pay from Bronx is about $1500 below them.
While this analysis is interesting, it leads to more questions than answers? Why does the wealthiest borough have the lowest average pay for civil servants? What’s going on on Staten Island that the pay is so high? Is the pay gap between Queens and Brooklyn growing or shrinking, and why so? These questions could be answered through normal research and by analyzing these mean pay values over the years, as is possible from the original dataset before it was filtered for this project. Another avenue of future research is to analyze the mean pay by department, as the location of these departments could say a lot about these mean pay values by borough.
“Citywide Payroll Data (Fiscal Year).” NYC OpenData (2025). Retrieved from https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e/ on December 7, 2025.
“Parse numbers, flexibly.” Tidyverse readr. Retrieved from https://readr.tidyverse.org/reference/parse_number.html on December 15, 2025.
“Summarise y values at unique/binned x.” Tidyverse ggplot2. Retrieved from https://ggplot2.tidyverse.org/reference/stat_summary.html on December 15, 2025.
“Label currencies ($100, €2.50, etc.).” R-lib scales. Retrieved from https://scales.r-lib.org/reference/label_currency.html on December 15, 2025.