Data 101 Final Project

Introduction

Are there significant differences in the average pay for civil servants between New York City’s five boroughs?

New York City is the most populated city in the United States by far, and as such, depends on hundreds of thousands of civil servants to keep the city going. This project will measure the differences in how civil servants are payed across the city’s five wildly different boroughs.

The parent dataset to be used is sourced directly from the city of New York. It contains citywide payroll data going back to 2014 and lists each employee plus a variety of relevant information. The dataset specifically imported into this project has already been filtered to solely contain payroll data from 2024. It contains payroll data for 562898 employees. The columns relevant to the upcoming analysis are:

Column	Description
Payroll Number	Numerical payroll identifier, unique to each agency
Work Location Borough	Employee’s primary work location
Base Salary	Base salary assigned to employee
Regular Gross Paid	Amount paid to employee from base salary
Total OT Paid	Amount paid to employee from overtime
Total Other Pay	Any compensation in addition to gross and OT pay; includes differentials, backpay, settlements, allowances

Data Analysis

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

data05 <- read_csv("D:/DATA 101/Datasets/Citywide_Payroll_Data_(Fiscal_Year)_20251207.csv")

## Rows: 562898 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Agency Name, Last Name, First Name, Mid Init, Agency Start Date, W...
## dbl  (2): Fiscal Year, Payroll Number
## num  (2): Regular Hours, OT Hours
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I will now use a series of R chunks to reduce my dataset to the parts essential for answering our question. We begin with some exploratory data analysis, looking at NAs by column and the locations under the “Work Location Borough” column. The first step in preparing the data for analysis is to only keep the variables relevant to my analysis. This dataset is so large that we are saving large amounts of storage and memory by filtering out these columns. Next is renaming the remaining columns to a more computer-friendly format. This analysis is looking at NYC’s five boroughs, so we filter for those. There is one NA value in the “borough” column, but it is functionally irrelevant against 557308 other observations and will be automatically filtered out. The pay values are currently in the dataset as character sequences because of the dollar sign, so we use the parse_number column from readr to return numerical values. We’re also removing Staten Island DOT employees with “other pay” over a thousand dollars, as the city handed out over $30 million in back pay to them that shouldn’t affect this analysis. The final step is to create the total_pay column by adding gross pay (from base pay), overtime pay, and other pay.

colSums(is.na(data05))

##                Fiscal Year             Payroll Number 
##                          0                          0 
##                Agency Name                  Last Name 
##                          0                        416 
##                 First Name                   Mid Init 
##                        428                     235478 
##          Agency Start Date      Work Location Borough 
##                         77                          1 
##          Title Description Leave Status as of June 30 
##                         68                          0 
##                Base Salary                  Pay Basis 
##                          0                          0 
##              Regular Hours         Regular Gross Paid 
##                          0                          0 
##                   OT Hours              Total OT Paid 
##                          0                          0 
##            Total Other Pay 
##                          0

unique(data05$"Work Location Borough")

##  [1] "MANHATTAN"     "BROOKLYN"      "BRONX"         "QUEENS"       
##  [5] "RICHMOND"      "ALBANY"        "WESTCHESTER"   "NASSAU"       
##  [9] "DELAWARE"      "SULLIVAN"      "ORANGE"        "ULSTER"       
## [13] NA              "OTHER"         "SCHOHARIE"     "PUTNAM"       
## [17] "DUTCHESS"      "GREENE"        "WASHINGTON DC"

data <- data05 |>
  select("Payroll Number", "Work Location Borough", "Base Salary", "Regular Gross Paid", "Total OT Paid", "Total Other Pay") |>
  rename(
    "agency" = "Payroll Number",
    "borough" = "Work Location Borough",
    "base_pay" = "Base Salary",
    "gross_pay" = "Regular Gross Paid",
    "overtime_pay" = "Total OT Paid",
    "other_pay" = "Total Other Pay"
  ) |>
  filter(borough %in% c("MANHATTAN", "BRONX", "QUEENS", "BROOKLYN", "RICHMOND")) ## Richmond is Staten Island

data$gross_pay <- parse_number(data$gross_pay)
data$overtime_pay <- parse_number(data$overtime_pay)
data$other_pay <- parse_number(data$other_pay)

data <- data |>
  filter(gross_pay > 0) |>
  filter(!(agency==841 & borough=="RICHMOND" & other_pay>=1000))

data <- data |>
  mutate("total_pay" = gross_pay + overtime_pay + other_pay)

Statistical Analysis

To answer this question, we will use an ANOVA (Analysis of Varaince) test. This makes the most sense because our question concerns differences in mean (variance), which is exactly what an ANOVA test analyzes.

Hypotheses

$H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5$

$H_A: \mu_1 \neq \mu_2 \neq \mu_3 \neq \mu_4 \neq \mu_5$

anova_result <- aov(total_pay ~ borough, data=data)

summary(anova_result)

##                 Df    Sum Sq   Mean Sq F value Pr(>F)    
## borough          4 1.401e+14 3.503e+13    9336 <2e-16 ***
## Residuals   541456 2.032e+15 3.752e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = total_pay ~ borough, data = data)
## 
## $borough
##                           diff         lwr         upr     p adj
## BROOKLYN-BRONX       1502.7977    222.3484   2783.2469 0.0119370
## MANHATTAN-BRONX    -34855.2510 -35923.3741 -33787.1279 0.0000000
## QUEENS-BRONX         1347.9818    102.5632   2593.4003 0.0261849
## RICHMOND-BRONX      13590.2851  11305.7499  15874.8203 0.0000000
## MANHATTAN-BROOKLYN -36358.0487 -37156.6663 -35559.4310 0.0000000
## QUEENS-BROOKLYN      -154.8159  -1178.5203    868.8885 0.9939367
## RICHMOND-BROOKLYN   12087.4875   9915.8511  14259.1238 0.0000000
## QUEENS-MANHATTAN    36203.2328  35462.0802  36944.3854 0.0000000
## RICHMOND-MANHATTAN  48445.5361  46391.9196  50499.1527 0.0000000
## RICHMOND-QUEENS     12242.3033  10091.1359  14393.4708 0.0000000

library(scales)

## Warning: package 'scales' was built under R version 4.4.3

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

ggplot(data, aes(x=borough, y=total_pay)) +
  geom_boxplot() +
  stat_summary( ## need to show mean not median
    fun = mean,
    geom = "point",
    shape = 18,
    size = 3,
    color = "red"
  ) +
  scale_y_continuous(labels = label_currency()) +
  labs(
    title = "Total pay by borough",
    x = "Borough",
    y = "Total Pay"
  ) +
  theme_minimal()

The ANOVA model’s p-value is $< 2*10^{-16}$, which indicates strong statistical significance. The TukeyHSD test shows the differences between individual boroughs, and the boxplot shows them all on the same scale. The adjusted p-values from the Tukey test shows that there are statistically significant differences in the mean pay of each borough in relation to each other, except for the mean pay of Queens and Brooklyn. It also shows that there are minorly significant differences in the mean pay of Brooklyn-Bronx and Queens-Bronx, with the Bronx being about $1300-$1500 behind in both cases. Surprisingly enough, Manhattan’s average pay is significantly lower than every other borough, though the boxplot does note that there are many outliers. The largest difference in means is between Richmond (Staten Island) and Manhattan, with the mean difference being $48,445. Richmond led the other boroughs by about $12,000 dollars each.

Conclusion and Future Directions

There are significant differences in the mean pay for civil servants in each of New York City’s five boroughs. Manhattan’s mean pay is $12,000 lower than most of the other boroughs, and is $36,000 lower than Richmond’s mean pay. Civil servants in Queens and Brooklyn are payed almost exactly the same, while the mean pay from Bronx is about $1500 below them.

While this analysis is interesting, it leads to more questions than answers? Why does the wealthiest borough have the lowest average pay for civil servants? What’s going on on Staten Island that the pay is so high? Is the pay gap between Queens and Brooklyn growing or shrinking, and why so? These questions could be answered through normal research and by analyzing these mean pay values over the years, as is possible from the original dataset before it was filtered for this project. Another avenue of future research is to analyze the mean pay by department, as the location of these departments could say a lot about these mean pay values by borough.

References

“Citywide Payroll Data (Fiscal Year).” NYC OpenData (2025). Retrieved from https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e/ on December 7, 2025.

“Parse numbers, flexibly.” Tidyverse readr. Retrieved from https://readr.tidyverse.org/reference/parse_number.html on December 15, 2025.

“Summarise y values at unique/binned x.” Tidyverse ggplot2. Retrieved from https://ggplot2.tidyverse.org/reference/stat_summary.html on December 15, 2025.

“Label currencies ($100, €2.50, etc.).” R-lib scales. Retrieved from https://scales.r-lib.org/reference/label_currency.html on December 15, 2025.